ARM與Cortex筆記

ARM與Cortex筆記





ARMCortex筆記
曾聽過一段話,有人問蘇格拉底為何成為雅典最有智慧的人,他說:雅典人自以為知道什麼,卻不知道其實自己什麼都不知道,他只知道一件事,就是他什麼都不知道希望個人在技術領域專研,也應常保此心.
因著工作的關係,ARM的處理器上經歷了Real-Time OS,Linux相關的Porting工作,希望可以透過這篇文章,把相關的資訊做一個整理(溫故知新),若你原本已經是ARM架構的熟手,本文應該幫助有限,主要希望對有志於在ARM相關產品開發更進一步了解的人有所幫助,然個人所學有限,若有不足之處,還請不吝指教,
參考ARM的網站http://www.arm.com/about/company-profile/index.php,ARM公司成立於1990,目前為止已經銷售了超過150億個基於ARM的晶片,並向200多加公司銷售了超過600個處理器的授權,並藉此收取ARM晶片的授權費用,目前全世界有超過95%的手機以及超過25%的消費性電子產品使用ARM做為處理器曲7超過ㄨㄛㄨㄛ.
ARM(Advanced RISC Machines)公司的名稱可以知道,這是一家專注在RISC(Reduced Instruction Set computer)架構的處理器公司,最早的ARM1原型是1985年在英國劍橋的Acorn計算機公司所設計,並由美國的VLSI公司製造,也因此在Wiki上看到,早期ARM1,ARM2,ARM250,ARM3..的處理器,都被Acorn這家公司採用作為計算機核心處理器.
1978/12/5,物理學家赫爾曼·豪澤(Hermann Hauser)和工程師Chris Curry,在英國康橋創辦了CPU公司(Cambridge Processing Unit)並在1979,CPU公司改名為Acorn電腦公司,1985,Roger WilsonSteve Furber設計了他們自己的第一代32位、6M Hz的處理器,用它做出了一台RISC指令集的電腦,簡稱ARMAcorn RISC Machine.
隨後,Acorn公司陷入財務困難,並被Olivetti收購,成為一個獨立的Olivetti研究子公司,1990/11/27,ARM獲得蘋果公司與晶片廠商VLSI的投資,成為一家獨立的處理器公司,在穀倉展開創業的歷程,像是大家印象深刻的Apple Newton PDA,用的就是ARM610處理器.(參考文章:http://www5.cnfol.com/big5/news.cnfol.com/100823/101,1587,8274016,00.shtml與 http://big5.buynow.com.cn/gate/big5/www.cnbeta.com/articles/131786.htm )
稍微考古一下,目前處理器的架構中,主要有1940年代提出的Von Neumann記憶體架構,讓程式與資料共用相同的匯流排,以及之後的Harvard架構,讓程式與資料走不同的匯流排,好處在於可以同時進行程式與資料的記憶體存取動作,早期的ARM78051一般是採用Von Neumann架構,一塊Cache供指令與資料存取,而目前新的微處理器架構(例如:ARM11 or Cortex A),通常都採用Harvard架構,也就是處理器會支援I-CacheD-Cache,區分指令與資料的擷取匯流排,提升處理器的效率. (參考文章:http://en.wikipedia.org/wiki/ARM7 and http://en.wikipedia.org/wiki/Harvard_architecture ).
有關ARMVon NeumannHarvard架構的分類,也可以參考網頁http://stenlyho.blogspot.com/2008/08/armcpu.html ,如下所示
Processor Family#of pipeline stagesMemory OrganizationClock RateMIPS/MHz
ARM63Von Neumann25MHz
ARM73Von Neumann66MHz0.9
ARM85Von Neumann72MHz1.2
ARM95Harvard200MHz1.1
ARM106Harvard400MHz1.25
StrongARM5Harvard233MHz1.15
ARM118Von Neumann/Harvard550MHz1.2
ARM是採用RISC 精簡指令集 (Reduced Instruction Set Computing)架構的處理器,RISC架構主要選擇使用頻率較高的簡單指令,避免複雜指令,使用固定長度的指令編碼(支援32bits,16bits16/32bits混合),單週期指令,便於Pipeline的操作執行,並透過大量暫存器,讓邏輯處理指令只對暫存器進行操作,只有特定載入/儲存的指令可以存取記憶體內容.相比CISC架構,會隨著需求,不斷的加入新的指令集,使得架構越趨複雜,現實應用中,也並非所有的指令都是常被使用的,如下,CSIC架構的x86 指令集為例,指令集呈現不固定長度的方式,如下例子,1,2,711 bytes的例子
(1bytes)0x48 = dec eax
(2bytes)0x89 F9= mov ecx,edi
(7bytes)0x8B BC 24 A4 01 00 00 = mov edi,dword ptr [esp+000001A4h]
(11bytes)0x81 BC 24 14 01 00 00 FF 00 00 00 = cmp dword ptr [esp+00000114h],0FFh
ARM透過Pipeline的方式加速指令集的處理,Pipeline執行階段,如果發生中斷,也會把Pipeline中的指令執行完畢才進入中斷,如下所示ARM7支援如下的3Pipeline
Fetch → Decode → Execute
其中
Fetch進行指令的擷取動作
DecodeThumb->ARM指令Decompress,ARM指令解碼,暫存器選擇
Execute進行暫存器/記憶體讀取,算術邏輯運算與暫存器/記憶體回寫動作
每一個CPU週期,處理器都可以同時處理 ‘Fetch’,’Decode’,’Execute’這三個動作,而非把一個指令從Fetch開始到執行完畢後,才處理下一個指令週期,如下圖所示
TimeFetchDecodeExecute
Cycle#1Instruction#1
Cycle#2Instruction#2Instruction#1
Cycle#3Instruction#3Instruction#2Instruction#1
Cycle#4Instruction#4Instruction#3Instruction#2
Cycle#5Instruction#5Instruction#4Instruction#3
Cycle#6Instruction#6Instruction#5Instruction#4
為了避免在非載入記憶體階段,讓運算指令進行記憶體的存取,而導致Pipeline可重疊執行的能力被破壞,ARM只允許特定載入儲存指令讀寫記憶體的資料早期的ARM6,ARM7有約3階的Pipeline,到了ARM8,ARM9,約為5階的Pipeline,之後的ARM11則為8階的Pipeline,不過,Pipeline過深不一定就能帶來更高的效益,如果程式碼的流程中遇到分支(例如:Branch到另一個程式區塊),就會導致Pipeline中的資料失效,要重新進行指令Fetch的動作.
簡單來說,Pipeline就是把指令的處理分級幾個不同的步驟,例如
ARM9支援如下的5Pipeline
Fetch → Decode → Execute→ Memory→ Write Back
其中
Fetch進行指令的擷取(Fetch)動作
Decode進行ARM/Thumb指令解碼與暫存器的讀取
Execute進行邏輯運算與記憶體存取位址計算動作
Memory讀取或寫回記憶體資料
Write Back將運算或是Load結果回寫暫存器中
ARM10之後,有支援Branch Prediction,以減少在Pipeline執行期間,因為Branch動作導致Pipeline失效 Flush的機會,支援如下的6Pipeline
Fetch→ Issue → Decode → Execute→ Memory→ Write Back
其中
Fetch進行Branch Predictor指令分支預測,指令位址計算,與指令的擷取(Fetch)動作
IssueARM/Thumb指令解碼,若非ARM/Thumb有效指令,就透過Coprocessor Signal判斷是否為Coprocessor指令
Decode暫存器的讀取,Result Forward,ScoreBoard
Execute進行算術邏輯運算與Branch/Data存取記憶體位址計算,乘法運算
Memory讀取或寫回記憶體資料,Coprocessor資料存取,乘法相加處理
Write Back將運算或是Load結果回寫暫存器中
ARM11採用Scalar架構的Pipeline,並在Issue階段支援ALU(arithmetic logic unit),MAC(multiply/accumulate)Load/Store分種Pipeline的流水線,可以在一個Cycle分發一個對應的處理器動作到一個Pipeline,如下所示的8Scalar Pipeline (ARM1156T2-S支援9階的Pipeline,其中Fetch Pipeline擴充為3,細節就不在這討論,可以參考網頁http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0338g/I1002919.html)
Fetch#1→ Fetch#2→ Decode→ISS (ALU Pipeline)→ Shifter→ALU→ SAT→ Write Back
______________________________(MAC Pipeline) → MAC1→MAC2→ MAC3→ Write Back
______________________________(Load/Store Pipeline)→ LS Add→DC1→ DC2→ Write Back
跟之前版本相比ARM11用了兩個Fetch Pipeline階段去支援兩種指令分支預測(Branch Prediction)的機制,第一個Fetch Pipeline階段會根據歷史紀錄進行動態的指令分支預測(Dynamic Branch Prediction),總共紀錄64,4種狀態(Strongly taken,Weakly taken,Weakly not-taken and Strongly non-taken)的分支((Branch)目標記憶體位址快取(BTAC,Branch-Target Address Cache),紀錄近期指令分支的情況第二個Fetch Pipeline階段,進行靜態的指令分支預測(Static Branch Prediction),會處理不在第一階段範圍中的分支預測記憶體位址命中率高的指令分支預測(Branch Prediction)可以避免Pipeline失效重置的問題,讓處理器的運作效率更高根據參考的資料,ARM11DynamicStatic Branch Prediction在一般執行情況下可以有約85%的命中率,大多數的情況可以介於80%-95%之間(取決於程式碼的大小).
簡介如下,
#1Fetch#1進行Dynamic Branch Prediction,指令位址計算,與指令的擷取(Fetch)動作
#2Fetch#2進行Static Branch Prediction
#3DecodeARM/Thumb指令解碼,若非ARM/Thumb有效指令,就透過Coprocessor Signal判斷是否為Coprocessor指令
Static BPR Stack
#4ISS
(Instruction Issue)
暫存器的讀取,與指令執行路徑分派,有三條路徑邏輯運算ALU Pipeline,乘法累加MAC Pipeline,與資料存取Load/Store Pipeline.
ALU PipelineMAC PipelineLoad/Store Pipeline
#5Shifter對邏輯運算指令操作單元(operand)進行ShiftMAC11階段乘法累加操作LS Add計算產生Load/Store操作的記憶體位址
#6ALU進行整數算術邏輯運算MAC22階段乘法累加操作DC11階段Data Cache存取
#7SAT儲存運算結果MAC33階段乘法累加操作DC22階段Data Cache存取
#8Write Back將運算或是Load結果回寫暫存器中
接下來,介紹ARM Cortext A系列的架構,在這架構下ARM導入了Superscalar 架構的Pipeline,讓處理器可以在一個週期平行處理一個以上的指令集,Cortex A8為例,支援13階的整數Pipeline10階的NEON多媒體指令集Pipeline,以整數處理的指令集為例,Cortex A8支援Dual-Issue,In-Order Pipeline,不同於之前的ARM核心一次只能處理一個整數處理指令集,Cortex A8可以同時Issue兩個整數處理指令集,並在一個週期中,透過兩個整數算術邏輯單元Pipeline平行處理這兩個指令集.
13-Stage Integer Pipeline10-Stage NEON Pipeline
F#0F#1F#2D#0D#1D#2D#3D#4E#0E#1E#2E#3E#4E#5M#0M#1M#2M#3N#1N#2N#3N#4N#5N#6
Instruction FetchInstruction Decode
with Dual-Issues
Architectural
Register
File
ALU/MUL Pipeline 0NEON
Instruction
Queue
NEON
Instruction
Decode
NEON
Register
File
Integer ALU Pipe
ALU Pipeline 1Integer MUL Pipe
Load/Store Pipeline 0 or 1Integer Shift Pipe
None-IEEE FP Add Pipe
None-IEEE FP Mul Pipe
IEEE FP Engine
Load/Store Permute Pipe
Cortex A8架構下,有兩個 ALU Pipeline,ALU 0ALU1是對稱的,可以同時處理兩個整數邏輯運算,由於Pipeline的特性,在使用上,乘法需求的指令會跟ALU 0成對 (也就是說在這條Pipeline 0連續處理有關整數邏輯運算與乘法相關的指令),Load/Store 的指令,則適合跟ALU 01兩者任一一起成對運作.
其中
13-Stage Integer Pipeline0-StageF#0用來產生要Fetch指令的位址,在文件中這個階段並不納入13階的Pipeline. (AGC,Address Generator Unit)
1-StageF#1RAM+TLB ,
支援兩個層級的全域歷史指令分支預測(Global History Branch Preditor)分別為
1,BTB(Branch Target Buffer)
能用來判斷目前所要Fetch的位址是否為分支(Branch)指令,以及所要調到的目標記憶體位址,目前總共可以記錄512筆資料,BTB命中,接下來就會進行GHB的動作.
2,GHB(Global History Buffer)
包含40962bits計數器,用來編碼分支預測的強度與方向,GHB會以10bits長度定址最近十筆分支的位址,4bitsPC(Program Counter).
此外,Return Stack(RS)會記錄八筆32bits Link Register的值,當發現有關於函式返回(Return)相關指令時,Return Stack中所記錄的最近八筆Link Register資訊就可以幫助Dynamic Branch Predictor預測可能的分支結果.
2-StageF#2提供12 筆 Fetch Queue
3-StageD#0Decode.
4-StageD#1
5-StageD#2
6-StageD#3
7-StageD#4
8-StageE#0Architectural Register File
ALU/MUL Pipeline 0ALU Pipeline 1Load/Store Pipeline 0 or 1
9-StageE#1Execution.
10-StageE#2
11-StageE#3
12-StageE#4BP Update(to F#0)BP Update(to F#0)BP Update(to F#0)
13-StageE#5
10-Stage NEON PipelineInstruction DecodeLoad and Store with Alignment
1-StageM#016-entry NEON Instruction Queue/Instruction DecodeMux L1/MCR
2-StageM#1Decode Queue and Read/Write Check8-entry Load Queue
3-StageM#2Score-Board and Issue-LogicLoad Align
4-StageM#3NEON Register Read and M3 fwding muxesMux with NRF
Integer ALU PipeInteger MUL PipeInteger Shift PipeNone-IEEE FP Add PipeNone-IEEE FP Mul PipeIEEE Single/Double precision VFPLoad/Store and Permute Pipe
5-StageN#1FMTDUPSHIFT#1FFMTFDUPVFPPERM#1
6-StageN#2ALUMUL#1SHIFT#2FADD#1FMUL#1Write BackPERM#2
7-StageN#3ABSMUL#2SHIFT#3FADD#2FMUL#2Store Align
8-StageN#4ACC#1FADD#3FMUL#38-entry Store Queue
9-StageN#5ACC#2FADD#4FMUL#4
10-StageN#6Write Back (Update to ARM/NEON Register File)
Cortex A8支援兩階的Cache,其中L1 Cache支援16kbytes32kbytesI/D-Cache(Harvard架構),與每個Byte有一個Bit的校正碼(Parity Bit),每個Cache都支援4ways的機制(可作為4個快取區塊),並使用Hash Virtual Address Buffer(HVAB)預測Pipeline要去L1 Cache抓取的位置,是在哪一個快取區塊,可降低所需的時間與功耗,並支援Write-BackWrite-Through相關機制.
L2 Cache支援64kbyes-2Mbytes範圍的記憶體大小,指令與資料都共用這一塊L2 Cache空間,提供L1 與 L2 Cache間高速的介面,可用來避免處理器頻繁到外部AXI Bus存取資料與和其他周邊搶資源,所造成的效能影響,L2 Cache支援8ways的機制(可作為8個快取區塊),可選擇支援ECCParity Bit校正碼,並支援Write-Back,Write-ThroughWrite-Allocate機制.
ARM Cortex A8是以Coprocessor的架構支援新的NEON多媒體指令集,ARM對於Coprocessor指令的辨別主要是在指令DecodeIssue 時透過跟Coprocessor判別是否為其支援的指令,
NEON多媒體指令的Pipeline主要是介接在ARM核心整數處理Pipeline之後,也因此所有的例外(Exception)處理與分支Branch預測問題在這之前都已經被處理好了,此外,有關對記憶體資料的Load/Store動作,也會在NEON Pipeline之前,就透過ARM核心的Load/Store Pipeline先從L1 D-Cache執行完畢,並儲存相關資料在NEON PipelineLoad/Store Data Queue.
NEON有自己的指令暫存空間(NEON Instruction Queue),基於ARMDual-Issue架構,每次處理器週期,最多可以指派兩個有效的NEON指令集,NEON的指令集可以一次從L1L2 CacheLoad/Store 128bits的資料.
NEON有三個整數SIMD Pipelines(包含整數乘法累加Pipeline,整數Shift Pipeline與整數邏輯運算Pipeline),一個Load-Store/Permute Pipeline(負責NEON資料的Load/Store與資料存取整數單元Integer Unit),兩個SIMD single-precision floating-point Pipelines(分別負責浮點數的乘法與加法)與一個Non-Pipelined Vector Floating-Point Unit(VFPLite,遵循ARM VFPv3浮點數規格,並符合IEEE754關於浮點數的規範,並向後相容原本ARM的浮點數實作). NEON指令在Pipeline中是以in-order方式被執行,所處理的資料要不就是NEON整數SIMD指令就是NEON浮點運算指令.
而隨著處理器時脈的提升,每一個處理器Cycle,每一階的 Pipeline所能做的事情也越加精簡(每一個Cycle執行的時間相對也越短),伴隨著就是Pipeline階數的增加,只要Branch Predition的準確度高,PipelineFlush的機率低,就能透過Pipeline階數增加得到處理器時脈提升的效能好處.
ARM指令集在每個指令都有4bitsCondition,對於Pipeline的架構來說,可以直接判斷PSR(Program Ststu Register)決定該指令該如何執行的條件,優化效能.
ARM的處理器核心命名也有一個可識別性,例如:ARM7-TDMI (ARM7-Thumb+Debug+Multiplier+ICE),指的就是這個ARM7-TDMI的核心,支援16bits Thumb Code,晶片除錯JTAG (IEEE 1149.1 ),硬體乘法器(Multiplier)與ICERT嵌入式邏輯/追蹤巨集單元.或像是J為支援Jazelle指令集與F為支援向量浮點數.
簡單介紹一下,ARM 最新Cortex系列的處理器,從早期的ARM7(armv4),ARM9(armv5),ARM11(armv6)到現在的Cortex(armv7)架構,每一個世代都有包括新的指令集(例如:v4T導入Thumb指令集,v5E導入增強型DSP指令,v6新增Thumb2SIMD指令集),架構與效能上的諸多改善,而到了Cortex,ARM第一次同時推出三個等級的產品線,主要說明如下
Cortex A(Application)系列主要用於高性能的開放平台,一般而言也都具備MMU,例如Symbian,Linux/Android或是Windows Mobile/Phone.
Cortex R(Real-Time)系列用於高端的嵌入式系統產品,例如汽車電子組件,機械手臂這類要求處理器功能強大,高可靠度與對事件反應快速的應用.
Cortex M
(
Microcontroller)
系列
用於嵌入式與單晶片的產品,針對過去8051這類單晶片所在的Real-Time,低功耗與成本的應用.
目前台灣的新唐也推出Cortex-M0低價處理器(mmm…我理解是在1USD以下),或像是Cortex-M3只支援部分常用Thumb2指令集(不支援ARM指令集)與中斷向量表,藉此提供高密度與效能的執行環境.

以下根據ARM系列的差異,逐一說明
ARM FamilyARM CoreARM ArchitectureFeaturesCache
(I/D)
MMU/
MPU
PerformanceApplied
Product
ARM1ARMv1ARM1NoneNoneARM Evaluation System second processor for BBC Micro
ARM2ARMv2ARM2ARMv2 added the MUL (multiply) instructionNoneNone4 MIPS @ 8 MHz
0.33 DMIPS/MHz
Acorn Archimedes, Chessmachine
ARMv2aARM250Integrated MEMC (MMU), Graphics and IO processor. ARMv2a added the SWP and SWPB (swap) instructions.NoneMEMC1a7 MIPS @ 12 MHzAcorn Archimedes
ARM3ARMv2aARM3First integrated memory cache.4 KB unifiedNone12 MIPS @ 25 MHz
0.50 DMIPS/MHz
Acorn Archimedes
ARM6ARMv3ARM60ARMv3 first to support 32-bit memory address space (previously 26-bit)NoneNone10 MIPS @ 12 MHz3DO Interactive Multiplayer, Zarlink GPS Receiver
ARM600As ARM60, cache and coprocessor bus (for FPA10 floating-point unit).4 KB unifiedNone28 MIPS @ 33 MHz
ARM610As ARM60, cache, no coprocessor bus.4 KB unifiedNone17 MIPS @ 20 MHz
0.65 DMIPS/MHz
Acorn Risc PC 600, Apple Newton 100 series
ARM7ARMv3ARM700KBunifiedNoneAcorn Risc PC prototype CPU card
ARM710As ARM700, no coprocessor bus.KBunifiedNoneAcorn Risc PC 700
ARM710aAs ARM710KBunifiedNone40 MHz
0.68 DMIPS/MHz
Acorn Risc PC 700, Apple eMate 300,Psion Series 5 (ARM7100), Acorn A7000(ARM7500), Acorn A7000+(ARM7500FE), Network Computer(ARM7500FE)
ARM7TDMIARMv4TARM7TDMI(-S)3-stage pipeline, ThumbNoneNone15 MIPS @ 16.8 MHz
63 DMIPS @ 70 MHz
Game Boy Advance, Nintendo DS,Apple iPod, Lego NXT, Juice Box,GarminNavigation Devices (1990s – early 2000s)
ARM710TAs ARM7TDMI, cache8 KB unifiedMMU36 MIPS @ 40 MHzPsion Series 5mx, Psion Revo/Revo Plus/Diamond Mako
ARM720TAs ARM7TDMI, cache, MMU with Fast Context Switch Extension8 KB unifiedMMU60 MIPS @ 59.8 MHzZipit Wireless Messenger
ARM740TAs ARM7TDMI, cache8 KB unifiedMPU
ARM7EJARMv5TEJARM7EJ-S5-stage pipeline, Thumb, Jazelle DBX, Enhanced DSP instructionsNoneNone
ARM8ARMv4ARM8105-stage pipeline, static branch prediction, double-bandwidth memory8 KB unifiedMMU84 MIPS @ 72 MHz
1.16 DMIPS/MHz
Acorn Risc PC prototype CPU card
StrongARMARMv4SA-15-stage pipeline16 KB/8–16 KBMMU203–206 MHz
1.0 DMIPS/MHz
SA-110
Apple Newton 2×00 series, Acorn Risc PC, Rebel/Corel Netwinder, Chalice CATS
SA-1100
Psion netBook
SA-1110
LART (computer), Intel Assabet, Ipaq H36x0, Balloon2, Zaurus SL-5×00, HP Jornada 7xx, Jornada 560 series, Palm Zire 31
ARM9TDMIARMv4TARM9TDMI5-stage pipeline, ThumbNoneNone
ARM920TAs ARM9TDMI, cache, MMU with FCSE (Fast Context Switch Extension)16 KB/16 KBMMU200 MIPS @ 180 MHzArmadillo, GP32, GP2X (first core),Tapwave Zodiac (Motorola i.MX1), Hewlett-PackardHP-49/50 Calculators, Sun SPOT, HTC TyTN, FIC Neo FreeRunner), GarminNavigation Devices (mid–late 2000s), TomTom navigation devices
ARM922TAs ARM9TDMI, caches8 KB/8 KBMMU
ARM940TAs ARM9TDMI, caches4 KB/4 KBMPUGP2X (second core), Meizu M6 Mini Player
ARM9EARMv5TEARM946E-SThumb, Enhanced DSP instructions, caches, TCM (tightly coupled memories)VariableMPUNintendo DS, Nokia N-Gage, Canon PowerShot A470, Canon EOS 5D Mark II ,Conexant 802.11 chips, Samsung S5L2010
ARM966E-SThumb, Enhanced DSP instructions, TCM (tightly coupled memories)None
ARM968E-SAs ARM966E-SNone
ARMv5TEJARM926EJ-SThumb, Jazelle DBX, Enhanced DSP instructions, caches, TCM (tightly coupled memories)VariableMMU220 MIPS @ 200 MHz,Mobile phones: Sony Ericsson (K, W series);Siemens and Benq (x65 series and newer); LG Arena; GPH Wiz; Squeezebox DuetController (Samsung S3C2412).Squeezebox Radio; Buffalo TeraStation Live (NAS);Drobo FS (NAS); Western Digital MyBook I World Edition; Western Digital MyBook II World Edition; Seagate FreeAgentDockStarSTDSD10G-RK; Seagate FreeAgent GoFlex Home; Chumby Classic
ARMv5TEARM996HSClockless processor, as ARM966E-S, TCM (tightly coupled memories)NoneMPU
ARM10EARMv5TEARM1020E6-stage pipeline, Thumb, Enhanced DSP instructions, (VFP)32 KB/32 KBMMU
ARM1022EAs ARM1020E16 KB/16 KBMMU
ARMv5TEJARM1026EJ-SThumb, Jazelle DBX, Enhanced DSP instructions, (VFP)VariableMMU or MPU
XScaleARMv5TEXScale7-stage pipeline, Thumb, Enhanced DSP instructions32 KB/32 KBMMU133–400 MHz80219
Thecus N2100
IOP321
Iyonix
PXA210/PXA250
Zaurus SL-5600, iPAQ H3900, Sony CLIE NX60, NX70V, NZ90
PXA255
Gumstix basix & connex, Palm Tungsten E2, Zaurus SL-C860, Mentor Ranger & Stryder, iRex ILiad
PXA263
Sony CLIE NX73V, NX80V
PXA26x
Palm Tungsten T3
PXA27x
Gumstix verdex, “Trizeps-Modules”, “eSOM270-Module” PXA270 COM, HTC Universal, HP hx4700, Zaurus SL-C1000, 3000, 3100, 3200, Dell Axim x30, x50, and x51 series, Motorola Q, Balloon3, Trolltech Greenphone, Palm TX, Motorola Ezx Platform A728, A780, A910, A1200, E680, E680i, E680g, E690, E895, Rokr E2, Rokr E6, Fujitsu Siemens LOOX N560, Toshiba Portege G500, Tr?o 650-755p, Zipit Z2, HP iPaq 614c Business Navigator, I-mate PDA2
PXA3XX
Samsung Omnia
PXA900
Blackberry 8700, Blackberry Pearl (8100)
IXP42x
NSLU2
BulverdeWireless MMX, Wireless SpeedStep added32 KB/32 KBMMU312–624 MHz
MonahansWireless MMX2 added, 32 KB/32 KB (L1), optional L2 cache up to 512 KB32 KB/32 KBMMUup to 1.25 GHz
ARM11ARMv6ARM1136J(F)-S8-stage pipeline, SIMD, Thumb, Jazelle DBX, (VFP), Enhanced DSP instructionsVariableMMUOMAP2420
Nokia E90, Nokia N93, Nokia N95, Nokia N82, Zune, BUGbase, Nokia N800, Nokia N810
MSM7200
Eten Glofiish, HTC TyTN II, HTC Nike
Freescale i.MX31
original Zune 30?GB, Toshiba Gigabeat S and Kindle DX
Freescale MXC300-30
Nokia E63, Nokia E71, Nokia 5800, Nokia E51, Nokia 6700 Classic, Nokia 6120 Classic, Nokia 6210 Navigator, Nokia 6220 Classic, Nokia 6290, Nokia 6710 Navigator, Nokia 6720 Classic, Nokia E75, Nokia N97, Nokia N81
Qualcomm MSM7201A
HTC Dream, HTC Magic, Motorola i1, Motorola Z6, HTC Hero, Samsung SGH-i627 (Propel Pro), Sony Ericsson Xperia X10 Mini Pro
Qualcomm MSM7227
ZTE Link, HTC Legend, HTC Aria, Viewsonic ViewPad 7
ARMv6T2ARM1156T2(F)-S9-stage pipeline, SIMD, Thumb-2, (VFP), Enhanced DSP instructionsVariableMPU
ARMv6ZKARM1176JZ(F)-SAs ARM1136EJ(F)-S, TrustZoneVariableMMU965?DMIPS @ 772?MHz
up to 2600DMIPS with four processors
Apple iPhone (original and 3G), Apple iPod touch (1st and 2nd Generation), Motorola RIZR Z8, Motorola RIZR Z10, Nintendo 3DS
S3C6410
Samsung Omnia II, Samsung Moment, SmartQ 5, Tablet PC
Qualcomm MSM7627
Palm Pixi and Motorola Calgary/Devour
ARMv6KARM11 MPCoreAs ARM1136EJ(F)-S, 1–4 core SMPVariableMMU
Cortex-AARMv7-ACortex-A5VFP, NEON, Jazelle RCT, Thumb/Thumb-2, 1–4 cores,Variable (L1 + L2) Cache, MMU + TrustZoneVariableMMU1.57 DMIPS / MHz per core
Cortex-A8VFP, NEON, Jazelle RCT, Thumb-2, 13-stage superscalar pipeline, Variable (L1 + L2) Cache, MMU + TrustZoneVariableMMUup to 2 000 (2.0 DMIPS/MHz in speed from 600 MHz to greater than 1 GHz)HTC Desire, SBM7000, Oregon State University OSWALD, Gumstix Overo Earth, Pandora, Apple iPhone 3GS, Apple iPod touch (3rd and 4th Generation), Apple iPad (A4), Apple iPhone 4 (A4), Archos 5, BeagleBoard, Motorola Droid, Motorola Droid X, Motorola Droid 2, Motorola Droid R2D2 Edition, Palm Pre, Samsung Omnia HD, Samsung Wave S8500, Samsung i9000 Galaxy S, Sony Ericsson Satio, Touch Book, Nokia N900, Meizu M9, Google Nexus S, Sharp PC-Z1 “Netwalker”.
Cortex-A9 MPCoreApplication profile, VFPv3 FPU, NEON, Thumb-2, Jazelle RCT/DBX, out-of-order speculative issue superscalar, 1–4 core SMP,
32 KB/32 KB L1, up to 4 MB L2, MMU + TrustZone
VariableMMU2.5 DMIPS/MHz per core, 10 000 DMIPS @ 2 GHz on Performance Optimized TSMC40G(dual core)LG Optimus 2X, Motorola Atrix 4G,Motorola DROID BIONIC, Motorola Xoom Pandaboard
Cortex-A15 MPCoreApplication profile, VFPv4 FPU, NEON, Thumb-2, Jazelle RCT/DBX, out-of-order speculative issue superscalar, Large Physical Address Extensions (LPAE), Hardware virtualization, 1–4 SMP cores,
32 KB/32 KB L1, up to 4 MB L2, MMU + TrustZone
VariableMMU
Cortex-RARMv7-RCortex-R4(F)Real-time profile, Thumb-2, (FPU), variable cache, MPU optionalVariableMPU600 DMIPS @ 475 MHz
Cortex-MARMv6-MCortex-M0Microcontroller profile, Thumb-2 subset (16-bit Thumb instructions & BL, MRS, MSR, ISB, DSB, and DMB). Hardware multiply instruction optionalNoneNone0.9 DMIPS/MHz
Cortex-M1FPGA targeted, Microcontroller profile, Thumb-2 subset (16-bit Thumb instructions & BL, MRS, MSR, ISB, DSB, and DMB),TCM(tightly coupled memory)optional.NoneNoneUp to 136 DMIPS @ 170 MHz (0.8 DMIPS/MHz, MHz achievable FPGA-dependent)
ARMv7-MCortex-M3Microcontroller profile, Thumb-2 only. Hardware divide instruction, no cache, MPU optional.NoneMPU1.25 DMIPS/MHz
ARMv7-MECortex-M4Microcontroller profile, both Thumb and Thumb-2, FPU. Hardware MAC, SIMD and divide instructions, MPU optionalNoneMPU1.25 DMIPS/MHz
簡要說明ARM的架構如下
ARM處理器起始位址一般是0x00000000,初始化時是處於SVC(Supervisor) Mode,並可以透過System Coprocessor設定為Little Endian(高位址資料較小)Bigger Endian(高位址資料較大),ARMI/O對應的方式為Memory Mapped I/O (X86I/O Mapped I/O,要透過 in/out指令才可以存取I/O Space).
ARM支援八類處理器執行模式,如下所示
(參考:ARMv7-AR Architecture Reference Manual)
處理器模式xPSR Mode encodingPriviledge說明
USRb10000Unpriviledged使用者模式
例如,我們在ARM Linux上的應用程式,就是處於這個模式.
FIQb10001Priviledged快速中斷模式
IRQb10010Priviledged通用中斷處理
SVC (Supervisor)b10011Priviledged管理者保護模式,一般沒有區分特權等級的RTOS,或是有區分特權等級的OS Kernel Mode都會處於這個模式包括,使用者透過SWI(or SVC)觸發軟體中斷 (對應到一般Linux Kernel就是用SWI實現System Call),也會進入到SVC Mode.
MON(Monitor)b10110Priviledged只有當處理器支援Security Extensions,才會有這模式.
可以透過SMC(Secure Monitor Call)指令,讓系統進入Secure Mode,或可透過設定Secure Configuration Register,讓系統所觸發的IRQ/FIQ/Abort都變成進入Secure Mode中處理.
ABT (Abort)b10111Priviledged記憶體存取異常模式 (發生Data或是Prefetch Abort,就會處於這個模式).
UND (Undefined)b11011Priviledged未定義指令異常模式.
當處理器遇到無法解譯的指令時,會先跟Coprocessor 確認是否為Coprocessor指令,若不是,就會觸發例外,進入這個模式.一般我們用軟體除錯器要設定中斷點時,也可透過置入未定義的指令,當作Break Point之用.
SYSb11111Priviledged系統特權模式
User Mode共用一致的暫存器(R0-R15/CPSR/SPSR),主要的差別是User ModeUnpriviledged Mode.
在除錯時,可以透過xPSR M[4:0] 5bits 判斷目前與前一個處理器狀態,推測系統前後問題發生的原因.
ARM處理器有31(不包含支援Security Extensions上的Monitor Mode R13R14)32位元通用暫存器(R0-R15,R13/R14_svc, R13/R14_abt, R13/R14_und, R13/R14_irq, R8-R14_fiq)6個狀態暫存器(CPSR,SPSR_svc,SPSR_abt,SPSR_und,SPSR_irq,_SPSR_fiq)
Priviledged Modes
Exception Modes
暫存器說明Application
View
User ModeSystem
Mode
FIQ
Mode
IRQ
Mode
Supervisor
Mode
Abort
Mode
Undefined
Mode
Monitor
Mode
函式傳遞參數#0R0R0_usr
函式傳遞參數#1R1R1_usr
函式傳遞參數#2R2R2_usr
函式傳遞參數#3R3R3_usr
R4R4_usr
R5R5_usr
R6R6_usr
R7R7_usr
R8R8_usrR8_fiq
R9R9_usrR9_fiq
R10R10_usrR10_fiq
R11R11_usrR11_fiq
R12R12_usrR12_fiq
SPSP_usrSP_fiqSP_irqSP_svcSP_abtSP_undSP_mon
LRLR_usrLR_fiqLR_irqLR_svcLR_abtLR_undLR_mon
PCPC
APSRCPSR
SPSR_fiqSPSR_irqSPSR_svcSPSR_abtSPSR_undSPSR_mon
一般而言16bits Thumb Code只會使用到R0-R7 (3bits Register Index),ARM 32-bits指令集可以用到完整的R0-R12. R13(SP),R14(LR)R15(PC)是每個模式下都會使用到的.
除了有支援中斷向量表的處理器核心外(例如:Cortex M3),一般的ARM核心主要支援以下八種中斷(其中第六個中斷為Reserved),可以透過修改CP15 c1暫存器的V (bit13),決定中斷表示在低位址 (V=0則位於0x00000000-0x0000001C)或高位址(V=1則位於0xFFFF0000-0xFFFF001C). 若產品在開發初期,而且又沒有MMU透過User Mode/Priviledge Mode機制做記憶體保護的環境,建議可以把中斷表設置在高位置,避免在產品開發階段,因為空指標(NULL Pointer)所導致的系統方面的錯誤.
一旦系統發生Exception,首先會把目前的CPSR儲存到發生Exception對應模式下的SPRS(可用來檢視進入Exception前系統的狀態),之後把PC值加上Exception對應的Offest(Exception-Dependent Offset),存在LR,如下所示為每個Exeception Mode對應的Exception-Dependent Offset
ExceptionBase LR ValueOffset for processor state of
ARM
Offset for processor state of
Thumb of ThumbEE
Offset for processor state of
Jazelle
Undefine InstructionAddress of the undefined instruction422 or 4
SVCAddress of SVC instruction42X
SMCAddress of SMC instruction44X
Prefetch AbortAddress of aborted instruction fetch444
Data AbortAddress of inctruction that generated the abort888
IRQ/FIQAddress of next instruction to execute444
之後根據Exception Handler所在位置,設定PC,與更新CPSR Mode[4:0]的內容為發生的Exception Mode,關閉對應的中斷防止重入(基本都會關閉IRQ,而在FIQ,Secure MonitorReset Mode中會同時關閉IRQ FIQ). 並參考CP15 c1暫存器的TE (bit30),決定Exception Handler是用哪個處理器指令集狀態執行. (TE=0 表示 ExceptionARM指令集,TE=1則為Thumb指令集),參考Exception ModeCPSR E(bit9)決定Exception執行時的Data Endian,設定CPSR IT[7:0]0.之後便開始Exception Handler的執行.
如果我們希望讓ARM處於Suspend進入低耗電的狀態(類似對裝置Clock Gating,只是並沒有透過PMIC關閉電源),也可以透過WFI(Wait For Interrupt)指令,ARM等待外部中斷例如:IRQFIQ,對產品端而言就是手機的按鍵或是透過Real-Time Clock的中斷,喚醒處理器,恢復正常的執行.反之,也可以透過System Controller關閉處理器的電源(進入Doze Mode),只是相對於WFI,會變成處理器要重新Re-initialize,相關的狀態還要預存在TCM(Tightly-Couple Memory),這需要針對產品端要達成的目的來做設計上的評估.
通常,WFI可以放在系統Idle Task的實作中,如果沒有相關需要處理的工作等待執行,系統就會把執行權交到最低優先級的Idle Task,再由Idle Task判斷系統中下一次會醒過來的時間點,決定是不是要把外部記憶體設定為省電模式,並讓處理器透過WFI進入Suspend低耗電的狀態.
如下為對應的類型,優先級與中斷記憶體位置.
(參考:ARMv7-AR Architecture Reference Manual).
中斷位址中斷類型優先級對應處理器模式發生時處理器對應的動作.
0x0000-0000
(0xFFFF-0000)
系統重置
Reset
1SVCTakeReset()
// Enter Supervisor mode and (if relevant) Secure state, and reset CP15. This affects the banked versions and values of various registers accessed later in the code. Also reset other system components.
CPSR.M = ‘10011’; // Supervisor mode
if HaveSecurityExt() then SCR.NS = ‘0’;
ResetCP15Registers();
ResetDebugRegisters();
if HaveAdvSIMDorVFP() then FPEXC.EN = ‘0’; SUBARCHITECTURE_DEFINED further resetting;
if HaveThumbEE() then TEECR.XED = ‘0’;
if HaveJazelle() then JMCR.JE = ‘0’; SUBARCHITECTURE_DEFINED further resetting;
// Further CPSR changes: all interrupts disabled, IT state reset, instruction set and endianness according to the SCTLR values produced by the above call to ResetCP15Registers().
CPSR.I = ‘1’; CPSR.F = ‘1’; CPSR.A = ‘1’;
CPSR.IT = ‘00000000’;
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// All registers, bits and fields not reset by the above pseudocode or by the BranchTo() call below are UNKNOWN bitstrings after reset. In particular, the return information registers R14_svc and SPSR_svc have UNKNOWN values, so that it is impossible to return from a reset in an architecturally defined way. Branch to Reset vector.
BranchTo(ExcVectorBase() + 0);
0x0000-0004
(0xFFFF-0004)
未定義指令集
Undefined Instruction
6UNDTakeUndefInstrException()
// Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 2 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required return address offsets of 2 or 4 respectively.
new_lr_value = if CPSR.T == ‘1’ then PC-2 else PC-4;
new_spsr_value = CPSR;
// Enter Undefined (‘11011’) mode, and ensure Secure state if initially in Monitor (‘10110’) mode. This affects the banked versions of various registers accessed later in the code.
if CPSR.M == ‘10110’ then SCR.NS = ‘0’;
CPSR.M = ‘11011’;
// Write return information to registers, and make further CPSR changes: IRQs disabled, IT state reset, instruction set and endianness to SCTLR-configured values.
SPSR[] = new_spsr_value;
R[14] = new_lr_value;
CPSR.I = ‘1’;
CPSR.IT = ‘00000000’;
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// Branch to Undefined Instruction vector.
BranchTo(ExcVectorBase() + 4);
////
ARMv7架構下,也可以讓Undefined Instruction執行類似NOP的動作,處理器不會觸發Exception,只是忽略該指令的執行.
0x0000-0008
(0xFFFF-0008)
軟體中斷
SWI
Secure Monitor Call (SMC)
6SVC 
SMC Mode
TakeSVCException()
// Determine return information. SPSR is to be the current CPSR, after changing the IT[] bits to give them the correct values for the following instruction, and LR is to be the current PC minus 2 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the next instruction (the SVC instruction having size 2 or 4 bytes respectively).
ITAdvance();
new_lr_value = if CPSR.T == ‘1’ then PC-2 else PC-4;
new_spsr_value = CPSR;
// Enter Supervisor (‘10011’) mode, and ensure Secure state if initially in Monitor (‘10110’) mode. This affects the banked versions of various registers accessed later in the code.
if CPSR.M == ‘10110’ then SCR.NS = ‘0’;
CPSR.M = ‘10011’;
// Write return information to registers, and make further CPSR changes: IRQs disabled, IT state reset, instruction set and endianness to SCTLR-configured values.
SPSR[] = new_spsr_value;
R[14] = new_lr_value;
CPSR.I = ‘1’;
CPSR.IT = ‘00000000’;
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// Branch to SVC vector.
BranchTo(ExcVectorBase() + 8);
TakeSMCException()
// Determine return information. SPSR is to be the current CPSR, after changing the IT[] bits to give them the correct values for the following instruction, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the next instruction (with the SMC instruction always being 4 bytes in length).
ITAdvance();
new_lr_value = if CPSR.T == ‘1’ then PC else PC-4;
new_spsr_value = CPSR;
// Enter Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code.
if CPSR.M == ‘10110’ then SCR.NS = ‘0’;
CPSR.M = ‘10110’;
// Write return information to registers, and make further CPSR changes: interrupts disabled, IT state reset, instruction set and endianness to SCTLR-configured values.
SPSR[] = new_spsr_value;
R[14] = new_lr_value;
CPSR.I = ‘1’; CPSR.F = ‘1’; CPSR.A = ‘1’;
CPSR.IT = ‘00000000’;
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// Branch to SMC vector.
BranchTo(MVBAR + 8);
0x0000-000C
(0xFFFF-000C)
指令記憶體存取錯誤
Prefetch Abort
5ABTTakePrefetchAbortException()
// Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the current instruction plus 4.
new_lr_value = if CPSR.T == ‘1’ then PC else PC-4;
new_spsr_value = CPSR;
// Determine whether this is an external abort to be trapped to Monitor mode.
trap_to_monitor = HaveSecurityExt() && SCR.EA == ‘1’ && IsExternalAbort();
// Enter Abort (‘10111’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code.
if CPSR.M == ‘10110’ then SCR.NS = ‘0’;
CPSR.M = if trap_to_monitor then ‘10110’ else ‘10111’;
// Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values.
SPSR[] = new_spsr_value;
R[14] = new_lr_value;
CPSR.I = ‘1’;
if trap_to_monitor then
CPSR.F = ‘1’; CPSR.A = ‘1’;
else
if !HaveSecurityExt() || SCR.NS == ‘0’ || SCR.AW == ‘1’ then CPSR.A = ‘1’;
CPSR.IT = ‘00000000’;
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// Branch to correct Prefetch Abort vector.
if trap_to_monitor then
BranchTo(MVBAR + 12);
else
BranchTo(ExcVectorBase() + 12);
0x0000-0010
(0xFFFF-0010)
資料記憶體存取錯誤
Data Abort
2ABTTakeDataAbortException()
// Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC plus 4 for Thumb or 0 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the current instruction plus 8. For an asynchronous abort, the PC and CPSR are considered to have already moved on to their values for the instruction following the instruction boundary at which the exception occurred.
new_lr_value = if CPSR.T == ‘1’ then PC+4 else PC;
new_spsr_value = CPSR;
// Determine whether this is an external abort to be trapped to Monitor mode.
trap_to_monitor = HaveSecurityExt() && SCR.EA == ‘1’ && IsExternalAbort();
// Enter Abort (‘10111’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code.
if CPSR.M == ‘10110’ then SCR.NS = ‘0’;
CPSR.M = if trap_to_monitor then ‘10110’ else ‘10111’;
// Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values.
SPSR[] = new_spsr_value;
R[14] = new_lr_value;
CPSR.I = ‘1’;
if trap_to_monitor then
CPSR.F = ‘1’; CPSR.A = ‘1’;
else
if !HaveSecurityExt() || SCR.NS == ‘0’ || SCR.AW == ‘1’ then CPSR.A = ‘1’;
CPSR.IT = ‘00000000’;
The System Level Programmers’ Model
ARM DDI 0406B Copyright © 1996-1998, 2000, 2004-2008 ARM Limited. All rights reserved.B1-57
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// Branch to correct Data Abort vector.
if trap_to_monitor then
BranchTo(MVBAR + 16);
else
BranchTo(ExcVectorBase() + 16);
0x0000-0014
(0xFFFF-0014)
保留未使用
0x0000-0018
(0xFFFF-0018)
外部一般中斷模式
IRQ
4IRQTakeIRQException()
// Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the instruction boundary at which the interrupt occurred plus 4. For this purpose, the PC and CPSR are considered to have already moved on to their values for the instruction following that boundary.
new_lr_value = if CPSR.T == ‘1’ then PC else PC-4;
new_spsr_value = CPSR;
// Determine whether IRQs are trapped to Monitor mode.
trap_to_monitor = HaveSecurityExt() && SCR.IRQ == ‘1’;
// Enter IRQ (‘10010’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code.
if CPSR.M == ‘10110’ then SCR.NS = ‘0’;
CPSR.M = if trap_to_monitor then ‘10110’ else ‘10010’;
// Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values.
SPSR[] = new_spsr_value;
R[14] = new_lr_value;
CPSR.I = ‘1’;
if trap_to_monitor then
CPSR.F = ‘1’; CPSR.A = ‘1’;
else
if !HaveSecurityExt() || SCR.NS == ‘0’ || SCR.AW == ‘1’ then CPSR.A = ‘1’;
CPSR.IT = ‘00000000’;
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// Branch to correct IRQ vector.
if trap_to_monitor then
BranchTo(MVBAR + 24);
elsif SCTLR.VE == ‘1’ then
IMPLEMENTATION_DEFINED branch to an IRQ vector;
else
BranchTo(ExcVectorBase() + 24);
0x0000-001C
(0xFFFF-001C)
快速中斷
FIQ
3FIQTakeFIQException()
// Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the instruction boundary at which the interrupt occurred plus 4. For this purpose, the PC and CPSR are considered to have already moved on to their values for the instruction following that boundary.
new_lr_value = if CPSR.T == ‘1’ then PC else PC-4;
new_spsr_value = CPSR;
// Determine whether FIQs are trapped to Monitor mode.
trap_to_monitor = HaveSecurityExt() && SCR.FIQ == ‘1’;
// Enter FIQ (‘10001’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code.
if CPSR.M == ‘10110’ then SCR.NS = ‘0’;
CPSR.M = if trap_to_monitor then ‘10110’ else ‘10001’;
// Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values.
SPSR[] = new_spsr_value;
R[14] = new_lr_value;
CPSR.I = ‘1’;
if trap_to_monitor then
CPSR.F = ‘1’; CPSR.A = ‘1’;
else
if !HaveSecurityExt() || SCR.NS == ‘0’ || SCR.FW == ‘1’ then CPSR.F = ‘1’;
if !HaveSecurityExt() || SCR.NS == ‘0’ || SCR.AW == ‘1’ then CPSR.A = ‘1’;
CPSR.IT = ‘00000000’;
CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb
CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian
// Branch to correct FIQ vector.
if trap_to_monitor then
BranchTo(MVBAR + 28);
elsif SCTLR.VE == ‘1’ then
IMPLEMENTATION_DEFINED branch to an FIQ vector;
else
BranchTo(ExcVectorBase() + 28);
Abort Mode的行為來說,當系統發生Abort中斷時,會把IRQ關閉,FIQ狀態仍維持開啟,根據開發使用的SoC不同,如果你所使用的晶片,有把其他中斷來源接到FIQ(例如:Timer),就會需要在Abort中斷處理中,立刻關閉FIQ,以避免在Abort Mode,有其他中斷的重入,導致分析系統問題時,不容易定位真正的問題.
ARM處理器的Cache,MMUMPU管理機制是透過Coprocessor #15實現的,當今天系統發生處理器的PC值指到一個無效的記憶體位置時,就會觸發Prefetch Abort,然後處理器會更新Coprocessor #15中的IFSR(Instruction Fault Status Register) 的錯誤狀態碼,以及更新IFAR(Instruction Fault Address Register)紀錄觸發Prefetch Abort的記憶體位置.
IFSR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,IFSR格式說明如下
位元功能說明
31-13UNK/SBZP
(Bits [31:13,11,9:4])
UNK/SBZP
unknown on reads, Should-Be-Zero-or-Preserved on writes.
12ExTExternal abort type.
110
10FS[4]Fault status bits.
9 – 4UNK/SBZPUNK/SBZP
unknown on reads, Should-Be-Zero-or-Preserved on writes.
3-0FS[3:0]Fault status bits.
可以透過CP15的指令進行IFSR的讀寫動作,如下例子
MRC p15,0,<Rt>,c5,c0,1 ; Read CP15 Instruction Fault Status Register
MCR p15,0,<Rt>,c5,c0,1 ; Write CP15 Instruction Fault Status Register
IFAR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,IFARPrefetch Abort,可用來反映出發生Abort時所在的記憶體位址,
可以透過CP15的指令進行IFAR的讀寫動作,如下例子
MRC p15,0,<Rt>,c6,c0,2 ; Read CP15 Instruction Fault Address Register
MCR p15,0,<Rt>,c6,c0,2 ; Write CP15 Instruction Fault Address Register
DFSR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,DFSR格式說明如下
位元功能說明
31-13UNK/SBZP
(Bits [31:13,9:8])
UNK/SBZP
unknown on reads, Should-Be-Zero-or-Preserved on writes.
12ExTExternal abort type.
11WnRWrite not Read bit.
Indicates whether the abort was caused by a write or a read access:
0 Abort caused by a read access
1 Abort caused by a write access.
For faults on CP15 cache maintenance operations, including the VA to PA translation operations, this bit always returns a value of 1.
10FS[4]Fault status bits.
9 — 8b00
7–4DomainThe domain of the fault address.
3-0FS[3:0]Fault status bits.
可以透過CP15的指令進行DFSR的讀寫動作,如下例子
MRC p15,0,<Rt>,c5,c0,0 ; Read CP15 Data Fault Status Register
MCR p15,0,<Rt>,c5,c0,0 ; Write CP15 Data Fault Status Register
DFAR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,DFARData Abort,可用來反映出發生Abort時所存取的記憶體位址,
可以透過CP15的指令進行DFAR的讀寫動作,如下例子
MRC p15,0,<Rt>,c6,c0,0 ; Read CP15 Data Fault Address Register
MCR p15,0,<Rt>,c6,c0,0 ; Write CP15 Data Fault Address Register
隨著ARMv7架構的出現,ARM目前所支援的指令集包括了ARMv32,Thumb,Thumb2,ThumbEE(Thumb Execution Environment),Jazelle,我們可以透過CPSR(Current Program Status Register)中的JT bits(位於CPSR24與第5bit) 來判斷目前處理器所處的狀態,如下所示(參考 ARMv7-AR Architecture Reference Manual)

J

T

Instruction set state

0

0

ARM

0

1

Thumb

1

0

Jazelle

1

1

ThumbEE

參考如下的模擬程式碼(參考文件:ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition),上述四個指令集除了由ThumbEE切到ARMv32 Mode是不能直接切換外,其它的模式都是可以依據需求直接的切換,切換的方式則是透過Branch Exchange指令轉換ARM指令集狀態.
// CurrentInstrSet()
// =================
InstrSet CurrentInstrSet()
case ISETSTATE of
when ‘00’ result = InstrSet_ARM;
when ‘01’ result = InstrSet_Thumb;
when ‘10’ result = InstrSet_Jazelle;
when ‘11’ result = InstrSet_ThumbEE;
return result;
// SelectInstrSet()
// ================
SelectInstrSet(InstrSet iset)
case iset of
when InstrSet_ARM
if CurrentInstrSet() == InstrSet_ThumbEE then
UNPREDICTABLE;
else
ISETSTATE = ‘00’;
when InstrSet_Thumb
ISETSTATE = ‘01’;
when InstrSet_Jazelle
ISETSTATE = ‘10’;
when InstrSet_ThumbEE
ISETSTATE = ‘11’;
return;
根據不同版本的ARM核心設計,像是在ARMv4T中有支援Thumb指令集,會在三階的Pipeline中的Decode階段,16bitsThumb Code轉碼為32bits的對應ARM Code,作為後續處理,或像是ARMv7的架構下,有導入SuperscalarPipeline,每次取指令時就會根據目前所在的指令集狀態,一次抓取232bits ARM指令或是抓取216bits Thumb指令進行後續Pipeline平行處理.
一般我們在系統軟體設計時,會根據所使用的處理器評估應該要採用哪種指令集,得到產品端的效益,例如ARM指令集效能最高,但因為都固定為32bits,所編譯出來的程式碼較大,Thumb指令集長度固定為16bits,編譯後的程式碼大約只有ARM程式的70%,而效能也只約等同於ARM直行效能的70%.若你所使用的處理器有支援Thumb2,如果所開發的模組為Video Codec,為了得到比較好的影音效果,選擇ARM指令集會是比較好的,若所開發的模組是屬於人機介面或是對效能要求有限的,則選擇ThumbThumb2指令集會是一個節省記憶體空間的方式.
有關ARM,Thumb,Thumb2效能的比較可以參考這篇在ARM工作的Richard Phelan所寫的文章Improving ARM Code Density and Performance (http://www.cs.uiuc.edu/class/fa05/cs433ug/PROCESSORS/Thumb2.pdf), C Code實作同樣的功能來說,編譯為Thumb2最高可以達到98%ARM指令及效能,程式碼本身所需的記憶體空間只占原本ARM程式碼的74%.再舉一個比較的例子,以一個1MB大小的ARM+Thumb程式碼來說,原本屬於ARMCode200kbytes,屬於Thumbv4Code800kbytes,如果全部都編譯為Thumb2,ARM部分的Code會成200kbytes降為150kbytes,Thumbv4Code會從800kbytes降為760kbytes,可以節省大約90kbytes的程式碼空間.
上述優化的數字,還是要根據開發者所使用的處理器版本 (會對應到不同的ThumbThumb2指令集版本),與在編譯時所帶入的參數,RVCT為例,不論使用者選擇的是-O0-O3,預設都會以-OSpace編譯,如果使用者選擇-OTime,採用的優化原則也會有所不同 (包含是否有 auto inline),所對應出來的程式碼大小也會不同不過基本上,Thumb2先天優勢就是屬於16/32bits混合執行的模式,也支援相對豐富的指令集(不同版本的ARM Core Thumb2指令集會有一些完整度的差異,例如Cortex M3只支援Thumb2 Subset.),若非特別要求效能的區塊,會是一個不錯的選擇.
此外,ARMv32指令集固定為32bits長度,其中包含條件判斷,操作碼(OP Code),是否影響CPSR,目標與操作暫存器編碼,如下為一般ARMv32指令集的格式 (參考文件ARMv7-AR Architecture Reference Manual.pdf),
ARM指令集分類位元指令集範例
313029282726252423222120191817161514131211109876543210
Data Processing
(Registers)
Cond.000op1op2op30AND,EOR,SUB,RSB,ADD,ADC,SBC,RSC,TST,TEQ,CMP,CMN,ORR,MOV,LSL,LSR,ASR,RRX,ROR,BIC,MVN
Data Processing
(Register-shifted register)
Cond.000op10op21AND,EOR,SUB,RSB,ADD,ADC,SBC,RSC,TST,TEQ,CMP,CMN,ORR,LSL,LSR,ASR,ROR,BIC,MVN
Data Processing
(Immediate)
Cond.001opRnAND,EOR,SUB,ADR,RSB,ADD,ADR,ADC,SBC,RSC,TST,TEQ,CMP,CMN,ORR,MOV,BIC,MVN
Multiply and multiply-accumulateCond.0000op1001MUL,MLA,UMAAL,MLS,UMULL,UMLAL,SMULL,SMLAL
Saturating addition and subtractionCond.00010op00101QADD,QSUB,QDADD,QDSUB
Halfword and multiply and multiply-accumulateCond.00010op101op0SMLABB,SMLABT,SMLATB,SMLATT,SMLAWB,SMLAWT,SMULWB,SMULWT,SMLALBB,SMLALBT,SMLALTB,SMLALTT,SMULBB,SMULBT,SMULTB,SMULTT
Extra load/store instructionsCond.000op1Rn1op21STRH,LDRH,LDRD,LDRSB,STRD,LDRSH,
Extra load/store instructions (unprivileged)Cond.00001opRt1op21STRHT,LDRHT,LDRSBT,LDRSHT
Synchronization primitivesCond.0001op1001SWP,SWPB,STREX,LDREX,STREXD,LDREXD,STREXB,LDREXB,STREXH,LDREXH,
MSR(immediate) and hintsCond.00110op10op1op2NOP,YIELD,WFE,WFI,SEV,DBG,MSR
Miscellaneous instructionsCond.00010op0op10op2MRS,MSR,BX,CLZ,BXJ,BLX,BKPT,SMC
Load/Store word and unsigned byteCond.01Aop1RnBSTR,STRT,LDR,LDRT,STRB,STRBT,LDRB,LDRBT
Media instructionsCond.011op1Rdop21RnUSAD8,USADA8,SBFX,BFC,BFI,UBFX
Parallel addition and subtraction,signedCond.011000op1op21SADD16,SASX,SSAX,SSUB16,SADD8,SSUB8,QADD16,QASX,QSUB16,QADD8,QSUB8,SHADD16,SHASX,SHSAX,SHSUB16SHADD8,SHSUB8
Parallel addition and subtraction,unsignedCond.011001op1op21UADD16,UASX,USAX,USUB16,UADD8,USUB8,UQADD16,UQASX,UQSAX,UQSUB16,UQADD8,UQSUB8,UHADD16,UHASX,UHSAX,UHSUB16,UHADD8,UHSUB8
Packing,unpacking,saturation, and reversalCond.01101op1Aop21PKH,SSAT,USAT,SXTAB16,SEL,SSAT16,SXTAB,SXTB,REV,SXTAH,SXTH,REV16,UXTAB16,UXTB16,USAT16,UXTAB,UXTB,RBIT,UXTAH,UXTH,REVSH
Signed multipliesCond.01110op1Aop21SMLAD,SMUAD,SMLSD,SMUSD,SMLALD,SMLSLD,SMMLA,SMMUL,SMMLE
Branch,branch with link, and block data transferCond.10opRnRSTMDA,STMED,LDMDA,LDMFA,STM,STMIA,STMEA,LDMDB,LDMEA,STMIB,STMFA,LDMIB,LDMED,LDM,B,BL,BLX
Supervisor call,and coprocessor instructionsCond.11op1RncoprocopSTC,STC2,LDC,LDC2,MCRR,MCRR2,MRRC,MRRC2,CDP,CDP2,MCR,MCR2,MRC,MRC2,SVC(previously SWI).
Unconditional instructions1111op1RnopSRS,RFE,BL,BLX,LDC,LDC2,STC,STC2,MCRR,MCRR2,MRRC,MRRC2,CDP,CDP2,MCR,MCR2,MRC,MRC2
Miscellaneous instructions,memory hints, and Advanced SIMD instructions11110op1Rnop2CPS,SETEND,PLI,PLD,PLDW,CLREX,DSB,DMB,ISB,
ARM 32bits的指令集,前面4 bits,為指令執行條件碼,彙整如下供參考.
Cond.意義對應CPSR中的標誌值
b0000EQ(Equal)Z set
b0001NE(Not equal)Z clear
b0010CS or HS
(Higher or same (unsigned >= ))
C set
b0011CC or LO
(Lower (unsigned < ))
C clear
b0100MI(Negative)N set
b0101PL(Positive or zero)N clear
b0110VS(Overflow)V set
b0111VC(No overflow)V clear
b1000HI(Higher (unsigned >))C set and Z clear
b1001LS(Lower or same (unsigned <=))C clear or Z set
b1010GE(Signed >=)N and V the same
b1011LT(Signed <)N and V differ
b1100GT(Signed >)Z clear, N and V the same
b1101LE(Signed <=)Z set, N and V differ
b1110AL無條件執行
b1111NV該指令不執行
不像是ARMv32指令集固定都為32bits,Thumb指令集固定為16bits,Thumb2則是同時提供了16bits32bits的指令集格式,並可提供優於Thumb指令集的執行效能,程式碼編譯後,如果15-11bits5bits0b11101,0b111100b11111就表示是32bits Thumb2指令集,如下為一般Thumb/Thumb2指令集的格式 (參考文件ARMv7-AR Architecture Reference Manual.pdf),
Thumb/Thumb2指令集分類位元指令集範例
1st 16bits2nd 16bits
15141312111098765432101514131211109876543210
Shift(immediate),add,subtract,move and compare00OpcodeLSL,LSR,ASR,ADD,SUB,MOV,CMP
Data Processing010000OpcodeAND,EOR,LSL,LSR,ASR,ADC,SBC,ROR,TST,RSB,CMP,CMN,ORR,MUL,BIC,MVN
Special data instructions and branch and exchange010001OpcodeADD,CMP,MOV,BX,BLX
Load/store single data item0101opBSTR,STRH,STRB,LDRSB,LDR,LDRH,LDRB,LDRSH
Load/store single data item0110opBSTR,LDR,
Load/store single data item0111opBSTRB,LDRB
Load/store single data item1000opBSTRH,LDRH
Load/store single data item1001opBSTR,LDR
Miscellaneous 16bits instructions1011OpcodeSETEND,CPS,ADD,SUB,CBNZ,SXTH,SXTB,UXTH,UXTB,CBNZ,CBZ,PUSH,REV,REV16,REVSH,POP,BKPT,
If-then and hints10111111opAopBIT,NOP,YIELD,WFE,WFI,SEV
Conditional branch and supervisor call1101OpcodeB,SVC(previously SWI)
Data processing(modified immediate)111100opSRn0RdAND,TST,BIC,ORR,MOV,ORN,MVN,EOR,TEQ,ADD,CMN,ADC,SBC,SUB,CMP,RSB
Data processing(plain binary immediate)111101opRn0ADD,ADR,MOV,SUB,ADR,MOVT,SSAT,SSAT16,SBFX,BFI,BFC,USAT,USAT16,UBFX
Branched and miscellaneous control11110op1op1op2B,MSR,BXJ,SUBS,SMC(previously SMI),BL,BLX
Change Processor State ,and hints111100111010100op1op2CPS,NOP,YIELD,WFE,WFI,SEV,DBG
Miscellaneous control instructions111100111011100opENTERX,LEAVEX,CLREX,DSB,DMB,ISB
Load/Store Multiple1110100op0LRnSRS,RFE,STM,STMIA,STMEA,LDM,LDMIA,LDMFD,POP,STMDB,STMFD,PUSH,LDMDB,LDMEA,SRS,RFE
Load/Store dual,Load/Store exclusive,table branch1110100op11op2Rnop3STREX,LDREX,STRD,LDRD,STREXB,STREXH,STREXD,TBB,TBH,LDREXB,LDREXH,LDREXD
Load word1111100op1101Rnop2LDR,LDRT
Load halfword, memory hints1111100op1011RnRtop2LDRH,LDRHT,LDRSH,LDRSHT,PLD,PLDW,
Load byte, memory hints1111100op1001RnRtop2LDRB,LDRBT,LDRSB,LDRSBT,PLD,PLDW,PLI
Store single data item11111000op10op2STRB,STRBT,STRH,STRHT,STRT,STR
Data processing(shifted register)1110101opSRnRdAND,TST,BIC,ORR,MOV,ORN,MVN,EOR,TEQ,PKH,ADD,CMN,ADC,SBC,SUB,CMP,RSB
Data processing(register)11111010op1Rn1111op2LSL,LSR,ASR,ROR,SXTAH,SXTH,UXTAH,UXTH,SXTAB16,SXTB16,UXTAB16,UXTB16,SXTAB,SXTB,UXTAB,UXTB,
Parallel addition and subtraction,signed111110101op1111100op2SADD16,SASX,SSAX,SSUB16,SADD8,SSUB8,QADD16,QASX,QSUB16,QADD8,QSUB8,SHADD16,SHASX,SHSUB16,SHADD8,SHSUB8
Parallel addition and subtraction,unsigned111110101op1111101op2UADD16,UASX,USAX,USUB16,UADD8,USUB8,UQADD16,UQASX,UQSAX,UQSUB16,UQADD8,UQSUB8,UHADD16,UHASX,UHSAX,UHSUB16,UHADD8,UHSUB8
Miscellaneous operations1111101010op1111110op2QADD,QDADD,QSUB,QDSUB,REV,REV16,RBIT,REVSH,SEL,CLZ
Multiply,multiply accumulate,and absolute difference111110110op1Ra00op2MLA,MUL,MLS,SMLABB,SMLABT,SMLATB,SMLATT,SMULBB,SMULBT,SMULTB,SMULTT,SMLAD,SMUAD,SMLAWB,SMLAWT,SMULWB,SMULWT,SMLSD,SMUSD,SMMLA,SMMUL,SMMLS,USAD8,USADA8
Long multiply,long multiply accumulate,and divide111110111op1op2SMULL,SDIV,UMULL,UDIV,SMLAL,SMLALBB,SMLALBT,SMLALTB,SMLALTT,SMLALD,SMLSLD,UMLAL,UMAAL
Coprocessor instructions11111op1RncoprocopSTC,STC2,LDC,LDC2,MCRR,MCRR2,MRRC,MRRC2,CDP,CDP2,MCR,MCR2,MRC,MRC2
接下來,我們把ARM處理器的一些基礎特徵,做一些介紹
A, 有關ARM 新增的指令集簡要介紹
有關不同版本的ARM核心支援的指令集,建議可以參考ARMR and ThumbR-2 Instruction Set Quick Reference Card(http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001l/QRC0001_UAL.pdf ),在這主要只針對部分筆者認為值得介紹的加以說明
ARMv4新增Thumb 16bits指令集
ARMv5新增支援VFPv2
支援Jazelle
BLX:支援透過Link Register的指令集狀態轉移Branch指令
BRK:支援中斷(Break)指令
CLZ:計數指令,算最高第一1零的個數,如果暫存器中全為0,則結 32,如果bit 31設定1,則結0,MultiMedia Codec優化很有助益.
其他像是QADDQSUBQDADD 和 QDSUB(有符号加法、减法,加倍加法,加倍减法), ,SMULxy,SMLAxy,SMULWy,SMLAWy,SMLALxy(乘法指令.),都是在ARMv5的核心中加入.
ARMv6新增支援Thumb2
支援Trustzone
支援SIMD
ARMv7-A/R新增支援VFPv3
支援NEON Advanced SIMD
支援ThumbEE
ARMv7-M新增
(For Low-Cost)
不支援ARM指令集,只支援Thumb2 16/32bits指令集
(支援最多240個中斷的集成式NVIC中斷控制器)
SVC在新的ARM處理器中,SWI的指令被改為SVC,雖然對應到的指令機械碼還是一樣(例如EFxxxxxx),但命名的改變,對應到的是新的處理器對SWI(SVC)行為的進一步改善.
LDREXSTREX這是在ARMv6之後新加入的指令,用來進行處理器層級的 Register/Memory Exclusive Access 確保,LDREXSTREX是成對的使用.
如下例子,使用者透過LDREX讀取一個記憶體的值,如果在執行STREX,該記憶體中的值被修改了,STREX動作會失敗且第一個暫存器R0的值會不為0 (non-exclusive by this CPU),反之,若該值沒有被更動到,STREX動作會成功且R00(exclusive access by this CPU).
這個處理器層級的Exclusive指令,很適合用在Multi-Task多工的環境或是多核心的環境中.目前ARM版本的Linux Kernel Spin Lock也是用此指令實作.
try
LDREX r0, [LockAddr] ; load the lock value
CMP r0, #0 ; is the lock free?
STREXEQ r0, r1, [LockAddr] ; try and claim the lock
CMPEQ r0, #0 ; did this succeed?
BNE try ; no – try again
B,使用Branch 指令在不同指令集中切換
我們可以參考文件 RealView® Compilation Tools Developer Guide 中 “Chapter 5. Interworking ARM and Thumb” 的例子,如下程式碼
PRESERVE8 ;Preserves eight-byte alignment of the stack
AREA TestCode,CODE,READONLY ; Name this block of code.
ENTRY ; Mark first instruction to call.
程式進入點
start
ADR R0, ThumbProg ; Generate branch target address and set bit 0, hence arrive at target in Thumb state.
ORR R0,R0,#1 ;等於是跳到目標位址ThumbProg+1的位址,再透過BX指令引發處理器指令集轉態為Thumb Mode
BX R0 ; Branch exchange to ThumbProg.
;ThumbCode區域
THUMB ; Subsequent instructions are Thumb code.
ThumbProg
MOVS R2, #2 ; Load R2 with value 2.
MOVS R3, #3 ; Load R3 with value 3.
ADDS R2, R2, R3 ; R2 = R2 + R3
ADR R0, ARMProg
BX R0 ; Branch exchange to ARMProg.
;Thumb Code2bytes,ARMv32 Code4bytes alignment取值,編譯器會在這補上2bytes 0x00,以便讓後續ARMv32 Code正常執行.
;ARMCode區域
ARM ; Subsequent instructions are ARM code.
ARMProg
MOV R4, #4
MOV R5, #5
ADD R4, R4, R5
結束程式.
stop MOV R0, #0x18 ; angel_SWIreason_ReportException
LDR R1, =0x20026 ; ADP_Stopped_ApplicationExit
SWI 0x123456 ; ARM semihosting
END ; Mark end of this file.
透過如下指令編譯與連結
armasm –debug –apcs=/interwork ARMThumbMixedCode.s
armlink ARMThumbMixedCode.o -o ARMThumbMixedCode.elf
把編譯後的ARMThumbMixedCode.elf放到ARM處理器執行,CPSR來看,一開始運作時處理器指令集是在ARM Mode(CPSRT bit0),隨後透過
ADR R0, ThumbProg
ORR R0,R0,#1
BX R0
R0儲存Thumb Code所在目標位址 OR 最小一個Bit1,透過BX轉態跳到Thumb Mode(CPSRT bit1)執行ThumbProg之後的程式碼,Thumb Mode執行的最後再透過
ADR R0, ARMProg
BX R0
R0儲存ARM Code所在位址,並直接透過BX轉態跳到ARM Mode(CPSRT bit0),繼續執行ARMProg之後的ARM程式碼.
如下列出不同的Branch指令所能跳躍的範圍 (一般而言ARM最大為32MB,Thumb216MB,Thumb4MB).
指令範圍(Thumb2 16/32bits)範圍(ARM 32bits)
B (Branch to target address)+/–16MB+/–32MB
CBNZ, CBZ(Compare and Branch on Nonzero, Compare and Branch on Zero)0-126BX
BL, BLX (immediate) (Call a subroutine ,Call a subroutine, change instruction set)+/–16MB+/–32MB
BLX (register) (Call a subroutine, optionally change instruction set)AnyAny
BX (Branch to target address, change instruction set)AnyAny
BXJ (Change to Jazelle state)
TBB, TBH (Table Branch (byte offsets) and Table Branch (halfword offsets))0-510B and
0-131070B
X
(Reference:ARMv7-AR Architecture Reference Manual)
C,Veneer-用來支援跨不同Obj檔案時的ARM指令集轉換.
由前面的例子我們可以知道,ARM<->Thumb(2)的轉態動作如果是在同一個Source Code檔案(Obj檔案)中時,轉態的動作其實就是直接在程式碼中執行與動作,但如果所發生的ARM<->Thumb(2)的轉態行為是發生在一個以上不同的Source Code之間的呼叫,就會牽涉到每個Obj檔案在編譯時的參數差異,有關跨不同Obj檔案間判別是部是需要在兩個Obj檔案的函式中支援轉態的動作,就會變成在ARM Link最後連結的動作中,依據跨檔案互相呼叫的雙方是不是在同一個指令集下,如果不是就會透過加入Veneer的嵌入碼,確保最後透過ARMLink連結的執行檔,可以支援所連結不同來源的Obj檔中所包含的指令集差異.
我們可以產生一個arm.s檔案,內容如下
PRESERVE8
AREA Arm,CODE,READONLY ; Name this block of code.
IMPORT ThumbProg
ENTRY ; Mark 1st instruction to call.
ARMProg
MOV R0,#1 ; Set R0 to show in ARM code.
BL ThumbProg ; Call Thumb subroutine.
MOV R2,#3 ; Set R2 to show returned to ARM.
; Terminate execution.
MOV R0, #0x18 ; angel_SWIreason_ReportException
LDR R1, =0x20026 ; ADP_Stopped_ApplicationExit
SVC 0x123456 ; ARM semihosting (formerly SWI)
END
並產生一個 thumb.s,內容如下
AREA Thumb,CODE,READONLY ; Name this block of code.
THUMB ; Subsequent instructions are Thumb.
EXPORT ThumbProg
ThumbProg
MOVS R1, #2 ; Set R1 to show reached Thumb code.
BX lr ; Return to the ARM function.
END ; Mark end of this file.
執行如下編譯,
armasm –debug –apcs=/interwork arm.s
armasm –thumb –debug –apcs=/interwork thumb.s
armlink arm.o thumb.o -o arm_thumb_veneer.elf
由於是跨不同的Obj檔案,不同於在同一個Obj檔案中,我們需要把Thumb Code函式進入點的Bit0,設定為1再透過BX指令跳躍過去讓ARMv32可以轉程Thumb Code的指令集,在上述的例子中,我們可以直接呼叫ThumbProg,透過Veneer機制達成由ARMv32轉態為Thumb Code的目的.
如下,我們透過反組譯arm_thumb_veneer.elf確認Veneer機制的作用,
$a
Arm
0x00008000: e3a00001 …. MOV r0,#1
0x00008004: eb000004 …. BL $Ven$AT$I$$ThumbProg ; 0x801c
0x00008008: e3a02003 . .. MOV r2,#3
0x0000800c: e3a00018 …. MOV r0,#0x18
0x00008010: e59f1000 …. LDR r1,[pc,#0] ; [0x8018] = 0x20026
0x00008014: ef123456 V4.. SVC #0x123456 ; formerly SWI
$d
0x00008018: 00020026 &… DCD 131110
$a
$Ven$AT$I$$ThumbProg
0x0000801c: e28fc001 …. ADR r12,{pc}+9 ; 0x8025
0x00008020: e12fff1c ../. BX r12
$t
Thumb
ThumbProg
0x00008024: 2102 .! MOVS r1,#2
0x00008026: 4770 pG BX lr
可以看到在ARM Mode,會在位址 0x00008004透過BL跳到位址0x0000801c執行Veneer Code,如同我們在同一個Obj檔案中所做的Bit0設定為1的動作,在所產生的Veneer Code中會設定R12暫存器指到0x00008025,再透過BX指令轉態執行0x00008024中的Thumb Code.
Veneer是由ARM Linker根據最後連結成執行檔的階段,判斷程式碼是否有跨ObjARMThumb Code互相呼叫的需求,或是ARM/Thumb/Thumb2彼此呼叫超過Branch上限範圍時,就會自動產生,可參考如下分類
Veneer類型說明
ARM/Thumb(2)之間呼叫在跨Obj檔案的ARM<->ThumbARM<->Thumb2彼此呼叫時,需要透過Veneer轉態
超過ARM/Thumb/Thumb2Branch範圍ARM<->ARM 呼叫超過32MB.
Thumb2<->Thumb2
呼叫超過16MB
Thumb<->Thumb呼叫超過4MB.
就會需要透過Veneer Code,協助完成呼叫流程.
D, CPSR (Current Program Status Register)與 SPSR (Saved Program Status Register)
程式狀態暫存器 PSR(Program Status Register),是用來紀錄程序狀態之用,包括反映出目前所處的處理器模式,指令集狀態,以及反應出條件(Cond.)執行指令判斷執行的依據.
舉個例子來說,當我們從CPSR4-0bits取出值為b10111就可以知道目前所在的Exception Handler,是發生了Abort,之後再判斷SPSR4-0bits,若為b10011(SVC Mode)b10000(User Mode),就可以知道在觸發這個Abort,處理器是在執行哪一個模式下的程式碼,再者,如果擔心有因為Exception Handle設計不當導致的Abort重入問題,也可以透過CPSR/SPSR前後模式比對,知道是不是Abort重入,可以鎖定潛在的系統問題加以解決如下簡述每個欄位的意義
位元功能說明
4-0Mode[4:0]Mode Bits模式位元
b10000(0x0010) -User Mode
b10001(0x0011)- FIQ Mode
b10010(0x0012)-IRQ Mode
b10011(0x0013)-Supervisor Mode
b10111(0x0017)-Abort Mode
b11011(0x001b)-Undefined Mode
b11111(0x001F)-System Mode
b10110(0x0016)-Secure Monitor
5TThumb state bit
0=ARM
1=Thumb
6FFIQ Disable
1=禁止
0=允許
7IIRQ Disable
1=禁止
0=允許
8AImprecise About Bit
A-bit
o Indicates if imprecise data abort exceptions are masked
9EData Endianess Bit
E-bit
o Indicates the current load/store endian setting of the core
o Can be set/cleared with the SETEND instruction
10cIT state bits
11b
12a
15-13IT_cond
19-16GR[3:0]Greater than or equal to
23-20DNM (RAZ)
24JJava State Bit
25eIT state bits
26d
27QSticky Overflow
28VOverflow
29CCarry/Borrow/Extend
30ZZero
31NNegative/Less than
E, APSR (Application Program Status Register)算術邏輯單元狀態標誌
不同於CPSRSPSR,APSR主要是作為ALU(Arithmetic Logic Unit)狀態標誌的暫存器,用以決定這些Condition指令是否被執行.
CPSR也會包含APSRFlags,其他像是處理器的狀態,中斷的致能與否,目前的指令集狀態,執行狀態IT block,這些都是APSR所不包含的.
位元功能說明
15-0ReservedReserved
19-16GE[3:0]Greater than or equal to (SIMD status bits – greater than or equal to for each 8/16-bit slice)
23-20ReservedReserved
26-24RAZ/SBZP
27QSticky Overflow
28VOverflow
29CCarry/Borrow/Extend
30ZZero
31NNegative/Less than
接下來介紹其他ARM處理器主要的特徵,

1, Coprocessors

ARM本身除了支援ARMv32,Thumb,Thumb2指令集外,還可以透過Coprocessor支援延伸的指令集,每當ARM處理器執行到無法識別的指令集,就會透過Coprocessor 試圖進行指令集的識別動作,如果Coprocessor無法識別有效的指令集,或是該系統沒有對應的Coprocessor配置,就會透過觸發Undefined Instruction Vector透過對應的軟體進行錯誤處理流程 (Undefined Instruction通常也會用在安插Break Point的除錯機制上)
ARM可以支持0-1516Coprocessor,例如: CP15(System Control Coprocessor 15)一般是用在CacheMMU相關的設定工作,CP14(Debug Control Coprocessor 14)支援相關Debug Registers,其它像是新增的NEON MPE (Media Processing Engine) SIMD指令與浮點運算VFP是透過CP10CP11支援 (可參考Cortex A8 Technical Reference Manualhttp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0450b/ch02s01s02.html).
目前所提供的Coprocessor指令中,ARM有針對VFPNEON的指令,定義對應的指令集,舉在Cortext A中有支援的NEON指令集為例,在編譯器端只要加入 vectorize 參數,就可以主動由編譯器根據程式碼內容優化,透過NEON的指令集進行產生對應的指令,如下所示
armcc –vectorize -c vector.c –cpu Cortex-A8 -Otime
之後反組譯 vector.o,如下所示
0x000000a0: f2944a40 @J.. VMULL.S16 q2,d4,d0[0] => NEON指令集
0x000000a4: f428774f Ow(. VLD1.16 {d7},[r8]
0x000000a8: e2868006 …. ADD r8,r6,#6
0x000000ac: e2866008 .`.. ADD r6,r6,#8
0x000000b0: f428674f Og(. VLD1.16 {d6},[r8]
0x000000b4: f2974248 HB.. VMLAL.S16 q2,d7,d0[1]
0x000000b8: f426174f O.&. VLD1.16 {d1},[r6]
0x000000bc: f2964260 `B.. VMLAL.S16 q2,d6,d0[2]
0x000000c0: f2914268 hB.. VMLAL.S16 q2,d1,d0[3]
0x000000c4: f2222844 D(“. VADD.I32 q1,q1,q2
而這些ARMNEON指令集就會透過CP10CP11支援,不需要使用者自己透過跟Coprocessor的資料交換指令來完成NEON指令集的優化動作如果今天使用的是基於CoprocessorGPU,由於編譯器跟處理器都沒有對應的GPU指令集支援,我們會需要透過Coprocessor指令去包裝這些GPU指令集的動作,由開發端根據對GPU指令了解的深度,優化圖形的函式庫,達到透過GPU加速的目的.
ARMCoprocessor架構也可以用來支援不同的周邊,除了ARM可以驅動周邊外,Coprocessor也能連接周邊,進行相關必要的工作,例如: Graphics Coprocessor,可主動的計算與處理資料,並更新Display記憶體,CoprocessorARM核心而言,就是另外一個協同處理器,連接到ARMDataControl Bus,ARM處理器遇到無法解析的指令時,就會初始化與CoprocessorHandshaking流程,進行後續的執行工作.
CoprocessorARM之間可以透過以下三個Signal進行Handshaking.
A, CPI (Co-processor instruction) Signal
在系統上的Coprocessor都會監聽這個訊息的發生,每當ARM遇到一個無法識別的指令集時就會觸發這個Signal.
B, CPA (Co-processor absent) Signal
Coprocessor接收到CPI Signal,就會把該指令集Fetch進來,並透過CPA Signal回應ARM是否支援這個指令集,舉例來說如果Coprocessor支援該指令集,就會把CPA Signal設定為Low (反之,不支持該指令集就把CPA設定為High),並透過
B.1, A集就是高電位電ㄨㄟㄢsor absent..tor1bitSignal Bit1表示該指令集可被Coprocessor處理,0表示不支援該指令集.
B.2, 4bits (0-15)Coprocessor ID表示目前可處理該指令集是哪一個Coprocessor (最多16).
如果沒有任何一個Coprocessor能處理該指令集,CPA Signal就會保持在High,ARM就會進一步觸發Undefined Instruction Vector後續流程.
C, CPB (Co-processor busy) Signal
一旦Coprocessor回應ARM可以執行該指令集後,ARM就會透過CPB Signal確認目前Coprocessor的執行狀態,如果這個SignalHigh表示Coprocessor目前尚未結束前一個處理中的指令集.CPB SignalLow,表示Coprocessor可以準備處理下一個Coprocessor指令.
ARM的程式,處於執行Coprocessor指令的狀態時,ARM會等待該指令集執行結束,才讓程式繼續運作下去,在我們一般多工的系統當中,例如Linux Kernel 2.6每一秒會觸發1000次的System Timer中斷觸發核心排程的機制,如果ARM在等待Coprocessor執行完畢的過程中發生ARM的中斷,這次Coprocessor指令執行的動作就會中斷,等到該中斷執行結束,重新返回該應用程式該處指令集位址,重新該次Coprocessor指令的執行Linux環境為例,如果這時候行程已經透過核心排程到其它應用程式,就要等到下一次該應用程式被排程到後,才有機會重新把該Coprocessor指令重新執行完畢.
一般而言,Coprocessor指令可以分為以下三種類型
1, 純粹為Coprocessor內部的操作流程,ARM端無須參與
這類的指令集,ARM不需要等待Coprocessor傳回資料,也不用等待Coprocessor執行該指令集結束,純粹為Coprocessor內部的資料處理,ARM能立刻繼續往下執行,CDP(Coprocessor Data Processing)指令為例
CDP{cond} <cp#>,<op>,<dest>,<lhs>,<rhs>,{info}
{cond}執行指令條件判斷,當條件滿足才會執行此指令(the optional condition code)
<cp#>Coprocessor編號(0-15,4bits,the co-processor number)
<op>所要執行Coprocessor的指令編號(0-15,4bits,the desired operation code)
<dest>Coprocessor端儲存資料的目標暫存器(0-15,4bits,the co-processor destination register)
<lhs> and <rhs>Coprocessor端讀取資料的來源暫存器(0-15,4bits,the co-processor source registers)
{info}Info (0-7,3bits,the optional additional information field)
另一個例子為,FPU Coprocessor指令的例子
ADF {cond}<P>{R} <dest>,<lhs>,<rhs>
{cond}執行指令條件判斷,當條件滿足才會執行此指令(the optional condition code)
<P>the precision of the operation
{R}the optional rounding mode and the other fields are as above.
<dest>目標暫存器
<lhs>來源暫存器
<rhs>來源暫存器
或是Graphics Coprocessor指令的例子
CDP 2,<palette>,<entry>,<value>,<component>
<cp#>2 is the co-processor number
<palette>the op-code for setting the palette
<entry>the logical colour number (0-15) (the <dest> field)
<value>the intensity for that component (0-65535) (the <lhs> and <rhs>) field.
<component>the red, green or blue component (0-2) (the info field)
2, 需要透過ARM的暫存器跟Coprocessor暫存器進行資料的交換
第二類的指令集,ARMCoprocessor透過暫存器進行資料的處理更新,由於會牽涉到ARM端暫存器的內容更動,因此ARM必須要等待Coprocessor處理完畢才可以繼續執行下去,以確保跟Coprocessor交換資料的動作,跟原本程式設計預期的行為流程的正確性,以下舉MRC (Move to arm core Register from Coprocessor) 與 MCR(Move to Coprocessor from arm core Register)兩個指令為例
MRC{cond} <cp#>,<op>,<ARM dest>,<lhs>,<rhs>,{info}
MCR{cond} <cp#>,<op>,<ARM srce>,<lhs>,<rhs>,{info}
{cond}指令集條件判斷(the optional condition code)
<cp#>Coprocessor編號(4bits,0-15, the co-processor number)
<op>所要執行Coprocessor的動作編號(3bits,0-7, the operation code required)
<ARM dest>/<ARM srce>ARM端要跟Coprocessor交換資料的暫存器(4bits,0-15, the ARM source/destination register)
<lhs> and <rhs>Coprocessor端要跟ARM交換資料的暫存器(4bits,0-15, co-processor register numbers)
{info}附加額外資訊(3bits,0-7, optional extra information)
3, Coprocessor 透過外部記憶體存取相關資料進行處理
類似於ARMLDR/STR指令可以把記憶體資料儲存到ARM的暫存器,或是把ARM暫存器的資料寫到記憶體終,針對Coprocessor的操作也有類似的指令LDC/STC,這個指令可以針對一個Array作範圍不超過-255 to +255 words (-1020 to +1020 bytes)index動作.
LDC{cond}{L} <cp#>,<dest>,<address>
STC{cond}{L} <cp#>,<srce>,<address>
{cond}指令集條件判斷(the optional condition code)
{L}(optional bit meaning ‘long transfer’)
<cp#>Coprocessor編號(4bits,0-15, the co-processor number)
<dest>/<srce>Coprocessor端要跟記憶體交換資料的暫存器(4bits,0-15, the ARM source/destination register)
<address>(specifies the address at which to start transferring data)
CoprocessorARM指令集一樣,都可以支援讓部分指令集只能在SVC Mode (Supervisor Mode)下被執行,ARM可以透過SPVMD Signal告訴外部的裝置或是Coprocessor目前執行中的ARM程式是否處於SVC Mode,讓周邊或是Coprocessor可以判別是否繼續改動作或指令的執行. Coprocessor可以透過觸發Abort中止該次指令的執行動作.
以浮點數來說,並不是所有ARM的平台都有支援VFP Coprocessor,因此在透過RVCT ARM編譯器時,可以透過 “– -fpu=vfp” 選擇硬體支援的VFP Coprocessor指令,讓編譯器可以產生使用VFP指令集的機械碼,若是該平台沒有支援VFP,就可以選擇編譯參數 “– -fpu=softvfp”,讓編譯器以軟體的方式處理浮點運算,不要產生基於VFPCoprocessor指令集.
除此之外,也可以透過支援VFP的函式庫,還是讓編譯器產生基於VFP Coprocessor的指令集機械碼,ARM處理到這些指令集時,就會進行上述對Coprocessor的流程,若平台上面沒有對應的VFP Coprocessor支援,就會觸發Undefined Instruction Trap,再透過ARM端處理Undefined Instruction Trap的軟體流程,呼叫對應於VFP指令集的軟體實作函式庫. (參考文件RealView Compilation Tools Libraries and Floating Point Support Guide).

2, Jazelle

參考ARM的網站http://www.arm.com/products/processors/technologies/jazelle.php, JazelleARM2002年開始的計畫,主要的目的是讓JavaByteCode可以無須經由一層軟體的JVM,能夠如同一般ARMv32指令集一樣,直接透過處理器執行,加速Java應用ByteCode的執行效率.第一個具備Jazalle指令集的ARM核心是ARMv5TEJ,所產生的第一個處理器產品是ARM926EJ-S.
支援JazelleARM核心,會在Pipeline Fecth到指令後,把所讀取到的Java指令,轉為1個或多個ARM指令集,並由Java Vm把要執行的ByteCode程式預備好後,再透過BXJ 進行Branch與轉態為ByteCode指令集的動作,跳到Java程式中執行,由於Jazelle主要實作常用的Java 指令集,如果遇到不支援的Java指令,就會通知Software Java Vm協助執行(回到ARM或是Thumb Mode),之後再由Java Vm根據需求切回Jazelle模式下因為,筆者沒實際操刀過Jazelle,但由於Jazelle模式下是可以觸發Undefined Instruction Exception,個人覺得當Jazelle遇到不支援的指令集,通知Software Java VM的方式應該是透過Undefined Instruction ExceptionSPSR判斷前一個狀態是不是Jazelle指令模式,如果是,就再透過Softwaft Java Vm去執行(可以參考LR 知道該ByteCode指令的位置.). 另外一種可能就是,採用類似ThumbEE Handler的作法,如果遇到無法支援的Java ByteCode指令集,就透過該Handler交給Software Java Vm執行.(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Bhhggafj.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344c/Chdiciaj.html ),以上僅供參考…:)
可以參考Sun在 CLDC HI(HotSpot Implementation)Release 文件(http://download.oracle.com/javame/config/cldc/cldc-opt-impl/cldc-hi-2.0-web/doc/release/CLDC_HI-release-notes.html ),CDLD Hi有支援Java Just-In Time的機制,可以即時把Java ByteCode編譯為ARM Code,加速Java應用的執行,如果要開啟Jazelle加速的話,文件中有說明需要取得ARM的授權後,才能使用
For the avoidance of doubt, distribution of products containing software code to exercise the BXJ instruction and enable the use of the ARM Jazelle architecture extension without a JTEK licensing agreement from ARM is expressly forbidden. 由於需要取得ARM JTEK授權,才能啟用這功能,在實際的應用上,應該都會比較偏向用JIT來做加速(通常會根據ARMv5指令集做優化).
目前Jazelle的應用主要可以分為
A, Jazelle DBX(Direct Binary Execution): 主要提供執行讓Java ByteCode可以在ARM處理器上執行的能力,開發端可以判斷處理器當下的CPSR J(bit 24)是否為與 T (bit 5)是否為0,確認處理器是否處於ARM Jazelle指令集的模式可用來減輕透過ARM指令集運作 Java Vm執行ByteCode的運算成本直接藉由處理器的指令支援,加速運作的效率要讓Jazelle能夠有效運作,這也需要Java Vm支援(Jazelle-aware JVM),以便讓Java應用程式所需相關的JSR(Java Specification Requests)或是JAR的讀取,能夠搭配Jazelle流程被妥善應用,根據ARM的宣稱,有約95%ByteCode應用程式,可以直接透過Jazelle執行.
B, Jazelle RCT(Runtime Compile Target): 這個技術,主要用來將Java ByteCode轉為ARM的機械碼(根據參考的資料,ByteCode轉為ARM機械碼後,最終程式的大小會膨脹4-8). JIT或是DAC(Dynamic Adaptive Compilation)可以根據Java ByteCode執行的情況,動態的分析並編譯ByteCode,Jazelle RCT希望解決的問題是,讓透過軟體將ByteCode轉機械碼的過程中,所造成的應用程式啟動時間增加,功耗與執行效能受到影響的問題,可以透過Jazelle RCT加以避免在實際的應用上,也可以透過AOT(ahead-of-time)提前在安裝或下載應用時進行編譯為ARM機械碼,當然,因為編譯後的Java應用會讓儲存空間膨脹4-8,在評估上也需要加以考量.
目前Android 2.2之後的Dalvik也有支援JIT,筆者認為以後軟體的JIT應該會是主流,透過Jazelle DBX或是RCT的機制,由於授權的限制,應該不容易成為應用的主流.(當然Jazelle也需要處理器的支援.)

3, ARMv32

ARMv32ARM原生的32bits指令集環境,也是相比於Thumb(16bits)Thumb2(16/32bits)執行環境來說,可以得到最佳化的執行效能,ARM指令在記憶體中會與4bytes位址對齊.通常在要求性能,例外或系統初始化的部分,會採用ARM 32bits指令集.指令集編碼特徵是每個指令最高4bits會代表該指令執行條件.

4, Thumb

Thumb16bits指令集主要是以常用的ARM 32bits指令集為Subset去設計的,在處理器中會把所載入的16bits Thumb指令轉成對應的32 bits ARM 指令去執行,開發端可以根據CPSR是否為1J是否為0,判定是否為Thumb Mode. 在開發上需要注意的是,同樣的C程式碼採用Thumb 16bits指令集編譯後,執行效率會有所減損,指令沒有ARM 32bits指令集豐富,但可以獲得比較高的程式碼密度,節省所需的記憶體空間.

5, Thumb2

Thumb2指令集提供了16/32 bits版本的指令集,同樣是以根據CPSR是否為1J是否為0,判定是否為Thumb2 Mode.(所以處理器要支援哩.), 根據ARM官網所述(http://www.arm.com/products/processors/technologies/instruction-set-architectures.php ),Thumb-2 可以比ARM Code減少 31%程式碼記憶體需求並且比原有的Thumb Code提升 38% 的性能. (不同的測試代碼會有一些出入.)
目前,像是Cortex M3,就全面採用Thumb2 Code,而不支援ARM指令集以便得到相比ARM較低的程式碼記憶體需求,又得到較少的效能減損(相比Thumb Code).

6, Thumb2 Execution Environment (Thumb-2EE)

首先,參考ARM網頁http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ic/Cjafgdih.html “除非另有明,否 ThumbEE 指令 Thumb 指令完全相,ARM指令集的角度,我們可以知道ThumbEE跟既有Thumb的指令是盡可能的一致,開發端可以先進入Thumb Mode(CPSRT bit1),再透過ENTERX指令( Thumb 狀態更改 ThumbEE 狀態,但 ThumbEE 狀態則不起作用進入ThumbEE Mode(CPSRJ bit1) 與透過LEAVEX指令( ThumbEE 狀態更改 Thumb 狀態,但 Thumb 狀態則不起作用 )離開ThumbEE Mode. ThumbEE Mode還提供如下指令集CHKA(檢查數組),HBHBLHBLP 和 HBP(處理程序跳轉,跳轉到指定處理程序).
此外,CP14有暫存器c0 可供設定ThumbEE Configuration Register
MRC p14, 6, <Rd>, c0, c0, 0 ; Read ThumbEE Configuration Register
MCR p14, 6, <Rd>, c0, c0, 0 ; Write ThumbEE Configuration Register
也可透過設定ThumbEE HandlerBase Register,支援當在ThumbEE發生例外時,可以讓應用程式有機會修正處理
MRC p14, 6, <Rd>, c1, c0, 0 ; Read ThumbEE HandlerBase Register
MCR p14, 6, <Rd>, c1, c0, 0 ; Write ThumbEE HandlerBase Register
開發階段,可以透過設定THUMBX程式碼節區,把指令集編譯為ThumbEE Mode(可以參考http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ic/CIHBCDGA.html ),筆者撰寫如下參考程式碼
AREA ToThumbX, CODE, READONLY
ENTRY
ARM
start
MRS R0,CPSR
BIC R0,R0,#0x1F
ORR R0,R0,#0x10 ;Switch to User Mode.
MSR CPSR_c,R0
MOVS R1,#0x7000 ; handler_ThumbEE =0x7000-4
MCR p14, 6, R1, c1, c0, 0 ; Write ThumbEE HandlerBase Register
ADR r0, enter_Thumb + 1
BX r0 ;進入Thumb Mode
THUMB
enter_Thumb
NOP
ENTERX ;進入ThumbEE Mode
THUMBX
enter_ThumbX
MOVS R2, #2 ; Load R2 with value 2.
MOV.W R3, #3 ; Load R3 with value 3. =>32bits Thumb2 指令
ADDS R2, R2, R3 ; R2 = R2 + R3
MOVS r0,#0 ;char *p; and p=0;
MOVS r1,#2
STRB r1,[r0,#0] ;*p=2; =>觸發Data Abort.
MOVS r0,#1
LEAVEX ;離開ThumbEE Mode
THUMB
leave_ThumbX
NOP
END
並請以支援Cortex-A8指令集的編譯器版本進行如下編譯
armcc –cpu=Cortex-A8 thumbee.s -o thumbee.elf
我們把 handler_ThumbEE設定為0x7000,透過CP14設定ThumbEE HandlerBase Register,隨後程式碼執行,SVC Mode切到User Mode,之後進入Thumb Mode,然後透過ENTERX進入ThumbEE Mode,首先驗證16bits32bits Thumb2指令集在ThumbEE Mode下的支援無誤,之後刻意產生一個NULL PointerData Abort,ThumbEE Mode,會直接把PC值只到0x7000-4的記憶體位址執行我們安排好的處理函式在處理函式中我們可以發現處理器還是處於User Mode(模式並沒有轉換),同時指令及模式還是處於ThumbEE Mode.
透過基礎的驗證,我們知道在ThumbEE模式下,發生異常處理時,會導引到ThumbEE Handler,透過該Handler應用程式可以有機會進行補救措施,而不適直接觸發處理器等級的Abort,導致應用程式直接終止執行.
舉個我們在Windows環境設定SEH(Structured Exception Handling)的例子來說明,應用程式可以透過如下的程式碼設定Windows上的SEH處理程式
handler = (DWORD) problem_fixing_seh ;
__asm{
mov eax,handler
push eax
push fs:0
mov fs:0,esp
};
只要在應用程式啟動時,預先把SEH設定好,如果應用程式因為設計不當,導致記憶體錯誤而中斷,就有機會先跳到處理函式 problem_fixing_seh,我們再透過這個函式把發生問題當下的現場包括暫存器,Stack與必要的資訊記錄下來,就有機會可以透過這些資訊幫助應用程式開發者把問題解決.
對應到有MMU環境,且有區分User ModeKernel Mode應用程式的消費性電子產品開發,ThumbEE可以讓應用程式在啟動時設定好ThumbEE Handler,如果應用程式因為不當設計而導致錯誤,就可以讓開發者有機會透過Handler把錯誤現場資訊蒐集下來,甚至也可以透過通訊機制回傳,幫助開發者收斂問題.(當然,應用程式有能力自己把問題排除會是更好的.)

7, VFP (Vector Floating Point) 與 Advanced SIMD (NEON)

ARM處理器把向量浮點放到Coprocessor處理,可以提供經濟的單精度與倍精度浮點運算能力,並可相容於ANSI/IEEE Std 754-1985 二進位浮點算數標準.
ARMAdvanced SIMD (NEON)指令集,也就是我們所稱的NEON,提供了64128bitsSIMDSingle Instruction Multiple Data)指令集,可以針對多媒體的應用提供指令集的加速能力,NEON本身是基於Coprocessor 1011所提供的指令集,Coprocessor有自己的暫存器,指令集,與獨立執行的處理器單元,NEON支援8/16/32/64bits的整數與32bits單經度浮點運算,NEON,SIMD最高可以執行到16個運算.
有關NEON的效能比較,建議可以參考 ARM Architecture & NEON(http://www.stanford.edu/class/ee282/handouts/lect.10.arm_soc.4pp.pdf)這份文件,其中也包含了跟ATOM N270的比較,NEON支援3264bits長度的暫存器(D0-D31)或可用Q暫存器的方式使用為16128bits長度的暫存器(Q0-Q15),一個暫存器在處理多媒體資料時,最多可以儲存432bits浮點數,或是16signed字元,ARM文件中所舉的 AAC編碼中所用到的FFT(Fast Fourier Transform)算法來說,只用ARMv6 SIMD指令根用ARMv7 NEON指令可以差到約四倍的執行效率,ffmpeg中的FFT也可達到約12倍的效率(僅供參考.).
要使用NEON,可以直接在armcc編譯時加上—vectorize,其他參數還請參考ARM編譯器的文件.

8, Security Extensions (TrustZone)

提到TrustZone,其實在PC產業也有安全運算組織TCG(http://www.trustedcomputinggroup.org/)在致力於推動這類平台安全的機制,並制定了像是TPM(Trusted Platform Module)與運行在其上的軟體架構TCG Software Stack,讓電腦環境的應用程式或核心驅動程式也區分TrustedNon-Trusted的執行環境,包括新的Windows環境,Linux Kernel 2.6之後也有支援,可以透過make menuconfig中進入Device Drivers —> Character devices —>[*] TPM Hardware Support 根據平台上支援的TPM 介面選擇 National Semiconductor TPM Interface 或 Atmel TPM Interface (會根據所取得的Linux Kernel版本而有所不同.)
目前微軟作業系統中也有支援基於TPM技術的BitLocker(參考網址http://windows.microsoft.com/zh-TW/windows7/Learn-more-about-BitLocker-Drive-Encryption),如果TPM偵測到作業系統磁碟啟動檔案被改變,就會強制進入修復模式,必須要輸入修復密碼才能重新正確的讀取資料,或如果發現TPM資料跟不一致,也會強制要求輸入修復密碼.Windows TPM機制也可以提供搭配啟動金鑰或PIN碼的機制,如果使用者的筆記電腦不小心遺失,而第三者如果沒有匹配具有對應加密金鑰的USB磁碟或是沒有輸入正確的PIN,也會無法順利地開啟作業系統磁碟的資料windows電腦上,基於TPM我們可以做到避免重要資料的磁碟被刻意拔到其他電腦上讀取,的資料遺失風險,就算是電腦被第三方使用者盜取,也可以有效確保資料遺失的風險.
TrustZone是由ARM所提出在處理器架構上,區分SecureNon-Secure模式的兩個平行執行環境 (Secure World 與 Normal World),這兩個執行環境可以透過Secure Monitor Mode來進行切換,概念如下所示
Non-Secure
User Mode
(Application)
Secure
User Mode
(Application)
Non-Secure
Privileged Mode
(Kernel/Driver)
Secure
Privileged Mode
(Kernel/Driver)
Monitor Mode (Exception)
基於TrustZone,不屬於Secure區域的應用程式或是核心程式,就會無法存取屬於Secure區域的資料,可用來確保在SmartPhone這類產品上,因為下載第三方惡意程式所帶來的安全問題. TtrustZone中有關的軟體安全機制,是由Trusted Logic S.A.這家公司所共同研發的,在一個支援TrustZone的處理器上,會有一塊記憶體空間預留給專屬Secure Mode的應用程式或是核心程式執行,也因此,MMU也必須要能支援這樣的欄位.(因此像是ARM1176或是 Cortex A這類有MMU的處理器環境會非常適合). 基於此,才能透過記憶體管理機制,在硬體上實際的分割出SecureNon-Secure的記憶體執行與使用空間,避免Non-Secure應用的惡意越界根據ARM的資料,一個可供執行的TrustZone環境必須包括
A. 支援 TrustZone 的處理器
B. 晶片上的Boot Rom用來支援啟動時的安全設定. (透過外部Flash儲存BootCode會有被修改的風險.)
C. 晶片上可供用來儲存設定或主控密碼的空間 (Maybe OTP(One Time Programmable))
D. 支援On-Chip RAM用來儲存DRM或相關重要的密碼資訊
E. 能夠設定成只限定被信任的應用軟體使用的周邊.
ARM會提供由Trusted Logic S.A.所提供的安全模組,支援跟TrustZone行為一致的安全保密協議.通過這些軟體保密服務,所提供的安全檢查,ARM希望能支援像是SIM卡上鎖,IMEI保密,安全啟動(確保所要載入的作業系統核心沒有被修改過.)(OMTP -Open Mobile Terminal Platform,也有制定相關Secure Boot的需求),DRM (Digital Right Management)受版權保護的資料內容,數位簽名與電子銀行參考ARM的文件,ARM會提供有包括,Trusted Interpreter,TrustZone Access Driver, TrustZone Monitor,Secure Kernel,Secure Key Storage,SIM Lock,E-Wallet 與 API Framework 這些配套的軟體模組.
SMI(Software Monitor Instruction,安全稽核(監察)中斷)SMC兩者的機械碼指令是一致的,都會透過SVC的中斷觸發,Non-Secure的程式碼有機會可以透過Monitor Mode切到Secure State. 如下所示,透過TrustZone,我們可以設定Asynchronous Abort,IRQ,FIQ,DMA,TLB,Coprocessor 等周邊中斷與管理單元是否都要納入Secure State的管理中.
Secure Configuration Register (SCR)
位元功能說明
31 – 7UNK/SBZP
(Bits [31:7])
UNK/SBZP
unknown on reads, Should-Be-Zero-or-Preserved on writes.
6nETNot Early Termination. This bit disables early termination
5AWA bit writable. This bit controls whether the A bit in the CPSR can be modified in Non-secure state:
0 the CPSR.A bit can be modified only in Secure state.
1 the CPSR.A bit can be modified in any security state.
4FWF bit writable. This bit controls whether the F bit in the CPSR can be modified in Non-secure state:
0 the CPSR.F bit can be modified only in Secure state
1 the CPSR.F bit can be modified in any security state.
3EAExternal Abort handler. This bit controls which mode handles external aborts:
0 Abort mode handles external aborts
1 Monitor mode handles external aborts.
2FIQFIQ handler. This bit controls which mode the processor enters when a Fast Interrupt (FIQ) is taken:
0 FIQ mode entered when FIQ is taken
1 Monitor mode entered when FIQ is taken.
1IRQIRQ handler. This bit controls which mode the processor enters when an Interrupt (IRQ) is taken:
0 IRQ mode entered when IRQ is taken
1 Monitor mode entered when IRQ is taken.
0NSNon Secure bit. Except when the processor is in Monitor mode, this bit determines the security state of the processor.
0 =Secure state
1 =Non-secure state
並可以透過如下程式碼修改Secure Configuration Register
MRC p15,0,<Rt>,c1,c1,0 ; Read CP15 Secure Configuration Register
MCR p15,0,<Rt>,c1,c1,0 ; Write CP15 Secure Configuration Register
Non-Secure Access Control Register
位元功能說明
31 – 19SBZSBZ
Should-Be-Zero on writes.
18DMAReserves the DMA channels and registers for the Secure world and determines the page tables, Secure or Non-Secure, to use for DMA transfers.
0 = DMA reserved for the Secure world only and the Secure page tables are used for DMA transfers,reset value
1 = DMA can be used by the Non-Secure world and the Non-Secure page tables are used for DMAtransfers.
17TLPrevents operations in the Non-Secure world from locking page tables in TLB lockdown entries.
The Invalidate Single Entry or Invalidate ASID match operations can match a TLB lockdown entry but an Invalidate All operation only applies to unlocked entries:
0 = Reserve TLB Lockdown registers for Secure operation only, reset value
1 = TLB Lockdown registers available for Secure and Non-Secure operation.
16CLPrevents operations in the Non-Secure world from changing cache lockdown entries:
0 = Reserve cache lockdown registers for Secure operation only, reset value
1 = Cache lockdown registers available for Secure and Non-Secure operation.
15 – 14SBZSBZ
Should-Be-Zero on writes.
13CP13Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
12CP12Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
11CP11Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
10CP10Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
9CP9Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
8CP8Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
7CP7Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
6CP6Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
5CP5Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
4CP4Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
3CP3Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
2CP2Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
1CP1Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
0CP0Determines permission to access the given coprocessor in the Non-Secure world:
0 = Secure access only, reset value
1 = Secure or Non-Secure access.
並可以透過如下程式碼修改Non-Secure Access Control Register
MRC p15, 0, <Rd>, c1, c1, 2 ; Read Non-Secure Access Control Register data
MCR p15, 0, <Rd>, c1, c1, 2 ; Write Non-Secure Access Control Register data
當處理器處於Monitor Mode (CPSR M[4:0] = b10110),處理器就是位於Secure State,這時對CP15的讀寫(MRC and MCR)動作,就會根據SCR.NS bit的值,如果NS(Non-Secure)0,就是處於Secure State狀態,對暫存器的讀寫就是透過Secure Banked暫存器,如果NS1,就是處於Non-Secure State狀態,讀寫的暫存器就是Non-Secure Banked暫存器.
暫告段落.
其實,行筆至此,感覺要寫的東西真的太多了…@_@..,既然是筆記,那就是隨筆去寫,暫時先做一個段落,之後有空,再寫下一回的ARMCortex筆記吧.