Posted by Lawpig on 6月 01, 2017 | No comments

ARM與Cortex筆記

hlchou@mail2000.com.twby loda

曾聽過一段話,有人問蘇格拉底為何成為雅典最有智慧的人,他說:雅典人自以為知道什麼,卻不知道其實自己什麼都不知道,他只知道一件事，就是他什麼都不知道. 希望個人在技術領域專研,也應常保此心.

因著工作的關係,在ARM的處理器上經歷了Real-Time OS,Linux相關的Porting工作,希望可以透過這篇文章,把相關的資訊做一個整理(溫故知新),若你原本已經是ARM架構的熟手,本文應該幫助有限,主要希望對有志於在ARM相關產品開發更進一步了解的人有所幫助,然個人所學有限,若有不足之處,還請不吝指教,

參考ARM的網站http://www.arm.com/about/company-profile/index.php,ARM公司成立於1990年,目前為止已經銷售了超過150億個基於ARM的晶片,並向200多加公司銷售了超過600個處理器的授權,並藉此收取ARM晶片的授權費用,目前全世界有超過95%的手機以及超過25%的消費性電子產品使用ARM做為處理器曲7超過ㄨㄛㄨㄛ.

從ARM(Advanced RISC Machines)公司的名稱可以知道,這是一家專注在RISC(Reduced Instruction Set computer)架構的處理器公司,最早的ARM1原型是1985年在英國劍橋的Acorn計算機公司所設計,並由美國的VLSI公司製造,也因此在Wiki上看到,早期ARM1,ARM2,ARM250,ARM3..的處理器,都被Acorn這家公司採用作為計算機核心處理器.

1978/12/5,物理學家赫爾曼·豪澤（Hermann Hauser）和工程師Chris Curry,在英國康橋創辦了CPU公司（Cambridge Processing Unit）並在1979年,CPU公司改名為Acorn電腦公司,在1985年,Roger Wilson和Steve Furber設計了他們自己的第一代32位、6M Hz的處理器,用它做出了一台RISC指令集的電腦,簡稱ARM（Acorn RISC Machine）.

隨後,Acorn公司陷入財務困難,並被Olivetti收購,成為一個獨立的Olivetti研究子公司,1990/11/27,ARM獲得蘋果公司與晶片廠商VLSI的投資,成為一家獨立的處理器公司,在穀倉展開創業的歷程,像是大家印象深刻的Apple Newton PDA,用的就是ARM610處理器.(參考文章:http://www5.cnfol.com/big5/news.cnfol.com/100823/101,1587,8274016,00.shtml與 http://big5.buynow.com.cn/gate/big5/www.cnbeta.com/articles/131786.htm )

稍微考古一下,目前處理器的架構中,主要有1940年代提出的Von Neumann記憶體架構,讓程式與資料共用相同的匯流排,以及之後的Harvard架構,讓程式與資料走不同的匯流排,好處在於可以同時進行程式與資料的記憶體存取動作,早期的ARM7跟8051一般是採用Von Neumann架構,一塊Cache供指令與資料存取,而目前新的微處理器架構(例如:ARM11 or Cortex A),通常都採用Harvard架構,也就是處理器會支援I-Cache與D-Cache,區分指令與資料的擷取匯流排,提升處理器的效率. (參考文章:http://en.wikipedia.org/wiki/ARM7 and http://en.wikipedia.org/wiki/Harvard_architecture ).

有關ARM在Von Neumann與Harvard架構的分類,也可以參考網頁http://stenlyho.blogspot.com/2008/08/armcpu.html ,如下所示

Processor Family	#of pipeline stages	Memory Organization	Clock Rate	MIPS/MHz
ARM6	3	Von Neumann	25MHz
ARM7	3	Von Neumann	66MHz	0.9
ARM8	5	Von Neumann	72MHz	1.2
ARM9	5	Harvard	200MHz	1.1
ARM10	6	Harvard	400MHz	1.25
StrongARM	5	Harvard	233MHz	1.15
ARM11	8	Von Neumann/Harvard	550MHz	1.2

ARM是採用RISC 精簡指令集 (Reduced Instruction Set Computing)架構的處理器,RISC架構主要選擇使用頻率較高的簡單指令,避免複雜指令,使用固定長度的指令編碼(支援32bits,16bits或16/32bits混合),單週期指令,便於Pipeline的操作執行,並透過大量暫存器,讓邏輯處理指令只對暫存器進行操作,只有特定載入/儲存的指令可以存取記憶體內容.相比CISC架構,會隨著需求,不斷的加入新的指令集,使得架構越趨複雜,現實應用中,也並非所有的指令都是常被使用的,如下,以CSIC架構的x86 指令集為例,指令集呈現不固定長度的方式,如下例子,有1,2,7與11 bytes的例子

(1bytes)0x48 = dec eax

(2bytes)0x89 F9= mov ecx,edi

(7bytes)0x8B BC 24 A4 01 00 00 = mov edi,dword ptr [esp+000001A4h]

(11bytes)0x81 BC 24 14 01 00 00 FF 00 00 00 = cmp dword ptr [esp+00000114h],0FFh

ARM透過Pipeline的方式加速指令集的處理,在Pipeline執行階段,如果發生中斷,也會把Pipeline中的指令執行完畢才進入中斷,如下所示ARM7支援如下的3階Pipeline

Fetch → Decode → Execute

其中

Fetch	進行指令的擷取動作
Decode	Thumb->ARM指令Decompress,ARM指令解碼,暫存器選擇
Execute	進行暫存器/記憶體讀取,算術邏輯運算與暫存器/記憶體回寫動作

每一個CPU週期,處理器都可以同時處理 ‘Fetch’,’Decode’,’Execute’這三個動作,而非把一個指令從Fetch開始到執行完畢後,才處理下一個指令週期,如下圖所示

Time	Fetch	Decode	Execute
Cycle#1	Instruction#1
Cycle#2	Instruction#2	Instruction#1
Cycle#3	Instruction#3	Instruction#2	Instruction#1
Cycle#4	Instruction#4	Instruction#3	Instruction#2
Cycle#5	Instruction#5	Instruction#4	Instruction#3
Cycle#6	Instruction#6	Instruction#5	Instruction#4

為了避免在非載入記憶體階段,讓運算指令進行記憶體的存取,而導致Pipeline可重疊執行的能力被破壞,ARM只允許特定載入儲存指令讀寫記憶體的資料. 早期的ARM6,ARM7有約3階的Pipeline,到了ARM8,ARM9時,約為5階的Pipeline,之後的ARM11則為8階的Pipeline,不過,Pipeline過深不一定就能帶來更高的效益,如果程式碼的流程中遇到分支(例如:Branch到另一個程式區塊),就會導致Pipeline中的資料失效,要重新進行指令Fetch的動作.

簡單來說,Pipeline就是把指令的處理分級幾個不同的步驟,例如

ARM9支援如下的5階Pipeline

Fetch → Decode → Execute→ Memory→ Write Back

其中

Fetch	進行指令的擷取(Fetch)動作
Decode	進行ARM/Thumb指令解碼與暫存器的讀取
Execute	進行邏輯運算與記憶體存取位址計算動作
Memory	讀取或寫回記憶體資料
Write Back	將運算或是Load結果回寫暫存器中

ARM10之後,有支援Branch Prediction,以減少在Pipeline執行期間,因為Branch動作導致Pipeline失效 Flush的機會,支援如下的6階Pipeline

Fetch→ Issue → Decode → Execute→ Memory→ Write Back

其中

Fetch	進行Branch Predictor指令分支預測,指令位址計算,與指令的擷取(Fetch)動作
Issue	ARM/Thumb指令解碼,若非ARM/Thumb有效指令,就透過Coprocessor Signal判斷是否為Coprocessor指令
Decode	暫存器的讀取,Result Forward,ScoreBoard
Execute	進行算術邏輯運算與Branch/Data存取記憶體位址計算,乘法運算
Memory	讀取或寫回記憶體資料,Coprocessor資料存取,乘法相加處理
Write Back	將運算或是Load結果回寫暫存器中

ARM11採用Scalar架構的Pipeline,並在Issue階段支援ALU(arithmetic logic unit),MAC(multiply/accumulate)與Load/Store分種Pipeline的流水線,可以在一個Cycle分發一個對應的處理器動作到一個Pipeline,如下所示的8階Scalar Pipeline (ARM1156T2-S支援9階的Pipeline,其中Fetch Pipeline擴充為3階,細節就不在這討論,可以參考網頁http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0338g/I1002919.html)

Fetch#1→ Fetch#2→ Decode→ISS (ALU Pipeline)→ Shifter→ALU→ SAT→ Write Back

______________________________(MAC Pipeline) → MAC1→MAC2→ MAC3→ Write Back

______________________________(Load/Store Pipeline)→ LS Add→DC1→ DC2→ Write Back

跟之前版本相比ARM11用了兩個Fetch Pipeline階段去支援兩種指令分支預測(Branch Prediction)的機制,第一個Fetch Pipeline階段會根據歷史紀錄進行動態的指令分支預測(Dynamic Branch Prediction),總共紀錄64筆,4種狀態(Strongly taken,Weakly taken,Weakly not-taken and Strongly non-taken)的分支((Branch)目標記憶體位址快取(BTAC,Branch-Target Address Cache),紀錄近期指令分支的情況. 第二個Fetch Pipeline階段,進行靜態的指令分支預測(Static Branch Prediction),會處理不在第一階段範圍中的分支預測記憶體位址. 命中率高的指令分支預測(Branch Prediction)可以避免Pipeline失效重置的問題,讓處理器的運作效率更高. 根據參考的資料,ARM11的Dynamic與Static Branch Prediction在一般執行情況下可以有約85%的命中率,大多數的情況可以介於80%-95%之間(取決於程式碼的大小).

簡介如下,

#1	Fetch#1		進行Dynamic Branch Prediction,指令位址計算,與指令的擷取(Fetch)動作
#2	Fetch#2		進行Static Branch Prediction
#3	Decode		ARM/Thumb指令解碼,若非ARM/Thumb有效指令,就透過Coprocessor Signal判斷是否為Coprocessor指令 Static BPR Stack
#4	ISS (Instruction Issue)		暫存器的讀取,與指令執行路徑分派,有三條路徑邏輯運算ALU Pipeline,乘法累加MAC Pipeline,與資料存取Load/Store Pipeline.
ALU Pipeline				MAC Pipeline		Load/Store Pipeline
#5	Shifter	對邏輯運算指令操作單元(operand)進行Shift		MAC1	第1階段乘法累加操作	LS Add	計算產生Load/Store操作的記憶體位址
#6	ALU	進行整數算術邏輯運算		MAC2	第2階段乘法累加操作	DC1	第1階段Data Cache存取
#7	SAT	儲存運算結果		MAC3	第3階段乘法累加操作	DC2	第2階段Data Cache存取
#8	Write Back		將運算或是Load結果回寫暫存器中

接下來,介紹ARM Cortext A系列的架構,在這架構下ARM導入了Superscalar 架構的Pipeline,讓處理器可以在一個週期平行處理一個以上的指令集,以Cortex A8為例,支援13階的整數Pipeline與10階的NEON多媒體指令集Pipeline,以整數處理的指令集為例,Cortex A8支援Dual-Issue,In-Order Pipeline,不同於之前的ARM核心一次只能處理一個整數處理指令集,Cortex A8可以同時Issue兩個整數處理指令集,並在一個週期中,透過兩個整數算術邏輯單元Pipeline平行處理這兩個指令集.

13-Stage Integer Pipeline

10-Stage NEON Pipeline

F#0

F#1

F#2

D#0

D#1

D#2

D#3

D#4

E#0

E#1

E#2

E#3

E#4

E#5

M#0

M#1

M#2

M#3

N#1

N#2

N#3

N#4

N#5

N#6

Instruction Fetch

Instruction Decode

with Dual-Issues

Architectural

File

ALU/MUL Pipeline 0

NEON
Instruction

Queue

NEON
Instruction

Decode

NEON
Register

File

Integer ALU Pipe

ALU Pipeline 1

Integer MUL Pipe

Load/Store Pipeline 0 or 1

Integer Shift Pipe

None-IEEE FP Add Pipe

None-IEEE FP Mul Pipe

IEEE FP Engine

Load/Store Permute Pipe

在Cortex A8架構下,有兩個 ALU Pipeline,ALU 0與ALU1是對稱的,可以同時處理兩個整數邏輯運算,由於Pipeline的特性,在使用上,乘法需求的指令會跟ALU 0成對 (也就是說在這條Pipeline 0連續處理有關整數邏輯運算與乘法相關的指令),而Load/Store 的指令,則適合跟ALU 0或1兩者任一一起成對運作.

其中

13-Stage Integer Pipeline

0-Stage

F#0

用來產生要Fetch指令的位址,在文件中這個階段並不納入13階的Pipeline中. (AGC,Address Generator Unit)

1-Stage

F#1

RAM+TLB ,

支援兩個層級的全域歷史指令分支預測(Global History Branch Preditor)分別為

1,BTB(Branch Target Buffer)

能用來判斷目前所要Fetch的位址是否為分支(Branch)指令,以及所要調到的目標記憶體位址,目前總共可以記錄512筆資料,若BTB命中,接下來就會進行GHB的動作.

2,GHB(Global History Buffer)

包含4096個2bits計數器,用來編碼分支預測的強度與方向,GHB會以10bits長度定址最近十筆分支的位址,與4bits的PC(Program Counter)值.

此外,Return Stack(RS)會記錄八筆32bits Link Register的值,當發現有關於函式返回(Return)相關指令時,Return Stack中所記錄的最近八筆Link Register資訊就可以幫助Dynamic Branch Predictor預測可能的分支結果.

2-Stage

F#2

提供12 筆 Fetch Queue

3-Stage

D#0

Decode.

4-Stage

D#1

5-Stage

D#2

6-Stage

D#3

7-Stage

D#4

8-Stage

E#0

Architectural Register File

ALU/MUL Pipeline 0

ALU Pipeline 1

Load/Store Pipeline 0 or 1

9-Stage

E#1

Execution.

10-Stage

E#2

11-Stage

E#3

12-Stage

E#4

BP Update(to F#0)

13-Stage

E#5

10-Stage NEON Pipeline

Instruction Decode

Load and Store with Alignment

1-Stage

M#0

16-entry NEON Instruction Queue/Instruction Decode

Mux L1/MCR

2-Stage

M#1

Decode Queue and Read/Write Check

8-entry Load Queue

3-Stage

M#2

Score-Board and Issue-Logic

Load Align

4-Stage

M#3

NEON Register Read and M3 fwding muxes

Mux with NRF

Integer ALU Pipe

Integer MUL Pipe

Integer Shift Pipe

None-IEEE FP Add Pipe

None-IEEE FP Mul Pipe

IEEE Single/Double precision VFP

Load/Store and Permute Pipe

5-Stage

N#1

FMT

DUP

SHIFT#1

FFMT

FDUP

VFP

PERM#1

6-Stage

N#2

ALU

MUL#1

SHIFT#2

FADD#1

FMUL#1

Write Back

PERM#2

7-Stage

N#3

ABS

MUL#2

SHIFT#3

FADD#2

FMUL#2

Store Align

8-Stage

N#4

ACC#1

FADD#3

FMUL#3

8-entry Store Queue

9-Stage

N#5

ACC#2

FADD#4

FMUL#4

10-Stage

N#6

Write Back (Update to ARM/NEON Register File)

Cortex A8支援兩階的Cache,其中L1 Cache支援16kbytes或32kbytes的I/D-Cache(Harvard架構),與每個Byte有一個Bit的校正碼(Parity Bit),每個Cache都支援4ways的機制(可作為4個快取區塊),並使用Hash Virtual Address Buffer(HVAB)預測Pipeline要去L1 Cache抓取的位置,是在哪一個快取區塊,可降低所需的時間與功耗,並支援Write-Back與Write-Through相關機制.

L2 Cache支援64kbyes-2Mbytes範圍的記憶體大小,指令與資料都共用這一塊L2 Cache空間,提供L1 與 L2 Cache間高速的介面,可用來避免處理器頻繁到外部AXI Bus存取資料與和其他周邊搶資源,所造成的效能影響,L2 Cache支援8ways的機制(可作為8個快取區塊),可選擇支援ECC與Parity Bit校正碼,並支援Write-Back,Write-Through與Write-Allocate機制.

ARM Cortex A8是以Coprocessor的架構支援新的NEON多媒體指令集,ARM對於Coprocessor指令的辨別主要是在指令Decode或Issue 時透過跟Coprocessor判別是否為其支援的指令,

NEON多媒體指令的Pipeline主要是介接在ARM核心整數處理Pipeline之後,也因此所有的例外(Exception)處理與分支Branch預測問題在這之前都已經被處理好了,此外,有關對記憶體資料的Load/Store動作,也會在NEON Pipeline之前,就透過ARM核心的Load/Store Pipeline先從L1 D-Cache執行完畢,並儲存相關資料在NEON Pipeline的Load/Store Data Queue中.

NEON有自己的指令暫存空間(NEON Instruction Queue),基於ARM的Dual-Issue架構,每次處理器週期,最多可以指派兩個有效的NEON指令集,NEON的指令集可以一次從L1或L2 Cache中Load/Store 128bits的資料.

NEON有三個整數SIMD Pipelines(包含整數乘法累加Pipeline,整數Shift Pipeline與整數邏輯運算Pipeline),一個Load-Store/Permute Pipeline(負責NEON資料的Load/Store與資料存取整數單元Integer Unit),兩個SIMD single-precision floating-point Pipelines(分別負責浮點數的乘法與加法)與一個Non-Pipelined Vector Floating-Point Unit(VFPLite,遵循ARM VFPv3浮點數規格,並符合IEEE754關於浮點數的規範,並向後相容原本ARM的浮點數實作). NEON指令在Pipeline中是以in-order方式被執行,所處理的資料要不就是NEON整數SIMD指令就是NEON浮點運算指令.

而隨著處理器時脈的提升,每一個處理器Cycle,每一階的 Pipeline所能做的事情也越加精簡(每一個Cycle執行的時間相對也越短),伴隨著就是Pipeline階數的增加,只要Branch Predition的準確度高,Pipeline被Flush的機率低,就能透過Pipeline階數增加得到處理器時脈提升的效能好處.

ARM指令集在每個指令都有4bits的Condition,對於Pipeline的架構來說,可以直接判斷PSR(Program Ststu Register)決定該指令該如何執行的條件,優化效能.

ARM的處理器核心命名也有一個可識別性,例如:ARM7-TDMI (ARM7-Thumb+Debug+Multiplier+ICE),指的就是這個ARM7-TDMI的核心,支援16bits Thumb Code,晶片除錯JTAG (IEEE 1149.1 ),硬體乘法器（Multiplier）與ICE－RT嵌入式邏輯/追蹤巨集單元.或像是J為支援Jazelle指令集與F為支援向量浮點數.

簡單介紹一下,ARM 最新Cortex系列的處理器,從早期的ARM7(armv4),ARM9(armv5),ARM11(armv6)到現在的Cortex(armv7)架構,每一個世代都有包括新的指令集(例如:v4T導入Thumb指令集,v5E導入增強型DSP指令,v6新增Thumb2與SIMD指令集),架構與效能上的諸多改善,而到了Cortex時,ARM第一次同時推出三個等級的產品線,主要說明如下

Cortex A(Application)系列	主要用於高性能的開放平台,一般而言也都具備MMU,例如Symbian,Linux/Android或是Windows Mobile/Phone.
Cortex R(Real-Time)系列	用於高端的嵌入式系統產品,例如汽車電子組件,機械手臂這類要求處理器功能強大,高可靠度與對事件反應快速的應用.
Cortex M (Microcontroller)系列	用於嵌入式與單晶片的產品,針對過去8051這類單晶片所在的Real-Time,低功耗與成本的應用. 目前台灣的新唐也推出Cortex-M0低價處理器(mmm…我理解是在1USD以下),或像是Cortex-M3只支援部分常用Thumb2指令集(不支援ARM指令集)與中斷向量表,藉此提供高密度與效能的執行環境.

以下根據ARM系列的差異,逐一說明

(參考網站 http://en.wikipedia.org/wiki/ARM_architecture)

ARM Family	ARM Core	ARM Architecture	Features	Cache (I/D)	MMU/ MPU	Performance	Applied Product
ARM1	ARMv1	ARM1		None	None		ARM Evaluation System second processor for BBC Micro
ARM2	ARMv2	ARM2	ARMv2 added the MUL (multiply) instruction	None	None	4 MIPS @ 8 MHz 0.33 DMIPS/MHz	Acorn Archimedes, Chessmachine
ARM2	ARMv2a	ARM250	Integrated MEMC (MMU), Graphics and IO processor. ARMv2a added the SWP and SWPB (swap) instructions.	None	MEMC1a	7 MIPS @ 12 MHz	Acorn Archimedes
ARM3	ARMv2a	ARM3	First integrated memory cache.	4 KB unified	None	12 MIPS @ 25 MHz 0.50 DMIPS/MHz	Acorn Archimedes
ARM6	ARMv3	ARM60	ARMv3 first to support 32-bit memory address space (previously 26-bit)	None	None	10 MIPS @ 12 MHz	3DO Interactive Multiplayer, Zarlink GPS Receiver
		ARM600	As ARM60, cache and coprocessor bus (for FPA10 floating-point unit).	4 KB unified	None	28 MIPS @ 33 MHz
		ARM610	As ARM60, cache, no coprocessor bus.	4 KB unified	None	17 MIPS @ 20 MHz 0.65 DMIPS/MHz	Acorn Risc PC 600, Apple Newton 100 series
ARM7	ARMv3	ARM700		8 KBunified	None		Acorn Risc PC prototype CPU card
		ARM710	As ARM700, no coprocessor bus.	8 KBunified	None		Acorn Risc PC 700
		ARM710a	As ARM710	8 KBunified	None	40 MHz 0.68 DMIPS/MHz	Acorn Risc PC 700, Apple eMate 300,Psion Series 5 (ARM7100), Acorn A7000(ARM7500), Acorn A7000+(ARM7500FE), Network Computer(ARM7500FE)
ARM7TDMI	ARMv4T	ARM7TDMI(-S)	3-stage pipeline, Thumb	None	None	15 MIPS @ 16.8 MHz 63 DMIPS @ 70 MHz	Game Boy Advance, Nintendo DS,Apple iPod, Lego NXT, Juice Box,GarminNavigation Devices (1990s – early 2000s)
		ARM710T	As ARM7TDMI, cache	8 KB unified	MMU	36 MIPS @ 40 MHz	Psion Series 5mx, Psion Revo/Revo Plus/Diamond Mako
		ARM720T	As ARM7TDMI, cache, MMU with Fast Context Switch Extension	8 KB unified	MMU	60 MIPS @ 59.8 MHz	Zipit Wireless Messenger
		ARM740T	As ARM7TDMI, cache	8 KB unified	MPU
ARM7EJ	ARMv5TEJ	ARM7EJ-S	5-stage pipeline, Thumb, Jazelle DBX, Enhanced DSP instructions	None	None
ARM8	ARMv4	ARM810	5-stage pipeline, static branch prediction, double-bandwidth memory	8 KB unified	MMU	84 MIPS @ 72 MHz 1.16 DMIPS/MHz	Acorn Risc PC prototype CPU card
StrongARM	ARMv4	SA-1	5-stage pipeline	16 KB/8–16 KB	MMU	203–206 MHz 1.0 DMIPS/MHz	SA-110 Apple Newton 2×00 series, Acorn Risc PC, Rebel/Corel Netwinder, Chalice CATS SA-1100 Psion netBook SA-1110 LART (computer), Intel Assabet, Ipaq H36x0, Balloon2, Zaurus SL-5×00, HP Jornada 7xx, Jornada 560 series, Palm Zire 31
ARM9TDMI	ARMv4T	ARM9TDMI	5-stage pipeline, Thumb	None	None
		ARM920T	As ARM9TDMI, cache, MMU with FCSE (Fast Context Switch Extension)	16 KB/16 KB	MMU	200 MIPS @ 180 MHz	Armadillo, GP32, GP2X (first core),Tapwave Zodiac (Motorola i.MX1), Hewlett-PackardHP-49/50 Calculators, Sun SPOT, HTC TyTN, FIC Neo FreeRunner), GarminNavigation Devices (mid–late 2000s), TomTom navigation devices
		ARM922T	As ARM9TDMI, caches	8 KB/8 KB	MMU
		ARM940T	As ARM9TDMI, caches	4 KB/4 KB	MPU		GP2X (second core), Meizu M6 Mini Player
ARM9E	ARMv5TE	ARM946E-S	Thumb, Enhanced DSP instructions, caches, TCM (tightly coupled memories)	Variable	MPU		Nintendo DS, Nokia N-Gage, Canon PowerShot A470, Canon EOS 5D Mark II ,Conexant 802.11 chips, Samsung S5L2010
		ARM966E-S	Thumb, Enhanced DSP instructions, TCM (tightly coupled memories)	None
		ARM968E-S	As ARM966E-S	None
	ARMv5TEJ	ARM926EJ-S	Thumb, Jazelle DBX, Enhanced DSP instructions, caches, TCM (tightly coupled memories)	Variable	MMU	220 MIPS @ 200 MHz,	Mobile phones: Sony Ericsson (K, W series);Siemens and Benq (x65 series and newer); LG Arena; GPH Wiz; Squeezebox DuetController (Samsung S3C2412).Squeezebox Radio; Buffalo TeraStation Live (NAS);Drobo FS (NAS); Western Digital MyBook I World Edition; Western Digital MyBook II World Edition; Seagate FreeAgent DockStarSTDSD10G-RK; Seagate FreeAgent GoFlex Home; Chumby Classic
	ARMv5TE	ARM996HS	Clockless processor, as ARM966E-S, TCM (tightly coupled memories)	None	MPU
ARM10E	ARMv5TE	ARM1020E	6-stage pipeline, Thumb, Enhanced DSP instructions, (VFP)	32 KB/32 KB	MMU
	ARMv5TE	ARM1022E	As ARM1020E	16 KB/16 KB	MMU
	ARMv5TEJ	ARM1026EJ-S	Thumb, Jazelle DBX, Enhanced DSP instructions, (VFP)	Variable	MMU or MPU
XScale	ARMv5TE	XScale	7-stage pipeline, Thumb, Enhanced DSP instructions	32 KB/32 KB	MMU	133–400 MHz	80219 Thecus N2100 IOP321 Iyonix PXA210/PXA250 Zaurus SL-5600, iPAQ H3900, Sony CLIE NX60, NX70V, NZ90 PXA255 Gumstix basix & connex, Palm Tungsten E2, Zaurus SL-C860, Mentor Ranger & Stryder, iRex ILiad PXA263 Sony CLIE NX73V, NX80V PXA26x Palm Tungsten T3 PXA27x Gumstix verdex, “Trizeps-Modules”, “eSOM270-Module” PXA270 COM, HTC Universal, HP hx4700, Zaurus SL-C1000, 3000, 3100, 3200, Dell Axim x30, x50, and x51 series, Motorola Q, Balloon3, Trolltech Greenphone, Palm TX, Motorola Ezx Platform A728, A780, A910, A1200, E680, E680i, E680g, E690, E895, Rokr E2, Rokr E6, Fujitsu Siemens LOOX N560, Toshiba Portege G500, Tr?o 650-755p, Zipit Z2, HP iPaq 614c Business Navigator, I-mate PDA2 PXA3XX Samsung Omnia PXA900 Blackberry 8700, Blackberry Pearl (8100) IXP42x NSLU2
		Bulverde	Wireless MMX, Wireless SpeedStep added	32 KB/32 KB	MMU	312–624 MHz
		Monahans	Wireless MMX2 added, 32 KB/32 KB (L1), optional L2 cache up to 512 KB	32 KB/32 KB	MMU	up to 1.25 GHz
ARM11	ARMv6	ARM1136J(F)-S	8-stage pipeline, SIMD, Thumb, Jazelle DBX, (VFP), Enhanced DSP instructions	Variable	MMU		OMAP2420 Nokia E90, Nokia N93, Nokia N95, Nokia N82, Zune, BUGbase, Nokia N800, Nokia N810 MSM7200 Eten Glofiish, HTC TyTN II, HTC Nike Freescale i.MX31 original Zune 30?GB, Toshiba Gigabeat S and Kindle DX Freescale MXC300-30 Nokia E63, Nokia E71, Nokia 5800, Nokia E51, Nokia 6700 Classic, Nokia 6120 Classic, Nokia 6210 Navigator, Nokia 6220 Classic, Nokia 6290, Nokia 6710 Navigator, Nokia 6720 Classic, Nokia E75, Nokia N97, Nokia N81 Qualcomm MSM7201A HTC Dream, HTC Magic, Motorola i1, Motorola Z6, HTC Hero, Samsung SGH-i627 (Propel Pro), Sony Ericsson Xperia X10 Mini Pro Qualcomm MSM7227 ZTE Link, HTC Legend, HTC Aria, Viewsonic ViewPad 7
	ARMv6T2	ARM1156T2(F)-S	9-stage pipeline, SIMD, Thumb-2, (VFP), Enhanced DSP instructions	Variable	MPU
	ARMv6ZK	ARM1176JZ(F)-S	As ARM1136EJ(F)-S, TrustZone	Variable	MMU	965?DMIPS @ 772?MHz up to 2600DMIPS with four processors	Apple iPhone (original and 3G), Apple iPod touch (1st and 2nd Generation), Motorola RIZR Z8, Motorola RIZR Z10, Nintendo 3DS S3C6410 Samsung Omnia II, Samsung Moment, SmartQ 5, Tablet PC Qualcomm MSM7627 Palm Pixi and Motorola Calgary/Devour
	ARMv6K	ARM11 MPCore	As ARM1136EJ(F)-S, 1–4 core SMP	Variable	MMU
Cortex-A	ARMv7-A	Cortex-A5	VFP, NEON, Jazelle RCT, Thumb/Thumb-2, 1–4 cores,Variable (L1 + L2) Cache, MMU + TrustZone	Variable	MMU	1.57 DMIPS / MHz per core
		Cortex-A8	VFP, NEON, Jazelle RCT, Thumb-2, 13-stage superscalar pipeline, Variable (L1 + L2) Cache, MMU + TrustZone	Variable	MMU	up to 2 000 (2.0 DMIPS/MHz in speed from 600 MHz to greater than 1 GHz)	HTC Desire, SBM7000, Oregon State University OSWALD, Gumstix Overo Earth, Pandora, Apple iPhone 3GS, Apple iPod touch (3rd and 4th Generation), Apple iPad (A4), Apple iPhone 4 (A4), Archos 5, BeagleBoard, Motorola Droid, Motorola Droid X, Motorola Droid 2, Motorola Droid R2D2 Edition, Palm Pre, Samsung Omnia HD, Samsung Wave S8500, Samsung i9000 Galaxy S, Sony Ericsson Satio, Touch Book, Nokia N900, Meizu M9, Google Nexus S, Sharp PC-Z1 “Netwalker”.
		Cortex-A9 MPCore	Application profile, VFPv3 FPU, NEON, Thumb-2, Jazelle RCT/DBX, out-of-order speculative issue superscalar, 1–4 core SMP, 32 KB/32 KB L1, up to 4 MB L2, MMU + TrustZone	Variable	MMU	2.5 DMIPS/MHz per core, 10 000 DMIPS @ 2 GHz on Performance Optimized TSMC40G(dual core)	LG Optimus 2X, Motorola Atrix 4G,Motorola DROID BIONIC, Motorola Xoom Pandaboard
		Cortex-A15 MPCore	Application profile, VFPv4 FPU, NEON, Thumb-2, Jazelle RCT/DBX, out-of-order speculative issue superscalar, Large Physical Address Extensions (LPAE), Hardware virtualization, 1–4 SMP cores, 32 KB/32 KB L1, up to 4 MB L2, MMU + TrustZone	Variable	MMU
Cortex-R	ARMv7-R	Cortex-R4(F)	Real-time profile, Thumb-2, (FPU), variable cache, MPU optional	Variable	MPU	600 DMIPS @ 475 MHz
Cortex-M	ARMv6-M	Cortex-M0	Microcontroller profile, Thumb-2 subset (16-bit Thumb instructions & BL, MRS, MSR, ISB, DSB, and DMB). Hardware multiply instruction optional	None	None	0.9 DMIPS/MHz
	ARMv6-M	Cortex-M1	FPGA targeted, Microcontroller profile, Thumb-2 subset (16-bit Thumb instructions & BL, MRS, MSR, ISB, DSB, and DMB),TCM(tightly coupled memory)optional.	None	None	Up to 136 DMIPS @ 170 MHz (0.8 DMIPS/MHz, MHz achievable FPGA-dependent)
	ARMv7-M	Cortex-M3	Microcontroller profile, Thumb-2 only. Hardware divide instruction, no cache, MPU optional.	None	MPU	1.25 DMIPS/MHz
	ARMv7-ME	Cortex-M4	Microcontroller profile, both Thumb and Thumb-2, FPU. Hardware MAC, SIMD and divide instructions, MPU optional	None	MPU	1.25 DMIPS/MHz

簡要說明ARM的架構如下

ARM處理器起始位址一般是0x00000000,初始化時是處於SVC(Supervisor) Mode,並可以透過System Coprocessor設定為Little Endian(高位址資料較小)或Bigger Endian(高位址資料較大),ARM的I/O對應的方式為Memory Mapped I/O (X86為I/O Mapped I/O,要透過 in/out指令才可以存取I/O Space).

ARM支援八類處理器執行模式,如下所示

(參考:ARMv7-AR Architecture Reference Manual)

處理器模式	xPSR Mode encoding	Priviledge	說明
USR	b10000	Unpriviledged	使用者模式例如,我們在ARM Linux上的應用程式,就是處於這個模式.
FIQ	b10001	Priviledged	快速中斷模式
IRQ	b10010	Priviledged	通用中斷處理
SVC (Supervisor)	b10011	Priviledged	管理者保護模式,一般沒有區分特權等級的RTOS,或是有區分特權等級的OS Kernel Mode都會處於這個模式. 包括,使用者透過SWI(or SVC)觸發軟體中斷 (對應到一般Linux Kernel就是用SWI實現System Call),也會進入到SVC Mode.
MON(Monitor)	b10110	Priviledged	只有當處理器支援Security Extensions時,才會有這模式. 可以透過SMC(Secure Monitor Call)指令,讓系統進入Secure Mode,或可透過設定Secure Configuration Register,讓系統所觸發的IRQ/FIQ/Abort都變成進入Secure Mode中處理.
ABT (Abort)	b10111	Priviledged	記憶體存取異常模式 (發生Data或是Prefetch Abort時,就會處於這個模式).
UND (Undefined)	b11011	Priviledged	未定義指令異常模式. 當處理器遇到無法解譯的指令時,會先跟Coprocessor 確認是否為Coprocessor指令,若不是,就會觸發例外,進入這個模式.一般我們用軟體除錯器要設定中斷點時,也可透過置入未定義的指令,當作Break Point之用.
SYS	b11111	Priviledged	系統特權模式跟User Mode共用一致的暫存器(R0-R15/CPSR/SPSR),主要的差別是User Mode為Unpriviledged Mode.

在除錯時,可以透過xPSR 的M[4:0] 5個bits 判斷目前與前一個處理器狀態,推測系統前後問題發生的原因.

ARM處理器有31個(不包含支援Security Extensions上的Monitor Mode R13與R14)32位元通用暫存器(R0-R15,R13/R14_svc, R13/R14_abt, R13/R14_und, R13/R14_irq, R8-R14_fiq)及6個狀態暫存器(CPSR,SPSR_svc,SPSR_abt,SPSR_und,SPSR_irq,_SPSR_fiq)

			Priviledged Modes
				Exception Modes
暫存器說明	Application View	User Mode	System Mode	FIQ Mode	IRQ Mode	Supervisor Mode	Abort Mode	Undefined Mode	Monitor Mode
函式傳遞參數#0	R0	R0_usr
函式傳遞參數#1	R1	R1_usr
函式傳遞參數#2	R2	R2_usr
函式傳遞參數#3	R3	R3_usr
	R4	R4_usr
	R5	R5_usr
	R6	R6_usr
	R7	R7_usr
	R8	R8_usr		R8_fiq
	R9	R9_usr		R9_fiq
	R10	R10_usr		R10_fiq
	R11	R11_usr		R11_fiq
	R12	R12_usr		R12_fiq
	SP	SP_usr		SP_fiq	SP_irq	SP_svc	SP_abt	SP_und	SP_mon
	LR	LR_usr		LR_fiq	LR_irq	LR_svc	LR_abt	LR_und	LR_mon
	PC	PC
	APSR	CPSR
				SPSR_fiq	SPSR_irq	SPSR_svc	SPSR_abt	SPSR_und	SPSR_mon

一般而言16bits Thumb Code只會使用到R0-R7 (3bits Register Index),ARM 32-bits指令集可以用到完整的R0-R12. 而R13(SP),R14(LR)跟R15(PC)是每個模式下都會使用到的.

除了有支援中斷向量表的處理器核心外(例如:Cortex M3),一般的ARM核心主要支援以下八種中斷(其中第六個中斷為Reserved),可以透過修改CP15 c1暫存器的V (bit13),決定中斷表示在低位址 (V=0則位於0x00000000-0x0000001C)或高位址(V=1則位於0xFFFF0000-0xFFFF001C). 若產品在開發初期,而且又沒有MMU透過User Mode/Priviledge Mode機制做記憶體保護的環境,建議可以把中斷表設置在高位置,避免在產品開發階段,因為空指標(NULL Pointer)所導致的系統方面的錯誤.

一旦系統發生Exception,首先會把目前的CPSR儲存到發生Exception對應模式下的SPRS(可用來檢視進入Exception前系統的狀態),之後把PC值加上Exception對應的Offest值(Exception-Dependent Offset)後,存在LR中,如下所示為每個Exeception Mode對應的Exception-Dependent Offset

Exception	Base LR Value	Offset for processor state of ARM	Offset for processor state of Thumb of ThumbEE	Offset for processor state of Jazelle
Undefine Instruction	Address of the undefined instruction	4	2	2 or 4
SVC	Address of SVC instruction	4	2	X
SMC	Address of SMC instruction	4	4	X
Prefetch Abort	Address of aborted instruction fetch	4	4	4
Data Abort	Address of inctruction that generated the abort	8	8	8
IRQ/FIQ	Address of next instruction to execute	4	4	4

之後根據Exception Handler所在位置,設定PC值,與更新CPSR Mode[4:0]的內容為發生的Exception Mode,關閉對應的中斷防止重入(基本都會關閉IRQ,而在FIQ,Secure Monitor與Reset Mode中會同時關閉IRQ 與FIQ). 並參考CP15 c1暫存器的TE (bit30),決定Exception Handler是用哪個處理器指令集狀態執行. (TE=0 表示 Exception採ARM指令集,TE=1則為Thumb指令集),參考Exception Mode的CPSR E(bit9)決定Exception執行時的Data Endian,設定CPSR IT[7:0]為0.之後便開始Exception Handler的執行.

如果我們希望讓ARM處於Suspend進入低耗電的狀態(類似對裝置Clock Gating,只是並沒有透過PMIC關閉電源),也可以透過WFI(Wait For Interrupt)指令,讓ARM等待外部中斷例如:IRQ或FIQ,對產品端而言就是手機的按鍵或是透過Real-Time Clock的中斷,喚醒處理器,恢復正常的執行.反之,也可以透過System Controller關閉處理器的電源(進入Doze Mode),只是相對於WFI,會變成處理器要重新Re-initialize,相關的狀態還要預存在TCM(Tightly-Couple Memory),這需要針對產品端要達成的目的來做設計上的評估.

通常,WFI可以放在系統Idle Task的實作中,如果沒有相關需要處理的工作等待執行,系統就會把執行權交到最低優先級的Idle Task中,再由Idle Task判斷系統中下一次會醒過來的時間點,決定是不是要把外部記憶體設定為省電模式,並讓處理器透過WFI進入Suspend低耗電的狀態.

如下為對應的類型,優先級與中斷記憶體位置.

(參考:ARMv7-AR Architecture Reference Manual).

中斷位址	中斷類型	優先級	對應處理器模式	發生時處理器對應的動作.
0x0000-0000 (0xFFFF-0000)	系統重置 Reset	1	SVC	TakeReset() // Enter Supervisor mode and (if relevant) Secure state, and reset CP15. This affects the banked versions and values of various registers accessed later in the code. Also reset other system components. CPSR.M = ‘10011’; // Supervisor mode if HaveSecurityExt() then SCR.NS = ‘0’; ResetCP15Registers(); ResetDebugRegisters(); if HaveAdvSIMDorVFP() then FPEXC.EN = ‘0’; SUBARCHITECTURE_DEFINED further resetting; if HaveThumbEE() then TEECR.XED = ‘0’; if HaveJazelle() then JMCR.JE = ‘0’; SUBARCHITECTURE_DEFINED further resetting; // Further CPSR changes: all interrupts disabled, IT state reset, instruction set and endianness according to the SCTLR values produced by the above call to ResetCP15Registers(). CPSR.I = ‘1’; CPSR.F = ‘1’; CPSR.A = ‘1’; CPSR.IT = ‘00000000’; CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // All registers, bits and fields not reset by the above pseudocode or by the BranchTo() call below are UNKNOWN bitstrings after reset. In particular, the return information registers R14_svc and SPSR_svc have UNKNOWN values, so that it is impossible to return from a reset in an architecturally defined way. Branch to Reset vector. BranchTo(ExcVectorBase() + 0);
0x0000-0004 (0xFFFF-0004)	未定義指令集 Undefined Instruction	6	UND	TakeUndefInstrException() // Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 2 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required return address offsets of 2 or 4 respectively. new_lr_value = if CPSR.T == ‘1’ then PC-2 else PC-4; new_spsr_value = CPSR; // Enter Undefined (‘11011’) mode, and ensure Secure state if initially in Monitor (‘10110’) mode. This affects the banked versions of various registers accessed later in the code. if CPSR.M == ‘10110’ then SCR.NS = ‘0’; CPSR.M = ‘11011’; // Write return information to registers, and make further CPSR changes: IRQs disabled, IT state reset, instruction set and endianness to SCTLR-configured values. SPSR[] = new_spsr_value; R[14] = new_lr_value; CPSR.I = ‘1’; CPSR.IT = ‘00000000’; CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // Branch to Undefined Instruction vector. BranchTo(ExcVectorBase() + 4); //// 在ARMv7架構下,也可以讓Undefined Instruction執行類似NOP的動作,處理器不會觸發Exception,只是忽略該指令的執行.
0x0000-0008 (0xFFFF-0008)	軟體中斷 SWI 或 Secure Monitor Call (SMC)	6	SVC 或 SMC Mode	TakeSVCException() // Determine return information. SPSR is to be the current CPSR, after changing the IT[] bits to give them the correct values for the following instruction, and LR is to be the current PC minus 2 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the next instruction (the SVC instruction having size 2 or 4 bytes respectively). ITAdvance(); new_lr_value = if CPSR.T == ‘1’ then PC-2 else PC-4; new_spsr_value = CPSR; // Enter Supervisor (‘10011’) mode, and ensure Secure state if initially in Monitor (‘10110’) mode. This affects the banked versions of various registers accessed later in the code. if CPSR.M == ‘10110’ then SCR.NS = ‘0’; CPSR.M = ‘10011’; // Write return information to registers, and make further CPSR changes: IRQs disabled, IT state reset, instruction set and endianness to SCTLR-configured values. SPSR[] = new_spsr_value; R[14] = new_lr_value; CPSR.I = ‘1’; CPSR.IT = ‘00000000’; CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // Branch to SVC vector. BranchTo(ExcVectorBase() + 8); 或 TakeSMCException() // Determine return information. SPSR is to be the current CPSR, after changing the IT[] bits to give them the correct values for the following instruction, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the next instruction (with the SMC instruction always being 4 bytes in length). ITAdvance(); new_lr_value = if CPSR.T == ‘1’ then PC else PC-4; new_spsr_value = CPSR; // Enter Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code. if CPSR.M == ‘10110’ then SCR.NS = ‘0’; CPSR.M = ‘10110’; // Write return information to registers, and make further CPSR changes: interrupts disabled, IT state reset, instruction set and endianness to SCTLR-configured values. SPSR[] = new_spsr_value; R[14] = new_lr_value; CPSR.I = ‘1’; CPSR.F = ‘1’; CPSR.A = ‘1’; CPSR.IT = ‘00000000’; CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // Branch to SMC vector. BranchTo(MVBAR + 8);
0x0000-000C (0xFFFF-000C)	指令記憶體存取錯誤 Prefetch Abort	5	ABT	TakePrefetchAbortException() // Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the current instruction plus 4. new_lr_value = if CPSR.T == ‘1’ then PC else PC-4; new_spsr_value = CPSR; // Determine whether this is an external abort to be trapped to Monitor mode. trap_to_monitor = HaveSecurityExt() && SCR.EA == ‘1’ && IsExternalAbort(); // Enter Abort (‘10111’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code. if CPSR.M == ‘10110’ then SCR.NS = ‘0’; CPSR.M = if trap_to_monitor then ‘10110’ else ‘10111’; // Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values. SPSR[] = new_spsr_value; R[14] = new_lr_value; CPSR.I = ‘1’; if trap_to_monitor then CPSR.F = ‘1’; CPSR.A = ‘1’; else if !HaveSecurityExt() \|\| SCR.NS == ‘0’ \|\| SCR.AW == ‘1’ then CPSR.A = ‘1’; CPSR.IT = ‘00000000’; CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // Branch to correct Prefetch Abort vector. if trap_to_monitor then BranchTo(MVBAR + 12); else BranchTo(ExcVectorBase() + 12);
0x0000-0010 (0xFFFF-0010)	資料記憶體存取錯誤 Data Abort	2	ABT	TakeDataAbortException() // Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC plus 4 for Thumb or 0 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the current instruction plus 8. For an asynchronous abort, the PC and CPSR are considered to have already moved on to their values for the instruction following the instruction boundary at which the exception occurred. new_lr_value = if CPSR.T == ‘1’ then PC+4 else PC; new_spsr_value = CPSR; // Determine whether this is an external abort to be trapped to Monitor mode. trap_to_monitor = HaveSecurityExt() && SCR.EA == ‘1’ && IsExternalAbort(); // Enter Abort (‘10111’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code. if CPSR.M == ‘10110’ then SCR.NS = ‘0’; CPSR.M = if trap_to_monitor then ‘10110’ else ‘10111’; // Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values. SPSR[] = new_spsr_value; R[14] = new_lr_value; CPSR.I = ‘1’; if trap_to_monitor then CPSR.F = ‘1’; CPSR.A = ‘1’; else if !HaveSecurityExt() \|\| SCR.NS == ‘0’ \|\| SCR.AW == ‘1’ then CPSR.A = ‘1’; CPSR.IT = ‘00000000’; The System Level Programmers’ Model ARM DDI 0406B Copyright © 1996-1998, 2000, 2004-2008 ARM Limited. All rights reserved.B1-57 CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // Branch to correct Data Abort vector. if trap_to_monitor then BranchTo(MVBAR + 16); else BranchTo(ExcVectorBase() + 16);
0x0000-0014 (0xFFFF-0014)	保留	未使用
0x0000-0018 (0xFFFF-0018)	外部一般中斷模式 IRQ	4	IRQ	TakeIRQException() // Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the instruction boundary at which the interrupt occurred plus 4. For this purpose, the PC and CPSR are considered to have already moved on to their values for the instruction following that boundary. new_lr_value = if CPSR.T == ‘1’ then PC else PC-4; new_spsr_value = CPSR; // Determine whether IRQs are trapped to Monitor mode. trap_to_monitor = HaveSecurityExt() && SCR.IRQ == ‘1’; // Enter IRQ (‘10010’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code. if CPSR.M == ‘10110’ then SCR.NS = ‘0’; CPSR.M = if trap_to_monitor then ‘10110’ else ‘10010’; // Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values. SPSR[] = new_spsr_value; R[14] = new_lr_value; CPSR.I = ‘1’; if trap_to_monitor then CPSR.F = ‘1’; CPSR.A = ‘1’; else if !HaveSecurityExt() \|\| SCR.NS == ‘0’ \|\| SCR.AW == ‘1’ then CPSR.A = ‘1’; CPSR.IT = ‘00000000’; CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // Branch to correct IRQ vector. if trap_to_monitor then BranchTo(MVBAR + 24); elsif SCTLR.VE == ‘1’ then IMPLEMENTATION_DEFINED branch to an IRQ vector; else BranchTo(ExcVectorBase() + 24);
0x0000-001C (0xFFFF-001C)	快速中斷 FIQ	3	FIQ	TakeFIQException() // Determine return information. SPSR is to be the current CPSR, and LR is to be the current PC minus 0 for Thumb or 4 for ARM, to change the PC offsets of 4 or 8 respectively from the address of the current instruction into the required address of the instruction boundary at which the interrupt occurred plus 4. For this purpose, the PC and CPSR are considered to have already moved on to their values for the instruction following that boundary. new_lr_value = if CPSR.T == ‘1’ then PC else PC-4; new_spsr_value = CPSR; // Determine whether FIQs are trapped to Monitor mode. trap_to_monitor = HaveSecurityExt() && SCR.FIQ == ‘1’; // Enter FIQ (‘10001’) or Monitor (‘10110’) mode, and ensure Secure state if initially in Monitor mode. This affects the banked versions of various registers accessed later in the code. if CPSR.M == ‘10110’ then SCR.NS = ‘0’; CPSR.M = if trap_to_monitor then ‘10110’ else ‘10001’; // Write return information to registers, and make further CPSR changes: IRQs disabled, other interrupts disabled if appropriate, IT state reset, instruction set and endianness to SCTLR-configured values. SPSR[] = new_spsr_value; R[14] = new_lr_value; CPSR.I = ‘1’; if trap_to_monitor then CPSR.F = ‘1’; CPSR.A = ‘1’; else if !HaveSecurityExt() \|\| SCR.NS == ‘0’ \|\| SCR.FW == ‘1’ then CPSR.F = ‘1’; if !HaveSecurityExt() \|\| SCR.NS == ‘0’ \|\| SCR.AW == ‘1’ then CPSR.A = ‘1’; CPSR.IT = ‘00000000’; CPSR.J = ‘0’; CPSR.T = SCTLR.TE; // TE=0: ARM, TE=1: Thumb CPSR.E = SCTLR.EE; // EE=0: little-endian, EE=1: big-endian // Branch to correct FIQ vector. if trap_to_monitor then BranchTo(MVBAR + 28); elsif SCTLR.VE == ‘1’ then IMPLEMENTATION_DEFINED branch to an FIQ vector; else BranchTo(ExcVectorBase() + 28);

以Abort Mode的行為來說,當系統發生Abort中斷時,會把IRQ關閉,FIQ狀態仍維持開啟,根據開發使用的SoC不同,如果你所使用的晶片,有把其他中斷來源接到FIQ(例如:Timer),就會需要在Abort中斷處理中,立刻關閉FIQ,以避免在Abort Mode時,有其他中斷的重入,導致分析系統問題時,不容易定位真正的問題.

ARM處理器的Cache,MMU與MPU管理機制是透過Coprocessor #15實現的,當今天系統發生處理器的PC值指到一個無效的記憶體位置時,就會觸發Prefetch Abort,然後處理器會更新Coprocessor #15中的IFSR(Instruction Fault Status Register) 的錯誤狀態碼,以及更新IFAR(Instruction Fault Address Register)紀錄觸發Prefetch Abort的記憶體位置.

IFSR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,IFSR格式說明如下

位元	功能	說明
31-13	UNK/SBZP (Bits [31:13,11,9:4])	UNK/SBZP unknown on reads, Should-Be-Zero-or-Preserved on writes.
12	ExT	External abort type.
11	0
10	FS[4]	Fault status bits.
9 – 4	UNK/SBZP	UNK/SBZP unknown on reads, Should-Be-Zero-or-Preserved on writes.
3-0	FS[3:0]	Fault status bits.

可以透過CP15的指令進行IFSR的讀寫動作,如下例子

MRC p15,0,<Rt>,c5,c0,1 ; Read CP15 Instruction Fault Status Register

MCR p15,0,<Rt>,c5,c0,1 ; Write CP15 Instruction Fault Status Register

IFAR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,IFAR在Prefetch Abort時,可用來反映出發生Abort時所在的記憶體位址,

可以透過CP15的指令進行IFAR的讀寫動作,如下例子

MRC p15,0,<Rt>,c6,c0,2 ; Read CP15 Instruction Fault Address Register

MCR p15,0,<Rt>,c6,c0,2 ; Write CP15 Instruction Fault Address Register

DFSR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,DFSR格式說明如下

位元	功能	說明
31-13	UNK/SBZP (Bits [31:13,9:8])	UNK/SBZP unknown on reads, Should-Be-Zero-or-Preserved on writes.
12	ExT	External abort type.
11	WnR	Write not Read bit. Indicates whether the abort was caused by a write or a read access: 0 Abort caused by a read access 1 Abort caused by a write access. For faults on CP15 cache maintenance operations, including the VA to PA translation operations, this bit always returns a value of 1.
10	FS[4]	Fault status bits.
9 — 8	b00
7–4	Domain	The domain of the fault address.
3-0	FS[3:0]	Fault status bits.

可以透過CP15的指令進行DFSR的讀寫動作,如下例子

MRC p15,0,<Rt>,c5,c0,0 ; Read CP15 Data Fault Status Register

MCR p15,0,<Rt>,c5,c0,0 ; Write CP15 Data Fault Status Register

DFAR為一個32bits可讀可寫的暫存器,只有在特權等級模式下才可以讀取,DFAR在Data Abort時,可用來反映出發生Abort時所存取的記憶體位址,

可以透過CP15的指令進行DFAR的讀寫動作,如下例子

MRC p15,0,<Rt>,c6,c0,0 ; Read CP15 Data Fault Address Register

MCR p15,0,<Rt>,c6,c0,0 ; Write CP15 Data Fault Address Register

隨著ARMv7架構的出現,ARM目前所支援的指令集包括了ARMv32,Thumb,Thumb2,ThumbEE(Thumb Execution Environment),與Jazelle,我們可以透過CPSR(Current Program Status Register)中的J與T bits(位於CPSR第24與第5個bit) 來判斷目前處理器所處的狀態,如下所示(參考 ARMv7-AR Architecture Reference Manual)

J	T	Instruction set state
0	0	ARM
0	1	Thumb
1	0	Jazelle
1	1	ThumbEE

參考如下的模擬程式碼(參考文件:ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition),上述四個指令集除了由ThumbEE切到ARMv32 Mode是不能直接切換外,其它的模式都是可以依據需求直接的切換,切換的方式則是透過Branch Exchange指令轉換ARM指令集狀態.

// CurrentInstrSet()

// =================

InstrSet CurrentInstrSet()

case ISETSTATE of

when ‘00’ result = InstrSet_ARM;

when ‘01’ result = InstrSet_Thumb;

when ‘10’ result = InstrSet_Jazelle;

when ‘11’ result = InstrSet_ThumbEE;

return result;

// SelectInstrSet()

// ================

SelectInstrSet(InstrSet iset)

case iset of

when InstrSet_ARM

if CurrentInstrSet() == InstrSet_ThumbEE then

UNPREDICTABLE;

else

ISETSTATE = ‘00’;

when InstrSet_Thumb

ISETSTATE = ‘01’;

when InstrSet_Jazelle

ISETSTATE = ‘10’;

when InstrSet_ThumbEE

ISETSTATE = ‘11’;

return;

根據不同版本的ARM核心設計,像是在ARMv4T中有支援Thumb指令集,會在三階的Pipeline中的Decode階段,把16bits的Thumb Code轉碼為32bits的對應ARM Code,作為後續處理,或像是ARMv7的架構下,有導入Superscalar的Pipeline,每次取指令時就會根據目前所在的指令集狀態,一次抓取2個32bits ARM指令或是抓取2個16bits Thumb指令進行後續Pipeline平行處理.

一般我們在系統軟體設計時,會根據所使用的處理器評估應該要採用哪種指令集,得到產品端的效益,例如ARM指令集效能最高,但因為都固定為32bits,所編譯出來的程式碼較大,而Thumb指令集長度固定為16bits,編譯後的程式碼大約只有ARM程式的70%,而效能也只約等同於ARM直行效能的70%.若你所使用的處理器有支援Thumb2,如果所開發的模組為Video Codec,為了得到比較好的影音效果,選擇ARM指令集會是比較好的,若所開發的模組是屬於人機介面或是對效能要求有限的,則選擇Thumb或Thumb2指令集會是一個節省記憶體空間的方式.

有關ARM,Thumb,Thumb2效能的比較可以參考這篇在ARM工作的Richard Phelan所寫的文章Improving ARM Code Density and Performance (http://www.cs.uiuc.edu/class/fa05/cs433ug/PROCESSORS/Thumb2.pdf), 以C Code實作同樣的功能來說,編譯為Thumb2最高可以達到98%的ARM指令及效能,程式碼本身所需的記憶體空間只占原本ARM程式碼的74%.再舉一個比較的例子,以一個1MB大小的ARM+Thumb程式碼來說,原本屬於ARM的Code佔200kbytes,屬於Thumbv4的Code佔800kbytes,如果全部都編譯為Thumb2,ARM部分的Code會成200kbytes降為150kbytes,Thumbv4的Code會從800kbytes降為760kbytes,可以節省大約90kbytes的程式碼空間.

上述優化的數字,還是要根據開發者所使用的處理器版本 (會對應到不同的Thumb與Thumb2指令集版本),與在編譯時所帶入的參數,以RVCT為例,不論使用者選擇的是-O0到-O3,預設都會以-OSpace編譯,如果使用者選擇-OTime,採用的優化原則也會有所不同 (包含是否有 auto inline),所對應出來的程式碼大小也會不同. 不過基本上,Thumb2先天優勢就是屬於16/32bits混合執行的模式,也支援相對豐富的指令集(不同版本的ARM Core Thumb2指令集會有一些完整度的差異,例如Cortex M3只支援Thumb2 Subset.),若非特別要求效能的區塊,會是一個不錯的選擇.

此外,ARMv32指令集固定為32bits長度,其中包含條件判斷,操作碼(OP Code),是否影響CPSR,目標與操作暫存器編碼,如下為一般ARMv32指令集的格式 (參考文件ARMv7-AR Architecture Reference Manual.pdf),

ARM指令集分類	位元																																指令集範例
	31	30	29	28	27	26	25	24	23		22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	10
Data Processing (Registers)	Cond.				0	0	0	op1														op2					op3		0				AND,EOR,SUB,RSB,ADD,ADC,SBC,RSC,TST,TEQ,CMP,CMN,ORR,MOV,LSL,LSR,ASR,RRX,ROR,BIC,MVN
Data Processing (Register-shifted register)	Cond.				0	0	0	op1																		0	op2		1				AND,EOR,SUB,RSB,ADD,ADC,SBC,RSC,TST,TEQ,CMP,CMN,ORR,LSL,LSR,ASR,ROR,BIC,MVN
Data Processing (Immediate)	Cond.				0	0	1	op						Rn																			AND,EOR,SUB,ADR,RSB,ADD,ADR,ADC,SBC,RSC,TST,TEQ,CMP,CMN,ORR,MOV,BIC,MVN
Multiply and multiply-accumulate	Cond.				0	0	0	0	op																	1	0	0	1				MUL,MLA,UMAAL,MLS,UMULL,UMLAL,SMULL,SMLAL
Saturating addition and subtraction	Cond.				0	0	0	1	0	op			0													0	1	0	1				QADD,QSUB,QDADD,QDSUB
Halfword and multiply and multiply-accumulate	Cond.				0	0	0	1	0	op1			0													1		op	0				SMLABB,SMLABT,SMLATB,SMLATT,SMLAWB,SMLAWT,SMULWB,SMULWT,SMLALBB,SMLALBT,SMLALTB,SMLALTT,SMULBB,SMULBT,SMULTB,SMULTT
Extra load/store instructions	Cond.				0	0	0	op1						Rn												1	op2		1				STRH,LDRH,LDRD,LDRSB,STRD,LDRSH,
Extra load/store instructions (unprivileged)	Cond.				0	0	0	0				1	op					Rt								1	op2		1				STRHT,LDRHT,LDRSBT,LDRSHT
Synchronization primitives	Cond.				0	0	0	1	op																	1	0	0	1				SWP,SWPB,STREX,LDREX,STREXD,LDREXD,STREXB,LDREXB,STREXH,LDREXH,
MSR(immediate) and hints	Cond.				0	0	1	1	0	op		1	0	op1												op2							NOP,YIELD,WFE,WFI,SEV,DBG,MSR
Miscellaneous instructions	Cond.				0	0	0	1	0	op			0	op1												0	op2						MRS,MSR,BX,CLZ,BXJ,BLX,BKPT,SMC
Load/Store word and unsigned byte	Cond.				0	1	A	op1						Rn															B				STR,STRT,LDR,LDRT,STRB,STRBT,LDRB,LDRBT
Media instructions	Cond.				0	1	1	op1										Rd								op2			1	Rn			USAD8,USADA8,SBFX,BFC,BFI,UBFX
Parallel addition and subtraction,signed	Cond.				0	1	1	0	0	0		op1														op2			1				SADD16,SASX,SSAX,SSUB16,SADD8,SSUB8,QADD16,QASX,QSUB16,QADD8,QSUB8,SHADD16,SHASX,SHSAX,SHSUB16SHADD8,SHSUB8
Parallel addition and subtraction,unsigned	Cond.				0	1	1	0	0	1		op1														op2			1				UADD16,UASX,USAX,USUB16,UADD8,USUB8,UQADD16,UQASX,UQSAX,UQSUB16,UQADD8,UQSUB8,UHADD16,UHASX,UHSAX,UHSUB16,UHADD8,UHSUB8
Packing,unpacking,saturation, and reversal	Cond.				0	1	1	0	1	op1				A												op2			1				PKH,SSAT,USAT,SXTAB16,SEL,SSAT16,SXTAB,SXTB,REV,SXTAH,SXTH,REV16,UXTAB16,UXTB16,USAT16,UXTAB,UXTB,RBIT,UXTAH,UXTH,REVSH
Signed multiplies	Cond.				0	1	1	1	0	op1								A								op2			1				SMLAD,SMUAD,SMLSD,SMUSD,SMLALD,SMLSLD,SMMLA,SMMUL,SMMLE
Branch,branch with link, and block data transfer	Cond.				1	0	op							Rn				R															STMDA,STMED,LDMDA,LDMFA,STM,STMIA,STMEA,LDMDB,LDMEA,STMIB,STMFA,LDMIB,LDMED,LDM,B,BL,BLX
Supervisor call,and coprocessor instructions	Cond.				1	1	op1							Rn								coproc							op				STC,STC2,LDC,LDC2,MCRR,MCRR2,MRRC,MRRC2,CDP,CDP2,MCR,MCR2,MRC,MRC2,SVC(previously SWI).
Unconditional instructions	1	1	1	1	op1									Rn															op				SRS,RFE,BL,BLX,LDC,LDC2,STC,STC2,MCRR,MCRR2,MRRC,MRRC2,CDP,CDP2,MCR,MCR2,MRC,MRC2
Miscellaneous instructions,memory hints, and Advanced SIMD instructions	1	1	1	1	0	op1								Rn												op2							CPS,SETEND,PLI,PLD,PLDW,CLREX,DSB,DMB,ISB,

ARM 32bits的指令集,前面4 bits,為指令執行條件碼,彙整如下供參考.

(參考http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473b/CEGBHJCJ.html )

Cond.	意義	對應CPSR中的標誌值
b0000	EQ(Equal)	Z set
b0001	NE(Not equal)	Z clear
b0010	CS or HS (Higher or same (unsigned >= ))	C set
b0011	CC or LO (Lower (unsigned < ))	C clear
b0100	MI(Negative)	N set
b0101	PL(Positive or zero)	N clear
b0110	VS(Overflow)	V set
b0111	VC(No overflow)	V clear
b1000	HI(Higher (unsigned >))	C set and Z clear
b1001	LS(Lower or same (unsigned <=))	C clear or Z set
b1010	GE(Signed >=)	N and V the same
b1011	LT(Signed <)	N and V differ
b1100	GT(Signed >)	Z clear, N and V the same
b1101	LE(Signed <=)	Z set, N and V differ
b1110	AL無條件執行
b1111	NV該指令不執行

不像是ARMv32指令集固定都為32bits,Thumb指令集固定為16bits,而Thumb2則是同時提供了16bits與32bits的指令集格式,並可提供優於Thumb指令集的執行效能,程式碼編譯後,如果15-11bits這5個bits為0b11101,0b11110或0b11111就表示是32bits Thumb2指令集,如下為一般Thumb/Thumb2指令集的格式 (參考文件ARMv7-AR Architecture Reference Manual.pdf),

Thumb/Thumb2指令集分類	位元																																	指令集範例
Thumb/Thumb2指令集分類	1st 16bits																2nd 16bits																	指令集範例
	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0	15	14	13	12	11	10	9	8	7		6		5	4	3	2	10
Shift(immediate),add,subtract,move and compare	0	0	Opcode																															LSL,LSR,ASR,ADD,SUB,MOV,CMP
Data Processing	0	1	0	0	0	0	Opcode																											AND,EOR,LSL,LSR,ASR,ADC,SBC,ROR,TST,RSB,CMP,CMN,ORR,MUL,BIC,MVN
Special data instructions and branch and exchange	0	1	0	0	0	1	Opcode																											ADD,CMP,MOV,BX,BLX
Load/store single data item	0	1	0	1	opB																													STR,STRH,STRB,LDRSB,LDR,LDRH,LDRB,LDRSH
Load/store single data item	0	1	1	0	opB																													STR,LDR,
Load/store single data item	0	1	1	1	opB																													STRB,LDRB
Load/store single data item	1	0	0	0	opB																													STRH,LDRH
Load/store single data item	1	0	0	1	opB																													STR,LDR
Miscellaneous 16bits instructions	1	0	1	1	Opcode																													SETEND,CPS,ADD,SUB,CBNZ,SXTH,SXTB,UXTH,UXTB,CBNZ,CBZ,PUSH,REV,REV16,REVSH,POP,BKPT,
If-then and hints	1	0	1	1	1	1	1	1	opA				opB																					IT,NOP,YIELD,WFE,WFI,SEV
Conditional branch and supervisor call	1	1	0	1	Opcode																													B,SVC(previously SWI)
Data processing(modified immediate)	1	1	1	1	0		0	op				S	Rn				0				Rd													AND,TST,BIC,ORR,MOV,ORN,MVN,EOR,TEQ,ADD,CMN,ADC,SBC,SUB,CMP,RSB
Data processing(plain binary immediate)	1	1	1	1	0		1	op					Rn				0																	ADD,ADR,MOV,SUB,ADR,MOVT,SSAT,SSAT16,SBFX,BFI,BFC,USAT,USAT16,UBFX
Branched and miscellaneous control	1	1	1	1	0	op											1	op1			op2													B,MSR,BXJ,SUBS,SMC(previously SMI),BL,BLX
Change Processor State ,and hints	1	1	1	1	0	0	1	1	1	0	1	0					1	0		0		op1				op2								CPS,NOP,YIELD,WFE,WFI,SEV,DBG
Miscellaneous control instructions	1	1	1	1	0	0	1	1	1	0	1	1					1	0		0						op								ENTERX,LEAVEX,CLREX,DSB,DMB,ISB
Load/Store Multiple	1	1	1	0	1	0	0	op		0		L	Rn																					SRS,RFE,STM,STMIA,STMEA,LDM,LDMIA,LDMFD,POP,STMDB,STMFD,PUSH,LDMDB,LDMEA,SRS,RFE
Load/Store dual,Load/Store exclusive,table branch	1	1	1	0	1	0	0	op1		1	op2		Rn													op3								STREX,LDREX,STRD,LDRD,STREXB,STREXH,STREXD,TBB,TBH,LDREXB,LDREXH,LDREXD
Load word	1	1	1	1	1	0	0	op1		1	0	1	Rn								op2													LDR,LDRT
Load halfword, memory hints	1	1	1	1	1	0	0	op1		0	1	1	Rn				Rt				op2													LDRH,LDRHT,LDRSH,LDRSHT,PLD,PLDW,
Load byte, memory hints	1	1	1	1	1	0	0	op1		0	0	1	Rn				Rt				op2													LDRB,LDRBT,LDRSB,LDRSBT,PLD,PLDW,PLI
Store single data item	1	1	1	1	1	0	0	0	op1			0									op2													STRB,STRBT,STRH,STRHT,STRT,STR
Data processing(shifted register)	1	1	1	0	1	0	1	op				S	Rn								Rd													AND,TST,BIC,ORR,MOV,ORN,MVN,EOR,TEQ,PKH,ADD,CMN,ADC,SBC,SUB,CMP,RSB
Data processing(register)	1	1	1	1	1	0	1	0	op1				Rn				1	1	1	1						op2								LSL,LSR,ASR,ROR,SXTAH,SXTH,UXTAH,UXTH,SXTAB16,SXTB16,UXTAB16,UXTB16,SXTAB,SXTB,UXTAB,UXTB,
Parallel addition and subtraction,signed	1	1	1	1	1	0	1	0	1	op1							1	1	1	1						0		0	op2					SADD16,SASX,SSAX,SSUB16,SADD8,SSUB8,QADD16,QASX,QSUB16,QADD8,QSUB8,SHADD16,SHASX,SHSUB16,SHADD8,SHSUB8
Parallel addition and subtraction,unsigned	1	1	1	1	1	0	1	0	1	op1							1	1	1	1						0		1	op2					UADD16,UASX,USAX,USUB16,UADD8,USUB8,UQADD16,UQASX,UQSAX,UQSUB16,UQADD8,UQSUB8,UHADD16,UHASX,UHSAX,UHSUB16,UHADD8,UHSUB8
Miscellaneous operations	1	1	1	1	1	0	1	0	1	0	op1						1	1	1	1						1		0	op2					QADD,QDADD,QSUB,QDSUB,REV,REV16,RBIT,REVSH,SEL,CLZ
Multiply,multiply accumulate,and absolute difference	1	1	1	1	1	0	1	1	0	op1							Ra									0		0	op2					MLA,MUL,MLS,SMLABB,SMLABT,SMLATB,SMLATT,SMULBB,SMULBT,SMULTB,SMULTT,SMLAD,SMUAD,SMLAWB,SMLAWT,SMULWB,SMULWT,SMLSD,SMUSD,SMMLA,SMMUL,SMMLS,USAD8,USADA8
Long multiply,long multiply accumulate,and divide	1	1	1	1	1	0	1	1	1	op1																op2								SMULL,SDIV,UMULL,UDIV,SMLAL,SMLALBB,SMLALBT,SMLALTB,SMLALTT,SMLALD,SMLSLD,UMLAL,UMAAL
Coprocessor instructions	1	1	1		1	1	op1						Rn								coproc									op				STC,STC2,LDC,LDC2,MCRR,MCRR2,MRRC,MRRC2,CDP,CDP2,MCR,MCR2,MRC,MRC2

接下來,我們把ARM處理器的一些基礎特徵,做一些介紹

A, 有關ARM 新增的指令集簡要介紹

有關不同版本的ARM核心支援的指令集,建議可以參考ARMR and ThumbR-2 Instruction Set Quick Reference Card(http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001l/QRC0001_UAL.pdf ),在這主要只針對部分筆者認為值得介紹的加以說明

ARMv4新增	Thumb 16bits指令集
ARMv5新增	支援VFPv2 支援Jazelle BLX:支援透過Link Register的指令集狀態轉移Branch指令 BRK:支援中斷(Break)指令 CLZ:零計數指令,可計算最高位與第一個1之間零的個數,如果暫存器中全為0,則結果為 32,如果設置把bit 31設定1,則結果為0,對MultiMedia Codec優化很有助益. 其他像是QADD、QSUB、QDADD 和 QDSUB(有符号加法、减法，加倍加法，加倍减法), ,SMULxy,SMLAxy,SMULWy,SMLAWy,SMLALxy(乘法指令.),都是在ARMv5的核心中加入.
ARMv6新增	支援Thumb2 支援Trustzone 支援SIMD
ARMv7-A/R新增	支援VFPv3 支援NEON Advanced SIMD 支援ThumbEE
ARMv7-M新增 (For Low-Cost)	不支援ARM指令集,只支援Thumb2 16/32bits指令集 (支援最多240個中斷的集成式NVIC中斷控制器)
SVC	在新的ARM處理器中,SWI的指令被改為SVC,雖然對應到的指令機械碼還是一樣(例如EFxxxxxx),但命名的改變,對應到的是新的處理器對SWI(SVC)行為的進一步改善.
LDREX與STREX	這是在ARMv6之後新加入的指令,用來進行處理器層級的 Register/Memory Exclusive Access 確保,LDREX跟STREX是成對的使用. 如下例子,使用者透過LDREX讀取一個記憶體的值,如果在執行STREX前,該記憶體中的值被修改了,STREX動作會失敗且第一個暫存器R0的值會不為0 (non-exclusive by this CPU),反之,若該值沒有被更動到,STREX動作會成功且R0為0(exclusive access by this CPU). 這個處理器層級的Exclusive指令,很適合用在Multi-Task多工的環境或是多核心的環境中.目前ARM版本的Linux Kernel Spin Lock也是用此指令實作. try LDREX r0, [LockAddr] ; load the lock value CMP r0, #0 ; is the lock free? STREXEQ r0, r1, [LockAddr] ; try and claim the lock CMPEQ r0, #0 ; did this succeed? BNE try ; no – try again

B,使用Branch 指令在不同指令集中切換

我們可以參考文件 RealView® Compilation Tools Developer Guide 中 “Chapter 5. Interworking ARM and Thumb” 的例子,如下程式碼

PRESERVE8 ;Preserves eight-byte alignment of the stack

AREA TestCode,CODE,READONLY ; Name this block of code.

ENTRY ; Mark first instruction to call.

; 程式進入點

start

ADR R0, ThumbProg ; Generate branch target address and set bit 0, hence arrive at target in Thumb state.

ORR R0,R0,#1 ;等於是跳到目標位址ThumbProg+1的位址,再透過BX指令引發處理器指令集轉態為Thumb Mode

BX R0 ; Branch exchange to ThumbProg.

;ThumbCode區域

THUMB ; Subsequent instructions are Thumb code.

ThumbProg

MOVS R2, #2 ; Load R2 with value 2.

MOVS R3, #3 ; Load R3 with value 3.

ADDS R2, R2, R3 ; R2 = R2 + R3

ADR R0, ARMProg

BX R0 ; Branch exchange to ARMProg.

;Thumb Code為2bytes,ARMv32 Code為4bytes alignment取值,編譯器會在這補上2bytes 0x00,以便讓後續ARMv32 Code正常執行.

;ARMCode區域

ARM ; Subsequent instructions are ARM code.

ARMProg

MOV R4, #4

MOV R5, #5

ADD R4, R4, R5

; 結束程式.

stop MOV R0, #0x18 ; angel_SWIreason_ReportException

LDR R1, =0x20026 ; ADP_Stopped_ApplicationExit

SWI 0x123456 ; ARM semihosting

END ; Mark end of this file.

透過如下指令編譯與連結

armasm –debug –apcs=/interwork ARMThumbMixedCode.s

armlink ARMThumbMixedCode.o -o ARMThumbMixedCode.elf

把編譯後的ARMThumbMixedCode.elf放到ARM處理器執行,從CPSR來看,一開始運作時處理器指令集是在ARM Mode(CPSR的T bit為0),隨後透過

ADR R0, ThumbProg

ORR R0,R0,#1

BX R0

R0儲存Thumb Code所在目標位址 OR 最小一個Bit為1,透過BX轉態跳到Thumb Mode(CPSR的T bit為1)執行ThumbProg之後的程式碼,在Thumb Mode執行的最後再透過

ADR R0, ARMProg

BX R0

R0儲存ARM Code所在位址,並直接透過BX轉態跳到ARM Mode(CPSR的T bit為0),繼續執行ARMProg之後的ARM程式碼.

如下列出不同的Branch指令所能跳躍的範圍 (一般而言ARM最大為32MB,Thumb2為16MB,Thumb為4MB).

指令	範圍(Thumb2 16/32bits)	範圍(ARM 32bits)
B (Branch to target address)	+/–16MB	+/–32MB
CBNZ, CBZ(Compare and Branch on Nonzero, Compare and Branch on Zero)	0-126B	X
BL, BLX (immediate) (Call a subroutine ,Call a subroutine, change instruction set)	+/–16MB	+/–32MB
BLX (register) (Call a subroutine, optionally change instruction set)	Any	Any
BX (Branch to target address, change instruction set)	Any	Any
BXJ (Change to Jazelle state)	–	–
TBB, TBH (Table Branch (byte offsets) and Table Branch (halfword offsets))	0-510B and 0-131070B	X

(Reference:ARMv7-AR Architecture Reference Manual)

C,Veneer-用來支援跨不同Obj檔案時的ARM指令集轉換.

由前面的例子我們可以知道,ARM<->Thumb(2)的轉態動作如果是在同一個Source Code檔案(Obj檔案)中時,轉態的動作其實就是直接在程式碼中執行與動作,但如果所發生的ARM<->Thumb(2)的轉態行為是發生在一個以上不同的Source Code之間的呼叫,就會牽涉到每個Obj檔案在編譯時的參數差異,有關跨不同Obj檔案間判別是部是需要在兩個Obj檔案的函式中支援轉態的動作,就會變成在ARM Link最後連結的動作中,依據跨檔案互相呼叫的雙方是不是在同一個指令集下,如果不是就會透過加入Veneer的嵌入碼,確保最後透過ARMLink連結的執行檔,可以支援所連結不同來源的Obj檔中所包含的指令集差異.

我們可以產生一個arm.s檔案,內容如下

PRESERVE8

AREA Arm,CODE,READONLY ; Name this block of code.

IMPORT ThumbProg

ENTRY ; Mark 1st instruction to call.

ARMProg

MOV R0,#1 ; Set R0 to show in ARM code.

BL ThumbProg ; Call Thumb subroutine.

MOV R2,#3 ; Set R2 to show returned to ARM.

; Terminate execution.

MOV R0, #0x18 ; angel_SWIreason_ReportException

LDR R1, =0x20026 ; ADP_Stopped_ApplicationExit

SVC 0x123456 ; ARM semihosting (formerly SWI)

END

並產生一個 thumb.s,內容如下

AREA Thumb,CODE,READONLY ; Name this block of code.

THUMB ; Subsequent instructions are Thumb.

EXPORT ThumbProg

ThumbProg

MOVS R1, #2 ; Set R1 to show reached Thumb code.

BX lr ; Return to the ARM function.

END ; Mark end of this file.

執行如下編譯,

armasm –debug –apcs=/interwork arm.s

armasm –thumb –debug –apcs=/interwork thumb.s

armlink arm.o thumb.o -o arm_thumb_veneer.elf

由於是跨不同的Obj檔案,不同於在同一個Obj檔案中,我們需要把Thumb Code函式進入點的Bit0,設定為1再透過BX指令跳躍過去讓ARMv32可以轉程Thumb Code的指令集,在上述的例子中,我們可以直接呼叫ThumbProg,透過Veneer機制達成由ARMv32轉態為Thumb Code的目的.

如下,我們透過反組譯arm_thumb_veneer.elf確認Veneer機制的作用,

Arm

0x00008000: e3a00001 …. MOV r0,#1

0x00008004: eb000004 …. BL $Ven$AT$I$$ThumbProg ; 0x801c

0x00008008: e3a02003 . .. MOV r2,#3

0x0000800c: e3a00018 …. MOV r0,#0x18

0x00008010: e59f1000 …. LDR r1,[pc,#0] ; [0x8018] = 0x20026

0x00008014: ef123456 V4.. SVC #0x123456 ; formerly SWI

0x00008018: 00020026 &… DCD 131110

$Ven$AT$I$$ThumbProg

0x0000801c: e28fc001 …. ADR r12,{pc}+9 ; 0x8025

0x00008020: e12fff1c ../. BX r12

Thumb

ThumbProg

0x00008024: 2102 .! MOVS r1,#2

0x00008026: 4770 pG BX lr

可以看到在ARM Mode時,會在位址 0x00008004透過BL跳到位址0x0000801c執行Veneer Code,如同我們在同一個Obj檔案中所做的Bit0設定為1的動作,在所產生的Veneer Code中會設定R12暫存器指到0x00008025,再透過BX指令轉態執行0x00008024中的Thumb Code.

Veneer是由ARM Linker根據最後連結成執行檔的階段,判斷程式碼是否有跨Obj間ARM與Thumb Code互相呼叫的需求,或是ARM/Thumb/Thumb2彼此呼叫超過Branch上限範圍時,就會自動產生,可參考如下分類

Veneer類型	說明
ARM/Thumb(2)之間呼叫	在跨Obj檔案的ARM<->Thumb與ARM<->Thumb2彼此呼叫時,需要透過Veneer轉態
超過ARM/Thumb/Thumb2的Branch範圍	當ARM<->ARM 呼叫超過32MB. Thumb2<->Thumb2呼叫超過16MB Thumb<->Thumb呼叫超過4MB. 就會需要透過Veneer Code,協助完成呼叫流程.

D, CPSR (Current Program Status Register)與 SPSR (Saved Program Status Register)

程式狀態暫存器 PSR(Program Status Register),是用來紀錄程序狀態之用,包括反映出目前所處的處理器模式,指令集狀態,以及反應出條件(Cond.)執行指令判斷執行的依據.

舉個例子來說,當我們從CPSR的4-0bits取出值為b10111就可以知道目前所在的Exception Handler,是發生了Abort,之後再判斷SPSR的4-0bits,若為b10011(SVC Mode)或b10000(User Mode),就可以知道在觸發這個Abort前,處理器是在執行哪一個模式下的程式碼,再者,如果擔心有因為Exception Handle設計不當導致的Abort重入問題,也可以透過CPSR/SPSR前後模式比對,知道是不是Abort重入,可以鎖定潛在的系統問題加以解決. 如下簡述每個欄位的意義

位元	功能	說明
4-0	Mode[4:0]	Mode Bits模式位元 b10000(0x0010) -User Mode b10001(0x0011)- FIQ Mode b10010(0x0012)-IRQ Mode b10011(0x0013)-Supervisor Mode b10111(0x0017)-Abort Mode b11011(0x001b)-Undefined Mode b11111(0x001F)-System Mode b10110(0x0016)-Secure Monitor
5	T	Thumb state bit 0=ARM 1=Thumb
6	F	FIQ Disable 1=禁止 0=允許
7	I	IRQ Disable 1=禁止 0=允許
8	A	Imprecise About Bit A-bit o Indicates if imprecise data abort exceptions are masked
9	E	Data Endianess Bit E-bit o Indicates the current load/store endian setting of the core o Can be set/cleared with the SETEND instruction
10	c	IT state bits
11	b
12	a
15-13	IT_cond
19-16	GR[3:0]	Greater than or equal to
23-20	DNM (RAZ)
24	J	Java State Bit
25	e	IT state bits
26	d	IT state bits
27	Q	Sticky Overflow
28	V	Overflow
29	C	Carry/Borrow/Extend
30	Z	Zero
31	N	Negative/Less than

E, APSR (Application Program Status Register)算術邏輯單元狀態標誌

不同於CPSR與SPSR,APSR主要是作為ALU(Arithmetic Logic Unit)狀態標誌的暫存器,用以決定這些Condition指令是否被執行.

CPSR也會包含APSR的Flags,其他像是處理器的狀態,中斷的致能與否,目前的指令集狀態,執行狀態IT block,這些都是APSR所不包含的.

位元	功能	說明
15-0	Reserved	Reserved
19-16	GE[3:0]	Greater than or equal to (SIMD status bits – greater than or equal to for each 8/16-bit slice)
23-20	Reserved	Reserved
26-24	RAZ/SBZP
27	Q	Sticky Overflow
28	V	Overflow
29	C	Carry/Borrow/Extend
30	Z	Zero
31	N	Negative/Less than

接下來介紹其他ARM處理器主要的特徵,

1, Coprocessors

ARM本身除了支援ARMv32,Thumb,Thumb2指令集外,還可以透過Coprocessor支援延伸的指令集,每當ARM處理器執行到無法識別的指令集,就會透過Coprocessor 試圖進行指令集的識別動作,如果Coprocessor無法識別有效的指令集,或是該系統沒有對應的Coprocessor配置,就會透過觸發Undefined Instruction Vector透過對應的軟體進行錯誤處理流程（Undefined Instruction通常也會用在安插Break Point的除錯機制上）ㄜ

ARM可以支持0-15共16個Coprocessor,例如: CP15(System Control Coprocessor 15)一般是用在Cache與MMU相關的設定工作,CP14(Debug Control Coprocessor 14)支援相關Debug Registers,其它像是新增的NEON MPE (Media Processing Engine) SIMD指令與浮點運算VFP是透過CP10與CP11支援 (可參考Cortex A8 Technical Reference Manual與http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0450b/ch02s01s02.html).

目前所提供的Coprocessor指令中,ARM有針對VFP與NEON的指令,定義對應的指令集,舉在Cortext A中有支援的NEON指令集為例,在編譯器端只要加入 vectorize 參數,就可以主動由編譯器根據程式碼內容優化,透過NEON的指令集進行產生對應的指令,如下所示

armcc –vectorize -c vector.c –cpu Cortex-A8 -Otime

之後反組譯 vector.o,如下所示

0x000000a0: f2944a40 @J.. VMULL.S16 q2,d4,d0[0] => NEON指令集

0x000000a4: f428774f Ow(. VLD1.16 {d7},[r8]

0x000000a8: e2868006 …. ADD r8,r6,#6

0x000000ac: e2866008 .`.. ADD r6,r6,#8

0x000000b0: f428674f Og(. VLD1.16 {d6},[r8]

0x000000b4: f2974248 HB.. VMLAL.S16 q2,d7,d0[1]

0x000000b8: f426174f O.&. VLD1.16 {d1},[r6]

0x000000bc: f2964260 `B.. VMLAL.S16 q2,d6,d0[2]

0x000000c0: f2914268 hB.. VMLAL.S16 q2,d1,d0[3]

0x000000c4: f2222844 D(“. VADD.I32 q1,q1,q2

而這些ARM的NEON指令集就會透過CP10與CP11支援,不需要使用者自己透過跟Coprocessor的資料交換指令來完成NEON指令集的優化動作. 如果今天使用的是基於Coprocessor的GPU,由於編譯器跟處理器都沒有對應的GPU指令集支援,我們會需要透過Coprocessor指令去包裝這些GPU指令集的動作,由開發端根據對GPU指令了解的深度,優化圖形的函式庫,達到透過GPU加速的目的.

ARM的Coprocessor架構也可以用來支援不同的周邊,除了ARM可以驅動周邊外,Coprocessor也能連接周邊,進行相關必要的工作,例如: Graphics Coprocessor,可主動的計算與處理資料,並更新Display記憶體,Coprocessor對ARM核心而言,就是另外一個協同處理器,連接到ARM的Data與Control Bus上,當ARM處理器遇到無法解析的指令時,就會初始化與Coprocessor的Handshaking流程,進行後續的執行工作.

Coprocessor與ARM之間可以透過以下三個Signal進行Handshaking.

A, CPI (Co-processor instruction) Signal

在系統上的Coprocessor都會監聽這個訊息的發生,每當ARM遇到一個無法識別的指令集時就會觸發這個Signal.

B, CPA (Co-processor absent) Signal

當Coprocessor接收到CPI Signal時,就會把該指令集Fetch進來,並透過CPA Signal回應ARM是否支援這個指令集,舉例來說如果Coprocessor支援該指令集,就會把CPA Signal設定為Low (反之,不支持該指令集就把CPA設定為High),並透過

B.1, A集就是高電位電ㄨㄟㄢsor absent..tor1bit的Signal Bit為1表示該指令集可被Coprocessor處理,為0表示不支援該指令集.

B.2, 4bits (0-15)的Coprocessor ID表示目前可處理該指令集是哪一個Coprocessor (最多16個).

如果沒有任何一個Coprocessor能處理該指令集,CPA Signal就會保持在High,而ARM就會進一步觸發Undefined Instruction Vector後續流程.

C, CPB (Co-processor busy) Signal

一旦Coprocessor回應ARM可以執行該指令集後,ARM就會透過CPB Signal確認目前Coprocessor的執行狀態,如果這個Signal為High表示Coprocessor目前尚未結束前一個處理中的指令集.CPB Signal為Low,表示Coprocessor可以準備處理下一個Coprocessor指令.

當ARM的程式,處於執行Coprocessor指令的狀態時,ARM會等待該指令集執行結束,才讓程式繼續運作下去,在我們一般多工的系統當中,例如Linux Kernel 2.6每一秒會觸發1000次的System Timer中斷觸發核心排程的機制,如果ARM在等待Coprocessor執行完畢的過程中發生ARM的中斷,這次Coprocessor指令執行的動作就會中斷,等到該中斷執行結束,重新返回該應用程式該處指令集位址,重新該次Coprocessor指令的執行. 以Linux環境為例,如果這時候行程已經透過核心排程到其它應用程式,就要等到下一次該應用程式被排程到後,才有機會重新把該Coprocessor指令重新執行完畢.

一般而言,Coprocessor指令可以分為以下三種類型

1, 純粹為Coprocessor內部的操作流程,ARM端無須參與

這類的指令集,ARM不需要等待Coprocessor傳回資料,也不用等待Coprocessor執行該指令集結束,純粹為Coprocessor內部的資料處理,ARM能立刻繼續往下執行,以CDP(Coprocessor Data Processing)指令為例

CDP{cond} <cp#>,<op>,<dest>,<lhs>,<rhs>,{info}

{cond}	執行指令條件判斷,當條件滿足才會執行此指令(the optional condition code)
<cp#>	Coprocessor編號(0-15,4bits,the co-processor number)
<op>	所要執行Coprocessor的指令編號(0-15,4bits,the desired operation code)
<dest>	在Coprocessor端儲存資料的目標暫存器(0-15,4bits,the co-processor destination register)
<lhs> and <rhs>	在Coprocessor端讀取資料的來源暫存器(0-15,4bits,the co-processor source registers)
{info}	Info (0-7,3bits,the optional additional information field)

另一個例子為,FPU Coprocessor指令的例子

ADF {cond}<P>{R} <dest>,<lhs>,<rhs>

{cond}	執行指令條件判斷,當條件滿足才會執行此指令(the optional condition code)
<P>	the precision of the operation
{R}	the optional rounding mode and the other fields are as above.
<dest>	目標暫存器
<lhs>	來源暫存器
<rhs>	來源暫存器

或是Graphics Coprocessor指令的例子

CDP 2,<palette>,<entry>,<value>,<component>

<cp#>	2 is the co-processor number
<palette>	the op-code for setting the palette
<entry>	the logical colour number (0-15) (the <dest> field)
<value>	the intensity for that component (0-65535) (the <lhs> and <rhs>) field.
<component>	the red, green or blue component (0-2) (the info field)

2, 需要透過ARM的暫存器跟Coprocessor暫存器進行資料的交換

第二類的指令集,為ARM跟Coprocessor透過暫存器進行資料的處理更新,由於會牽涉到ARM端暫存器的內容更動,因此ARM必須要等待Coprocessor處理完畢才可以繼續執行下去,以確保跟Coprocessor交換資料的動作,跟原本程式設計預期的行為流程的正確性,以下舉MRC (Move to arm core Register from Coprocessor) 與 MCR(Move to Coprocessor from arm core Register)兩個指令為例

MRC{cond} <cp#>,<op>,<ARM dest>,<lhs>,<rhs>,{info}
MCR{cond} <cp#>,<op>,<ARM srce>,<lhs>,<rhs>,{info}

{cond}	指令集條件判斷(the optional condition code)
<cp#>	Coprocessor編號(4bits,0-15, the co-processor number)
<op>	所要執行Coprocessor的動作編號(3bits,0-7, the operation code required)
<ARM dest>/<ARM srce>	在ARM端要跟Coprocessor交換資料的暫存器(4bits,0-15, the ARM source/destination register)
<lhs> and <rhs>	在Coprocessor端要跟ARM交換資料的暫存器(4bits,0-15, co-processor register numbers)
{info}	附加額外資訊(3bits,0-7, optional extra information)

3, Coprocessor 透過外部記憶體存取相關資料進行處理

類似於ARM的LDR/STR指令可以把記憶體資料儲存到ARM的暫存器,或是把ARM暫存器的資料寫到記憶體終,針對Coprocessor的操作也有類似的指令LDC/STC,這個指令可以針對一個Array作範圍不超過-255 to +255 words (-1020 to +1020 bytes)的index動作.

LDC{cond}{L} <cp#>,<dest>,<address>
STC{cond}{L} <cp#>,<srce>,<address>

{cond}	指令集條件判斷(the optional condition code)
{L}	(optional bit meaning ‘long transfer’)
<cp#>	Coprocessor編號(4bits,0-15, the co-processor number)
<dest>/<srce>	在Coprocessor端要跟記憶體交換資料的暫存器(4bits,0-15, the ARM source/destination register)
<address>	(specifies the address at which to start transferring data)

Coprocessor跟ARM指令集一樣,都可以支援讓部分指令集只能在SVC Mode (Supervisor Mode)下被執行,ARM可以透過SPVMD Signal告訴外部的裝置或是Coprocessor目前執行中的ARM程式是否處於SVC Mode,讓周邊或是Coprocessor可以判別是否繼續改動作或指令的執行. Coprocessor可以透過觸發Abort中止該次指令的執行動作.

以浮點數來說,並不是所有ARM的平台都有支援VFP Coprocessor,因此在透過RVCT ARM編譯器時,可以透過 “– -fpu=vfp” 選擇硬體支援的VFP Coprocessor指令,讓編譯器可以產生使用VFP指令集的機械碼,若是該平台沒有支援VFP,就可以選擇編譯參數 “– -fpu=softvfp”,讓編譯器以軟體的方式處理浮點運算,不要產生基於VFP的Coprocessor指令集.

除此之外,也可以透過支援VFP的函式庫,還是讓編譯器產生基於VFP Coprocessor的指令集機械碼,當ARM處理到這些指令集時,就會進行上述對Coprocessor的流程,若平台上面沒有對應的VFP Coprocessor支援,就會觸發Undefined Instruction Trap,再透過ARM端處理Undefined Instruction Trap的軟體流程,呼叫對應於VFP指令集的軟體實作函式庫. (參考文件RealView Compilation Tools Libraries and Floating Point Support Guide).

2, Jazelle

參考ARM的網站http://www.arm.com/products/processors/technologies/jazelle.php, Jazelle是ARM在2002年開始的計畫,主要的目的是讓Java的ByteCode可以無須經由一層軟體的JVM,能夠如同一般ARMv32指令集一樣,直接透過處理器執行,加速Java應用ByteCode的執行效率.第一個具備Jazalle指令集的ARM核心是ARMv5TEJ,所產生的第一個處理器產品是ARM926EJ-S.

支援Jazelle的ARM核心,會在Pipeline Fecth到指令後,把所讀取到的Java指令,轉為1個或多個ARM指令集,並由Java Vm把要執行的ByteCode程式預備好後,再透過BXJ 進行Branch與轉態為ByteCode指令集的動作,跳到Java程式中執行,由於Jazelle主要實作常用的Java 指令集,如果遇到不支援的Java指令,就會通知Software Java Vm協助執行(回到ARM或是Thumb Mode),之後再由Java Vm根據需求切回Jazelle模式下. ㄟ…因為,筆者沒實際操刀過Jazelle,但由於Jazelle模式下是可以觸發Undefined Instruction Exception的,個人覺得當Jazelle遇到不支援的指令集,通知Software Java VM的方式應該是透過Undefined Instruction Exception由SPSR判斷前一個狀態是不是Jazelle指令模式,如果是,就再透過Softwaft Java Vm去執行(可以參考LR 知道該ByteCode指令的位置.). 另外一種可能就是,採用類似ThumbEE Handler的作法,如果遇到無法支援的Java ByteCode指令集,就透過該Handler交給Software Java Vm執行.(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Bhhggafj.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344c/Chdiciaj.html ),以上僅供參考…:)

可以參考Sun在 CLDC HI(HotSpot Implementation)的Release 文件(http://download.oracle.com/javame/config/cldc/cldc-opt-impl/cldc-hi-2.0-web/doc/release/CLDC_HI-release-notes.html ),CDLD Hi有支援Java Just-In Time的機制,可以即時把Java ByteCode編譯為ARM Code,加速Java應用的執行,如果要開啟Jazelle加速的話,文件中有說明需要取得ARM的授權後,才能使用

『For the avoidance of doubt, distribution of products containing software code to exercise the BXJ instruction and enable the use of the ARM Jazelle architecture extension without a JTEK licensing agreement from ARM is expressly forbidden. 』. 由於需要取得ARM JTEK授權,才能啟用這功能,在實際的應用上,應該都會比較偏向用JIT來做加速(通常會根據ARMv5指令集做優化).

目前Jazelle的應用主要可以分為

A, Jazelle DBX(Direct Binary Execution): 主要提供執行讓Java ByteCode可以在ARM處理器上執行的能力,開發端可以判斷處理器當下的CPSR J(bit 24)是否為1 與 T (bit 5)是否為0,確認處理器是否處於ARM Jazelle指令集的模式. 可用來減輕透過ARM指令集運作 Java Vm執行ByteCode的運算成本. 直接藉由處理器的指令支援,加速運作的效率. 要讓Jazelle能夠有效運作,這也需要Java Vm支援(Jazelle-aware JVM),以便讓Java應用程式所需相關的JSR(Java Specification Requests)或是JAR的讀取,能夠搭配Jazelle流程被妥善應用,根據ARM的宣稱,有約95%的ByteCode應用程式,可以直接透過Jazelle執行.

B, Jazelle RCT(Runtime Compile Target): 這個技術,主要用來將Java ByteCode轉為ARM的機械碼(根據參考的資料,ByteCode轉為ARM機械碼後,最終程式的大小會膨脹4-8倍). JIT或是DAC(Dynamic Adaptive Compilation)可以根據Java ByteCode執行的情況,動態的分析並編譯ByteCode,Jazelle RCT希望解決的問題是,讓透過軟體將ByteCode轉機械碼的過程中,所造成的應用程式啟動時間增加,功耗與執行效能受到影響的問題,可以透過Jazelle RCT加以避免. 在實際的應用上,也可以透過AOT(ahead-of-time)提前在安裝或下載應用時進行編譯為ARM機械碼,當然,因為編譯後的Java應用會讓儲存空間膨脹4-8倍,在評估上也需要加以考量.

目前Android 2.2之後的Dalvik也有支援JIT,筆者認為以後軟體的JIT應該會是主流,透過Jazelle DBX或是RCT的機制,由於授權的限制,應該不容易成為應用的主流.(當然Jazelle也需要處理器的支援.)

3, ARMv32

ARMv32是ARM原生的32bits指令集環境,也是相比於Thumb(16bits)與Thumb2(16/32bits)執行環境來說,可以得到最佳化的執行效能,ARM指令在記憶體中會與4bytes位址對齊.通常在要求性能,例外或系統初始化的部分,會採用ARM 32bits指令集.指令集編碼特徵是每個指令最高4bits會代表該指令執行條件.

4, Thumb

Thumb16bits指令集主要是以常用的ARM 32bits指令集為Subset去設計的,在處理器中會把所載入的16bits Thumb指令轉成對應的32 bits ARM 指令去執行,開發端可以根據CPSR的T 是否為1與J是否為0,判定是否為Thumb Mode. 在開發上需要注意的是,同樣的C程式碼採用Thumb 16bits指令集編譯後,執行效率會有所減損,指令也沒有ARM 32bits指令集豐富,但可以獲得比較高的程式碼密度,節省所需的記憶體空間.

5, Thumb2

Thumb2指令集提供了16/32 bits版本的指令集,同樣是以根據CPSR的T 是否為1與J是否為0,判定是否為Thumb2 Mode.(所以處理器要支援哩.), 根據ARM官網所述(http://www.arm.com/products/processors/technologies/instruction-set-architectures.php ),Thumb-2 可以比ARM Code減少 31%程式碼記憶體需求, 並且比原有的Thumb Code提升 38% 的性能. (不同的測試代碼會有一些出入.)

目前,像是Cortex M3,就全面採用Thumb2 Code,而不支援ARM指令集. 以便得到相比ARM較低的程式碼記憶體需求,又得到較少的效能減損(相比Thumb Code).

6, Thumb2 Execution Environment (Thumb-2EE)

首先,參考ARM網頁http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ic/Cjafgdih.html “除非另有說明，否則 ThumbEE 指令與 Thumb 指令完全相同”,從ARM指令集的角度,我們可以知道ThumbEE跟既有Thumb的指令是盡可能的一致,開發端可以先進入Thumb Mode後(CPSR的T bit為1),再透過ENTERX指令(將 Thumb 狀態更改為 ThumbEE 狀態，但對 ThumbEE 狀態則不起作用) 進入ThumbEE Mode(CPSR的J bit為1) 與透過LEAVEX指令(將 ThumbEE 狀態更改為 Thumb 狀態，但對 Thumb 狀態則不起作用 )離開ThumbEE Mode. ThumbEE Mode還提供如下指令集CHKA(檢查數組),HB、HBL、HBLP 和 HBP(處理程序跳轉，跳轉到指定處理程序).

此外,CP14有暫存器c0 可供設定ThumbEE Configuration Register

MRC p14, 6, <Rd>, c0, c0, 0 ; Read ThumbEE Configuration Register

MCR p14, 6, <Rd>, c0, c0, 0 ; Write ThumbEE Configuration Register

也可透過設定ThumbEE HandlerBase Register,支援當在ThumbEE發生例外時,可以讓應用程式有機會修正處理

MRC p14, 6, <Rd>, c1, c0, 0 ; Read ThumbEE HandlerBase Register

MCR p14, 6, <Rd>, c1, c0, 0 ; Write ThumbEE HandlerBase Register

開發階段,可以透過設定THUMBX程式碼節區,把指令集編譯為ThumbEE Mode(可以參考http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ic/CIHBCDGA.html ),筆者撰寫如下參考程式碼

AREA ToThumbX, CODE, READONLY

ENTRY

ARM

start

MRS R0,CPSR

BIC R0,R0,#0x1F

ORR R0,R0,#0x10 ;Switch to User Mode.

MSR CPSR_c,R0

MOVS R1,#0x7000 ; handler_ThumbEE =0x7000-4

MCR p14, 6, R1, c1, c0, 0 ; Write ThumbEE HandlerBase Register

ADR r0, enter_Thumb + 1

BX r0 ;進入Thumb Mode

THUMB

enter_Thumb

NOP

ENTERX ;進入ThumbEE Mode

THUMBX

enter_ThumbX

MOVS R2, #2 ; Load R2 with value 2.

MOV.W R3, #3 ; Load R3 with value 3. =>32bits Thumb2 指令

ADDS R2, R2, R3 ; R2 = R2 + R3

MOVS r0,#0 ;char *p; and p=0;

MOVS r1,#2

STRB r1,[r0,#0] ;*p=2; =>觸發Data Abort.

MOVS r0,#1

LEAVEX ;離開ThumbEE Mode

THUMB

leave_ThumbX

NOP

END

並請以支援Cortex-A8指令集的編譯器版本進行如下編譯

armcc –cpu=Cortex-A8 thumbee.s -o thumbee.elf

我們把 handler_ThumbEE設定為0x7000,透過CP14設定ThumbEE HandlerBase Register,隨後程式碼執行,從SVC Mode切到User Mode,之後進入Thumb Mode,然後透過ENTERX進入ThumbEE Mode,首先驗證16bits與32bits Thumb2指令集在ThumbEE Mode下的支援無誤,之後刻意產生一個NULL Pointer的Data Abort,在ThumbEE Mode下,會直接把PC值只到0x7000-4的記憶體位址, 執行我們安排好的處理函式. 在處理函式中我們可以發現處理器還是處於User Mode(模式並沒有轉換),同時指令及模式還是處於ThumbEE Mode.

透過基礎的驗證,我們知道在ThumbEE模式下,發生異常處理時,會導引到ThumbEE Handler中,透過該Handler應用程式可以有機會進行補救措施,而不適直接觸發處理器等級的Abort,導致應用程式直接終止執行.

舉個我們在Windows環境設定SEH(Structured Exception Handling)的例子來說明,應用程式可以透過如下的程式碼設定Windows上的SEH處理程式

handler = (DWORD) problem_fixing_seh ;

__asm{

mov eax,handler

push eax

push fs:0

mov fs:0,esp

};

只要在應用程式啟動時,預先把SEH設定好,如果應用程式因為設計不當,導致記憶體錯誤而中斷,就有機會先跳到處理函式 problem_fixing_seh中,我們再透過這個函式把發生問題當下的現場包括暫存器,Stack與必要的資訊記錄下來,就有機會可以透過這些資訊幫助應用程式開發者把問題解決.

對應到有MMU環境,且有區分User Mode與Kernel Mode應用程式的消費性電子產品開發,ThumbEE可以讓應用程式在啟動時設定好ThumbEE Handler,如果應用程式因為不當設計而導致錯誤,就可以讓開發者有機會透過Handler把錯誤現場資訊蒐集下來,甚至也可以透過通訊機制回傳,幫助開發者收斂問題.(當然,應用程式有能力自己把問題排除會是更好的.)

7, VFP (Vector Floating Point) 與 Advanced SIMD (NEON)

ARM處理器把向量浮點放到Coprocessor處理,可以提供經濟的單精度與倍精度浮點運算能力,並可相容於ANSI/IEEE Std 754-1985 二進位浮點算數標準.

ARM的Advanced SIMD (NEON)指令集,也就是我們所稱的NEON,提供了64或128bits的SIMD（Single Instruction Multiple Data）指令集,可以針對多媒體的應用提供指令集的加速能力,NEON本身是基於Coprocessor 10與11所提供的指令集,Coprocessor有自己的暫存器,指令集,與獨立執行的處理器單元,NEON支援8/16/32/64bits的整數與32bits單經度浮點運算,在NEON中,SIMD最高可以執行到16個運算.

有關NEON的效能比較,建議可以參考 ARM Architecture & NEON(http://www.stanford.edu/class/ee282/handouts/lect.10.arm_soc.4pp.pdf)這份文件,其中也包含了跟ATOM N270的比較,NEON支援32個64bits長度的暫存器(為D0-D31)或可用Q暫存器的方式使用為16個128bits長度的暫存器(為Q0-Q15),一個暫存器在處理多媒體資料時,最多可以儲存4個32bits浮點數,或是16個signed字元,以ARM文件中所舉的 AAC編碼中所用到的FFT(Fast Fourier Transform)算法來說,只用ARMv6 SIMD指令根用ARMv7 NEON指令可以差到約四倍的執行效率,在ffmpeg中的FFT也可達到約12倍的效率(僅供參考.).

要使用NEON,可以直接在armcc編譯時加上—vectorize,其他參數還請參考ARM編譯器的文件.

8, Security Extensions (TrustZone)

提到TrustZone,其實在PC產業也有安全運算組織TCG(http://www.trustedcomputinggroup.org/)在致力於推動這類平台安全的機制,並制定了像是TPM(Trusted Platform Module)與運行在其上的軟體架構TCG Software Stack,讓電腦環境的應用程式或核心驅動程式也區分Trusted與Non-Trusted的執行環境,包括新的Windows環境,與Linux Kernel 2.6之後也有支援,可以透過make menuconfig中進入Device Drivers —> Character devices —>[*] TPM Hardware Support 根據平台上支援的TPM 介面選擇 National Semiconductor TPM Interface 或 Atmel TPM Interface (會根據所取得的Linux Kernel版本而有所不同.)

目前微軟作業系統中也有支援基於TPM技術的BitLocker(參考網址http://windows.microsoft.com/zh-TW/windows7/Learn-more-about-BitLocker-Drive-Encryption),如果TPM偵測到作業系統磁碟啟動檔案被改變,就會強制進入修復模式,必須要輸入修復密碼才能重新正確的讀取資料,或如果發現TPM資料跟不一致,也會強制要求輸入修復密碼.Windows TPM機制也可以提供搭配啟動金鑰或PIN碼的機制,如果使用者的筆記電腦不小心遺失,而第三者如果沒有匹配具有對應加密金鑰的USB磁碟或是沒有輸入正確的PIN碼,也會無法順利地開啟作業系統磁碟的資料. 在windows電腦上,基於TPM我們可以做到避免重要資料的磁碟被刻意拔到其他電腦上讀取,的資料遺失風險,就算是電腦被第三方使用者盜取,也可以有效確保資料遺失的風險.

TrustZone是由ARM所提出在處理器架構上,區分Secure與Non-Secure模式的兩個平行執行環境 (Secure World 與 Normal World),這兩個執行環境可以透過Secure Monitor Mode來進行切換,概念如下所示

Non-Secure User Mode (Application)		Secure User Mode (Application)
Non-Secure Privileged Mode (Kernel/Driver)		Secure Privileged Mode (Kernel/Driver)
Monitor Mode (Exception)

基於TrustZone,不屬於Secure區域的應用程式或是核心程式,就會無法存取屬於Secure區域的資料,可用來確保在SmartPhone這類產品上,因為下載第三方惡意程式所帶來的安全問題. TtrustZone中有關的軟體安全機制,是由Trusted Logic S.A.這家公司所共同研發的,在一個支援TrustZone的處理器上,會有一塊記憶體空間預留給專屬Secure Mode的應用程式或是核心程式執行,也因此,MMU也必須要能支援這樣的欄位.(因此像是ARM1176或是 Cortex A這類有MMU的處理器環境會非常適合). 基於此,才能透過記憶體管理機制,在硬體上實際的分割出Secure與Non-Secure的記憶體執行與使用空間,避免Non-Secure應用的惡意越界. 根據ARM的資料,一個可供執行的TrustZone環境必須包括

A. 支援 TrustZone 的處理器

B. 晶片上的Boot Rom用來支援啟動時的安全設定. (透過外部Flash儲存BootCode會有被修改的風險.)

C. 晶片上可供用來儲存設定或主控密碼的空間 (Maybe OTP(One Time Programmable))

D. 支援On-Chip RAM用來儲存DRM或相關重要的密碼資訊

E. 能夠設定成只限定被信任的應用軟體使用的周邊.

ARM會提供由Trusted Logic S.A.所提供的安全模組,支援跟TrustZone行為一致的安全保密協議.通過這些軟體保密服務,所提供的安全檢查,ARM希望能支援像是SIM卡上鎖,IMEI保密,安全啟動(確保所要載入的作業系統核心沒有被修改過.)(OMTP -Open Mobile Terminal Platform,也有制定相關Secure Boot的需求),DRM (Digital Right Management)受版權保護的資料內容,數位簽名與電子銀行. 參考ARM的文件,ARM會提供有包括,Trusted Interpreter,TrustZone Access Driver, TrustZone Monitor,Secure Kernel,Secure Key Storage,SIM Lock,E-Wallet 與 API Framework 這些配套的軟體模組.

SMI(Software Monitor Instruction,安全稽核(監察)中斷)跟SMC兩者的機械碼指令是一致的,都會透過SVC的中斷觸發,讓Non-Secure的程式碼有機會可以透過Monitor Mode切到Secure State. 如下所示,透過TrustZone,我們可以設定Asynchronous Abort,IRQ,FIQ,DMA,TLB,Coprocessor 等周邊中斷與管理單元是否都要納入Secure State的管理中.

Secure Configuration Register (SCR)

位元	功能	說明
31 – 7	UNK/SBZP (Bits [31:7])	UNK/SBZP unknown on reads, Should-Be-Zero-or-Preserved on writes.
6	nET	Not Early Termination. This bit disables early termination
5	AW	A bit writable. This bit controls whether the A bit in the CPSR can be modified in Non-secure state: 0 the CPSR.A bit can be modified only in Secure state. 1 the CPSR.A bit can be modified in any security state.
4	FW	F bit writable. This bit controls whether the F bit in the CPSR can be modified in Non-secure state: 0 the CPSR.F bit can be modified only in Secure state 1 the CPSR.F bit can be modified in any security state.
3	EA	External Abort handler. This bit controls which mode handles external aborts: 0 Abort mode handles external aborts 1 Monitor mode handles external aborts.
2	FIQ	FIQ handler. This bit controls which mode the processor enters when a Fast Interrupt (FIQ) is taken: 0 FIQ mode entered when FIQ is taken 1 Monitor mode entered when FIQ is taken.
1	IRQ	IRQ handler. This bit controls which mode the processor enters when an Interrupt (IRQ) is taken: 0 IRQ mode entered when IRQ is taken 1 Monitor mode entered when IRQ is taken.
0	NS	Non Secure bit. Except when the processor is in Monitor mode, this bit determines the security state of the processor. 0 =Secure state 1 =Non-secure state

並可以透過如下程式碼修改Secure Configuration Register

MRC p15,0,<Rt>,c1,c1,0 ; Read CP15 Secure Configuration Register

MCR p15,0,<Rt>,c1,c1,0 ; Write CP15 Secure Configuration Register

Non-Secure Access Control Register

位元	功能	說明
31 – 19	SBZ	SBZ Should-Be-Zero on writes.
18	DMA	Reserves the DMA channels and registers for the Secure world and determines the page tables, Secure or Non-Secure, to use for DMA transfers. 0 = DMA reserved for the Secure world only and the Secure page tables are used for DMA transfers,reset value 1 = DMA can be used by the Non-Secure world and the Non-Secure page tables are used for DMAtransfers.
17	TL	Prevents operations in the Non-Secure world from locking page tables in TLB lockdown entries. The Invalidate Single Entry or Invalidate ASID match operations can match a TLB lockdown entry but an Invalidate All operation only applies to unlocked entries: 0 = Reserve TLB Lockdown registers for Secure operation only, reset value 1 = TLB Lockdown registers available for Secure and Non-Secure operation.
16	CL	Prevents operations in the Non-Secure world from changing cache lockdown entries: 0 = Reserve cache lockdown registers for Secure operation only, reset value 1 = Cache lockdown registers available for Secure and Non-Secure operation.
15 – 14	SBZ	SBZ Should-Be-Zero on writes.
13	CP13	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
12	CP12	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
11	CP11	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
10	CP10	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
9	CP9	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
8	CP8	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
7	CP7	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
6	CP6	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
5	CP5	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
4	CP4	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
3	CP3	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
2	CP2	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
1	CP1	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.
0	CP0	Determines permission to access the given coprocessor in the Non-Secure world: 0 = Secure access only, reset value 1 = Secure or Non-Secure access.

並可以透過如下程式碼修改Non-Secure Access Control Register

MRC p15, 0, <Rd>, c1, c1, 2 ; Read Non-Secure Access Control Register data

MCR p15, 0, <Rd>, c1, c1, 2 ; Write Non-Secure Access Control Register data

當處理器處於Monitor Mode (CPSR M[4:0] = b10110),處理器就是位於Secure State,這時對CP15的讀寫(MRC and MCR)動作,就會根據SCR.NS bit的值,如果NS(Non-Secure)為0,就是處於Secure State狀態,對暫存器的讀寫就是透過Secure Banked暫存器,如果NS為1,就是處於Non-Secure State狀態,讀寫的暫存器就是Non-Secure Banked暫存器.

暫告段落.

其實,行筆至此,感覺要寫的東西真的太多了…@_@..,既然是筆記,那就是隨筆去寫,暫時先做一個段落,之後有空,再寫下一回的ARM與Cortex筆記吧.

Loda's Blog (hlchou@gmail.com)

Main menu