SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
CEC 470, PROJECT II, DECEMBER 2014 1
ARM Cortex-A8:
An Overview
Andrew Daws, David Franklin, Cole Laidlaw
Abstract
The purpose of this document is to provide a comprehensive overview of the ARM Cortex-A8. This
will include background information, a detailed description of the Cortex-A8s RISC based superscalar
design, provide brief comparison with other ARM processors, and describe the NEON SIMD and
other features. The Cortex-A8 is among the highest performance ARM processors currently in service.
It is widely used in mobile devices and other consumer electronics. This analysis will include basic
descriptions of components and processes of the Cortex-A8 architecture. These features include a 13
stage instruction pipeline, branch prediction, and other components which result in high performance
with low power needs. The Cortex-A8 includes the NEON SIMD which uses integer and floating point
operations to deliver greater graphics and multimedia capabilities. Additionally, we will include a brief
description of the ARMv7-A architecture to establish comparison to the Cortex-A8.
Index Terms
ARM, SIMD, VFP, RISC.
I. INTRODUCTION
THE increasing ubiquity of mobile computing continues to increase the demand for
processors that are versatile and deliver high performance. This demand is driven by the
need for a variety of services to include connectivity and entertainment. The ARM Cortex-A8
- developed by Texas Instruments - combines connectivity, performance, and multimedia. Its
achieves versatility while attaining energy efficiency. Its lower power profile can be measured in
Instructions per Cycle (IPC) and is measured through a balance of increased operating frequency
and machine efficiency [1]. The increase in performance results from superscalar execution,
improvements in branch prediction, and efficient memory system. The Cortex-A8 has a pipeline
with less instruction depth per stage than previous ARMs. It is important to analyze the Cortex-
A8s features to highlight their effects on power saving and improved performance. There is
also an important need for graphics and multimedia. The Cortex-A8 meets this demand with the
NEON. The NEON achieves greater graphical capabilities by utilizing 64 bit integer and floating
point operations. The NEON is a Single Instruction Multiple Data (SIMD) accelerator processor.
It is capable of executing one instruction across 16 sets simultaneously. This parallelism confers a
host of new capabilities. The Cortex-A8 also employs the Vector Floating Point (VFP) accelerator
for the purpose of speeding up floating point operations.
The Cortex-A8s capabilities can illustrated by making a brief comparison with other architec-
tures within ARMs Cortex family. The Cortex-A8 belongs to the ARMv7-A family. This group
consists of seven of other processors. The Cortex-A8 is proven to be the more flexible processor
when compared to related architectures. These other designs can be faster and more powerful
but lack the Cortex-A8s versatility. Any comparison illustrates the Cortex-A8s success as a
commercial grade processor. Analyzing these concepts reveals the importance of the Cortex-A8s
RISC based superscalar design and its versatility.
CEC 470, PROJECT II, DECEMBER 2014 2
II. ARM PROCESSORS
The ARM processor family has had a substantial impact on the world of consumer electronics.
ARMs developers founded their company in 1990 as Advanced RISC Machines Ltd [6].
This companys purpose was to develop commercial grade general purpose processors. ARM
processors can be found on many platforms including laptops, mobile devices, and other
embedded systems. The ARM is a Reduced Instruction Set Computer (RISC). Typical RISC
architectures includes several features including: a large register file, load/store architecture,
simple address modes, and uniform instruction lengths. The ARM architecture includes several
more aspects in addition to these basic RISC features. These features include:
• Control of ALU and shifter in most data processing operations to maximize their respective
uses.
• Optimization of program loops through auto-increment and auto-decrement of address
modes.
• Load and Store multiple instructions to maximize data throughput.
• Maximized execution throughput by conditional execution of almost all instructions
These ARM features enhance already existing RISC architecture to reach high performance,
reduced code size, and reduced power needs. A typical ARM processor will have 16 visible
registers out of 31 total registers. There are three special purpose registers: the stack pointer,
link register, and program counter. ARM supports exception handling which will cause a standard
register to be replaced with a register specific to its respective exception type. All processor states
are are contained in status registers. The ARM instruction set includes branch instructions, data
processing, status register transfer, load/store, coprocessor, and exception generating instructions
[7].
CEC 470, PROJECT II, DECEMBER 2014 3
III. ARCHITECTURE
The ARM Cortex-A8 is a microprocessor that works with general-purpose consumer electron-
ics. The ARM architecture is load/store with an instruction set similar to other RISC processors
but contains numerous special features. Shift and ALU operations may be carried out in the same
instruction. The program counter may be used as a general purpose register. There is support
for 16-bit and 32-bit instruction opcodes. Lastly, there is a fully conditional instruction set [1].
There are 16 32-bit registers. 13 of these registers are general purpose. The stack pointer, link
register, and program counter comprise the remaining registers. These registers can be used for
load/store instructions and data processing in addition to their special purposes.
Pipeline. The Cortex-A8 utilizes a sophisticated instruction pipeline architecture. This 13 stage
instruction pipeline implements in-order, dual-issue, superscalar processor with advanced branch
prediction [1]. The main pipeline is divided into fetch, execute, and decode instructions. For
example:
F1,F2,D0,D1
Where the first two instructions are responsible for prediction and placement into a buffer for
decoding. Decoding is implemented in five stages that decode, schedule, and issue instructions
[1]. Complex instruction sequences are processed or even replayed if the memory stalls. Six
execute stages are comprised of a load-store pipeline, a multiply pipeline, and two symmetric
ALU pipelines. There are additional pipelines in addition to the main 13-stage pipeline. An
8-stage pipeline is used for the level-2 memory system and a 13 stage pipeline for debug trace
execution. The NEON SIMDs execution engine implements a 10-stage pipeline. The NEON
pipeline includes four stages for instruction decode and six stages for execution.
CEC 470, PROJECT II, DECEMBER 2014 4
Fig. 1. Full Pipeline [1]
CEC 470, PROJECT II, DECEMBER 2014 5
IV. INSTRUCTION FETCH
Instruction Fetch Pipeline. Dynamic branch prediction, instruction queueing hardware,
and the entire Level-1 instruction side memory are located within the Instruction Fetch unit.
Instruction fetching includes dynamic branch prediction and instruction queuing [1]. The
instruction fetch pipeline runs decoupled from the actual processor and may acquire up to four
instructions per cycle in parallel to predicted execution stream. Instructions are subsequently
placed in the queue to be decoded.
A new virtual address is created at the F0 stage once the fetch pipeline begins. This may
be a predicted target address or the next calculated sequentially from the previous instruction
if no branch is made. The F0 stage is also not counted as the first stage. The instruction cache
is considered the first official stage by ARM processor pipelines. The F1 stage can serve two
purposes in parallel. It contains an array for instruction cache access and branch prediction. The
F2 stage is the final stage in the instruction fetch pipeline. Instruction data is returned from the
instruction cache and placed in its respective queue for future utilization by the decode unit. A
new target address will be used as the fetch address if there is a resulting branch prediction. This
will change the address calculated in the F0 stage and discard the instruction fetch made in the
F1 stage. Code sequences which contain substantial branch commands may cause an inacurate
branch prediction resulting from this situation [1].
CEC 470, PROJECT II, DECEMBER 2014 6
Fig. 2. Fetch Pipeline [1]
CEC 470, PROJECT II, DECEMBER 2014 7
Instruction Cache. An instruction cache is implemented in the instruction fetch unit. The
instruction cache is the largest component of the instruction fetch unit [1]. It can be configured
for 16KB or 32KB in size and can return 64 bits of data per access. The instruction cache is
physically addressed, four way associative cache. A fully associative 32 bit translation lookaside
buffer (TLB) is also included.
Instruction and data caches are identical to ensure design efficiency. Differences are minimized
by allowing access to the same array structures while making only minor changes to control
logic [1]. These elements are consistent with conventional cache designs. The hashed virtual
address buffer (HVAB) is not part of this conventional design strategy. Traditionally, RAM is
cross referenced with physical addresses in parallel. The physical address is then compared to
tag arrays which verify data contained in RAM. The HVAB prevents the arrays from being fired
in parallel. A scheme is implemented using 6-bit hash of a virtual address which is used to
index the HVAB to determine which cache is likely to contain the required data. Translation
and tag compare from the TLB verify if there is an accurate hit [1]. If a hit is invalid the
access is removed and the HVAB and cache data are updated. The TLB translation and tag
are subsequently removed from the critical path to cache access. This process results in power
savings but hinders performance when predictions are inaccurate. This can be mitigated by
implementing an efficient hash function which possesses low probability for false matches.
Instruction Queue. The purpose of the instruction queue is to reduce discontinuity from
instruction delivery and consumption. Instructions are placed in the instruction queue after they
are fetched from the instruction cache. The instructions are forwarded to the D0 stage if the
instruction queue is empty [1]. Decoupling allows the instruction fetch unit to prefetch ahead of
the remaining integer unit and establish a reserve of instructions awaiting decoding. This reserve
conceals latency from prediction changes. Decode unit stalls are also prevented from spreading
back into the prefetch unit during the cycle in which a stall is recognized.
There are four parallel FIFOs which comprise the instruction queue [1]. Each FIFO consists
of six entries that are 20 bits wide - 16 bits of instruction data and four bits of control state.
Instructions may be contained in up to two entries.
Branch Prediction. A 512-entry branch target buffer (BTB) and 4096-entry global history
buffer (GHB) is included in the branch predictor. The BTB indicates whether or not the current
fetch instruction contains a branch using counters to indicate which branch predictions should or
should not be taken. The BTB is indexed by the fetch address and contains target addresses and
branch types. Both arrays are accessed in parallel with the instruction cache during the F1 stage.
A 10-bit global branch history and four lower bits of the PC can select a GHB entry. Branch
history is generated by analyzing the taken/not taken status of of the 10 most recent branches.
This information is saved in the global history register (GHR). This approach increases efficiency
by creating traces which are used to make better predictions. Low order bits are used to index
the GHB to prevent referencing too similar histories.
The BTB consists of branch target addresses and branch type information. It is indexed by
the fetch address [1]. Both direct and indirect target address branch predictions are contained in
the BTB.
Return Stack. Subroutine predictions are made using a return stack, which returns an eight-
entry stack depth [1]. There is an instruction decode unit which issues new instructions after
decoding and sequencing. Return addresses are pushed onto the stack once the BTB determines
when a branch is a subroutine. A subroutine return results in the address being popped from the
stack instead of being read from the BTB entry. It is important to support multiple push/pop
CEC 470, PROJECT II, DECEMBER 2014 8
commands at a time due to the relative shortness of each subroutine. Speculative updates may be
harmful because updates from an incorrect path may result in a loss of synchronization with the
return stack. This may cause mispredictions. The instruction fetch unit must have both speculative
and non speculative return stacks. The speculative return stack will be updated immediately
while the non-speculative return stack will not be updated until it is known whether the branch
is speculative or non-speculative [1]. Inaccurate predictions will result in the speculative stack
being overwritten by the non-speculative state.
V. INSTRUCTION DECODE
Instruction Decode Pipeline. The instruction decode unit decodes, sequences, issues new
instructions, and provides exception handling [1]. The decode unit is contained within the D0-
D4 pipeline stages. The instruction type, destination and source operands are determined in the
D0 and D1 stages. Multi Cycle instructions are divided into multiple single cycle instructions
during the D1 stage. Instructions are written and into and read from the pending/replay queue
structure during the D2 stage. The D3 stage implements the instruction scheduling logic. The
scoreboard is referenced for the next two possible instructions during the stage. These two
instructions are analyzed to determine any dependency hazards that may not be detected by
the scoreboard. These instructions cannot be stalled once they reach the D3/D4 boundary. Final
decode for all control signals critical to instruction execute and load/store units occurs in the
D4 stage.
Fig. 3. Decode Pipeline [1]
Static Scheduling Scoreboard. The static scheduling scoreboard predicts available operands
[1]. This static scheduling scoreboard value indicates the number of cycles until a valid result
is available. This differs from traditional scoreboards which will normally use a single bit to
determine the availability of a source operand. This information is used with the source operand
to determine possible dependency hazzard. Each scoreboard entry is self-updating on a cycle-
to-cycle basis to ensure proper operation when a new register is written to. Each entry will
decrement by one until a new register write or until the counter reaches zero - which indicates
CEC 470, PROJECT II, DECEMBER 2014 9
availability. The static scheduling scoreboard also tracks the execution pipeline by its respective
stage and result. This information is used to generate forwarding multiplexer control signals that
accompany instructions upon issue [1].
There are several advantages to the static scheduling scoreboard. It allows implementation of
fire-and-forget pipeline with no stalls when used in with the replay queue [1]. This removes speed
paths that would hinder high frequency operation. This design conserves power by knowing early
which execution units are required.
Instruction Scheduling. Cortex-A8 is a dual-issue processor. There are two integer pipelines:
pipe0 and pipe1. Pipe0 contains the older instruction while pipe1 contains the newest. If an older
instruction cannot issue the instruction in pipe1 will not issue. This will be true even if there is
no hazard or resource conflict [1]. Pipe0 is the default for single instructions. All instructions will
progress through the execution pipeline and their results recorded into the register file during the
E5 stage. This process will prevent write-after-read hazards and track write-after-write hazards.
The pipe0 instruction is free to issue if no hazards are detected by the scoreboard. There
are constraints that must be considered in addition to scoreboard indicators for dual pairing
of instruction issue. The combination of instruction types must be considered. The following
combinations are supported:
• Any two data processing instructions
• One load/store instruction followed by one data processing instruction
• Older multiply instruction with a newer load/store or data processing instruction
The program counter can only be changed by one of the two issued instructions. Only branch
instructions or data processing and load with the program counter as the destination register may
change its value [1].
The two instructions must be cross referenced to verify data dependency. Read-after-write
or write-after-read hazards may prevent dual issue. Dual issue may be prevented if the newer
instruction requires a destination register before it is produced by the older instruction or if
both instructions are writing to the same register. Comparisons are performed when the data is
produced and when it is needed. The dual issue may not be prevented if the data is not needed
for one or more cycles. These are the examples when this occurs:
• Compare or subtract instruction that sets the flags followed by a flag-dependent conditional
branch instruction
• Any ALU instruction followed by a dependent store of the ALU result to memory
• A move or shift instruction followed by a dependent ALU instruction
These instructions are commonplace in conventional code sequences. Addressing dual issue
instruction pairs is critical in overall performance increase [1].
CEC 470, PROJECT II, DECEMBER 2014 10
VI. NEON PIPELINE
The Cortex-A8 has other features that complement its high performance, such as the NEON
hybrid SIMD, which grants the Cortex-A8 increased performance in the field of graphics and
other media [2].
There are numerous advantages to NEON. Efficiency of SIMD operations is ensured through
aligned and unaligned data access. Integer and floating- point operations provide a broad range
of applications including 3D graphics. There is a simpler tool flow created by single instruction
streams and unified memory views. There is efficient data implementation and memory access
through its large register file [2].
So what is NEON? The NEON engine is a SIMD (Single Instruction Multiple Data) accelerator
processor, also known as a vector processor, which means that during the execution of one
instruction the same operation will occur on up to 16 data sets in parallel [2]. The purpose of
this parallelism is to obtain a greater amount of get more MIPS or FLOPS out of the SIMD
portion of the processor then you could obtain with a basic SISD (Single Instruction Single
Data) processor running at the same clock rate. This increased parallelism also decreases the
instruction count necessary to accomplish the same task if run on an SISD, thus also reducing
the number of clocks used to perform the same task.
To determine how much of a speed increase the NEON engine will grant to a portion of
code, a specific loop is necessary to look at the data size of the operation. The largest NEON
register is 128 bits, thus if you wish to perform an operation on 8-bit values you can perform
up to 16 operations simultaneously. Another example being if you are using 32 bit values,
you can perform up to 4 operation simultaneously [2]. However, there are other factors to take
into consideration that affect execution speed such as loop overhead, memory speeds, and data
throughput. NEON instructions are mainly for numerical, load/store, and some logical operations,
thus NEON operations will execute while other instruction occur in the main ARM pipeline.
NEON has 4 decode stages, known as M0-M3, which are similar in design to the four decode
stages, D0-D4, seen in the main ARM pipeline. This structure uses the first two stages to
decode the instruction resource and operand requirements, then the last two stages for instruction
scheduling. NEON also has 6 execute stages, N1-N6 [1]. The NEON pipeline uses a fire-and-
forget issue mechanism and a static scoreboard, similar to what is used by the ARM integer
pipeline with the primary difference being that there is no replay queue [2].
The NEON decode logic is highly capabable in that it can dual issue any LS permute
instruction with any non-LS permute instruction which requires fewer register ports than what
would be needed for dual issuing two data processing instructions since LS data is provided
directly from the load data queue. It is also the most useful pairing of instructions to dual issue
since significant load/store bandwidth is required to keep up with the Advanced SIMD data
processing operations [1].
Access to the 32-entry register file is handled M3 stage when instruction(s) are issued [1]. Once
an instruction is issued, it is sent to one of seven execution pipelines: integer algorithmic logic
unit, integer multiply, integer shift, NFP Add, NFP multiply, IEEE floating point, or load/store
permute with all execution datapath pipelines being balanced at six stages [1].
CEC 470, PROJECT II, DECEMBER 2014 11
Fig. 4. NEON Pipeline Stages [1]
VII. NEON INTEGER EXECUTION PIPELINE
There are three execution pipelines responsible for executing NEON integer instructions:
multiply-accumulate (MAC), shift, and ALU. The integer MAC pipeline contains two 32x16
multiply arrays with two 64-bit accumulate units. The 32x16 multiplier array can perform four
8x8, two 16x16, or one 32x16 multiply operation in each cycle and have dedicated register
read ports for the accumulate operand. The MAC unit is also optimized to support one multiply
accumulate operations per cycle for high performance on a sequence of MAC operations with a
common accumulator.
The integer shift pipeline consists of simply three stages. Shift is made available early for
subsequent instructions at the end of the N3 stage when only the shift result is required [1].
When both a shift and accumulate operation are require the result from the shift pipeline are
forwarded directly to the MAC pipeline.
The integer ALU pipeline consists of two parallel 64-bit SIMD ALUs, each permitting four 64-
bit inputs. The first stage of the ALU pipeline, N1, formats the operands to in preperation for the
the next cycle, includes inverting operands as needed for subtract operations, multiplexing vector
element pairs for folding operations, and sign/zero-extend of operands [1]. The second stage,
N2, performs the main ALU opations such as: add, subtract, logical, count leading-sign/zero,
count set, and sum of element pairs operations [1] along with also calculating the flags are also
to be used in the following stage. The third stage, N3, performs operations such as: compare,
test, and max/min operations for saturation detection. The N3 stage also has contains an SIMD
incrementer for generating twos complement and rounding operations It also has a data formatter
for performing high-half and halving operations. Just like the shift pipeline, the ALU pipeline
will use the final stages, N4 and N5, for completing any accumulate operations by forwarding
it to the MAC [1].
CEC 470, PROJECT II, DECEMBER 2014 12
VIII. NEON LOAD-STORE/PERMUTE EXECUTION PIPELINE
The permute pipeline is fed by the load data queue (LDQ). The LDQ holds all data associated
with NEON load accesses prior to entering the NEON permute pipeline. It is 12 entries deep
and each entry is 128-bit wide [1]. Data can be placed into the LDQ from either L1 cache or L2
memory system. Accesses that hit in the L1 cache will return and commit the data to the LDQ.
Accesses that miss in the L1 cache will initiate an L2 access. A pointer is attached with the load
request as it proceeds down the L2 memory system pipeline. When the data is returned from
the L2 cache, the pointer is used to update the LDQ entry reserved for this load request. Each
entry in the LDQ has a valid bit to indicate valid data returned from L1 cache or L2. Entries in
the LDQ can be filled by L1 or L2 out-of-order, but valid data within the LDQ must be read in
program order. Entries at the front of the LDQ are read off in-order. If a load instruction reaches
the M2 issue stage before the corresponding data has arrived in the LDQ, then it will stall and
wait for the data [1].
L1 and L2 data that is read out of the LDQ is aligned and formatted to be useful for the NEON
execution units. Aligned/formatted data from the LDQ is multiplexed with NEON register read
operands in the M3 stage, before it is issued to the NEON execute pipeline.
The NEON LS/Permute pipeline is responsible for all NEON load/stores, data transfers to/from
the integer unit, and data permute operations. One of the more interesting features of the NEON
instruction set is the data permute operations that can be done from register to register or as
part of a load or store operation. These operations allow for the interleaving of bytes of memory
into packed values in SIMD registers. For example, when adding two eight byte vectors, you
may wish to interleave all of the odd bytes of memory into register A and the even bytes into
register B [1]. The permute instructions in NEON allow you to do operations like this natively
in the instruction set and often with only using a single instruction [1].
This data permute functionality is implemented by the load-store permute pipeline. Any data
permutation required is done across 2 stages, N1-N2. In the N3 stage, store data can forwarded
from the permute pipeline and sent to the NEON Store Buffer in the memory system [1].
IX. NEON FLOATING-POINT EXECUTION PIPELINES
The NEON Floating-Point (NFP) has two main pipelines: a 6-stage multiply pipeline and
a 6-stage add pipeline [1]. The add pipeline adds two single-precision floating-point numbers,
producing a single-precision sum. The multiply pipeline multiplies two single-precision floating-
point numbers, producing a single-precision product. In both cases, the pipelines are 2-way SIMD
which means that two 32-bit results are produced in parallel when executing NFP instructions
[1].
X. NEONS IEEE COMPLIANT FLOATING POINT ENGINE
The IEEE compliant floating point engine is a non-pipelined implementation of the ARM
Floating-Point instruction set targeted for medium performance IEEE 754-compliant and double
precision floating-point [1]. It is designed to provide general-purpose floating-point capabilities
for a Cortex A8 processor. This engine is not pipelined for most operations and modes, but
instead iterates over a single instruction until it has completed. A subsequent operation will
be stalled until the prior operation has fully completed execution and written the result to the
register file. The IEEE compliant engine will be used for any floating point operation that cannot
be executed in the NEON floating point pipeline. This includes all double precision operations
and any floating point operations run with IEEE precision enabled.
CEC 470, PROJECT II, DECEMBER 2014 13
XI. VFP
VFP (Vector Floating Point) is a floating point hardware accelerator whose primary purpose
is to perform one operation on one set of inputs and returns one output, thus allowing it to
speed up floating point calculations. Considerably slower software math libraries are used by
ARM processors if dedicated have floating point hardware is not available. The VFP supports
both single and double precision floating point calculations compliant with IEEE 754 [2]. It is
also worth noting that the VFP will not have the same performance increase that NEON grants
because it does not contain a similar highly parallel and fully pipelined architecture
XII. ARM CORTEX-A8 COMPARED TO ARM CORTEX-A17
The ARM Cortex-A8 is a part of the ARMv7-A architecture. There have been seven cores
designed with this architecture including the Cortex-A8 and the Cortex-A17. The ARM Cortex-
A17 is the most powerful core within the same family as the Cortex-A8 yet the differences
between the two are drastic. From internal specifications to the actual use in devices vary.
The Cortex-A17 provides a 60% increase in performance over the Cortex-A9 and the Cortex-
A9 has a 50% [10] increase in performance over the Cortex-A8 leading the comparison the
Cortex-A17 is a 110% performance increase over the Cortex-A8.
Fig. 5. Cortex-A17 performance comparison to the Cortex-A9 [8]
CEC 470, PROJECT II, DECEMBER 2014 14
This leads to the initial comparison that the Cortex-A17 is far more powerful than the Cortex-
A8 even though their design is the same 32-bit ARMv7-A architecture using the NEON SIMD
and VFP hardware accelerator. Just as the Cortex-A8, this core is also very popular in mobile
devices with its combination of high performance combined with the high efficiency brought
about by the Cortex-A8 introduction. The Cortex-A17 consists of four scalable cores. These
cores contain a fully out-of-order pipeline delivering optimal performance of todays premium
mobile devices [8]. This is a key difference since the Cortex-A8 only supports one core, hence
the massive speed increase with the Cortex-A17. The decode width of the Cortex-A17 is only
one more than the Cortex-A8, yet that ability to decode one more instruction in parallel creates
an improvement without sacrificing efficiency. The pipeline depth of the Cortex-A8 is 13 in order
while the Cortex-A17 is 11+ out-of-order. The NEON (SIMD) for the Cortex-A8 is 64-bit wide
where the Cortex-A17s is 128-bit wide allowing for greater parallel processing of data to occur.
The Cortex-A17 has a big role in the big.LITTLE architecture role whereas the Cortex-A8 does
not use big.LITTLE at all. The Cortex-A8 does not have a pipelined VFP accelerator whereas
the Cortex-A17 does, which improves performance.
The Cortex-A8 is used in many commercial applications that affect our daily lives. An
application that the Cortex-A8 is utilized in is smartphones as an application processor running
fully featured mobile OS, the Cortex-A17 is commonly seen in smartphones as well and tablets
unlike the Cortex-A8. It is also used in Netbooks because of its Power-Efficient main processor
running desktop OS. The Cortex-A8 is also used in set-top Boxes as the main processor for
managing Rich OS, Multi-format A/V and UI, same as the Cortex-A17. They are also used
in Digital TV applications as the processor for managing rich OS, UI and browser functions,
same as the Cortex-A17. The Cortex-A8 is used in home networking as a control processor for
system management. It is also used for storage networking as a control processor to manage
traffic flow. They are even used in printers as a high-performance integrated processor [8][9].
The Cortex-A17 also works with Industrial and Automotive Infotainment which the Cortex-A8
did not work with [8]. These are devices that we interact with in our lives and some that we
interact with daily. The small size of the core is advantageous because it can fit into small
devices such as smartphones, netbooks, TV receivers and printers. The Cortex-A8 is also very
advantageous because its power efficiency which for these small devices with small batteries
makes a huge difference in lifespan of use per charge. The power of the Cortex-A8 is very useful
in many of these applications. With its pipelining abilities and enhancement from the NEON
SIMD and the VFP hardware accelerator, it allows for small devices such as smartphones to
have amazing processing speed. The Cortex-A8 and the Cortex-A17 are very similar, yet with
large performance differences.
The Cortex-A8 is a High-Performance processor used to run in complex systems, it is:
• Symmetric, superscalar pipeline for full dual-issue capability
• High-frequency through efficient, deep pipeline
• Advanced branch prediction unit with ¿95% accuracy
• Integrated Level 2 Cache for optimal performance in high-performance systems [9]
The Cortex-A8 is designed to handle media processing in software with NEON Technology
which is:
• 128-bit SIMD data engine
• 2x the performance of v6SIMD
• Power-saving through efficient media processing
• Flexibility to handle the media formats of the future
CEC 470, PROJECT II, DECEMBER 2014 15
• Easily integrate multiple codecs in software with NEON Technology on the Cortex-A8
• Enhance user interfaces [9]
The Cortex-A8 boasts many features, but how do they compare to the Cortex-A17? The Cortex-
A8 features the NEON, 128-bit SIMD engine that enables high performance media processing.
It also features the Optimized Level 1 cache which is integrated tightly into the processor with
a single-cycle access time as well as an Integrated Level 2 cache which is integrated into the
core and provides ease of integration, power efficiency, and optimal performance. The Cortex-
A8 also features Thumb-2 Technology which delivers the peak performance of traditional ARM
code while also providing up to a 30% reduction in memory required to store instructions. It
also has Dynamic Branch Prediction, used to minimize branch wrong prediction penalties, the
dynamic branch predictor achieves 95% accuracy across a wide range of industry benchmarks.
The Cortex-A8 also features a Memory Management Unit, having a full MMU enables the
Cortex-A8 to run rich operating systems in a variety of Applications. It also features Jazelle-
RCT Technology, a RCT Java-acceleration technology to optimize Just in Time (JIT) and
Dynamic Adaptive Compilation (DAC), and reduce memory footprint by up to three times.
The Cortex-A8 also features a Memory System that is optimized for power-efficiency and high-
performance. It also features TrustZone Technology which allows for secure transactions and
Digital Rights Management (DRM) [9]. This list of features comes from the ARM website and
the specific product specification pages. The Cortex-A17 also has list of specifications on the
ARM website, however they are different from the Cortex-A8. The Cortex-A17 and the Cortex-
A8 share some similar features such as Thumb-2 Technology, TrustZone Technology, NEON and
Optimized Level 1 Caches. The Cortex-A17 also has an Integrated Level 2 Cache Controller
but the difference is that its size is configurable. The Cortex-A17 also has the DSP & SIMD
Extensions which increases the DSP processing capability of ARM solutions in high-performance
applications, while offering the low power consumption required by portable, battery-powered
devices. It also uses a Floating Point, the Cortex-A17 processor provides a high-performance FPU
including hardware support for floating point operations in half-, single- and double-precision
floating point arithmetic. The Cortex-A17 also features Hardware Virtualization, a highly efficient
hardware support for data management and arbitration, whereby multiple software environments
and their applications are able to access simultaneously the system capabilities. It also has a
Large Physical Address Extension (LPAE) which enables the processor to access up to 1TB of
memory. The Cortex-A17 also features an AMBA4 CoreLink CCI-400 Cache Coherent which
provides AMBA4 ACE ports for full coherency between multiple processors, enabling use cases
like big.LITTLE [8]. This lengthy list of features about the Cortex-A17 for comparison to the
Cortex-A8 was also retrieved from the ARM website in the Cortex-A17 product specifications
section. This comparison is key to see where the ARMv7-A architecture has evolved to, the
Cortex-A8 is one of the middle models in development purposes whereas the Cortex-A17 is the
newest and most powerful that ARM produces in this architecture set.
The comparison for the debugger between the Cortex-A8 and the Cortex-A17 are the same.
The ARM DS-5 Development Studio fully supports all ARM processors and IP as well as a wide
range of third party tools, operating systems and EDA flows. DS-5 represents a comprehensive
range of software tools to create, debug and optimize systems based on the Cortex-A8 and Cortex-
A17 processors [8]. This line comes from the Cortex-A17 related products page but is near the
exact same as the Cortex-A8. They both incorporates DS-5 Debugger, whose powerful and
intuitive graphical environment enables fast debugging of bare-metal, Linux and Android native
applications. DS-5 Debugger provides pre-defined configurations for Fixed Virtual Platforms
CEC 470, PROJECT II, DECEMBER 2014 16
(built on ARM Fast Models technology) and ARM Versatile Express boards, enabling early
software development before silicon availability [8][9]. This segment is the same in both the
Cortex-A17 and Cortex-A8.
Both of the Cortex-A17 and the Cortex-A8 use the same family of products for Graphic
Processing. The MaliTM family of products combine to provide the complete graphics stack for
all embedded graphics needs, enabling device manufacturers and content developers to deliver
the highest quality, cutting edge graphics solutions across the broadest range of consumer devices
[8][9]. An example would be the Mali-400 in the Cortex-A8 which is the worlds first OpenGL
ES 2.0 conformant multi-core GPU that provides 2D and 3D acceleration with performance
scalable up to 1080p resolution [9].
For the Cortex-A8 the ARM Physical IP Platforms deliver process optimized IP, for best-in-
class implementations of the Cortex-A8 processor at 40nm and below [9]. The Cortex-A8 uses
the Standard Cell Logic Libraries which are available in a variety of different architectures ARM
Standard Cell Libraries support a wide performance range for all types of SoC designs. It also
supports Memory Compilers and Registers, a broad array of silicon proven SRAM, Register File
and ROM memory compilers for all types of SoC designs ranging from performance critical to
cost sensitive and low power applications. The Cortex-A8 also supports Interface Libraries, a
broad portfolio of silicon-proven Interface IP designed to meet varying system architectures and
standards [9]. The ARM Physical IP Platforms deliver process optimized IP, for best-in-class
implementations of the Cortex-A17 processor at 28nm and below [8]. This is similar to the
Cortex-A8 except for the difference from 40nm to 28nm. A set of high-performance POPTM
IP containing advanced ARM Physical IP for 28nm technologies supports the Cortex-A17, to
enable rapid development of leadership physical implementations [8]. ARM is uniquely able to
design the optimization packs in parallel with the Cortex-A17 processor, enabling the processor
and physical IP combination to deliver best-in-class performance in the mobile power envelope
while facilitating rapid time-to-market [8]. The Physical IP for the Cortex-A17 is a different
design from the Cortex-A8 with using the POP IP.
System IP components are essential for building complex system on chips and by utilizing
System IP components developers can significantly reduce development and validation cycles,
saving cost and reducing time to market [9]. The Cortex-A8 uses a different set of tools for
System IP than Cortex-A17, here are the differences:
Cortex-A8
• Advanced AMBA 3 Interconnect IP using the AXI AMBA Bus.
• Dynamic Memory Controller using the AXI AMBA Bus.
• Adaptive Verification IP using the AXI AMBA Bus.
• DMA Controller using the AXI AMBA Bus.
• CoreSight Embedded Debug and Trace using the ATB AMBA Bus. [9]
The set of tools that the Cortex-A17 uses for System IP are as follows:
Cortex-A17
• AMBA 4 Cache Coherent Interconnect
– The CCI-400 provides AMBA 4 AXI Coherency Extensions compliant ports for
full coherency for the Cortex-A17 processor and other Cortex processors, better
utilizing caches and simplifying software development. This feature is essential for high
bandwidth applications including future mobile SoCs that require clusters of coherent
processors or GPUs. Combined with other available ARM CoreLink System IP, the
CCI-400 increases system performance and power efficiency.
CEC 470, PROJECT II, DECEMBER 2014 17
– CoreLink CCI-400 Cache Coherent Interconnect provides system coherency with
Cortex processors and an IO Coherent channel with Mali IP and opens up a number
of possibilities for offload and acceleration of tasks. When combined with a Cortex-A7
processor, CCI-400 allows big.LITTLE operation with full L2 cache coherency between
the Cortex-A17 and Cortex-A7 processors.
– Efficient voltage scaling and power management is enabled with the CoreLink ADB-
400 unlocking DVFS control of the Cortex-A17 processor.
• AMBA Generic Interrupt Controller
– AMBA Interrupt Controllers like the GIC-400 provide an efficient implementation of
the ARM Generic Interrupt Specification to work in multi-processor systems. They
are highly configurable to provide the ultimate flexibility in handling a wide range of
interrupt sources that can control a single CPU or multiple CPUs.
• AMBA 4 CoreLink MMU-500
– CoreLink MMU-500 provides a, hardware accelerated, common memory view for all
SoC components and minimizes software overhead for virtual machines to get on with
other system management functions.
• CoreLink TZC-400
– The Cortex-A17 processor implements a secure, optimized path to memory to further
enhance its market leading performance with the aid of CoreLink TZC-400 TrustZone
address space controller.
• CoreLink DMC-400
– All interconnect components and the ARM DMC guarantee bandwidth and latency
requirements by utilizing in-built dynamic QoS mechanisms.
• CoreSight SoC-400
– ARM CoreSight SoC debug and trace hardware is used to profile and optimize the
system software running through-out from driver to OS level.
• Artisan POP IP
– Cortex-A17 processor is supported through advanced Physical POP IP for accelerated
time to market [8].
These differences show how much more technology is in the Cortex-A17 versus the Cortex-A8
even though they are in the same ARM architecture set family (ARMv7-A). With differences like
these it really makes it clear as to how flexible these systems are and what all can be done with
them from media processing to data crunching. These differences are important to understand
because it lays out where this technology is headed and what changes could and are being made
to create more powerful yet more efficient devices.
XIII. CONCLUSION
The Cortex-A8 is an important example of RISC-based superscalar design. It has many features
that make it a powerful and flexible processor. The summation of its components result in
increased performance and flexibility. Its instruction pipelining and branch prediction are critical
into ensuring performance efficiency. The NEON SIMD possesses a very robust architecture
including its own instruction pipelines. The NEON introduces a host of new capabilities including
multimedia and graphics processing. Examination of other ARM processors further illustrates
the Cortex-A8s evolution. The Cortex-A8 belongs to family consisting of seven other processors.
CEC 470, PROJECT II, DECEMBER 2014 18
A comparison with the faster Cortex-A17 demonstrates a higher degree of flexibility from the
Cortex-A8. This flexibility is critical to the Cortex-A8s success in consumer electronics. This
processor is commercially available in a variety of applications including mobile devices and
other media. Studying the ARM Cortex-A8 is critical in understanding the role superscalar
architecture plays in embedded systems.
CEC 470, PROJECT II, DECEMBER 2014 19
REFERENCES
[1] Williamson, David, ARM Cortex-A8: A High-Performance Processor for Low-Power Applications, Unique Chips and Systems
(2007): 79.
[2] (n.d.). Texas Instruments Wiki, Cortex-A8 - Texas Instruments Wiki Retrieved from
http://processors.wiki.ti.com/index.php/Cortex-A8.
[3] (n.d.). ARM - The Architecture For The Digital World, NEON - ARM, Retrieved from
http://arm.com/products/processors/technologies/neon.php.
[4] (n.d.). ARM - The Architecture For The Digital World, Cortex-A8 Processor - ARM,Retrieved from
http://arm.com/products/processors/cortex-a/cortex-a8.php.
[5] (n.d.). ARM - The ARM Architecture. With a focus on v7A and Cortex-A8. Retrieved from
http://www.arm.com/files/pdf/ARM Arch A8.pdf
[6] ARM, A. (2000). Architecture Reference Manual. ARM DDI E, 100, 6.https://www.scss.tcd.ie/ waldroj/3d1/arm arm.pdf
[7] Cortex-A17 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex-
a17-processor.php
[8] Cortex-A8 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex-
a8.php
[9] Cortex-A9 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex-
a9.php

Mais conteúdo relacionado

Mais procurados

Motherboard Components
Motherboard ComponentsMotherboard Components
Motherboard Components
stooty s
 
Computer organiztion5
Computer organiztion5Computer organiztion5
Computer organiztion5
Umang Gupta
 

Mais procurados (20)

Chapter 5 a
Chapter 5 aChapter 5 a
Chapter 5 a
 
Associative memory 14208
Associative memory 14208Associative memory 14208
Associative memory 14208
 
80386 microprocessor
80386 microprocessor80386 microprocessor
80386 microprocessor
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
intel 8086 introduction
intel 8086 introductionintel 8086 introduction
intel 8086 introduction
 
Programming the basic computer
Programming the basic computerProgramming the basic computer
Programming the basic computer
 
General register organization (computer organization)
General register organization  (computer organization)General register organization  (computer organization)
General register organization (computer organization)
 
Input Output Organization
Input Output OrganizationInput Output Organization
Input Output Organization
 
X86 Architecture
X86 Architecture X86 Architecture
X86 Architecture
 
Basic computer organization
Basic computer organizationBasic computer organization
Basic computer organization
 
Computer registers
Computer registersComputer registers
Computer registers
 
Register transfer & microoperations moris mano ch 04
Register transfer & microoperations    moris mano ch 04Register transfer & microoperations    moris mano ch 04
Register transfer & microoperations moris mano ch 04
 
Memory organization (Computer architecture)
Memory organization (Computer architecture)Memory organization (Computer architecture)
Memory organization (Computer architecture)
 
Unit vi (2)
Unit vi (2)Unit vi (2)
Unit vi (2)
 
Register Transfer Language,Bus and Memory Transfer
Register Transfer Language,Bus and Memory TransferRegister Transfer Language,Bus and Memory Transfer
Register Transfer Language,Bus and Memory Transfer
 
Motherboard Components
Motherboard ComponentsMotherboard Components
Motherboard Components
 
The Basic Organization of Computers
The Basic Organization of ComputersThe Basic Organization of Computers
The Basic Organization of Computers
 
Cache memory
Cache memory Cache memory
Cache memory
 
Ca basic computer organization
Ca basic computer organizationCa basic computer organization
Ca basic computer organization
 
Computer organiztion5
Computer organiztion5Computer organiztion5
Computer organiztion5
 

Destaque (11)

Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Memory model
Memory modelMemory model
Memory model
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)FPGA/Reconfigurable computing (HPRC)
FPGA/Reconfigurable computing (HPRC)
 
ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory Systems
 
Review Multicore processing based on ARM architecture
Review Multicore processing based on ARM architectureReview Multicore processing based on ARM architecture
Review Multicore processing based on ARM architecture
 
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsHardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ Processors
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
 
cache memory
cache memorycache memory
cache memory
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 

Semelhante a arm-cortex-a8

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
8086 MICROPROCESSOR
8086 MICROPROCESSOR8086 MICROPROCESSOR
8086 MICROPROCESSOR
Alxus Shuvo
 

Semelhante a arm-cortex-a8 (20)

The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architecture
 
W04505116121
W04505116121W04505116121
W04505116121
 
iPhone Architecture - Review
iPhone Architecture - ReviewiPhone Architecture - Review
iPhone Architecture - Review
 
16-bit Microprocessor Design (2005)
16-bit Microprocessor Design (2005)16-bit Microprocessor Design (2005)
16-bit Microprocessor Design (2005)
 
Module-2 Instruction Set Cpus.pdf
Module-2 Instruction Set Cpus.pdfModule-2 Instruction Set Cpus.pdf
Module-2 Instruction Set Cpus.pdf
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
Design & Simulation of RISC Processor using Hyper Pipelining Technique
Design & Simulation of RISC Processor using Hyper Pipelining TechniqueDesign & Simulation of RISC Processor using Hyper Pipelining Technique
Design & Simulation of RISC Processor using Hyper Pipelining Technique
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
EC8791 ARM Processor and Peripherals.pptx
EC8791 ARM Processor and Peripherals.pptxEC8791 ARM Processor and Peripherals.pptx
EC8791 ARM Processor and Peripherals.pptx
 
Dm25671674
Dm25671674Dm25671674
Dm25671674
 
Architecture Of TMS320C50 DSP Processor
Architecture Of TMS320C50 DSP ProcessorArchitecture Of TMS320C50 DSP Processor
Architecture Of TMS320C50 DSP Processor
 
Cao 2012
Cao 2012Cao 2012
Cao 2012
 
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
 
Question paper with solution the 8051 microcontroller based embedded systems...
Question paper with solution  the 8051 microcontroller based embedded systems...Question paper with solution  the 8051 microcontroller based embedded systems...
Question paper with solution the 8051 microcontroller based embedded systems...
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
Doc32002
Doc32002Doc32002
Doc32002
 
8086 MICROPROCESSOR
8086 MICROPROCESSOR8086 MICROPROCESSOR
8086 MICROPROCESSOR
 
Arm11
Arm11Arm11
Arm11
 
ARM11.ppt
ARM11.pptARM11.ppt
ARM11.ppt
 

arm-cortex-a8

  • 1. CEC 470, PROJECT II, DECEMBER 2014 1 ARM Cortex-A8: An Overview Andrew Daws, David Franklin, Cole Laidlaw Abstract The purpose of this document is to provide a comprehensive overview of the ARM Cortex-A8. This will include background information, a detailed description of the Cortex-A8s RISC based superscalar design, provide brief comparison with other ARM processors, and describe the NEON SIMD and other features. The Cortex-A8 is among the highest performance ARM processors currently in service. It is widely used in mobile devices and other consumer electronics. This analysis will include basic descriptions of components and processes of the Cortex-A8 architecture. These features include a 13 stage instruction pipeline, branch prediction, and other components which result in high performance with low power needs. The Cortex-A8 includes the NEON SIMD which uses integer and floating point operations to deliver greater graphics and multimedia capabilities. Additionally, we will include a brief description of the ARMv7-A architecture to establish comparison to the Cortex-A8. Index Terms ARM, SIMD, VFP, RISC. I. INTRODUCTION THE increasing ubiquity of mobile computing continues to increase the demand for processors that are versatile and deliver high performance. This demand is driven by the need for a variety of services to include connectivity and entertainment. The ARM Cortex-A8 - developed by Texas Instruments - combines connectivity, performance, and multimedia. Its achieves versatility while attaining energy efficiency. Its lower power profile can be measured in Instructions per Cycle (IPC) and is measured through a balance of increased operating frequency and machine efficiency [1]. The increase in performance results from superscalar execution, improvements in branch prediction, and efficient memory system. The Cortex-A8 has a pipeline with less instruction depth per stage than previous ARMs. It is important to analyze the Cortex- A8s features to highlight their effects on power saving and improved performance. There is also an important need for graphics and multimedia. The Cortex-A8 meets this demand with the NEON. The NEON achieves greater graphical capabilities by utilizing 64 bit integer and floating point operations. The NEON is a Single Instruction Multiple Data (SIMD) accelerator processor. It is capable of executing one instruction across 16 sets simultaneously. This parallelism confers a host of new capabilities. The Cortex-A8 also employs the Vector Floating Point (VFP) accelerator for the purpose of speeding up floating point operations. The Cortex-A8s capabilities can illustrated by making a brief comparison with other architec- tures within ARMs Cortex family. The Cortex-A8 belongs to the ARMv7-A family. This group consists of seven of other processors. The Cortex-A8 is proven to be the more flexible processor when compared to related architectures. These other designs can be faster and more powerful but lack the Cortex-A8s versatility. Any comparison illustrates the Cortex-A8s success as a commercial grade processor. Analyzing these concepts reveals the importance of the Cortex-A8s RISC based superscalar design and its versatility.
  • 2. CEC 470, PROJECT II, DECEMBER 2014 2 II. ARM PROCESSORS The ARM processor family has had a substantial impact on the world of consumer electronics. ARMs developers founded their company in 1990 as Advanced RISC Machines Ltd [6]. This companys purpose was to develop commercial grade general purpose processors. ARM processors can be found on many platforms including laptops, mobile devices, and other embedded systems. The ARM is a Reduced Instruction Set Computer (RISC). Typical RISC architectures includes several features including: a large register file, load/store architecture, simple address modes, and uniform instruction lengths. The ARM architecture includes several more aspects in addition to these basic RISC features. These features include: • Control of ALU and shifter in most data processing operations to maximize their respective uses. • Optimization of program loops through auto-increment and auto-decrement of address modes. • Load and Store multiple instructions to maximize data throughput. • Maximized execution throughput by conditional execution of almost all instructions These ARM features enhance already existing RISC architecture to reach high performance, reduced code size, and reduced power needs. A typical ARM processor will have 16 visible registers out of 31 total registers. There are three special purpose registers: the stack pointer, link register, and program counter. ARM supports exception handling which will cause a standard register to be replaced with a register specific to its respective exception type. All processor states are are contained in status registers. The ARM instruction set includes branch instructions, data processing, status register transfer, load/store, coprocessor, and exception generating instructions [7].
  • 3. CEC 470, PROJECT II, DECEMBER 2014 3 III. ARCHITECTURE The ARM Cortex-A8 is a microprocessor that works with general-purpose consumer electron- ics. The ARM architecture is load/store with an instruction set similar to other RISC processors but contains numerous special features. Shift and ALU operations may be carried out in the same instruction. The program counter may be used as a general purpose register. There is support for 16-bit and 32-bit instruction opcodes. Lastly, there is a fully conditional instruction set [1]. There are 16 32-bit registers. 13 of these registers are general purpose. The stack pointer, link register, and program counter comprise the remaining registers. These registers can be used for load/store instructions and data processing in addition to their special purposes. Pipeline. The Cortex-A8 utilizes a sophisticated instruction pipeline architecture. This 13 stage instruction pipeline implements in-order, dual-issue, superscalar processor with advanced branch prediction [1]. The main pipeline is divided into fetch, execute, and decode instructions. For example: F1,F2,D0,D1 Where the first two instructions are responsible for prediction and placement into a buffer for decoding. Decoding is implemented in five stages that decode, schedule, and issue instructions [1]. Complex instruction sequences are processed or even replayed if the memory stalls. Six execute stages are comprised of a load-store pipeline, a multiply pipeline, and two symmetric ALU pipelines. There are additional pipelines in addition to the main 13-stage pipeline. An 8-stage pipeline is used for the level-2 memory system and a 13 stage pipeline for debug trace execution. The NEON SIMDs execution engine implements a 10-stage pipeline. The NEON pipeline includes four stages for instruction decode and six stages for execution.
  • 4. CEC 470, PROJECT II, DECEMBER 2014 4 Fig. 1. Full Pipeline [1]
  • 5. CEC 470, PROJECT II, DECEMBER 2014 5 IV. INSTRUCTION FETCH Instruction Fetch Pipeline. Dynamic branch prediction, instruction queueing hardware, and the entire Level-1 instruction side memory are located within the Instruction Fetch unit. Instruction fetching includes dynamic branch prediction and instruction queuing [1]. The instruction fetch pipeline runs decoupled from the actual processor and may acquire up to four instructions per cycle in parallel to predicted execution stream. Instructions are subsequently placed in the queue to be decoded. A new virtual address is created at the F0 stage once the fetch pipeline begins. This may be a predicted target address or the next calculated sequentially from the previous instruction if no branch is made. The F0 stage is also not counted as the first stage. The instruction cache is considered the first official stage by ARM processor pipelines. The F1 stage can serve two purposes in parallel. It contains an array for instruction cache access and branch prediction. The F2 stage is the final stage in the instruction fetch pipeline. Instruction data is returned from the instruction cache and placed in its respective queue for future utilization by the decode unit. A new target address will be used as the fetch address if there is a resulting branch prediction. This will change the address calculated in the F0 stage and discard the instruction fetch made in the F1 stage. Code sequences which contain substantial branch commands may cause an inacurate branch prediction resulting from this situation [1].
  • 6. CEC 470, PROJECT II, DECEMBER 2014 6 Fig. 2. Fetch Pipeline [1]
  • 7. CEC 470, PROJECT II, DECEMBER 2014 7 Instruction Cache. An instruction cache is implemented in the instruction fetch unit. The instruction cache is the largest component of the instruction fetch unit [1]. It can be configured for 16KB or 32KB in size and can return 64 bits of data per access. The instruction cache is physically addressed, four way associative cache. A fully associative 32 bit translation lookaside buffer (TLB) is also included. Instruction and data caches are identical to ensure design efficiency. Differences are minimized by allowing access to the same array structures while making only minor changes to control logic [1]. These elements are consistent with conventional cache designs. The hashed virtual address buffer (HVAB) is not part of this conventional design strategy. Traditionally, RAM is cross referenced with physical addresses in parallel. The physical address is then compared to tag arrays which verify data contained in RAM. The HVAB prevents the arrays from being fired in parallel. A scheme is implemented using 6-bit hash of a virtual address which is used to index the HVAB to determine which cache is likely to contain the required data. Translation and tag compare from the TLB verify if there is an accurate hit [1]. If a hit is invalid the access is removed and the HVAB and cache data are updated. The TLB translation and tag are subsequently removed from the critical path to cache access. This process results in power savings but hinders performance when predictions are inaccurate. This can be mitigated by implementing an efficient hash function which possesses low probability for false matches. Instruction Queue. The purpose of the instruction queue is to reduce discontinuity from instruction delivery and consumption. Instructions are placed in the instruction queue after they are fetched from the instruction cache. The instructions are forwarded to the D0 stage if the instruction queue is empty [1]. Decoupling allows the instruction fetch unit to prefetch ahead of the remaining integer unit and establish a reserve of instructions awaiting decoding. This reserve conceals latency from prediction changes. Decode unit stalls are also prevented from spreading back into the prefetch unit during the cycle in which a stall is recognized. There are four parallel FIFOs which comprise the instruction queue [1]. Each FIFO consists of six entries that are 20 bits wide - 16 bits of instruction data and four bits of control state. Instructions may be contained in up to two entries. Branch Prediction. A 512-entry branch target buffer (BTB) and 4096-entry global history buffer (GHB) is included in the branch predictor. The BTB indicates whether or not the current fetch instruction contains a branch using counters to indicate which branch predictions should or should not be taken. The BTB is indexed by the fetch address and contains target addresses and branch types. Both arrays are accessed in parallel with the instruction cache during the F1 stage. A 10-bit global branch history and four lower bits of the PC can select a GHB entry. Branch history is generated by analyzing the taken/not taken status of of the 10 most recent branches. This information is saved in the global history register (GHR). This approach increases efficiency by creating traces which are used to make better predictions. Low order bits are used to index the GHB to prevent referencing too similar histories. The BTB consists of branch target addresses and branch type information. It is indexed by the fetch address [1]. Both direct and indirect target address branch predictions are contained in the BTB. Return Stack. Subroutine predictions are made using a return stack, which returns an eight- entry stack depth [1]. There is an instruction decode unit which issues new instructions after decoding and sequencing. Return addresses are pushed onto the stack once the BTB determines when a branch is a subroutine. A subroutine return results in the address being popped from the stack instead of being read from the BTB entry. It is important to support multiple push/pop
  • 8. CEC 470, PROJECT II, DECEMBER 2014 8 commands at a time due to the relative shortness of each subroutine. Speculative updates may be harmful because updates from an incorrect path may result in a loss of synchronization with the return stack. This may cause mispredictions. The instruction fetch unit must have both speculative and non speculative return stacks. The speculative return stack will be updated immediately while the non-speculative return stack will not be updated until it is known whether the branch is speculative or non-speculative [1]. Inaccurate predictions will result in the speculative stack being overwritten by the non-speculative state. V. INSTRUCTION DECODE Instruction Decode Pipeline. The instruction decode unit decodes, sequences, issues new instructions, and provides exception handling [1]. The decode unit is contained within the D0- D4 pipeline stages. The instruction type, destination and source operands are determined in the D0 and D1 stages. Multi Cycle instructions are divided into multiple single cycle instructions during the D1 stage. Instructions are written and into and read from the pending/replay queue structure during the D2 stage. The D3 stage implements the instruction scheduling logic. The scoreboard is referenced for the next two possible instructions during the stage. These two instructions are analyzed to determine any dependency hazards that may not be detected by the scoreboard. These instructions cannot be stalled once they reach the D3/D4 boundary. Final decode for all control signals critical to instruction execute and load/store units occurs in the D4 stage. Fig. 3. Decode Pipeline [1] Static Scheduling Scoreboard. The static scheduling scoreboard predicts available operands [1]. This static scheduling scoreboard value indicates the number of cycles until a valid result is available. This differs from traditional scoreboards which will normally use a single bit to determine the availability of a source operand. This information is used with the source operand to determine possible dependency hazzard. Each scoreboard entry is self-updating on a cycle- to-cycle basis to ensure proper operation when a new register is written to. Each entry will decrement by one until a new register write or until the counter reaches zero - which indicates
  • 9. CEC 470, PROJECT II, DECEMBER 2014 9 availability. The static scheduling scoreboard also tracks the execution pipeline by its respective stage and result. This information is used to generate forwarding multiplexer control signals that accompany instructions upon issue [1]. There are several advantages to the static scheduling scoreboard. It allows implementation of fire-and-forget pipeline with no stalls when used in with the replay queue [1]. This removes speed paths that would hinder high frequency operation. This design conserves power by knowing early which execution units are required. Instruction Scheduling. Cortex-A8 is a dual-issue processor. There are two integer pipelines: pipe0 and pipe1. Pipe0 contains the older instruction while pipe1 contains the newest. If an older instruction cannot issue the instruction in pipe1 will not issue. This will be true even if there is no hazard or resource conflict [1]. Pipe0 is the default for single instructions. All instructions will progress through the execution pipeline and their results recorded into the register file during the E5 stage. This process will prevent write-after-read hazards and track write-after-write hazards. The pipe0 instruction is free to issue if no hazards are detected by the scoreboard. There are constraints that must be considered in addition to scoreboard indicators for dual pairing of instruction issue. The combination of instruction types must be considered. The following combinations are supported: • Any two data processing instructions • One load/store instruction followed by one data processing instruction • Older multiply instruction with a newer load/store or data processing instruction The program counter can only be changed by one of the two issued instructions. Only branch instructions or data processing and load with the program counter as the destination register may change its value [1]. The two instructions must be cross referenced to verify data dependency. Read-after-write or write-after-read hazards may prevent dual issue. Dual issue may be prevented if the newer instruction requires a destination register before it is produced by the older instruction or if both instructions are writing to the same register. Comparisons are performed when the data is produced and when it is needed. The dual issue may not be prevented if the data is not needed for one or more cycles. These are the examples when this occurs: • Compare or subtract instruction that sets the flags followed by a flag-dependent conditional branch instruction • Any ALU instruction followed by a dependent store of the ALU result to memory • A move or shift instruction followed by a dependent ALU instruction These instructions are commonplace in conventional code sequences. Addressing dual issue instruction pairs is critical in overall performance increase [1].
  • 10. CEC 470, PROJECT II, DECEMBER 2014 10 VI. NEON PIPELINE The Cortex-A8 has other features that complement its high performance, such as the NEON hybrid SIMD, which grants the Cortex-A8 increased performance in the field of graphics and other media [2]. There are numerous advantages to NEON. Efficiency of SIMD operations is ensured through aligned and unaligned data access. Integer and floating- point operations provide a broad range of applications including 3D graphics. There is a simpler tool flow created by single instruction streams and unified memory views. There is efficient data implementation and memory access through its large register file [2]. So what is NEON? The NEON engine is a SIMD (Single Instruction Multiple Data) accelerator processor, also known as a vector processor, which means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel [2]. The purpose of this parallelism is to obtain a greater amount of get more MIPS or FLOPS out of the SIMD portion of the processor then you could obtain with a basic SISD (Single Instruction Single Data) processor running at the same clock rate. This increased parallelism also decreases the instruction count necessary to accomplish the same task if run on an SISD, thus also reducing the number of clocks used to perform the same task. To determine how much of a speed increase the NEON engine will grant to a portion of code, a specific loop is necessary to look at the data size of the operation. The largest NEON register is 128 bits, thus if you wish to perform an operation on 8-bit values you can perform up to 16 operations simultaneously. Another example being if you are using 32 bit values, you can perform up to 4 operation simultaneously [2]. However, there are other factors to take into consideration that affect execution speed such as loop overhead, memory speeds, and data throughput. NEON instructions are mainly for numerical, load/store, and some logical operations, thus NEON operations will execute while other instruction occur in the main ARM pipeline. NEON has 4 decode stages, known as M0-M3, which are similar in design to the four decode stages, D0-D4, seen in the main ARM pipeline. This structure uses the first two stages to decode the instruction resource and operand requirements, then the last two stages for instruction scheduling. NEON also has 6 execute stages, N1-N6 [1]. The NEON pipeline uses a fire-and- forget issue mechanism and a static scoreboard, similar to what is used by the ARM integer pipeline with the primary difference being that there is no replay queue [2]. The NEON decode logic is highly capabable in that it can dual issue any LS permute instruction with any non-LS permute instruction which requires fewer register ports than what would be needed for dual issuing two data processing instructions since LS data is provided directly from the load data queue. It is also the most useful pairing of instructions to dual issue since significant load/store bandwidth is required to keep up with the Advanced SIMD data processing operations [1]. Access to the 32-entry register file is handled M3 stage when instruction(s) are issued [1]. Once an instruction is issued, it is sent to one of seven execution pipelines: integer algorithmic logic unit, integer multiply, integer shift, NFP Add, NFP multiply, IEEE floating point, or load/store permute with all execution datapath pipelines being balanced at six stages [1].
  • 11. CEC 470, PROJECT II, DECEMBER 2014 11 Fig. 4. NEON Pipeline Stages [1] VII. NEON INTEGER EXECUTION PIPELINE There are three execution pipelines responsible for executing NEON integer instructions: multiply-accumulate (MAC), shift, and ALU. The integer MAC pipeline contains two 32x16 multiply arrays with two 64-bit accumulate units. The 32x16 multiplier array can perform four 8x8, two 16x16, or one 32x16 multiply operation in each cycle and have dedicated register read ports for the accumulate operand. The MAC unit is also optimized to support one multiply accumulate operations per cycle for high performance on a sequence of MAC operations with a common accumulator. The integer shift pipeline consists of simply three stages. Shift is made available early for subsequent instructions at the end of the N3 stage when only the shift result is required [1]. When both a shift and accumulate operation are require the result from the shift pipeline are forwarded directly to the MAC pipeline. The integer ALU pipeline consists of two parallel 64-bit SIMD ALUs, each permitting four 64- bit inputs. The first stage of the ALU pipeline, N1, formats the operands to in preperation for the the next cycle, includes inverting operands as needed for subtract operations, multiplexing vector element pairs for folding operations, and sign/zero-extend of operands [1]. The second stage, N2, performs the main ALU opations such as: add, subtract, logical, count leading-sign/zero, count set, and sum of element pairs operations [1] along with also calculating the flags are also to be used in the following stage. The third stage, N3, performs operations such as: compare, test, and max/min operations for saturation detection. The N3 stage also has contains an SIMD incrementer for generating twos complement and rounding operations It also has a data formatter for performing high-half and halving operations. Just like the shift pipeline, the ALU pipeline will use the final stages, N4 and N5, for completing any accumulate operations by forwarding it to the MAC [1].
  • 12. CEC 470, PROJECT II, DECEMBER 2014 12 VIII. NEON LOAD-STORE/PERMUTE EXECUTION PIPELINE The permute pipeline is fed by the load data queue (LDQ). The LDQ holds all data associated with NEON load accesses prior to entering the NEON permute pipeline. It is 12 entries deep and each entry is 128-bit wide [1]. Data can be placed into the LDQ from either L1 cache or L2 memory system. Accesses that hit in the L1 cache will return and commit the data to the LDQ. Accesses that miss in the L1 cache will initiate an L2 access. A pointer is attached with the load request as it proceeds down the L2 memory system pipeline. When the data is returned from the L2 cache, the pointer is used to update the LDQ entry reserved for this load request. Each entry in the LDQ has a valid bit to indicate valid data returned from L1 cache or L2. Entries in the LDQ can be filled by L1 or L2 out-of-order, but valid data within the LDQ must be read in program order. Entries at the front of the LDQ are read off in-order. If a load instruction reaches the M2 issue stage before the corresponding data has arrived in the LDQ, then it will stall and wait for the data [1]. L1 and L2 data that is read out of the LDQ is aligned and formatted to be useful for the NEON execution units. Aligned/formatted data from the LDQ is multiplexed with NEON register read operands in the M3 stage, before it is issued to the NEON execute pipeline. The NEON LS/Permute pipeline is responsible for all NEON load/stores, data transfers to/from the integer unit, and data permute operations. One of the more interesting features of the NEON instruction set is the data permute operations that can be done from register to register or as part of a load or store operation. These operations allow for the interleaving of bytes of memory into packed values in SIMD registers. For example, when adding two eight byte vectors, you may wish to interleave all of the odd bytes of memory into register A and the even bytes into register B [1]. The permute instructions in NEON allow you to do operations like this natively in the instruction set and often with only using a single instruction [1]. This data permute functionality is implemented by the load-store permute pipeline. Any data permutation required is done across 2 stages, N1-N2. In the N3 stage, store data can forwarded from the permute pipeline and sent to the NEON Store Buffer in the memory system [1]. IX. NEON FLOATING-POINT EXECUTION PIPELINES The NEON Floating-Point (NFP) has two main pipelines: a 6-stage multiply pipeline and a 6-stage add pipeline [1]. The add pipeline adds two single-precision floating-point numbers, producing a single-precision sum. The multiply pipeline multiplies two single-precision floating- point numbers, producing a single-precision product. In both cases, the pipelines are 2-way SIMD which means that two 32-bit results are produced in parallel when executing NFP instructions [1]. X. NEONS IEEE COMPLIANT FLOATING POINT ENGINE The IEEE compliant floating point engine is a non-pipelined implementation of the ARM Floating-Point instruction set targeted for medium performance IEEE 754-compliant and double precision floating-point [1]. It is designed to provide general-purpose floating-point capabilities for a Cortex A8 processor. This engine is not pipelined for most operations and modes, but instead iterates over a single instruction until it has completed. A subsequent operation will be stalled until the prior operation has fully completed execution and written the result to the register file. The IEEE compliant engine will be used for any floating point operation that cannot be executed in the NEON floating point pipeline. This includes all double precision operations and any floating point operations run with IEEE precision enabled.
  • 13. CEC 470, PROJECT II, DECEMBER 2014 13 XI. VFP VFP (Vector Floating Point) is a floating point hardware accelerator whose primary purpose is to perform one operation on one set of inputs and returns one output, thus allowing it to speed up floating point calculations. Considerably slower software math libraries are used by ARM processors if dedicated have floating point hardware is not available. The VFP supports both single and double precision floating point calculations compliant with IEEE 754 [2]. It is also worth noting that the VFP will not have the same performance increase that NEON grants because it does not contain a similar highly parallel and fully pipelined architecture XII. ARM CORTEX-A8 COMPARED TO ARM CORTEX-A17 The ARM Cortex-A8 is a part of the ARMv7-A architecture. There have been seven cores designed with this architecture including the Cortex-A8 and the Cortex-A17. The ARM Cortex- A17 is the most powerful core within the same family as the Cortex-A8 yet the differences between the two are drastic. From internal specifications to the actual use in devices vary. The Cortex-A17 provides a 60% increase in performance over the Cortex-A9 and the Cortex- A9 has a 50% [10] increase in performance over the Cortex-A8 leading the comparison the Cortex-A17 is a 110% performance increase over the Cortex-A8. Fig. 5. Cortex-A17 performance comparison to the Cortex-A9 [8]
  • 14. CEC 470, PROJECT II, DECEMBER 2014 14 This leads to the initial comparison that the Cortex-A17 is far more powerful than the Cortex- A8 even though their design is the same 32-bit ARMv7-A architecture using the NEON SIMD and VFP hardware accelerator. Just as the Cortex-A8, this core is also very popular in mobile devices with its combination of high performance combined with the high efficiency brought about by the Cortex-A8 introduction. The Cortex-A17 consists of four scalable cores. These cores contain a fully out-of-order pipeline delivering optimal performance of todays premium mobile devices [8]. This is a key difference since the Cortex-A8 only supports one core, hence the massive speed increase with the Cortex-A17. The decode width of the Cortex-A17 is only one more than the Cortex-A8, yet that ability to decode one more instruction in parallel creates an improvement without sacrificing efficiency. The pipeline depth of the Cortex-A8 is 13 in order while the Cortex-A17 is 11+ out-of-order. The NEON (SIMD) for the Cortex-A8 is 64-bit wide where the Cortex-A17s is 128-bit wide allowing for greater parallel processing of data to occur. The Cortex-A17 has a big role in the big.LITTLE architecture role whereas the Cortex-A8 does not use big.LITTLE at all. The Cortex-A8 does not have a pipelined VFP accelerator whereas the Cortex-A17 does, which improves performance. The Cortex-A8 is used in many commercial applications that affect our daily lives. An application that the Cortex-A8 is utilized in is smartphones as an application processor running fully featured mobile OS, the Cortex-A17 is commonly seen in smartphones as well and tablets unlike the Cortex-A8. It is also used in Netbooks because of its Power-Efficient main processor running desktop OS. The Cortex-A8 is also used in set-top Boxes as the main processor for managing Rich OS, Multi-format A/V and UI, same as the Cortex-A17. They are also used in Digital TV applications as the processor for managing rich OS, UI and browser functions, same as the Cortex-A17. The Cortex-A8 is used in home networking as a control processor for system management. It is also used for storage networking as a control processor to manage traffic flow. They are even used in printers as a high-performance integrated processor [8][9]. The Cortex-A17 also works with Industrial and Automotive Infotainment which the Cortex-A8 did not work with [8]. These are devices that we interact with in our lives and some that we interact with daily. The small size of the core is advantageous because it can fit into small devices such as smartphones, netbooks, TV receivers and printers. The Cortex-A8 is also very advantageous because its power efficiency which for these small devices with small batteries makes a huge difference in lifespan of use per charge. The power of the Cortex-A8 is very useful in many of these applications. With its pipelining abilities and enhancement from the NEON SIMD and the VFP hardware accelerator, it allows for small devices such as smartphones to have amazing processing speed. The Cortex-A8 and the Cortex-A17 are very similar, yet with large performance differences. The Cortex-A8 is a High-Performance processor used to run in complex systems, it is: • Symmetric, superscalar pipeline for full dual-issue capability • High-frequency through efficient, deep pipeline • Advanced branch prediction unit with ¿95% accuracy • Integrated Level 2 Cache for optimal performance in high-performance systems [9] The Cortex-A8 is designed to handle media processing in software with NEON Technology which is: • 128-bit SIMD data engine • 2x the performance of v6SIMD • Power-saving through efficient media processing • Flexibility to handle the media formats of the future
  • 15. CEC 470, PROJECT II, DECEMBER 2014 15 • Easily integrate multiple codecs in software with NEON Technology on the Cortex-A8 • Enhance user interfaces [9] The Cortex-A8 boasts many features, but how do they compare to the Cortex-A17? The Cortex- A8 features the NEON, 128-bit SIMD engine that enables high performance media processing. It also features the Optimized Level 1 cache which is integrated tightly into the processor with a single-cycle access time as well as an Integrated Level 2 cache which is integrated into the core and provides ease of integration, power efficiency, and optimal performance. The Cortex- A8 also features Thumb-2 Technology which delivers the peak performance of traditional ARM code while also providing up to a 30% reduction in memory required to store instructions. It also has Dynamic Branch Prediction, used to minimize branch wrong prediction penalties, the dynamic branch predictor achieves 95% accuracy across a wide range of industry benchmarks. The Cortex-A8 also features a Memory Management Unit, having a full MMU enables the Cortex-A8 to run rich operating systems in a variety of Applications. It also features Jazelle- RCT Technology, a RCT Java-acceleration technology to optimize Just in Time (JIT) and Dynamic Adaptive Compilation (DAC), and reduce memory footprint by up to three times. The Cortex-A8 also features a Memory System that is optimized for power-efficiency and high- performance. It also features TrustZone Technology which allows for secure transactions and Digital Rights Management (DRM) [9]. This list of features comes from the ARM website and the specific product specification pages. The Cortex-A17 also has list of specifications on the ARM website, however they are different from the Cortex-A8. The Cortex-A17 and the Cortex- A8 share some similar features such as Thumb-2 Technology, TrustZone Technology, NEON and Optimized Level 1 Caches. The Cortex-A17 also has an Integrated Level 2 Cache Controller but the difference is that its size is configurable. The Cortex-A17 also has the DSP & SIMD Extensions which increases the DSP processing capability of ARM solutions in high-performance applications, while offering the low power consumption required by portable, battery-powered devices. It also uses a Floating Point, the Cortex-A17 processor provides a high-performance FPU including hardware support for floating point operations in half-, single- and double-precision floating point arithmetic. The Cortex-A17 also features Hardware Virtualization, a highly efficient hardware support for data management and arbitration, whereby multiple software environments and their applications are able to access simultaneously the system capabilities. It also has a Large Physical Address Extension (LPAE) which enables the processor to access up to 1TB of memory. The Cortex-A17 also features an AMBA4 CoreLink CCI-400 Cache Coherent which provides AMBA4 ACE ports for full coherency between multiple processors, enabling use cases like big.LITTLE [8]. This lengthy list of features about the Cortex-A17 for comparison to the Cortex-A8 was also retrieved from the ARM website in the Cortex-A17 product specifications section. This comparison is key to see where the ARMv7-A architecture has evolved to, the Cortex-A8 is one of the middle models in development purposes whereas the Cortex-A17 is the newest and most powerful that ARM produces in this architecture set. The comparison for the debugger between the Cortex-A8 and the Cortex-A17 are the same. The ARM DS-5 Development Studio fully supports all ARM processors and IP as well as a wide range of third party tools, operating systems and EDA flows. DS-5 represents a comprehensive range of software tools to create, debug and optimize systems based on the Cortex-A8 and Cortex- A17 processors [8]. This line comes from the Cortex-A17 related products page but is near the exact same as the Cortex-A8. They both incorporates DS-5 Debugger, whose powerful and intuitive graphical environment enables fast debugging of bare-metal, Linux and Android native applications. DS-5 Debugger provides pre-defined configurations for Fixed Virtual Platforms
  • 16. CEC 470, PROJECT II, DECEMBER 2014 16 (built on ARM Fast Models technology) and ARM Versatile Express boards, enabling early software development before silicon availability [8][9]. This segment is the same in both the Cortex-A17 and Cortex-A8. Both of the Cortex-A17 and the Cortex-A8 use the same family of products for Graphic Processing. The MaliTM family of products combine to provide the complete graphics stack for all embedded graphics needs, enabling device manufacturers and content developers to deliver the highest quality, cutting edge graphics solutions across the broadest range of consumer devices [8][9]. An example would be the Mali-400 in the Cortex-A8 which is the worlds first OpenGL ES 2.0 conformant multi-core GPU that provides 2D and 3D acceleration with performance scalable up to 1080p resolution [9]. For the Cortex-A8 the ARM Physical IP Platforms deliver process optimized IP, for best-in- class implementations of the Cortex-A8 processor at 40nm and below [9]. The Cortex-A8 uses the Standard Cell Logic Libraries which are available in a variety of different architectures ARM Standard Cell Libraries support a wide performance range for all types of SoC designs. It also supports Memory Compilers and Registers, a broad array of silicon proven SRAM, Register File and ROM memory compilers for all types of SoC designs ranging from performance critical to cost sensitive and low power applications. The Cortex-A8 also supports Interface Libraries, a broad portfolio of silicon-proven Interface IP designed to meet varying system architectures and standards [9]. The ARM Physical IP Platforms deliver process optimized IP, for best-in-class implementations of the Cortex-A17 processor at 28nm and below [8]. This is similar to the Cortex-A8 except for the difference from 40nm to 28nm. A set of high-performance POPTM IP containing advanced ARM Physical IP for 28nm technologies supports the Cortex-A17, to enable rapid development of leadership physical implementations [8]. ARM is uniquely able to design the optimization packs in parallel with the Cortex-A17 processor, enabling the processor and physical IP combination to deliver best-in-class performance in the mobile power envelope while facilitating rapid time-to-market [8]. The Physical IP for the Cortex-A17 is a different design from the Cortex-A8 with using the POP IP. System IP components are essential for building complex system on chips and by utilizing System IP components developers can significantly reduce development and validation cycles, saving cost and reducing time to market [9]. The Cortex-A8 uses a different set of tools for System IP than Cortex-A17, here are the differences: Cortex-A8 • Advanced AMBA 3 Interconnect IP using the AXI AMBA Bus. • Dynamic Memory Controller using the AXI AMBA Bus. • Adaptive Verification IP using the AXI AMBA Bus. • DMA Controller using the AXI AMBA Bus. • CoreSight Embedded Debug and Trace using the ATB AMBA Bus. [9] The set of tools that the Cortex-A17 uses for System IP are as follows: Cortex-A17 • AMBA 4 Cache Coherent Interconnect – The CCI-400 provides AMBA 4 AXI Coherency Extensions compliant ports for full coherency for the Cortex-A17 processor and other Cortex processors, better utilizing caches and simplifying software development. This feature is essential for high bandwidth applications including future mobile SoCs that require clusters of coherent processors or GPUs. Combined with other available ARM CoreLink System IP, the CCI-400 increases system performance and power efficiency.
  • 17. CEC 470, PROJECT II, DECEMBER 2014 17 – CoreLink CCI-400 Cache Coherent Interconnect provides system coherency with Cortex processors and an IO Coherent channel with Mali IP and opens up a number of possibilities for offload and acceleration of tasks. When combined with a Cortex-A7 processor, CCI-400 allows big.LITTLE operation with full L2 cache coherency between the Cortex-A17 and Cortex-A7 processors. – Efficient voltage scaling and power management is enabled with the CoreLink ADB- 400 unlocking DVFS control of the Cortex-A17 processor. • AMBA Generic Interrupt Controller – AMBA Interrupt Controllers like the GIC-400 provide an efficient implementation of the ARM Generic Interrupt Specification to work in multi-processor systems. They are highly configurable to provide the ultimate flexibility in handling a wide range of interrupt sources that can control a single CPU or multiple CPUs. • AMBA 4 CoreLink MMU-500 – CoreLink MMU-500 provides a, hardware accelerated, common memory view for all SoC components and minimizes software overhead for virtual machines to get on with other system management functions. • CoreLink TZC-400 – The Cortex-A17 processor implements a secure, optimized path to memory to further enhance its market leading performance with the aid of CoreLink TZC-400 TrustZone address space controller. • CoreLink DMC-400 – All interconnect components and the ARM DMC guarantee bandwidth and latency requirements by utilizing in-built dynamic QoS mechanisms. • CoreSight SoC-400 – ARM CoreSight SoC debug and trace hardware is used to profile and optimize the system software running through-out from driver to OS level. • Artisan POP IP – Cortex-A17 processor is supported through advanced Physical POP IP for accelerated time to market [8]. These differences show how much more technology is in the Cortex-A17 versus the Cortex-A8 even though they are in the same ARM architecture set family (ARMv7-A). With differences like these it really makes it clear as to how flexible these systems are and what all can be done with them from media processing to data crunching. These differences are important to understand because it lays out where this technology is headed and what changes could and are being made to create more powerful yet more efficient devices. XIII. CONCLUSION The Cortex-A8 is an important example of RISC-based superscalar design. It has many features that make it a powerful and flexible processor. The summation of its components result in increased performance and flexibility. Its instruction pipelining and branch prediction are critical into ensuring performance efficiency. The NEON SIMD possesses a very robust architecture including its own instruction pipelines. The NEON introduces a host of new capabilities including multimedia and graphics processing. Examination of other ARM processors further illustrates the Cortex-A8s evolution. The Cortex-A8 belongs to family consisting of seven other processors.
  • 18. CEC 470, PROJECT II, DECEMBER 2014 18 A comparison with the faster Cortex-A17 demonstrates a higher degree of flexibility from the Cortex-A8. This flexibility is critical to the Cortex-A8s success in consumer electronics. This processor is commercially available in a variety of applications including mobile devices and other media. Studying the ARM Cortex-A8 is critical in understanding the role superscalar architecture plays in embedded systems.
  • 19. CEC 470, PROJECT II, DECEMBER 2014 19 REFERENCES [1] Williamson, David, ARM Cortex-A8: A High-Performance Processor for Low-Power Applications, Unique Chips and Systems (2007): 79. [2] (n.d.). Texas Instruments Wiki, Cortex-A8 - Texas Instruments Wiki Retrieved from http://processors.wiki.ti.com/index.php/Cortex-A8. [3] (n.d.). ARM - The Architecture For The Digital World, NEON - ARM, Retrieved from http://arm.com/products/processors/technologies/neon.php. [4] (n.d.). ARM - The Architecture For The Digital World, Cortex-A8 Processor - ARM,Retrieved from http://arm.com/products/processors/cortex-a/cortex-a8.php. [5] (n.d.). ARM - The ARM Architecture. With a focus on v7A and Cortex-A8. Retrieved from http://www.arm.com/files/pdf/ARM Arch A8.pdf [6] ARM, A. (2000). Architecture Reference Manual. ARM DDI E, 100, 6.https://www.scss.tcd.ie/ waldroj/3d1/arm arm.pdf [7] Cortex-A17 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex- a17-processor.php [8] Cortex-A8 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex- a8.php [9] Cortex-A9 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex- a9.php