2. TRENDS in EMBEDDED SYSTEMS
The need for high performance functionalities is increasing day by day, so are
the no. of chips on a system. This increase has become inevitable due to which
the designers are switching to Multiprocessors System-on-Chip (MPSoCs)
from System-on-chips.
http://www.lirmm.fr/~robert/presentation/IntroMPSOCV.pdf
3.
An SoC basically refers to putting together all the components of a computing
system onto a single chip(IC).
4. Issues with traditional bus and crossbar networks
• Latency
•Priority
•Length of the wires
•Bandwidth limitation
•Cost
5. Routing congestion-aware and communication power-aware
mapping in MPSoCs
A core-graph with each communication flow with a latency constraint (in terms of
time) is given. As the BWs on each communication link are finite, latencies would
depend on network congestion. Thus, latency constraint violations can sometimes be
fixed by mapping cores farther apart so that routing paths with lesser congestion can
be found. Assume multiple link-insertions are not allowed.
6. 1.
Arrange the task graph in the decreasing order of the severity of the latency
constraint.
2.
The entire list is divided into „m‟ number of child search windows.
3.
Each window is searched for the cores occurring more than 4 times. (This is
because; the maximum number of adjacent neighbors a core can have is 4).
4.
Such cores are named as prime nodes.
5.
For each of the prime node, a tile with maximum neighbors is chosen and the
node is inserted on the tile.
6.
The four neighbors of the prime node will be the ones with tight latency
constraints, thus at one hop distance away from the node.
7.
The steps are followed for the rest of the cores in the decreasing order of
occurrences.
7. Fig 1. Given CTG for 9 cores
Fig 3. Initial Mapping
Fig 2. Sorted latency table in decreasing order
8. •
A congestion detection mechanism is used to detect if the route is congested or not.
•
A CD packet is sent from source to router. The threshold is kept as 'k' . If this
threshold is exceeded, the router is assumed to be congested and updated in the CD
matrix.
•
The new path allocation is explained in the following slide.
9. New Path Allocation Algorithm:
1.
The position of the destination router position is found out with respect to
the „x‟ position of the source as either east or west.
2.
In the determined direction, the CD matrix is traversed i.e., y co-ordinate is
fixed and the x co-ordinate is varied till it reaches the x co-ordinate of the
destination router.
3.
If in this path, congested router is found, („1‟ in the matrix) the search is
shifted to either north or south direction.
4.
The search continues until it reaches the y co-ordinate of the router.
5.
If the path is clear and the y co-ordinate is reached, the path continues either
in east or west direction to reach the destination.
6.
If the path is not clear, i.e., if a congested router is again reached, steps 2-5
is repeated and the new path is found.
7.
The latency of the new path is calculated and determined if the constraint is
satisfied or violated.
11. Given an MPSoC of heterogeneous cores (with 2 types of cores) and a task
graph, assign each task to a certain core type so that the chip wide power
constraint and a chip wide performance constraint (in terms of total delay).
So that you will find out the number of cores needed for each type.
Assume: each task is suited to run on a certain core type for better
performance and on the other core type for better power.
12.
1. For each task, there are 4 set of parameters. So, in our program, we take 4 arrays
of length “n” where n corresponds to the number of tasks.
2. The chip wide power constraint (Pcons) is taken to be the average of P1 and P2
where, P1 corresponds to the total sum of powers of tasks corresponding to Type 1
core and P2 corresponds to the total sum of powers of tasks corresponding to Type 2
core.
3. The chip wide delay constraint (Tcons) is taken to be the average of T1 and
T2where, T1 corresponds to the total sum of delays of tasks corresponding to Type 1
core and T2 corresponds to the total sum of delays of tasks corresponding to Type2
core.
13. 1. The power and delay of each task should be less
than or equal to 1/n of the fixed constraint. If this
satisfies, the corresponding core is mapped to the
task.
2. If case 1 is not satisfied, test is run to satisfy
either one of the constraint and mapping done
accordingly.
3. If both the constraints are not satisfied, i.e., if
power or delay of the task is not less than 1/n of the
constraints, the core's parameter nearer to the
constraint is assigned to the task.
14. From the above test cases, a factor called Violation Factor (VF) is
derived as, the sum of the division of given power and delay to the
respective average chip wide power and delay. This factor is calculated
for both the core types for a given task. If this value is less than 1, it is a
good mapping. If not, there are chances of the mapping to violate the
imposed constraints.
Let Pavg and Tavg be the average power and delay defined as
Pavg = Pcons /n
Tavg = Tcons /n
VF[j] = (P[j][i]/Pavg)+ (T[j][i]/Tavg)
j = 1 or 2 corresponding to the core type.
i = 1 to n corresponding to the number of tasks.
15. 1. Store the given set of parameters from the task graph into the 4 arrays.
2. Calculate chip wide performance and power constraint- Pcons andTcons.
3. Calculate the VF for the given task
4. Compare the VFs corresponding to the different cores and select the core for which
VF is the least.
5. Calculate the number of task corresponding to the core type.
6. Decide whether the constraints can be obtained or not.
7. Print the results.
16. .
TASK
POWER 1
TIME 1
POWER 2
TIME 2
1
5 mW
50 ms
48 mW
8 ms
2
15 mW
145 ms
80 mW
15 ms
3
20 mW
187 ms
97 mW
23 ms
4
13 mW
108 ms
78 mW
12 ms
5
2 mW
20 ms
23 mW
4 ms
P1=55 mW
P2=326 mW
Pcons=190.5 mW
Pavg=38.1 mW
T1=510 ms
T2=62 ms
Tcons=286 ms
Tavg=57.2 ms
Results:
#of type1 core:3
#of type2 core:2
Delay
constraint:286.000000
Total delay:216.000000
Power
constraint:190.500000
Total power:197.000000
Constraints not met
17. TASK
POWER 1
TIME 1
POWER 2
TIME 2
1
5 mW
30 ms
48 mW
8 ms
2
15 mW
105 ms
55 mW
15 ms
3
20 mW
117 ms
97 mW
23 ms
4
13 mW
78 ms
40 mW
12 ms
5
2 mW
20 ms
23 mW
4 ms
Results:
#of type1 core:3
#of type2 core:2
Delay
constraint:206.000000
Total delay:194.000000
Power
constraint:159.000000
Total power:122.000000
Constraints met
18. • We would like to extend the mapping and routing algorithm to an NoC with
Voltage islands.
•Here, we have implemented a sequential search, whereas in the future we
would like to implement a parallel search thereby improving the performance
of the chip.
•For mapping, we would like to improve it by considering incremental
swapping to reduce the tight latency in case of least prioritized cores.
19. Things learnt:
• NoC router architecture
• Routing algorithms
• Mapping techniques
• Congestion detection algorithms
• Current trends in NoC architecture