The audio starts at 9:00 due to a glitch while recording.
The topic can be as well stated as “All Processor Accesses are not created equally” – H/W and S/W Synergy – Methods and Mechanism:
- The developers in the audience, throughout the class, will wear different hats – first as a hard design engineer and look at platform design choices.
- Then they will wear the hat of driver developer and see the requirements and assumptions of driver on the hardware behavior.
- Then they will wear the hat of system architect and venture out to find out choices as how s/w and h/w can be in synergy and communicate – thereby potential optimum implementation can be made.
About the presenter: M Jay (Muthurajan Jayakumar) has worked with the DPDK team since 2009. He joined Intel in 1991 and has been in various roles and divisions: 64-bit CPU front side bus architect, 64 bit HAL developer, among others, before he joined the DPDK team. M Jay holds 21 US patents, both individually and jointly, all issued while working at Intel. M Jay was awarded the Intel Achievement Award in 2016, Intel's highest honor based on innovation and results.
3. 3
Agenda
• Cache Coherency – Is it really needed? – Message Passing Vs Shared Mem
• Read access & cache - benefits we all know
• What about Write & Cache?
• Write Through – Write Back Cache
• DPDK PMD and Cache Coherency
• Snoop Protocol
• NUMA
• LIFO
• Dynamic Vs Static
• DDIO & Cache Size
6. 6
Why share data?
Why not developers use Message Passing Paradigm?
Can we
visualize no
address
space?
7. 7
Why shared data?
Why not developers use Message Passing Paradigm?
Scratch Scratch Scratch Scratch
If
Developers
Did so?
8. 8
No Need For Coherency Protocol !!
No need for
Coherency
protocol !
9. 9
No Need of Cache Coherency?
Message
Passing – No
need of
Coherency
Shared
Memory
Paradigm – H/
W to manage
Coherency
10. 10
So, really what is the root cause of Cache Coherency
requirement?
Where from Cache Coherency requirement is coming?
Is it software developers’ problem “of not doing truly parallel programming”?
Or is it hardware designer’s “overdo” problem?
11. 11
Well ! But …
Message
Passing needs
Moving Data
Around…
Moving Data
…..
Won’t it be lot
of overhead?
Shared Memory
Means Just Read /
Write. No Moving
Data Around !
Right?
Yeah ! Right !
Bring it On Shared
Memory
13. Network Platforms Group
What Is The Task At Hand?
Receive
Process
Transmit
rx cost tx cost
A Chain is only as strong as …..
14. Network Platforms Group
Benefits – Eliminating / Hiding Overheads
Interrupt
Context
Switch
Overhead
Kernel User
Overhead
Core To Thread
Scheduling
Overhead
Elimina=ng How?
Polling
User Mode
Driver
Pthread
Affinity
4K Paging
Overhead
PCI Bridge I/
O Overhead
Elimina'ng /Hiding How?
Huge Page
Lockless Inter-core
Communica=on
High Throughput
Bulk Mode I/O calls
To Tackle this challenge, what kind of devices /latency we have at our
disposal?
15. Network Platforms Group 15
PCIe* Connectivity and Core Usage
Using run-to-completion or pipeline software models
Processor 0
Physical
Core 0
Linux* Control Plane
NUMA
Pool Caches
Queue/Rings
Buffers
10 GbE
10 GbE
Physical
Core 1
Intel® DPDK
PMD Packet I/O
Packet work
Rx
Tx
Physical
Core 2
Intel® DPDK
PMD Packet I/O
Flow work
Rx
Tx
Physical
Core 3
Intel® DPDK
PMD Packet I/O
Flow
Classification
App A, B, C
Rx
Tx
Physical
Core 5
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Run to Completion Model
• I/O and Application workload can be handled on a single core
• I/O can be scaled over multiple cores
10 GbE
Pipeline Model
• I/O application disperses packets to other cores
• Application work performed on other cores
Processor 1
Physical
Core 4
Intel® DPDK
10 GbE
Physical
Core 5
Intel® DPDK
Physical
Core 0
Intel® DPDK
PMD Packet I/O
Hash
Physical
Core 1
Intel® DPDK
App
A
App
B
App
C
Physical
Core 2
Intel® DPDK
App
A
App
B
App
C
Physical
Core 3
Intel® DPDK
Rx
Tx
10 GbE
Pkt Pkt
Physical
Core 4
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Pkt Pkt
Pkt Pkt
Pkt
Pkt
RSS
Mode
QPI
PCIePCIePCIePCIe
PCIePCIe
NUMA
Pool Caches
Queue/Rings
Buffers
Can handle more I/O
on fewer cores with
vectorization
16. 16
Why you need to share data with another thread?I
So tell me .. Why you need to share data with another thread?
It is the Pipeline Model that
needs Sharing! – looks like!!
Let us go with that for now !!
17. 17
How can we map our s/w variables to h/w infrastructure?
18. 18
How can we map our s/w variables to h/w infrastructure?
19. 19
Individual Memory => For Thread Local Storage?
Shared Memory => For Global Data?
int shared
Function ( )
{
Int private
}
21. 21
What do you wish for?
Bigger Shared memory or
bigger Individual memory?
What
about
Locality
?
22. 22
You look at the header once and forward the packet..
Right Away You Sprint to the next packet
So What do you wish for? Bigger which one?
23. 23
You look once the header & forward pkt
Right Away You Sprint to next packet
Not the same packet
With fast line rate, you sprint from one packet to
another packet very fast
Temporal Locality in Packet Processing?
How are we doing? How much Locality?
Smaller Individual Caches with Less Locality – more Individual cache misses
So you end up often going far Shared Cache / Memory
So it is as if you don’t even have the individual cache and end up as if having slower
memory all the time.
So What do you wish for? Bigger which one?
24. Last Level Cache
L2 Cache
Challenge: What if there is L1 Cache Miss and LLC Hit?
L1 Cache
Core 0
L1 Cache
Core 0
LLC
Cache
40 cycle
With 40 cycles LLC Hit, How will you achieve Rx budget of 19 cycles ?
L1 Cache
Miss
So what do
you wish
for?
Bigger which
one?
26. L1 Cache With 4 Cycle Latency
L1 Cache
Core 0
Latenc
y
4 cycle
Caching Benefits on Read – Excellent !!
Right?
What? Now What?
L1
Cache
Hit
Read Packet Descriptor
With 4 cycles Latency, achieving Rx budget of 19 cycles is within
reach.
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Miss, What about the
first read that may
cause miss
27. 27
Cache is actually hashing !
1st Line
1st Line
1st Line
1st Line
1st Line
Cache
Memory
Cache Tag / Directory
Indicates which one
Is occupying the cache.
What
about
Locality
?
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
28. 28
Cache and Tag!
1st Line
1st Line
1st Line
1st Line
1st Line
Cache
Memory
Cache Tag / Directory
Indicates which one
Is occupying the cache.
What
about
Locality
?
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
Read Packet Descriptor
34. 34
Let us Look at Write – Through First
For P2, Where will be Data Coming From?
On
Hit
On
Miss
35. 35
Let us Look at Write – Through First
For P2, Where will be Data Coming From?
On
Hit
On
Miss
36. 36
So Writes happen at what speed? With Write
Through Cache?
What happens if you repeatedly write
37. 37
Let us Look at Write – Back Next
For P2, Where will be Data Coming From?
If Hit,
From
Cache
If
miss
From
Where?
38. 38
At What Speed Write Happens in Write Back ?
How do we improve with more and more writes? – compared to Write Through !
39. 39
Let us Look at Write – Back Next
For P2, Where will be Data Coming From?
If Hit,
From
Cache
If
miss
From
Where?
40. 40
Where Else? Cache To Cache …
So, it can come from
1) its own cache or
2) shared memory or
3) Even from ANY OF the other Individual Cache (WB)
Requesting CPU Which All CPUs can
offer Data
P 0 P1 to Pn
P1 P0 & [P2 to Pn]
P2 P0,P1 & [P3-Pn]
And so on
Pn [P0 to Pn-1]
Total paths [N X N] ?? ???
Looks like we have complexity of Message
Passing also
Remember
Me?
You thought no
movement of data in
“shared memory”?
46. L1 Cache With 4 Cycle Latency
L1 Cache
Core 0
Post it !
POSTED WRITE !!
Write Packet Descriptor
But Why should I “wait for 4 cycles” in case of write?
47. 47
How is the complexity?
Data source is now Posted Buffer too
Posted Buffer participating in Data sourcing
As well as MESI cache coherency
48. 48
Shared Memory – Data Sources
From Local Write Buffer
From Another Write Buffer
From Local Cache
From Another Cache
From Shared cache From Shared memory
60. 60
With Thread Pinning, we avoid Sharing !
If Sharing is not needed, then why put it in memory?
Why go through shared Memory? Why?
Why not take it directly into Private Cache?
Why not bypass shared memory?
61. 61
Familiar About Bypass Road?
Why go through congested inner cities?
Why not bypass? Use Bypass Road !!
70. 70
What about Router Table? Is it a shared resource or
private – per core resource?
71. 71
What about Router Table? Is it a shared resource or
private? – per core resource? Collective or Individual
Router Table – is it one table per system?
If so,
Who are all writers? Who are all readers?
Howmany writers? Howmany readers?
What about 2 socket, 4 socket system?
One table for each socket?
Coherency between the 2 or 4 tables in a multi-socket system?
Collective Responsibility or Individual Responsibility?