Cache Consistency – Requirements and its packet processing Performance implications

Packet Processing &
Cache Coherency -101A Primer
By: M Jay

2
Notices and Disclaimers
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability,
fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course
of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided
here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule,
specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from
published specifications. Current characterized errata are available on request.
Intel, the Intel logo, {List of ALL the Intel trademarks in this document} are trademarks of Intel Corporation in the U.S.
and/or other countries.
*Other names and brands may be claimed as the property of others
© Intel Corporation.

3
Agenda
•  Cache Coherency – Is it really needed? – Message Passing Vs Shared Mem
•  Read access & cache - benefits we all know
•  What about Write & Cache?
•  Write Through – Write Back Cache
•  DPDK PMD and Cache Coherency
•  Snoop Protocol
•  NUMA
•  LIFO
•  Dynamic Vs Static
•  DDIO & Cache Size

4
Thread Local Storage – why worry about coherency?
Well ! I need to Share Data !!

5
Thread Local Storage – why worry about coherency?
Well ! I need to Share Data !!

6
Why share data?
Why not developers use Message Passing Paradigm?
Can we
visualize no
address
space?

7
Why shared data?
Why not developers use Message Passing Paradigm?
Scratch Scratch Scratch Scratch
If
Developers
Did so?

8
No Need For Coherency Protocol !!
No need for
Coherency
protocol !

9
No Need of Cache Coherency?
Message
Passing – No
need of
Coherency
Shared
Memory
Paradigm – H/
W to manage
Coherency

10
So, really what is the root cause of Cache Coherency
requirement?
Where from Cache Coherency requirement is coming?
Is it software developers’ problem “of not doing truly parallel programming”?
Or is it hardware designer’s “overdo” problem?

11
Well ! But …
Message
Passing needs
Moving Data
Around…
Moving Data
…..
Won’t it be lot
of overhead?
Shared Memory
Means Just Read /
Write. No Moving
Data Around !
Right?
Yeah ! Right !
Bring it On Shared
Memory

12
Why you need to share data with another thread?

Network Platforms Group
What Is The Task At Hand?
Receive
Process
Transmit
rx cost tx cost
A Chain is only as strong as …..

Network Platforms Group
Benefits – Eliminating / Hiding Overheads
Interrupt
Context
Switch
Overhead
Kernel User
Overhead
Core To Thread
Scheduling
Overhead
Elimina=ng How?
Polling
User Mode
Driver
Pthread
Aﬃnity
4K Paging
Overhead
PCI Bridge I/
O Overhead
Elimina'ng /Hiding How?
Huge Page
Lockless Inter-core
Communica=on

High Throughput
Bulk Mode I/O calls
To Tackle this challenge, what kind of devices /latency we have at our
disposal?

Network Platforms Group 15
PCIe* Connectivity and Core Usage
Using run-to-completion or pipeline software models
Processor 0
Physical
Core 0
Linux* Control Plane
NUMA
Pool Caches
Queue/Rings
Buffers
10 GbE
10 GbE
Physical
Core 1
Intel® DPDK
PMD Packet I/O
Packet work
Rx
Tx
Physical
Core 2
Intel® DPDK
PMD Packet I/O
Flow work
Rx
Tx
Physical
Core 3
Intel® DPDK
PMD Packet I/O
Flow
Classification
App A, B, C
Rx
Tx
Physical
Core 5
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Run to Completion Model
• I/O and Application workload can be handled on a single core
• I/O can be scaled over multiple cores
10 GbE
Pipeline Model
• I/O application disperses packets to other cores
• Application work performed on other cores
Processor 1
Physical
Core 4
Intel® DPDK
10 GbE
Physical
Core 5
Intel® DPDK
Physical
Core 0
Intel® DPDK
PMD Packet I/O
Hash
Physical
Core 1
Intel® DPDK
App
A
App
B
App
C
Physical
Core 2
Intel® DPDK
App
A
App
B
App
C
Physical
Core 3
Intel® DPDK
Rx
Tx
10 GbE
Pkt Pkt
Physical
Core 4
Intel® DPDK
PMD Packet I/O
Flow Classification
App A, B, C
Rx
Tx
Pkt Pkt
Pkt Pkt
Pkt
Pkt
RSS
Mode
QPI
PCIePCIePCIePCIe
PCIePCIe
NUMA
Pool Caches
Queue/Rings
Buffers
Can handle more I/O
on fewer cores with
vectorization

16
Why you need to share data with another thread?I
So tell me .. Why you need to share data with another thread?
It is the Pipeline Model that
needs Sharing! – looks like!!
Let us go with that for now !!

17
How can we map our s/w variables to h/w infrastructure?

18
How can we map our s/w variables to h/w infrastructure?

19
Individual Memory => For Thread Local Storage?
Shared Memory => For Global Data?
int shared
Function ( )
{
Int private
}

21
What do you wish for?
Bigger Shared memory or
bigger Individual memory?
What
about
Locality
?

22
You look at the header once and forward the packet..
Right Away You Sprint to the next packet
So What do you wish for? Bigger which one?

23
You look once the header & forward pkt
Right Away You Sprint to next packet
Not the same packet
With fast line rate, you sprint from one packet to
another packet very fast
Temporal Locality in Packet Processing?
How are we doing? How much Locality?
Smaller Individual Caches with Less Locality – more Individual cache misses
So you end up often going far Shared Cache / Memory
So it is as if you don’t even have the individual cache and end up as if having slower
memory all the time.
So What do you wish for? Bigger which one?

Last Level Cache
L2 Cache
Challenge: What if there is L1 Cache Miss and LLC Hit?
L1 Cache
Core 0
L1 Cache
Core 0
LLC
Cache
40 cycle
With 40 cycles LLC Hit, How will you achieve Rx budget of 19 cycles ?
L1 Cache
Miss
So what do
you wish
for?
Bigger which
one?

L1 Cache With 4 Cycle Latency
L1 Cache
Core 0
Latenc
y
4 cycle
Caching Benefits on Read – Excellent !!
Right?
What? Now What?
L1
Cache
Hit
Read Packet Descriptor
With 4 cycles Latency, achieving Rx budget of 19 cycles is within
reach.
Miss, What about the
first read that may
cause miss

27
Cache is actually hashing !
1st Line
1st Line
1st Line
1st Line
1st Line
Cache
Memory
Cache Tag / Directory
Indicates which one
Is occupying the cache.
What
about
Locality
?

28
Cache and Tag!
1st Line
1st Line
1st Line
1st Line
1st Line
Cache
Memory
Cache Tag / Directory
Indicates which one
Is occupying the cache.
What
about
Locality
?

30
Where will Data be Coming From?
Write-Through Vs Write-Back

31

32

34
Let us Look at Write – Through First
For P2, Where will be Data Coming From?
On
Hit
On
Miss

35
Let us Look at Write – Through First
On
Hit
On
Miss

36
So Writes happen at what speed? With Write
Through Cache?
What happens if you repeatedly write

37
Let us Look at Write – Back Next
If Hit,
From
Cache
If
miss
From
Where?

38
At What Speed Write Happens in Write Back ?
How do we improve with more and more writes? – compared to Write Through !

39
Let us Look at Write – Back Next
If Hit,
From
Cache
If
miss
From
Where?

40
Where Else? Cache To Cache …
So, it can come from
1)  its own cache or
2)  shared memory or
3)  Even from ANY OF the other Individual Cache (WB)
Requesting CPU Which All CPUs can
offer Data
P 0 P1 to Pn
P1 P0 & [P2 to Pn]
P2 P0,P1 & [P3-Pn]
And so on
Pn [P0 to Pn-1]
Total paths [N X N] ?? ???
Looks like we have complexity of Message
Passing also
Remember
Me?
You thought no
movement of data in
“shared memory”?

41
Additional housekeeping “dirty bit” with Write Back

42
That is for Data Side…
What About Control for Coherency?

43
M- Modified E- Exclusive S – Shared I - Invalid

44
https://www.slideshare.net/sumitmittu/aca2-07-new

45
Write Through Memory Speed
Write Back Cache Speed
Can we go faster and faster…

L1 Cache With 4 Cycle Latency
L1 Cache
Core 0
Post it !
POSTED WRITE !!
Write Packet Descriptor
But Why should I “wait for 4 cycles” in case of write?

47
How is the complexity?
Data source is now Posted Buffer too
Posted Buffer participating in Data sourcing
As well as MESI cache coherency

48
Shared Memory – Data Sources
From Local Write Buffer
From Another Write Buffer
From Local Cache
From Another Cache
From Shared cache From Shared memory

49
And you thought You will never see me again !

50
Coming to Packet Processing & Polled Mode Driver…

51
Shall we see couple of use cases?

52
Use Case 1
Prod
ucer
Consu
mer
Software
Queue
Question:
What policy you will
design?
FIFO?
LIFO?
Why?

53
LRU … MRU …. Where Are You?

55
Question – Statistics Collection
Collective task? or
Individual task?

56
Which Thread Gets Picked up by whom?
CPU’s Task Priority Register

57
Which Thread Gets Picked up by whom?
CPU’sTask Priority Register CPU’s Task Priority Register

58
So Going back to the question
So, Collective task? or
Individual task?

59
With Thread Pinning, we avoid Sharing !
Same lcore for same NIC !
No need to Share !!

60
With Thread Pinning, we avoid Sharing !
If Sharing is not needed, then why put it in memory?
Why go through shared Memory? Why?
Why not take it directly into Private Cache?
Why not bypass shared memory?

61
Familiar About Bypass Road?
Why go through congested inner cities?
Why not bypass? Use Bypass Road !!

62
You say Bypass…. We Say DDIO ..
Bypass memory
Directly into cache

63
Do You?
Really?
With Polling and Thread Pinning, we avoid Sharing !

64
With RSS … back to the question -- responsibility
Collective task? or
Individual task?

65
Well, that is a special case use case - RSS
But for RSS, we are good with only Thread Local Storage
No need of shared data

66
Well, that is a special case use case - RSS
Apart from that, we pin 1 core to 1 NIC – so no sharing !!
Is that so?
Really?

67
Classification – Cache Coherency Needed or Not?

68
Depends!!
Depends on What?
http://www.eetimes.com/document.asp?doc_id=1277622

69
Depends on
Static Classification
or
Dynamic Classification?

70
What about Router Table? Is it a shared resource or
private – per core resource?

71
What about Router Table? Is it a shared resource or
private? – per core resource? Collective or Individual
Router Table – is it one table per system?
If so,
Who are all writers? Who are all readers?
Howmany writers? Howmany readers?
What about 2 socket, 4 socket system?
One table for each socket?
Coherency between the 2 or 4 tables in a multi-socket system?
Collective Responsibility or Individual Responsibility?

72
Multiple Writers – What will benefit?
Write back Cache? or Write Through Cache?

73
What if you keep it “Dirty” and DMA control sneaks in?

75
In case of Siblings, does each has private cache of its
own?
With Siblings, how LTS gets mapped?

76
How do Siblings Share Caches – say, L1 and L2 ?

Cache Consistency – Requirements and its packet processing Performance implications

Cache Consistency – Requirements and its packet processing Performance implications

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Cache Consistency – Requirements and its packet processing Performance implications

Semelhante a Cache Consistency – Requirements and its packet processing Performance implications (20)

Mais de Michelle Holley

Mais de Michelle Holley (20)

Último

Último (20)

Cache Consistency – Requirements and its packet processing Performance implications