Mais conteúdo relacionado Semelhante a Database Research on Modern Computing Architecture (20) Database Research on Modern Computing Architecture1. Database Research on
Modern Computing
Architecture
September 10, 2010
Kyong-Ha Lee (bart7449@gmail.com)
Department of Computer Science
KAIST, Daejeon, Korea
2. Brief Overview of This Talk
Basic theories and principles about database
technology on modern HW
• Not much discussion on implementation or tools,
but will be happy to discuss them if there are any
questions
Topics
• The immense changes in computer architecture
• A variety of computing sources
• Intra-node parallelism
• The DB technology that facilitates modern HW
features
Invited talk @ ETRI, © Kyong-Ha Lee 2
3. Things we have now in our PC
Core 1 16 Integer
Throughput ~1 instruction per cycle. registers
One cycle takes ~0.33 ns. Core 2 16 Double
The exact # of cycles depends on the in FP registers
struction
L1 D-cache L1 TLB L1 D-cache
32KB L1 I-cache 128 entries 32KB L1 I-cache L1
Latency 1ns 32KB for I 256 en Latency 1ns 32KB TLB
(3 cycles) tries for D (3 cycles)
L2 Unified Cache
Intel Core 2 Duo
L2 TLB 6MB
3.0GHz, E8400
Latency 4.7ns (14 cycles)
Wolfdale
Front Side Bus
1,333MHz
Bandwidth: 10GB/S
DDR3 Ram Modules
Intel X48
PCI Express 2.0 x16, 8GB/s (e 4 GB
ach way)
Northbridge
Latency: ~83ns(~250 c
Chip
Invited talk @ ETRI, © Kyong-Ha Lee
ycles) 3
4. DMI Interface
Bandwidth: 1GB/s (each way)
USB 2.0 ~30MB/s
Serial ATA port 300MB/s
FireWire 800 ~55MB/s
PCIe 2.0 x1, 500MB/s (each way)
Seagate 1TB 7,200 RPM
Wireless 802.11g ~2.5 MB/s
32MB
HDD Cache
Gigabit wired ethernet. ~100 MB/s Operates at SATA rate
Intel ICH9R
Southbridge
My LGU+ cable line chip Sustained disk I/O ~138Mb/S
100Mb/s up/down
Random seek time:
8.5ms (read)/9.5ms(write)
25.7 million/28.8 million
Internet cycles
original source:
http://duartes.org/gustavo/blog/post/what-your-
computer-does-while-you-wait
Invited talk @ ETRI, © Kyong-Ha Lee 4
5. So what‘s happening now?
Changes in memory hierarchy
• Higher capacity and the emergence of Non-
Volatile RAM(NVRAM)
Memory wall and multi-level caches
• Latency lags bandwidth
Increasing number of cores in a single die
• Multicore or CMP
A variety of computing sources
• CPU, GPU, NP and FPGA
Intra-/Inter-node parallelism
• CMP vs. MPP
Invited talk @ ETRI, © Kyong-Ha Lee 5
6. Now In Memory Hierarchy
Very cheap HDD with VERY high capacity
• Seagate 1TB Barracuda 3.5‘ HDD (7200rpm, 32MB) for
74,320 won($61.94) in Aug 2010
• 1GB for 74.3 won
Write-once storage
• Tape drive is dead, ODD is waning
• Due to the poor latency and seek time
• Seek time >= 100ms
• although 22X DVD writer can sequentially write 4.7GBytes
within 3 minutes (29.7MB in theory)
1GB for 53.82 won
Price of RAM has fallen enough to keep much more data in
memory than before.
• A 4GB DDR3 Memory(1,333MHz) for 108,000 won
• 1 GB for 27,000 won ($22.5)
• but, still cost_m >> cost_d
Invited talk @ ETRI, © Kyong-Ha Lee 6
7. The Five-Minute Rule
Cache randomly accessed disk pages that are reused
every 5 minutes[1].
• BreakEvenIntervalInSeconds
PagesPerMBofRAM Pr icePerDisk Drive
AccessPerS econdPerDi sk Pr icePerMBof RAM
In 1987, breakeven interval was ~2 minutes
After that, ~5 minutes in 1997, ~88 minutes in 2007.
“Memory becomes HDD, HDD becomes Tape,
and Tape is dead”, by Jim Gray
Today‘s memory is ~102,400 times faster than HDD
• Memory : 83 ns(250 cycles)
• HDD : 8.5ms (25.7 million cycles)
(256/116) x (61.94/0.0225) = ~101 minutes.
=> Cache your data in memory as always as possible.
Invited talk @ ETRI, © Kyong-Ha Lee 7
8. Latency lags bandwidth
From 1983 to 2003[2]
• Capacity increased ~ 2,500 times
(0.03GB -> 73.4GB)
• Bandwidth improved 143.3 times
(0.6 MB/s -> 86 MB/s)
• Latency improved 8.5 times (48.3 ->
5.7 ms)
Why?
• Moore‘s law helps bandwidth more
than latency
• Distance limits latency
• Bandwidth is generally easier to sell
• Latency helps bandwidth but not
vice versa.(e.g., spinning disk faster)
• Bandwidth hurts latency(e.g., buffer)
• OS overhead hurts latency
Invited talk @ ETRI, © Kyong-Ha Lee 8
9. Latency vs. Bandwidth
Latency can be handled by
• Hiding (or tolerating) it – out of order issue, non blocking cache,
prefetching
• Reducing it –better cache
Parallelism sometimes helps to hide latency
• MLP(Memory Level Parallelism) - multiple outstanding cache misses
overlapped
• But increased bandwidth demand
Latency ultimately limited by physics
Bandwidth can be handled by ―spending ― more (HW cost)
• Wider buses, interface, Interleaving
Bandwidth improvement usually increases latency
• No free lunch
Hierarchies decreases bandwidth demand to lower levels.
• Serve as traffic filters: a hit in L1 is filtered from L2
• If average bandwidth is not met -> infinite queues
Invited talk @ ETRI, © Kyong-Ha Lee 9
10. NVRAM Storage: Solid Sate Disk
Intel X25-M Mainstream(50nm) 160GB
• Read/write latency 85/115 us
• Random 4KB read/write: 35K/3.3K IOPS
• Sustained sequential read/write: 250/70MB/s
• 1GB for 3,619 won in Aug 2010
SSD has successfully occupied the position
between memory and HDD
• best suited for sequential read/write
• e.g., logging device
Invited talk @ ETRI, © Kyong-Ha Lee 10
11. Features of SSD
No mechanical latency
• Flash memory is an electronic device with no moving parts
• Provides uniform random access speed without seek/rotational latency
No in-place update
• No data on a page can be updated in place before erasing it first
• An erase unit (or block) is much larger than a page
Limited lifetime
• MLC : 0.1M times of writes, SLC : 1M times of writes
• Wear-leveling
Asymmetric read & write speed
• Read speed is typically at least 3X faster than write speed
• Write (and erase) optimization is critical
Asymmetric seq. vs. random I/O performance
• Random 4KB read/write: 35K/3.3K IOPS
• 140MB/13.2MB in total size
• Sustained sequential read/write: 250/70MB/s
―Disk‖ Abstraction
• LBA(or LPA) -> (channel#, plane#, … ) or just PBA(or PPA)
• This mapping changes each time a page write is performed
• The controller must maintain a mapping table in RAM or Flash
Invited talk @ ETRI, © Kyong-Ha Lee 11
12. Memory wall
Latencies
• CPU stalls because of time spent for memory access
• latency for memory access: 250 cycles
• ~249 instructions are blocked waiting data from the
memory access.
Solution: CPU Caching!!
Invited talk @ ETRI, © Kyong-Ha Lee 12
13. Why Caching?
Processor speeds are projected to increase about 70% per year
for many years to com. This trend will widen the speed gap
btw. Memory and processor caches. The caches will get larger,
but memory speed will not keep pace with processor speeds
Low latent memory that hides memory access latency
• Static RAM vs. Dynamic RAM
• 3ns(L1)~14ns(L2) vs. 83ns
Small capacity with support of locality
• Temporal locality
• Recently referenced items are likely to be referenced in the near
future
• Capacity limits the # of items to be kept in the cache at a time
• L1$ in Intel Core i7 is 32KB
• Spatial locality
• Items with nearby addressed tend to be referenced close together in
time
• the size of one cache line
• e.g., a cache line size in Intel Core i7 is 64B
• So 32K/64B = 512 cache lines
Invited talk @ ETRI, © Kyong-Ha Lee 13
14. An Example Memory Hierarchy
Source: Computer Systems,
A Programmer‘s Perspective, 2003
Invited talk @ ETRI, © Kyong-Ha Lee 14
15. Memory Mountain in 2000
32B cache line size
Source: Computer Systems,
A Programmer‘s Perspective, 2003
Invited talk @ ETRI, © Kyong-Ha Lee 15
16. Memory Mountain in 2010
Intel Core i7
2.67GHz
32KB L1 d-cache
256KB L2 cache
8MB L3 cache
64B cache line size
source: http://csapp.cs.cmu.edu/public/perspective.html
Invited talk @ ETRI, © Kyong-Ha Lee 16
17. CPU Cache Structure
Source: Computer Systems,
A Programmer‘s Perspective, 2003
Invited talk @ ETRI, © Kyong-Ha Lee 17
18. Addressing Caches
Source: Computer Systems,
A Programmer‘s Perspective, 2003
Invited talk @ ETRI, © Kyong-Ha Lee 18
19. Types of Cache Misses
Cold miss(or compulsive miss)
• Data are not loaded at first.
Capacity miss
• Because of limited capacity
• must evict a victim to make space for replacement block
• LRU or LFU
Conflict miss
• involves cache thrashing
• can be alleviated by associative cache
• e.g., 8-way set associative cache in Core2 Duo
Coherence miss
• Data consistency between caches
Invited talk @ ETRI, © Kyong-Ha Lee 19
20. Cache Performance
Metrics
• Miss rate: # of misses/# of references
• The fraction of memory references during the execution of a
program.
• Hit rate :# of success/#of references
• Hit time: the time to deliver a word in the cache to the CPU
• Miss penalty: any additional time required because of a miss.
Impact of :
• Cache size: reduce capacity miss and increase both of hit rate
and hit time
• Cache line size: increase spatial locality and decrease temporal
locality
• Associativity
• Full-associative : no conflict miss, but linear scan of cache lines
eventually
• Direct-mapping: conflict miss
Invited talk @ ETRI, © Kyong-Ha Lee 20
21. Writing Cache-Friendly Codes
Maximizes two localities in your program
• Remove pointers as many as possible
• Increasing both spatial locality and update cost
• fit the working data into a cache line and into the
capacity of the cache
• Increasing spatial and temporal locality
• Use working data as often as possible once it has
been read from memory
Software prefetching
• Removing cold miss rates
Invited talk @ ETRI, © Kyong-Ha Lee 21
22. Example: Matrix Multiplication
*Assumptions:
•Row-major order
•Cache block = 8 doubles
•Cache size C << n
Invited talk @ ETRI, © Kyong-Ha Leeblocks fit into cache 3B^222 C
•Three <
23. SW Prefetching
Loop unrolling
• for (int i=0; i < N-4; i+=4){ //inner product of double a[] and b[]
prefetch(&a[i+4] );//32-bit machine with 32B cache line size
prefetch(&b[i+4]);
ip = ip + a[i]*b[i]; a[i] a[i+1] a[i+2] a[i+3]
ip = ip + a[i+1] * b[i+1];
b[i] b[i+1] b[i+2] b[i+3]
ip = ip + a[i+2]* b[i+2];
ip = ip +a[i+3]* b[i+3]; }
Data linearization 1
preorder traverse
2 3
1 2 4 5 3 6 7
4 5 6 7
Invited talk @ ETRI, © Kyong-Ha Lee 23
24. Optimizations in Modern
Microprocessor
Pipelining (Intel i486)
• utilizes ILP(Instruction Level
Parallelism)
• increases throughput but not
latency.
Out-of-order execution(Intel P6)
• 96-sized inst. window(Core2), 128-
sized inst. window(Nehalem)
• in-order processor(Intel Atom,
GPU)
Superscalar(Intel P5)
• 3-wide(Core2) , 4-wide(Nehalem)
Invited talk @ ETRI, © Kyong-Ha Lee 24
25. Simultaneous Multi-threading(from Intel Pentium4)
• TLP(Thread-Level Parallelism)
• Hardware multi-threading
• Support of HW-level context switching
• issues multiple instructions from multiple threads in one cycle.
• HT(Hyper Threading) is Intel‘s term for SMT
SIMD(Single Instruction Multiple Data)(Intel Pentium III)
• DLP(Data-Level Parallelism)
• 128bit SSE(Streaming SIMD Extensions) for x86 architecture
Invited talk @ ETRI, © Kyong-Ha Lee 25
26. Branch prediction and speculative execution
• guess which way a branch will go before this is known for
sure
• To improve the flow in the ILP
Hardware prefetching
• Hiding latency by fetching data from memory in advance
• Advantage
• No need to add any instruction overhead to issue prefetches
• No SW cost
• Disadvantage
• Cache pollution
• Bandwidth can be wasted
• H/W cost and compatibility
Invited talk @ ETRI, © Kyong-Ha Lee 26
27. Speed of a Program
CPI(Clock Per Instruction) vs.
IPC(Instructions Per Clock)
MIPS(Million Instructions Per Second)
• FLOPS(Floating Point Per Second)
• GFLOPS, TFLOPS
T = N x CPI x T_cycle
Improvement
• Reduce the # of instructions
• Reduce CPI
• Increase clock speed
Invited talk @ ETRI, © Kyong-Ha Lee 27
28. Virtuous Cycle, circa 1950 – 2005
Increased
processor
performance
Larger, more
Slower
feature-full
programs
software
Higher-level Larger
languages & development
abstractions teams
World-Wide Software Market (per IDC):
$212b (2005) $310b (2010)
Invited talk @ ETRI, © Kyong-Ha Lee 28
29. Virtuous Cycle, circa 2005-??
Slower
programs
X
Increased
processor
performance
Larger, more
feature-full
software
GAME OVER — NEXT LEVEL?
Threadlanguages & Parallelism &
Level
Higher-level Larger
development
abstractions teams
Multicore Chips
Invited talk @ ETRI, © Kyong-Ha Lee 29
30. CMP(Chip Level Multiprocessor)
Apple Inc. starts to sell 12-core Mac Pro(in Aug 2010)
• ―The new Mac Pro offers two advanced processor options from Intel.
The Quad-Core Intel Xeon ― Nehalem‖ processor is available in a
single-processor, quad-core configuration at speeds up to 3.2GHz.
For even greater speed and power, choose the ―Westmere‖ series,
Intel‘s next-generation processor based on its latest 32-nm process
technology. ‖Westmere‖ is available in both quad-core and 6-core
versions, and the Mac Pro comes with either one or two processors.
Which means that you can have a 6-core Mac Pro at 3.33GHz, an 8-
core system at 2.4GHz, or, to max out your performance, a 12-core
system at up to 2.93GHz.‖ from Apple homepage
Invited talk @ ETRI, © Kyong-Ha Lee 30
31. Multicore
Moore‘s law is still valid
• ―The # of transistors on an
integrated circuit has doubled
approximately every other year.‖- Core
Gordon E. Moore, 1965 Shared
Obstacles to increasing clock speed L2 Cache
• Power density problem
• ―Can soon put more transistors Core
on a chip than can afford to turn Intel
on‖ – Patterson‘07 Core2 Duo
• Heat problem
• e.g., Intel Pentium IV Prescott
(3.7GHz) in 2004
Limits in Instruction Level
Parallelism(ILP)
=> The emergence of Multicore !!
Intel
Core i7
Invited talk @ ETRI, © Kyong-Ha Lee 31
32. Chip density is continuin
g increase ~2x every 2 ye
ars
Clock speed is not incre
asing
Number of processor co
res may double instead
There is little or no hidde
n parallelism (ILP) to be f
ound
Parallelism must be expo
sed to and managed by so
ftware
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
Invited talk @ ETRI, © Kyong-Ha Lee 32
33. Can soon put more transistors on a chip than can afford to turn on.
-- Patterson ‗07
Scaling clock speed (business as usual) will not work
10000 Sun‘s
Surface
Rocket
Power Density (W/cm2)
1000
Nozzle
Nuclear
100
Reactor
8086 Hot Plate
10 4004 P6
8008 8085 386 Pentium®
286 486
8080 Source: Patrick
1 Gelsinger, Intel
1970 1980 1990 2000 2010
Year
Invited talk @ ETRI, © Kyong-Ha Lee 33
34. Parallelism Saves Power
Exploit explicit parallelism for reducing
power
Power = C **V222/4 F)/4
(C * V **FF F/2
2C V * * Performance = Cores **FF
(Cores * F)*1
2Cores F/2
Capacitance Voltage Frequency
• Using additional cores
– Increase density (= more transistors = more capacita
nce)
– Can increase cores (2x) and performance (2x)
– Or increase cores (2x), but decrease frequency (1/2): s
ame performance at ¼ the power
• Additional benefits
– Small/simple cores more predictable performance
Invited talk @ ETRI, © Kyong-Ha Lee 34
35. Amdahl‘s law
Two basic metrics
•
•
Recall Amdahl‘s law [1967]
• Simple SW assumption
• No Overhead for
• Scheduling, communication, synchronization, and
etc
• e.g.,
Invited talk @ ETRI, © Kyong-Ha Lee 35
36. Types of multicore
Symmetric multicore
• e.g., Core 2Duo, i5, i7, Xeon octo-core
Assume that
• Each Chip Bounded to N BCEs (for all cores)
• Each Core consumes R BCEs
• Assume Symmetric Multicore = All Cores Identical
• Therefore, N/R Cores per Chip — (N/R)*R = N
• For an N = 16 BCE Chip:
Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core
Invited talk @ ETRI, © Kyong-Ha Lee 36
37. Performance of Symmetric
Multicore Chips
Serial Fraction 1-F uses 1 core at rate Perf(R)
Serial time = (1 – F) / Perf(R)
Parallel Fraction uses N/R cores at rate Perf(R) each
Parallel time = F / (Perf(R) * (N/R)) = F*R /
Perf(R)*N
Therefore, w.r.t. one base core:
1
Symmetric Speedup = F*R
1-F
Perf(R)
+ Perf(R)*N
Implications?
Enhanced Cores speed Serial & Parallel
Invited talk @ ETRI, © Kyong-Ha Lee 37
38. Symmetric Multicore Chip, N = 16
BCEs
16
F=0.999
14
F=0.99 F1, R=1, Cores=16, Speedup16
Symmetric Speedup
12
F=0.975
10
8
F=0.9
6
F=0.9, R=2, Cores=8, Speedup=6.7
4
F=0.5
2
0
1 2 4 8 16
(16 cores) R BCEs (8 cores)
(2 cores) (1 core)
(4 cores)
F matters: Amdahl‘s Law applies to multicore chips
MANY Researchers should target parallelism F first
As Moore‘s Law increases N, often need enhanced core designs
Some arch. researchers target on single-core performance
Invited talk @ ETRI, © Kyong-Ha Lee 38
39. Asymmetric multicore
• Cell Broadband Engine in PS3
• 1 PPE(Power Processor Element) and 8 SPE(Synergic
Processor Element)
Each Chip Bounded to N BCEs (for all cores)
One R-BCE Core leaves N-R BCEs
Use N-R BCEs for N-R Base Cores
Therefore, 1 + N - R Cores per Chip
For an N = 16 BCE Chip:
Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core
& Twelve 1-BCE base cores
Invited talk @ ETRI, © Kyong-Ha Lee 39
40. Performance of Asymmetric
Multicore Chips
Serial Fraction 1-F same, so time = (1 – F) /
Perf(R)
Parallel Fraction F
• One core at rate Perf(R)
• N-R cores at rate 1
• Parallel time = F / (Perf(R) + N - R)
Therefore, w.r.t. one base core:
1
Asymmetric Speedup = F
1-F
Perf(R)
+ Perf(R) + N - R
Invited talk @ ETRI, © Kyong-Ha Lee 40
41. Asymmetric Multicore Chip, N =
256 BCEs
250
Asymmetric Speedup F=0.999
200
150
F=0.99
100
F=0.975
50
F=0.9
F=0.5
0
1 2 4 8 16 32 64 128 256
(256 cores) (1+252 cores) R BCEs (1+192 cores) (1 core)
(1+240 cores)
Number of Cores = 1 (Enhanced) + 256 – R (Base)
How do Asymmetric & Symmetric speedups compare?
Invited talk @ ETRI, © Kyong-Ha Lee 41
42. Other laws
Gustafson‘s law
• ,α = 1-f
Karp-Fratt metric
• efficient to estimate serial
fraction from the real code
•
Invited talk @ ETRI, © Kyong-Ha Lee 42
43. Multicore Makes The Memory Wall
Problembandwidth to access memory
Assume that each core requires 2GB/s
Worse
What if 6 cores access the memory at a time? => 12GB >> FSB bandwidth
A prefetching scheme that is appropriate for a uniprocessor may be entirely
inappropriate for a multiprocessor [22].
5.0E+09
Total CPU cycles of a query
4.0E+09
Memory
3.0E+09
DTLB miss
L2 hit
Branch Misprediction
2.0E+09 Computation
1.0E+09
0.0E+00
1 core 8 cores
http://spectrum.ieee.org/computing/hard
ware/multicore-is-bad-news-for-supercom
puters Solution : Sharing memory access
Invited talk @ ETRI, © Kyong-Ha Lee 43
44. GPU
DLP(Data-Level Parallelism)
GPU has become a powerful computing
engine behind scientific computing and data-
intensive applications
Many light-weighted in-order cores
has separated caches and memories
GPGPU applications are data-intensive,
handling long-running kernel execution(10-
1,000s of ms) and large data units ( 1-100s of
MB)
Invited talk @ ETRI, © Kyong-Ha Lee 44
45. GPU Architecture
NVIDIA GTX 512
16 Streaming Multiprocessors(SM), each of which
consists of 32 Stream Processors(SPs), resulting in 512
cores in total.
All threads running on SPs share the same program
called kernel
An SM works as an independent SIMT processor.
Invited talk @ ETRI, © Kyong-Ha Lee 45
46. Levels of Parallel Granularity and
Memory sharing
A thread block is a batch of
threads that can cooperate
with each other by:
• Synchronizing their execution
• For hazard-free shared
memory accesses
• Efficiently sharing data
through a low latency shared-
memory
Two threads from two
different blocks cannot
cooperate
Invited talk @ ETRI, © Kyong-Ha Lee 46
47. Four Execution Steps
The DMA controller transfers data from
host(CPU) memory to device(GPU) memory
A host program instructs the GPU to launch
the kernel
The GPU executes threads in parallel
The DMA controller transfers result from
device memory to host memory
Warp; a basic execution(or scheduling) unit
of SM, a group of 32 threads sharing the
same instruction pointer; all threads in a
warp take the same code path.
Invited talk @ ETRI, © Kyong-Ha Lee 47
48. Comparison with CPU
• It maximizes ILP to • It maximizes thread-level
accelerate a small # of parallelism
threads
• It devotes most of their die
• large caches and area to a large array of
sophisticated control planes ALUs.
for advanced features
• e.g., superscalar, OoO • Memory stall can be
execution, branch prediction, effectively minimized with
and speculative loads an enough number of
• Latency hiding is limited by threads
CPU resources
• Large memory
• Limited memory bandwidth(177.4GB/s for
bandwidth(32GB/s for GTX480)
X5550)
CPU GPU
Invited talk @ ETRI, © Kyong-Ha Lee 48
49. GPU Programming Considerations
What to offload
• Computation and memory-intensive algorithms with
high regularity suit well for GPU acceleration
How to parallelize
Data structure usage
• Simple data structure such as arrays are
recommended.
Divergency in GPU code
• SIMT demands to have minimal code-path divergence
caused by data-dependent conditional branches
within a warp
Expensive host-device memory cost
Invited talk @ ETRI, © Kyong-Ha Lee 49
50. FPGA(Field Programmable Gate
Array)
Von Neumann
architecture vs.
Hardware architecture
Integrated circuit
designed to be configured
by customer.
configuration is specified
using a HDL(Hardware
Description Language)
Invited talk @ ETRI, © Kyong-Ha Lee 50
51. Limitations of FPGA
Area/speed tradeoff
• Finite CLB on a single die
• Becomes slower and more power-
consumptive as logic becomes more complex
Act as a hard-wired once it is cooked
No support of recursion calls
Asynchronous design
Less power efficient
Invited talk @ ETRI, © Kyong-Ha Lee 51
52. Are DB execution cache-friendly?
DB Execution Time Breakdown (in 2005)
At least 50% cycles on stalls.
Memory access is major bottleneck
Branch mispredictions increase cache misses
Invited talk @ ETRI, © Kyong-Ha Lee 52
53. Modern DB techniques
• Cache-conscious • CMP and multithreading
• Cache-friendly data • Memory scan sharing
placement • Staged DB execution
• Data cache • GPGPU
• Cache-conscious data
structure • SIMT
• Buffering index structure • FPGA
• Hiding latency using • Von-Neumann vs. HW
Prefetching
• Cache-conscious join
circuit
• Instruction cache
• Buffering
• Staged database
execution
• Branch prediction
• Reduce branches and
SIMD
Invited talk @ ETRI, © Kyong-Ha Lee 53
54. Record Layout Schemes
Select name f
PAX optimizes cache-to-memory communication but rom R where
retains NSM‘s IO (page contents do not change) age > 50
(a) NSM(N-ary Storage Model) (b) DSM(Decomposed Storage Model) or
Column-based (c)PAX(Partition Attribute Across)
Invited talk @ ETRI, © Kyong-Ha Lee 54
55. Main-Memory Tree Indexes
T-tree: Balanced-binary tree proposed in 1986 for
MMDB
• Aim: balance space overhead with searching time.
Main-memory B+-trees: better cache performance[4]
Node width = cache line size (32-128B)
Minimize number of cache misses for search
Much higher than traditional disk-based B+-tree
=> more cache miss
How the shallow B+-tree?
Invited talk @ ETRI, © Kyong-Ha Lee 55
56. Cache Sensitive B +-tree
Layout child nodes contiguously
Eliminate all but one child pointers
• keys in one node fit in one cache line
• Removing pointers increases the fanout of the tree, which
results in a reduced tree height
• 35% faster tree lookups
• Update performance is 30% worse (splits)
Invited talk @ ETRI, © Kyong-Ha Lee 56
57. Buffering Index Structures
buffering accesses to the index structure to avoid cache
thrashing
Nodes in the index tree are grouped together into
pieces that fit within the cache
Increase temporal locality but accesses can be delayed
Invited talk @ ETRI, © Kyong-Ha Lee 57
58. Prefetching B+-tree
Idea: Larger nodes + prefetching
Node size = multiple cache lines (e.g., 8 lines)
Prefetch all lines of a node before search it
Cost to access a node only increases slightly
Much shallower tree, no changes required
Improves both search and update performance
Invited talk @ ETRI, © Kyong-Ha Lee 58
59. Fractal pB+-tree
For faster range scan
• Leaf parent nodes contain addresses of all leaves
• Link leaf parent nodes together
• Use this structure for prefetching leaf nodes
* A prefetching scheme that is appropriate for a
uniprocessor may be entirely inappropriate for a
multiprocessor [22].
Invited talk @ ETRI, © Kyong-Ha Lee 59
60. Cache-Conscious Hash Join
For good temporal
locality, two relations to
be joined are partitioned
into partitions that fit in
the data cache.
To reduce TLB misses
caused by big H, use
radix hash
• In the cluster, # of
random accesses is low
• a large number of
clusters can be created by
making multiple passes
through the data
Invited talk @ ETRI, © Kyong-Ha Lee 60
62. a group
Invited talk @ ETRI, © Kyong-Ha Lee 62
63. Buffering tuples btw. operators
group consecutive operators into execution groups
whose operators fit into the L1 I-cache.
buffering output of the execution group
I-Cache misses are amortized over multiple tuples
and i-cache thrashing is avoided
Invited talk @ ETRI, © Kyong-Ha Lee 63
64. How SMTs can help DB
performance
Bi-threaded: partition input, cooperative
threads
Work-ahead-set: main thread + helper thread
• Main thread posts ―work-ahead set‖ to a queue
• Helper thread issues load instructions for the
requests
Invited talk @ ETRI, © Kyong-Ha Lee 64
65. Staged Database Execution Model
TX may be divided into stages that fit in the L1 I-cache
When one tx reaches the end of stage, system switches
context to a different thread that needs to execute the
same stage. Stage S0
LOAD X
LOAD X STORE Y
STORE Y STORE Y
STORE Y
Stage S1
LOAD Y LOAD Y
…. ….
STORE Z STORE Z
LOAD Z
…. Stage S2
LOAD Z
….
Invited talk @ ETRI, © Kyong-Ha Lee 65
66. Stage Spawning
LOAD X LOAD Y
LOAD Z
S0 STORE Y S1 ….
S2 ….
STORE Y STORE Z
Core 0 Core 1 Core 2
Work-queues
Instances Instances Instances
of S0 of S1 of S2
Invited talk @ ETRI, © Kyong-Ha Lee 66
67. Main-Memory Scan Sharing
•Memory scan sharing also increases temporal locality
•Too many sharing can cause cache thrashing
Invited talk @ ETRI, © Kyong-Ha Lee 67
68. Summary
Latency is a major problem
Cache-friendly programming is
indispensible
Chip level multiprocessor requires to be
used for TLP
Facilitating diverse computing sources is a
challenge
Invited talk @ ETRI, © Kyong-Ha Lee 68
69. Further readings
1. Jim Gray, Gianfrano R. Putzolu, The 5 Minute Rule for Trading Memory for Disk Accesses and
The 10 Byte Rule for Trading Memory for CPU Time, SIGMOD 1987: 395-398
2. David A. Patterson, Latency lags bandwidth, CACM, Vol. 47, No. 10 pp. 71—75, 2004
3. Mark Hill and et al., Amdahl‘s law in multicore era, IEEE Computer, Vol. 41, No. 7 pp. 33-38,
2008
4. J. Rao and et al., Cache Conscious Indexing for Decision-Support in Main Memory
5. P. A. Boncz and et al., Breaking the Memory wall in monetDB, CACM, Dec 2008
6. Shimin Chen and et al., Improving Hash Join Performance through Prefetching, ICDE 2004
7. Jingren Zhou and et al., Implementing Database Operations Using SIMD instructions, SIGMOD
2002
8. J. Cieslewicz and K.A. Ross, Database Optimizations for Modern Hardware, Proceedings of the
IEEE 96(5), 2009
9. Lawrence Sparcklen and et al., Chip Multithreading: Opportunities and Challenges
10. Nikos Hardavellas and et al., Database Servers on Chip Multiprocessors: Limitations and
Opportunities, CIDR 2007
11. Lin Qiao and et al., Main-Memory Scan Sharing For Multi-Core CPUs, PVLDB 2008
12. Ryan Johnson and et al., To Share or Not to Share?, VLDB 2007
13. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB 2009
14. Database Architectures for New Hardware, Tutorial, in the 30th VLDB, 2004 and in the 21st
ICDE 2005
15. Query Co-processing on Commodity Processors, Tutorial in the 22nd ICDE 2006.
16. John Nickolls and et al., GPU Computing Era, IEEE Micro March/April 2010
17. Kayvon Fatahalian and et al., A Closer Look at GPUs, CACM Vol. 51, No.10, 2008
Invited talk @ ETRI, © Kyong-Ha Lee 69
70. 18. John Nickolls and et al., Scalable Parallel Programming, March/April ACM Queue, 2008
19. N.K. GOvindaraju and et al., GPUTeraSort: High performance graphics co-processor sorting for large
database management, SIGMOD 2006
20. A. Mitra and et al., Boosting XML Filtering with a Scalable FPGA-based Architecture, CIDR 2009
21. S. Harizopoulos and A. Ailamaki and et al., Improving instruction cache performance in OLTP, ACM
TODS, vol. 31, pp. 887-920
22. T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory
multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, 1991
*Courses available on Internet
Introduction to Computer Systems @CMU, 2000~2010
• http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f10/www/index.html
Multicore Programming Primer @MIT, 2007 (with video)
• http://groups.csail.mit.edu/cag/ps3/index.shtml
Introduction to Multiprocessor Synchronization @Brown
• http://www.cs.brown.edu/courses/cs176
Parallel Programming for Multicore @Berkeley, Spring 2007
• http://www.cs.berkeley.edu/~yelick/cs194f07/
Applications of Parallel Computing @Berkeley, Spring 2007
• http://www.cs.berkeley.edu/~yelick/cs267_sp07/
High-Performance Computing for Applications in Engineering @Wisc, Autumn 2008
• http://sbel.wisc.edu/Courses/ME964/2008/index.htm
High Performance Computing Training @Lawrence Livermore National Laboratory
• https://computing.llnl.gov/?set=training&page=index
Programming Massively Parallel Processors with CUDA @Stanford, Spring 2010 (with video)
• on Itunes U and Youtube.com
Invited talk @ ETRI, © Kyong-Ha Lee 70