Database Research on Modern Computing Architecture

Database Research on
Modern Computing
Architecture
September 10, 2010

Kyong-Ha Lee (bart7449@gmail.com)
Department of Computer Science
KAIST, Daejeon, Korea

Brief Overview of This Talk
 Basic theories and principles about database
technology on modern HW
• Not much discussion on implementation or tools,
but will be happy to discuss them if there are any
questions
 Topics
• The immense changes in computer architecture
• A variety of computing sources
• Intra-node parallelism
• The DB technology that facilitates modern HW
features

Invited talk @ ETRI, © Kyong-Ha Lee 2

Things we have now in our PC
Core 1 16 Integer
Throughput ~1 instruction per cycle. registers
One cycle takes ~0.33 ns. Core 2 16 Double
The exact # of cycles depends on the in FP registers
struction
L1 D-cache L1 TLB L1 D-cache
32KB L1 I-cache 128 entries 32KB L1 I-cache L1
Latency 1ns 32KB for I 256 en Latency 1ns 32KB TLB
(3 cycles) tries for D (3 cycles)
L2 Unified Cache
Intel Core 2 Duo
L2 TLB 6MB
3.0GHz, E8400
Latency 4.7ns (14 cycles)
Wolfdale
Front Side Bus
1,333MHz
Bandwidth: 10GB/S
DDR3 Ram Modules
Intel X48
PCI Express 2.0 x16, 8GB/s (e 4 GB
ach way)
Northbridge
Latency: ~83ns(~250 c
Chip
Invited talk @ ETRI, © Kyong-Ha Lee
ycles) 3

DMI Interface
Bandwidth: 1GB/s (each way)

USB 2.0 ~30MB/s

Serial ATA port 300MB/s
FireWire 800 ~55MB/s

PCIe 2.0 x1, 500MB/s (each way)
Seagate 1TB 7,200 RPM
Wireless 802.11g ~2.5 MB/s
32MB
HDD Cache
Gigabit wired ethernet. ~100 MB/s Operates at SATA rate
Intel ICH9R
Southbridge
My LGU+ cable line chip Sustained disk I/O ~138Mb/S
100Mb/s up/down

Random seek time:
8.5ms (read)/9.5ms(write)
25.7 million/28.8 million
Internet cycles
original source:
http://duartes.org/gustavo/blog/post/what-your-
computer-does-while-you-wait

So what‘s happening now?
 Changes in memory hierarchy
• Higher capacity and the emergence of Non-
Volatile RAM(NVRAM)
 Memory wall and multi-level caches
• Latency lags bandwidth
 Increasing number of cores in a single die
• Multicore or CMP
 A variety of computing sources
• CPU, GPU, NP and FPGA
 Intra-/Inter-node parallelism
• CMP vs. MPP

Now In Memory Hierarchy
 Very cheap HDD with VERY high capacity
• Seagate 1TB Barracuda 3.5‘ HDD (7200rpm, 32MB) for
74,320 won($61.94) in Aug 2010
• 1GB for 74.3 won
 Write-once storage
• Tape drive is dead, ODD is waning
• Due to the poor latency and seek time
• Seek time >= 100ms
• although 22X DVD writer can sequentially write 4.7GBytes
within 3 minutes (29.7MB in theory)
 1GB for 53.82 won
 Price of RAM has fallen enough to keep much more data in
memory than before.
• A 4GB DDR3 Memory(1,333MHz) for 108,000 won
• 1 GB for 27,000 won ($22.5)
• but, still cost_m >> cost_d


The Five-Minute Rule
 Cache randomly accessed disk pages that are reused
every 5 minutes[1].
• BreakEvenIntervalInSeconds 
PagesPerMBofRAM Pr icePerDisk Drive

AccessPerS econdPerDi sk Pr icePerMBof RAM
 In 1987, breakeven interval was ~2 minutes
 After that, ~5 minutes in 1997, ~88 minutes in 2007.
 “Memory becomes HDD, HDD becomes Tape,
and Tape is dead”, by Jim Gray
 Today‘s memory is ~102,400 times faster than HDD
• Memory : 83 ns(250 cycles)
• HDD : 8.5ms (25.7 million cycles)
 (256/116) x (61.94/0.0225) = ~101 minutes.
 => Cache your data in memory as always as possible.

Latency lags bandwidth
 From 1983 to 2003[2]
• Capacity increased ~ 2,500 times
(0.03GB -> 73.4GB)
• Bandwidth improved 143.3 times
(0.6 MB/s -> 86 MB/s)
• Latency improved 8.5 times (48.3 ->
5.7 ms)
 Why?
• Moore‘s law helps bandwidth more
than latency
• Distance limits latency
• Bandwidth is generally easier to sell
• Latency helps bandwidth but not
vice versa.(e.g., spinning disk faster)
• Bandwidth hurts latency(e.g., buffer)
• OS overhead hurts latency


Latency vs. Bandwidth
 Latency can be handled by
• Hiding (or tolerating) it – out of order issue, non blocking cache,
prefetching
• Reducing it –better cache
 Parallelism sometimes helps to hide latency
• MLP(Memory Level Parallelism) - multiple outstanding cache misses
overlapped
• But increased bandwidth demand
 Latency ultimately limited by physics
 Bandwidth can be handled by ―spending ― more (HW cost)
• Wider buses, interface, Interleaving
 Bandwidth improvement usually increases latency
• No free lunch
 Hierarchies decreases bandwidth demand to lower levels.
• Serve as traffic filters: a hit in L1 is filtered from L2
• If average bandwidth is not met -> infinite queues


NVRAM Storage: Solid Sate Disk

 Intel X25-M Mainstream(50nm) 160GB
• Read/write latency 85/115 us
• Random 4KB read/write: 35K/3.3K IOPS
• Sustained sequential read/write: 250/70MB/s
• 1GB for 3,619 won in Aug 2010
 SSD has successfully occupied the position
between memory and HDD
• best suited for sequential read/write
• e.g., logging device


Features of SSD
 No mechanical latency
• Flash memory is an electronic device with no moving parts
• Provides uniform random access speed without seek/rotational latency
 No in-place update
• No data on a page can be updated in place before erasing it first
• An erase unit (or block) is much larger than a page
 Limited lifetime
• MLC : 0.1M times of writes, SLC : 1M times of writes
• Wear-leveling
 Asymmetric read & write speed
• Read speed is typically at least 3X faster than write speed
• Write (and erase) optimization is critical
 Asymmetric seq. vs. random I/O performance
• Random 4KB read/write: 35K/3.3K IOPS
• 140MB/13.2MB in total size
• Sustained sequential read/write: 250/70MB/s
 ―Disk‖ Abstraction
• LBA(or LPA) -> (channel#, plane#, … ) or just PBA(or PPA)
• This mapping changes each time a page write is performed
• The controller must maintain a mapping table in RAM or Flash

Memory wall

 Latencies
• CPU stalls because of time spent for memory access
• latency for memory access: 250 cycles
• ~249 instructions are blocked waiting data from the
memory access.
 Solution: CPU Caching!!


Why Caching?
 Processor speeds are projected to increase about 70% per year
for many years to com. This trend will widen the speed gap
btw. Memory and processor caches. The caches will get larger,
but memory speed will not keep pace with processor speeds
 Low latent memory that hides memory access latency
• Static RAM vs. Dynamic RAM
• 3ns(L1)~14ns(L2) vs. 83ns
 Small capacity with support of locality
• Temporal locality
• Recently referenced items are likely to be referenced in the near
future
• Capacity limits the # of items to be kept in the cache at a time
• L1$ in Intel Core i7 is 32KB
• Spatial locality
• Items with nearby addressed tend to be referenced close together in
time
• the size of one cache line
• e.g., a cache line size in Intel Core i7 is 64B
• So 32K/64B = 512 cache lines

An Example Memory Hierarchy

Source: Computer Systems,
A Programmer‘s Perspective, 2003

Memory Mountain in 2000

32B cache line size


Memory Mountain in 2010
Intel Core i7
2.67GHz
32KB L1 d-cache
256KB L2 cache
8MB L3 cache
64B cache line size

source: http://csapp.cs.cmu.edu/public/perspective.html

CPU Cache Structure


Addressing Caches


Types of Cache Misses
 Cold miss(or compulsive miss)
• Data are not loaded at first.
 Capacity miss
• Because of limited capacity
• must evict a victim to make space for replacement block
• LRU or LFU
 Conflict miss
• involves cache thrashing
• can be alleviated by associative cache
• e.g., 8-way set associative cache in Core2 Duo
 Coherence miss
• Data consistency between caches


Cache Performance
 Metrics
• Miss rate: # of misses/# of references
• The fraction of memory references during the execution of a
program.
• Hit rate :# of success/#of references
• Hit time: the time to deliver a word in the cache to the CPU
• Miss penalty: any additional time required because of a miss.
 Impact of :
• Cache size: reduce capacity miss and increase both of hit rate
and hit time
• Cache line size: increase spatial locality and decrease temporal
locality
• Associativity
• Full-associative : no conflict miss, but linear scan of cache lines
eventually
• Direct-mapping: conflict miss


Writing Cache-Friendly Codes
 Maximizes two localities in your program
• Remove pointers as many as possible
• Increasing both spatial locality and update cost
• fit the working data into a cache line and into the
capacity of the cache
• Increasing spatial and temporal locality
• Use working data as often as possible once it has
been read from memory
 Software prefetching
• Removing cold miss rates


Example: Matrix Multiplication

*Assumptions:
•Row-major order
•Cache block = 8 doubles
•Cache size C << n
Invited talk @ ETRI, © Kyong-Ha Leeblocks fit into cache 3B^222 C
•Three <

SW Prefetching
 Loop unrolling
• for (int i=0; i < N-4; i+=4){ //inner product of double a[] and b[]
prefetch(&a[i+4] );//32-bit machine with 32B cache line size
prefetch(&b[i+4]);
ip = ip + a[i]*b[i]; a[i] a[i+1] a[i+2] a[i+3]
ip = ip + a[i+1] * b[i+1];
b[i] b[i+1] b[i+2] b[i+3]
ip = ip + a[i+2]* b[i+2];
ip = ip +a[i+3]* b[i+3]; }

 Data linearization 1
preorder traverse
2 3
1 2 4 5 3 6 7
4 5 6 7


Optimizations in Modern
Microprocessor
 Pipelining (Intel i486)
• utilizes ILP(Instruction Level
Parallelism)
• increases throughput but not
latency.
 Out-of-order execution(Intel P6)
• 96-sized inst. window(Core2), 128-
sized inst. window(Nehalem)
• in-order processor(Intel Atom,
GPU)
 Superscalar(Intel P5)
• 3-wide(Core2) , 4-wide(Nehalem)


 Simultaneous Multi-threading(from Intel Pentium4)
• TLP(Thread-Level Parallelism)
• Hardware multi-threading
• Support of HW-level context switching
• issues multiple instructions from multiple threads in one cycle.
• HT(Hyper Threading) is Intel‘s term for SMT
 SIMD(Single Instruction Multiple Data)(Intel Pentium III)
• DLP(Data-Level Parallelism)
• 128bit SSE(Streaming SIMD Extensions) for x86 architecture


 Branch prediction and speculative execution
• guess which way a branch will go before this is known for
sure
• To improve the flow in the ILP
 Hardware prefetching
• Hiding latency by fetching data from memory in advance
• Advantage
• No need to add any instruction overhead to issue prefetches
• No SW cost
• Disadvantage
• Cache pollution
• Bandwidth can be wasted
• H/W cost and compatibility


Speed of a Program
 CPI(Clock Per Instruction) vs.
IPC(Instructions Per Clock)
 MIPS(Million Instructions Per Second)
• FLOPS(Floating Point Per Second)
• GFLOPS, TFLOPS
 T = N x CPI x T_cycle
 Improvement
• Reduce the # of instructions
• Reduce CPI
• Increase clock speed


Virtuous Cycle, circa 1950 – 2005
Increased
processor
performance

Larger, more
Slower
feature-full
programs
software

Higher-level Larger
languages & development
abstractions teams

World-Wide Software Market (per IDC):
$212b (2005)  $310b (2010)

Virtuous Cycle, circa 2005-??

Slower
programs
X
Increased
processor
performance

Larger, more
feature-full
software

GAME OVER — NEXT LEVEL?
Threadlanguages & Parallelism &
Level
Higher-level Larger
development
abstractions teams
Multicore Chips


CMP(Chip Level Multiprocessor)
 Apple Inc. starts to sell 12-core Mac Pro(in Aug 2010)
• ―The new Mac Pro offers two advanced processor options from Intel.
The Quad-Core Intel Xeon ― Nehalem‖ processor is available in a
single-processor, quad-core configuration at speeds up to 3.2GHz.
For even greater speed and power, choose the ―Westmere‖ series,
Intel‘s next-generation processor based on its latest 32-nm process
technology. ‖Westmere‖ is available in both quad-core and 6-core
versions, and the Mac Pro comes with either one or two processors.
Which means that you can have a 6-core Mac Pro at 3.33GHz, an 8-
core system at 2.4GHz, or, to max out your performance, a 12-core
system at up to 2.93GHz.‖ from Apple homepage


Multicore
 Moore‘s law is still valid
• ―The # of transistors on an
integrated circuit has doubled
approximately every other year.‖- Core
Gordon E. Moore, 1965 Shared
 Obstacles to increasing clock speed L2 Cache
• Power density problem
• ―Can soon put more transistors Core
on a chip than can afford to turn Intel
on‖ – Patterson‘07 Core2 Duo
• Heat problem
• e.g., Intel Pentium IV Prescott
(3.7GHz) in 2004
 Limits in Instruction Level
Parallelism(ILP)
=> The emergence of Multicore !!
Intel
Core i7

Chip density is continuin
g increase ~2x every 2 ye
ars
Clock speed is not incre
asing
Number of processor co
res may double instead
There is little or no hidde
n parallelism (ILP) to be f
ound
Parallelism must be expo
sed to and managed by so
ftware

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)


Can soon put more transistors on a chip than can afford to turn on.
-- Patterson ‗07

Scaling clock speed (business as usual) will not work
10000 Sun‘s
Surface
Rocket
Power Density (W/cm2)

1000
Nozzle

Nuclear
100
Reactor

8086 Hot Plate
10 4004 P6
8008 8085 386 Pentium®
286 486
8080 Source: Patrick
1 Gelsinger, Intel
1970 1980 1990 2000 2010
Year


Parallelism Saves Power
 Exploit explicit parallelism for reducing
power
Power = C **V222/4 F)/4
(C * V **FF F/2
2C V * * Performance = Cores **FF
(Cores * F)*1
2Cores F/2
Capacitance Voltage Frequency
• Using additional cores
– Increase density (= more transistors = more capacita
nce)
– Can increase cores (2x) and performance (2x)
– Or increase cores (2x), but decrease frequency (1/2): s
ame performance at ¼ the power
• Additional benefits
– Small/simple cores  more predictable performance

Amdahl‘s law
 Two basic metrics
•
•
 Recall Amdahl‘s law [1967]
• Simple SW assumption
• No Overhead for
• Scheduling, communication, synchronization, and
etc

• e.g.,



Types of multicore
 Symmetric multicore
• e.g., Core 2Duo, i5, i7, Xeon octo-core
 Assume that
• Each Chip Bounded to N BCEs (for all cores)
• Each Core consumes R BCEs
• Assume Symmetric Multicore = All Cores Identical
• Therefore, N/R Cores per Chip — (N/R)*R = N
• For an N = 16 BCE Chip:

Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core

Performance of Symmetric
Multicore Chips
 Serial Fraction 1-F uses 1 core at rate Perf(R)
 Serial time = (1 – F) / Perf(R)
 Parallel Fraction uses N/R cores at rate Perf(R) each
 Parallel time = F / (Perf(R) * (N/R)) = F*R /
Perf(R)*N
 Therefore, w.r.t. one base core:
1
Symmetric Speedup = F*R
1-F
Perf(R)
+ Perf(R)*N
 Implications?
Enhanced Cores speed Serial & Parallel

Symmetric Multicore Chip, N = 16
BCEs
16
F=0.999
14
F=0.99 F1, R=1, Cores=16, Speedup16
Symmetric Speedup

12
F=0.975
10

8
F=0.9
6
F=0.9, R=2, Cores=8, Speedup=6.7
4
F=0.5
2

0
1 2 4 8 16
(16 cores) R BCEs (8 cores)
(2 cores) (1 core)
(4 cores)
F matters: Amdahl‘s Law applies to multicore chips
MANY Researchers should target parallelism F first
As Moore‘s Law increases N, often need enhanced core designs
Some arch. researchers target on single-core performance

 Asymmetric multicore
• Cell Broadband Engine in PS3
• 1 PPE(Power Processor Element) and 8 SPE(Synergic
Processor Element)
 Each Chip Bounded to N BCEs (for all cores)
 One R-BCE Core leaves N-R BCEs
 Use N-R BCEs for N-R Base Cores
 Therefore, 1 + N - R Cores per Chip
 For an N = 16 BCE Chip:

Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core
& Twelve 1-BCE base cores

Performance of Asymmetric
Multicore Chips
 Serial Fraction 1-F same, so time = (1 – F) /
Perf(R)
 Parallel Fraction F
• One core at rate Perf(R)
• N-R cores at rate 1
• Parallel time = F / (Perf(R) + N - R)
 Therefore, w.r.t. one base core:
1
Asymmetric Speedup = F
1-F
Perf(R)
+ Perf(R) + N - R

Asymmetric Multicore Chip, N =
256 BCEs
250
Asymmetric Speedup F=0.999

200

150
F=0.99

100
F=0.975

50
F=0.9
F=0.5
0
1 2 4 8 16 32 64 128 256
(256 cores) (1+252 cores) R BCEs (1+192 cores) (1 core)
(1+240 cores)

Number of Cores = 1 (Enhanced) + 256 – R (Base)
How do Asymmetric & Symmetric speedups compare?

Other laws
 Gustafson‘s law
• ,α = 1-f
 Karp-Fratt metric
• efficient to estimate serial
fraction from the real code
•


Multicore Makes The Memory Wall

Problembandwidth to access memory
Assume that each core requires 2GB/s
Worse
 What if 6 cores access the memory at a time? => 12GB >> FSB bandwidth
 A prefetching scheme that is appropriate for a uniprocessor may be entirely
inappropriate for a multiprocessor [22].
5.0E+09

Total CPU cycles of a query
4.0E+09

Memory
3.0E+09
DTLB miss
L2 hit
Branch Misprediction
2.0E+09 Computation

1.0E+09

0.0E+00
1 core 8 cores
http://spectrum.ieee.org/computing/hard
ware/multicore-is-bad-news-for-supercom
puters Solution : Sharing memory access

GPU
 DLP(Data-Level Parallelism)
 GPU has become a powerful computing
engine behind scientific computing and data-
intensive applications
 Many light-weighted in-order cores
 has separated caches and memories
 GPGPU applications are data-intensive,
handling long-running kernel execution(10-
1,000s of ms) and large data units ( 1-100s of
MB)


GPU Architecture

NVIDIA GTX 512
 16 Streaming Multiprocessors(SM), each of which
consists of 32 Stream Processors(SPs), resulting in 512
cores in total.
 All threads running on SPs share the same program
called kernel
 An SM works as an independent SIMT processor.

Levels of Parallel Granularity and
Memory sharing
 A thread block is a batch of
threads that can cooperate
with each other by:
• Synchronizing their execution
• For hazard-free shared
memory accesses
• Efficiently sharing data
through a low latency shared-
memory
 Two threads from two
different blocks cannot
cooperate


Four Execution Steps
 The DMA controller transfers data from
host(CPU) memory to device(GPU) memory
 A host program instructs the GPU to launch
the kernel
 The GPU executes threads in parallel
 The DMA controller transfers result from
device memory to host memory

 Warp; a basic execution(or scheduling) unit
of SM, a group of 32 threads sharing the
same instruction pointer; all threads in a
warp take the same code path.

Comparison with CPU
• It maximizes ILP to • It maximizes thread-level
accelerate a small # of parallelism
threads
• It devotes most of their die
• large caches and area to a large array of
sophisticated control planes ALUs.
for advanced features
• e.g., superscalar, OoO • Memory stall can be
execution, branch prediction, effectively minimized with
and speculative loads an enough number of
• Latency hiding is limited by threads
CPU resources
• Large memory
• Limited memory bandwidth(177.4GB/s for
bandwidth(32GB/s for GTX480)
X5550)

CPU GPU


GPU Programming Considerations
 What to offload
• Computation and memory-intensive algorithms with
high regularity suit well for GPU acceleration
 How to parallelize
 Data structure usage
• Simple data structure such as arrays are
recommended.
 Divergency in GPU code
• SIMT demands to have minimal code-path divergence
caused by data-dependent conditional branches
within a warp
 Expensive host-device memory cost


FPGA(Field Programmable Gate
Array)
 Von Neumann
architecture vs.
Hardware architecture
 Integrated circuit
designed to be configured
by customer.
 configuration is specified
using a HDL(Hardware
Description Language)


Limitations of FPGA
 Area/speed tradeoff
• Finite CLB on a single die
• Becomes slower and more power-
consumptive as logic becomes more complex
 Act as a hard-wired once it is cooked
 No support of recursion calls

 Asynchronous design

 Less power efficient


Are DB execution cache-friendly?
 DB Execution Time Breakdown (in 2005)

 At least 50% cycles on stalls.
 Memory access is major bottleneck
 Branch mispredictions increase cache misses


Modern DB techniques
• Cache-conscious • CMP and multithreading
• Cache-friendly data • Memory scan sharing
placement • Staged DB execution
• Data cache • GPGPU
• Cache-conscious data
structure • SIMT
• Buffering index structure • FPGA
• Hiding latency using • Von-Neumann vs. HW
Prefetching
• Cache-conscious join
circuit
• Instruction cache
• Buffering
• Staged database
execution
• Branch prediction
• Reduce branches and
SIMD


Record Layout Schemes
Select name f
PAX optimizes cache-to-memory communication but rom R where
retains NSM‘s IO (page contents do not change) age > 50

(a) NSM(N-ary Storage Model) (b) DSM(Decomposed Storage Model) or
Column-based (c)PAX(Partition Attribute Across)


Main-Memory Tree Indexes
 T-tree: Balanced-binary tree proposed in 1986 for
MMDB
• Aim: balance space overhead with searching time.
 Main-memory B+-trees: better cache performance[4]
 Node width = cache line size (32-128B)
 Minimize number of cache misses for search
 Much higher than traditional disk-based B+-tree
=> more cache miss
 How the shallow B+-tree?


Cache Sensitive B +-tree
 Layout child nodes contiguously
 Eliminate all but one child pointers
• keys in one node fit in one cache line
• Removing pointers increases the fanout of the tree, which
results in a reduced tree height

• 35% faster tree lookups
• Update performance is 30% worse (splits)

Buffering Index Structures
 buffering accesses to the index structure to avoid cache
thrashing
 Nodes in the index tree are grouped together into
pieces that fit within the cache
 Increase temporal locality but accesses can be delayed


Prefetching B+-tree

 Idea: Larger nodes + prefetching
 Node size = multiple cache lines (e.g., 8 lines)
 Prefetch all lines of a node before search it
 Cost to access a node only increases slightly
 Much shallower tree, no changes required
 Improves both search and update performance


Fractal pB+-tree

 For faster range scan
• Leaf parent nodes contain addresses of all leaves
• Link leaf parent nodes together
• Use this structure for prefetching leaf nodes
 * A prefetching scheme that is appropriate for a
uniprocessor may be entirely inappropriate for a
multiprocessor [22].


Cache-Conscious Hash Join
 For good temporal
locality, two relations to
be joined are partitioned
into partitions that fit in
the data cache.
 To reduce TLB misses
caused by big H, use
radix hash
• In the cluster, # of
random accesses is low
• a large number of
clusters can be created by
making multiple passes
through the data

Group prefetching


a group


Buffering tuples btw. operators
 group consecutive operators into execution groups
whose operators fit into the L1 I-cache.
 buffering output of the execution group
 I-Cache misses are amortized over multiple tuples
and i-cache thrashing is avoided


How SMTs can help DB
performance
 Bi-threaded: partition input, cooperative
threads
 Work-ahead-set: main thread + helper thread
• Main thread posts ―work-ahead set‖ to a queue
• Helper thread issues load instructions for the
requests


Staged Database Execution Model
 TX may be divided into stages that fit in the L1 I-cache
 When one tx reaches the end of stage, system switches
context to a different thread that needs to execute the
same stage. Stage S0
LOAD X
LOAD X STORE Y
STORE Y STORE Y
STORE Y
Stage S1
LOAD Y LOAD Y
…. ….
STORE Z STORE Z

LOAD Z
…. Stage S2
LOAD Z
….

Stage Spawning

LOAD X LOAD Y
LOAD Z
S0 STORE Y S1 ….
S2 ….
STORE Y STORE Z

Core 0 Core 1 Core 2

Work-queues

Instances Instances Instances
of S0 of S1 of S2

Main-Memory Scan Sharing

•Memory scan sharing also increases temporal locality
•Too many sharing can cause cache thrashing


Summary
 Latency is a major problem
 Cache-friendly programming is
indispensible
 Chip level multiprocessor requires to be
used for TLP
 Facilitating diverse computing sources is a
challenge


Further readings
1. Jim Gray, Gianfrano R. Putzolu, The 5 Minute Rule for Trading Memory for Disk Accesses and
The 10 Byte Rule for Trading Memory for CPU Time, SIGMOD 1987: 395-398
2. David A. Patterson, Latency lags bandwidth, CACM, Vol. 47, No. 10 pp. 71—75, 2004
3. Mark Hill and et al., Amdahl‘s law in multicore era, IEEE Computer, Vol. 41, No. 7 pp. 33-38,
2008
4. J. Rao and et al., Cache Conscious Indexing for Decision-Support in Main Memory
5. P. A. Boncz and et al., Breaking the Memory wall in monetDB, CACM, Dec 2008
6. Shimin Chen and et al., Improving Hash Join Performance through Prefetching, ICDE 2004
7. Jingren Zhou and et al., Implementing Database Operations Using SIMD instructions, SIGMOD
2002
8. J. Cieslewicz and K.A. Ross, Database Optimizations for Modern Hardware, Proceedings of the
IEEE 96(5), 2009
9. Lawrence Sparcklen and et al., Chip Multithreading: Opportunities and Challenges
10. Nikos Hardavellas and et al., Database Servers on Chip Multiprocessors: Limitations and
Opportunities, CIDR 2007
11. Lin Qiao and et al., Main-Memory Scan Sharing For Multi-Core CPUs, PVLDB 2008
12. Ryan Johnson and et al., To Share or Not to Share?, VLDB 2007
13. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB 2009
14. Database Architectures for New Hardware, Tutorial, in the 30th VLDB, 2004 and in the 21st
ICDE 2005
15. Query Co-processing on Commodity Processors, Tutorial in the 22nd ICDE 2006.
16. John Nickolls and et al., GPU Computing Era, IEEE Micro March/April 2010
17. Kayvon Fatahalian and et al., A Closer Look at GPUs, CACM Vol. 51, No.10, 2008


18. John Nickolls and et al., Scalable Parallel Programming, March/April ACM Queue, 2008
19. N.K. GOvindaraju and et al., GPUTeraSort: High performance graphics co-processor sorting for large
database management, SIGMOD 2006
20. A. Mitra and et al., Boosting XML Filtering with a Scalable FPGA-based Architecture, CIDR 2009
21. S. Harizopoulos and A. Ailamaki and et al., Improving instruction cache performance in OLTP, ACM
TODS, vol. 31, pp. 887-920
22. T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory
multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, 1991

*Courses available on Internet
 Introduction to Computer Systems @CMU, 2000~2010
• http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f10/www/index.html
 Multicore Programming Primer @MIT, 2007 (with video)
• http://groups.csail.mit.edu/cag/ps3/index.shtml
 Introduction to Multiprocessor Synchronization @Brown
• http://www.cs.brown.edu/courses/cs176
 Parallel Programming for Multicore @Berkeley, Spring 2007
• http://www.cs.berkeley.edu/~yelick/cs194f07/
 Applications of Parallel Computing @Berkeley, Spring 2007
• http://www.cs.berkeley.edu/~yelick/cs267_sp07/
 High-Performance Computing for Applications in Engineering @Wisc, Autumn 2008
• http://sbel.wisc.edu/Courses/ME964/2008/index.htm
 High Performance Computing Training @Lawrence Livermore National Laboratory
• https://computing.llnl.gov/?set=training&page=index
 Programming Massively Parallel Processors with CUDA @Stanford, Spring 2010 (with video)
• on Itunes U and Youtube.com


Database Research on Modern Computing Architecture

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Database Research on Modern Computing Architecture

Semelhante a Database Research on Modern Computing Architecture (20)

Mais de Kyong-Ha Lee

Mais de Kyong-Ha Lee (7)

Último

Último (20)

Database Research on Modern Computing Architecture