1. L3 Cache
DBA Level 400
C P U
L2 Cache
L1 Cache
Core Core
L2 Cache
L1 Cache
2. About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
Speaker, both at UK user group events and at conferences.
I have a passion for understanding how the database engine works
at a deep level.
3. “Everything fits in memory, so
performance is as good as it will
get. It fits in memory therefore
end of story”
5. Demonstration #1 Repeated With ‘Bigger’ Hardware
CPU
6 core 2.0 Ghz
(Sandybridge)
Warm large object
cache used in all
tests to remove
storage as a factor.
CPU
6 core 2.0 Ghz
(Sandybridge)
48 Gb quad channel 1333
Mhz DDR3 memory
Hyper-threading enabled, unless specified otherwise.
6. Which SELECT Statement Has The Lowest Elapsed Time ?
17.41Mb column store Vs. 51.7Mb column store
WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM (SELECT a = 1
FROM master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
,CAST(((id - 1) % 1048576) AS money ) AS Price2
,CAST(((id - 1) % 1048576) AS money ) AS Price3
INTO FactInternetSalesBigNoSort
FROM generator
CROSS JOIN [dbo].[DimDate] d
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON
FactInternetSalesBigNoSort
SELECT CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [dbo].[FactInternetSalesBigNoSort] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM (SELECT a = 1
FROM master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
,CAST(((id - 1) % 1048576) AS money ) AS Price2
,CAST(((id - 1) % 1048576) AS money ) AS Price3
INTO FactInternetSalesBigSorted
FROM generator
CROSS JOIN [dbo].[DimDate] d
CREATE CLUSTERED INDEX ccsi
ON FactInternetSalesBigNoSorted ( OrderDateKey )
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON
FactInternetSalesBigNoSorted
WITH (DROP_EXISTING = ON)
SELECT CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [dbo].[FactInternetSalesBigSorted] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
The
fastest
?
The
fastest
?
7. The Case of The Two Column Store Index Sizes
SQL Server query tuning 101
The optimizer will always use the
smaller data structures it can
find to satisfy the query, right ?
8. How Well Do The Queries Using The Two Column Stores Scale ?
0
10000
20000
30000
40000
50000
60000
70000
80000
2 4 6 8 10 12 14 16 18 20 22 24
Time(ms)
Degree of Parallelism
Non-sorted column store Sorted column store
Data creation statement scaled using top 300,000 to
create 1,095,600,000 rows.
9. Can We Use All Available CPU Resource ?
0
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20 22 24
PercentageCPUUtilization
Degree of Parallelism
Non-sorted Sorted
Memory access
should be 100%
CPU intensive ?!?
10. Wait Statistics Do Not Help Here !
Stats are for the query ran with a DOP of 24, a warm column store object pool and the
column store created on pre-sorted data.
11. Spin Locks Do Not Provide Any Clues Either
Executes in 775 ms for a warm column store object pool
12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles
Total spins 293,491
SELECT [CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [FactInternetSalesBig] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION (MAXDOP 24)
12. Well Documented Tools
Dynamic management views and functions
Performance counters
Extended events
Not all wait events, latches and spin locks are well documented if
documented at all.
Takeaway: These tools are not going to tell us definitively
where our CPU time is going !!!
13. The Problem With Query Plan Costs
Assumptions:
The buffer cache is cold.
IO cannot be performed in parallel.
Data in different columns is never correlated ( improved on in SQL 2014 ).
Hash distribution is always uniform.
Etc. . . .
Based on the amount of time it took a
developers machine to complete certain
operations, a Dell OptiPlex ( according to legend ).
16. Introducing the Windows Performance Toolkit
Comes as part of the Windows Assessment and Deployment Kit.
Traces are created via xperf and windows performance recorder.
It utilises Event Tracing for Windows (ETW)
Visualise traces using windows performance analyser.
The real power is being able to stack walk the database engine
17. Public Symbols
These are labels contained in .pdb files that provide information on what programming
construct generated a piece of machine code, also known as debug symbols.
Caution:
public
symbols
are version
specific,
down to
CU level !
18. Obtaining An ETW Trace Stack Walking The Database Engine
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
WPA
SQL Statement
19. Basic xperf Command Line Syntax
xperf –on < kernel flags | kernel groups > [ -stackwalk < stackwalk kernel providers ]
Kernel groups are groups of flags and not to be confused with Windows kernel groups
Takeaway: kernel groups make life
easier
20. ETW Tracing Is Generally Light Weight
This comes from the Premier Field Engineer Blog: Identifying the cause of SQL Server IO
bottlenecks using XPerf, it reduced my IO throughput from 3300 Mb/s to 300 Mb/s
XPERF -on PROC_THREAD+LOADER+
FLT_IO_INIT+FLT_IO+FLT_FASTIO+FLT_IO_FAILURE+
FILENAME+FILE_IO+FILE_IO_INIT+
DISK_IO+HARD_FAULTS+DPC+
INTERRUPT+CSWITCH+PROFILE+DRIVERS+DISPATCHER
-stackwalk MiniFilterPreOpInit+MiniFilterPostOpInit+
CSWITCH+PROFILE+ThreadCreate+ReadyThread+
DiskReadInit+DiskWriteInit+DiskFlushInit+
FileCreate+FileCleanup+FileClose+FileRead+FileWrite
-BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular
Takeaway: start off with the smallest set of kernel providers you
can get away with and add incrementally.
21. The Database Engine Composition ( SQL Server 2012 onwards )
Database Engine
Language Processing: sqllang.dll
Runtime: sqlmin.dll, sqltst.dll, qds.dll
SQLOS: sqldk.dll, sqlos.dll
25. North bridge
Core
Core 2 Architecture
L3 Cache
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
Core
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
C P U
South bridge
Front side bus
26. Core 2 Architecture
Latency when talking to IO and memory controllers even before accessing
memory or IO.
Single level TLB.
Hyper-threading via NetBurst, this delivered poor performance.
Design was not module, 4 CPU was 2 x 2 core CPUs “Bolted together”.
Only last generation “Dunnington” had a L3 cache.
27. Core
Core i Series Generation 1 ‘Nehalem’
L3 Cache
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
TLB
Memory bus
C P U
QPI. . .
Un-core
Integrated memory controller.
NetBurst replaced with a new
hyper threading technology.
The Lookaside Buffer ( TLB ) for
caching logical to physical memory
mappings has two levels.
Front side bus replaced by the
Quick Path Inter connector (QPI).
Genuine modular design.
28. Core
Core i Series Generation 2 ‘Sandybridge’
L3 Cache
L1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
Bi-directional ring bus
PCIe 2.0
controllerTLBMemory bus
C P U
QPI…
Un-core
Integrated PCIe 2.0 controller.
Level 0 uop cache.
L3 cache connected to core
via bi-directional ring bus.
Advanced Vector eXtensions v1.
Data Direct IO.
29. Cache Lines
Unit of transfer between memory and the CPU.
Cache lines are used to create “Cache entries”, these
are tagged with the requested memory location.
Data structures should be cache line aligned in
order to avoid split register operations.
If different cores access the same cache line for
writes, this effectively ‘Bounces’ the L2 cache
=> Try to avoid threads sharing cache lines.
CPU
Cache Line – 64 bits
30. CPU Cache, Memory and IO Sub System, Latency
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
but, memory is the new flash storage and CPU cache is the new RAM . . .
Slide borrowed from
Thomas Kejser with his
kind permission.
31. 4
4
4
11
11
11
14
18
38
167
0 50 100 150 200
L1 Cache sequential access
L1 Cache In Page Random access
L1 Cache In Full Random access
L2 Cache sequential access
L2 Cache In Page Random access
L2 Cache Full Random access
L3 Cache sequential access
L3 Cache In Page Random access
L3 Cache Full Random access
Main memory
The CPU Cache Hierarchy Latencies In CPU Cycles Memory
32. Leverage the pre-fetcher as
much as possible.
Larger CPU caches
L4 Cache => Crystalwell eDram
DDR4 memory
By pass main memory
Stacked memory:
Hybrid memory cubes ( Micron, Intel etc ).
High bandwidth memory ( AMD, Hynix ).
Main Memory Is Holding The CPU Back, Solutions . . .
Main
memory
CPU
35. By Passing Main Memory With Data Direct IO
C P U
Core
L3 Cache
The Old
World
The New
World
With
Data
Direct IO
Core
C P U
Core
L3 Cache
Core
36. Transactions/s With and Without Data Direct IO
0
10
20
30
40
50
60
70
80
90
2 x 10
GBe
2 x 10
GBe
4 x 10
GBe
4 x 10
GBe
6 x 10
GBe
6 x 10
GBe
8 x 10
GBe
8 x 10
GBe
Transactions/Sec(Mu)
Single Socket IO Performance
Xeon
5600
Xeon
E5
37. The Case Of The Two Column Store Index Sizes
Could the CPU cycles required to
access the CPU cache versus main
memory be a factor ?
. . . lets dig deeper
38. Call Stack For Query Against Column Store On Non-Pre-Sorted Data
Hash agg lookup
weight 65,329.87
Column Store scan
weight 28,488.73
39. Control flow
Data flow
Where Is The Bottleneck In The Plan ?
The stack trace is indicating that the
Bottleneck is right here
40. Does The Stream Aggregate Perform Any Better ?
Query takes seven minutes and nine seconds
to run.
Performance is killed by a huge row mode sort
prior to the stream aggregate !!!
( sorted column store )
41. Call Stack For Query Against Column Store On Pre-Sorted Data
Hash agg lookup
weight:
now 275.00
before 65,329.87
Column Store scan
weight
now 45,764.07
before 28,488.73
42. The Case Of The Two Column Stores
The hash aggregate using the column
stored created on pre-sorted data is
very CPU efficient.
Why ?
43. SQL Server and Large Memory Pages, How Does It Work ?
L3 Cache
Power and
Clock
Core
Bi-directional ring bus
PCI
TLB
C P U
QPI
Un-core
Core
Page
Translation
Table
Memory
Controller
DTLB
( 1 st level )
STLB
( 2 nd level )
~10s of
CPU
cycles
160+
CPU
cycles
44. The TLB Without Large Page Support
L3 Cache
Power and
Clock
Core
Bi-directional ring bus
PCITLB: 32
x 4Kb
pages
C P U
QPI
Un-core
Core
Memory
Controller
128Kb of logical to
physical memory
mappings covered*
* Nehalem through to Ivybridge
architectures
45. The TLB With Large Page Support
L3 Cache
Power and
Clock
Core
Bi-directional ring bus
PCITLB: 32
x 2Mb
pages
C P U
QPI
Un-core
Core
Memory
Controller
128Kb of logical to
physical memory mapping
coverage is increased to
64Mb !!!
Fewer trips off the
CPU to the page
table
46. What Difference Do Large Pages Make ?
130s elapsed time
28%
saving
Large page
support off
Large page
support on
Page Lookups/s 222220 263766
0
50000
100000
150000
200000
250000
300000
93s elapsed time
Single 6 core
socket
47. CPU Pipeline Architecture
C P U
Front end
Back end
A ‘Pipeline’ of logical slots
runs through the processor.
Front end can issue four
micro ops per clock cycle.
Back end can retire up to
four micro operations per
clock cycle.
AllocationRetirement
48. Pipeline ‘Bubbles’ Are Bad !
C P U
Front end
Back end
Empty slots are referred to as
‘Bubbles’.
Causes of front end bubbles:
Bad speculation.
CPU stalls.
Data dependencies
A = B + C
E = A + D
Back end bubbles can be due
to excessive demand for
specific types of execution unit.
AllocationRetirement
Bubble
Bubble
Bubble
Bubble
49. A Basic NUMA Architecture
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Remote
memory
access
Local
memory
access
Local
memory
access
NUMA Node 0 NUMA Node 1
50. IOHub
Four and Eight Node NUMA QPI Topologies ( Nehalem i7 onwards )
CPU 1 CPU 3
CPU 0 CPU 2
CPU 6 CPU 7
CPU 4 CPU 5
CPU 2 CPU 3
CPU 0 CPU 1
IO Hub
IO Hub
IOHubIOHub
IOHub
With 18 core Xeon’s in the offing, these
topologies will become increasingly rare.
51. Remote Node Memory Access Comes At A Cost
An additional 20% overhead when
accessing ‘Foreign’ memory !
( from coreinfo )
52. Local Vs Remote Memory Access and Thread Locality
How does SQLOS
schedule hyper-
threads in relation
to physical cores ?
( 6 cores per socket )
CPU
socket 0
CPU
socket 1
Core 0 Core 1 Core 2
Core 3 Core 4 Core 5
53. Making Use Of CPU Stalls With Hyper Threading ( Nehalem i7 onwards )
Access
n row B-tree
L2
L3
Last level
cache miss n row B-tree
1. Session 1 performs an
index seek, pages not in CPU
cache
2. A CPU stall takes place
( 160 clock cycles+ ) whilst
the page is retrieved from
memory.
3. The ‘Dead’ CPU stall
cycles gives the physical
core the opportunity to
run a 2nd thread.
Core
L1
54. Hyper-Threading Scalability For The ‘Sorted’ Column Store Index
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59
0
10
20
30
40
50
60
70
80
90
100
CPUUtilisation
Degree of Parallelism
5 % CPU utilisation per core
60% of each core utilised
by first hyper thread
55. L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
SQL OS Layer
One Scheduler Per
Logical Processor
The OS has no concept of SQL
resources such as latches.
‘Hard’ context switches into soft user
mode context switches.
SQL OS scheduler threads by
prioritizing L2 cache hits and reuse
over ‘Fairness’.
The Rationale Behind SQL OS
56. The LOGCACHE_ACCESS Spin Lock
Buffer Offset (cache line)
Alloc Slot in Buffer
MemCpy
Slot
Content
Log Writer
Writer Queue
Async I/O
Completion Port
Slot
1
LOGBUFFER
WRITELOG
Signal thread
which issued
commit
T0
Tn
Slot
127
Slot
126
LOG FLUSHQ
Slide borrowed from
Thomas Kejser with his
kind permission.
This spin lock is the
ultimate bottleneck for
ACID transactions in SQL
Server
57. Core Core
Log
writer
thread
The log writer
needs to be able
to free the spin
lock cache line as
fast as possible.
Spinlocks and Context Switches
Cache line
58. Keeping The Rest Of The Database Engine Away From The Log Writer
Thread
The log writer
thread is always be
assigned to the first
free CPU core.
59. A Simple Test Create LOGCACHE_ACCESS Spin Lock Pressure
Create a table that ensures
there is one row per page, in
order to eliminate
PAGELATCH_EX latch activity.
Insert 1000 rows into the table,
the SPIDS for 22 sessions should
fall in the range of 1 -> 1000.
Procedure to perform
a single row update 10,000
times for the SPID of the
session it is running in.
61. The log writer thread is always assigned to core 0.
By isolating the rest of the database engine from core 0, the log
writer does not have to contend with so many context switches
when handing off and receiving back the log cache access spin
lock.
How Has This Drop In Spins Been Achieved ?
63. Segment Hash Key Value
Scan
Segment Hash Key Value
Scan
Column Store on
Non-Pre-Sorted Data
Hash table is likely to be at the high
latency end of the cache hierarchy.
Hypothesis
Sequential
access
Random
access
Column Store on Pre-Sorted Data
Hash table is likely to be at the low
latency end of the cache hierarchy.
64. The Case Of The 60% CPU Utilisation Ceiling
Column store scans are pre-fetcher
friendly, hash joins and hash
aggregates are not
Can a hash join or aggregate keep up
with a column store scan ?
65. Introducing Intel VTune Amplifier XE
Investigating events at the CPU cache,
clock cycle and instruction level requires
software outside the standard Windows
and SQL Server tool set.
Refer to Appendix D for an overview of what “General exploration”
provides.
66. VTune Can Go One Better Than WPT, Down To CPU Cycle Level !
181,578,272,367 versus 466,000,699 clock cycles !!!
67. This Is What The CPU Stall Picture Looks Like Against DOP
2 4 6 8 10 12 14 16 18 20 22 24
Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763
Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082
0
1,000,000,000
2,000,000,000
3,000,000,000
4,000,000,000
5,000,000,000
6,000,000,000
LLCMisses
Degree of Parallelism
Non-sorted Sorted
68. Does The OrderDateKey Column Fit In The L3 Cache ?
Table Name Column Name Size (Mb)
FactInternetSalesBigNoSort
OrderDateKey 1786182
Price1 3871
Price2 3871
Price3 3871
FactInternetSalesBigSorted
OrderDateKey 738
Price1 2965127
Price2 2965127
Price3 2965127
No , L3 cache is 20Mb in size
69. The Case Of The Two Column Store Index Sizes: Conclusion
Turning the memory access on
the hash aggregate table from random
to sequential probes
=
CPU savings > cost of scanning an
enlarged column store
70. Skew Is Also A Factor In Batch Mode Hash Join PerformanceBatch Mode
Hash Joins
70
Row Mode Batch Mode
Expensive to repartition inputs
Data skew reduces parallelism
Hash join
B1
B2
B3
Bn
Hash table
(shared)
Thread
Thread
Build input
Thread
Thread
Thread
Probe input
B1
B2
B4
Bm
B3
Build input
Hash join
Exchange Exchange
Probe input
Exchange Exchange
Data skew speeds up processing
No repartitioning
Slide borrowed
from Thomas
Kejser with his
kind permission.
71. The Case Of The 60% CPU Utilisation Ceiling
Still not solved. Tuning 101, if you
cannot max out the CPU capacity or
IOPS bandwidth, there must be
some form of contention at play . . .
73. The Case Of The 60% CPU Utilisation Ceiling: Conclusion
The hash aggregate cannot
keep up with the column store
scan.
The batch engine therefore
throttles the column store scan
by calling sleep system calls !!!.
The integration services engine does
something very similar, refer to
“Back pressure”.
74. The Case of The CPU Pressure Point
Where are the pressure points on
the CPU and what can be done to
resolve this ?.
75. Making Efficient Use Of The CPU In The “In Memory” World
C P U
D A T A F L O W
D A T A F L O WFront endBack end
Backend Pressure
Retirement throttled due to
pressure on back end
resources ( port saturation )
Frontend Pressure
Front end issuing < 4 uops per cycle whilst
backend is ready to accept uops
( CPU stalls, bad speculation, data
dependencies )
76. Lots Of KPIs To Choose From, Which To Select ?
CPU Cycles Per Retired Instruction (CPI)
This should ideally be 0.25, anything
approaching 1.0 is bad.
Front end bound
Reflects the front end under supplying the back
end with work, this figure should be as close to
zero as possible.
Back end bound
The Back end cannot accept work from the front
end because there is excessive demand for
specific execution units.
77. These Are The Pressure Point Statistics For The ‘Sorted’ Column Store
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24
KPIValue
Degree of Parallelism
CPI Front end Bound Back end bound
Refer to Appendix C for the formulae from
which these metrics are derived.
78. Which Parts Of The Database Engine Are Suffering Backend Pressure ?
Results obtained for a degree of parallelism of 24.
79. 256-bit
FMUL
Blend
Back end Pressure Can Manifest As “Port Saturation”
( *Sandybridge core)
CBagAggregateExpression::
TryAggregateUsingQE_Pure
*
SELECT [CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [FactInternetSalesBig] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION (MAXDOP 24)
80. Ports Saturation Analysis
0.46
0.45
0.16
0.17
0.1
0.55
0 0.1 0.2 0.3 0.4 0.5 0.6
0
1
2
3
4
5
Hit Ratio
Port
0.7 and above is deemed as port saturation,
if we can drive CPU utilisation above 60% we
may start to see this on ports 0, 1 and 5.
82. Single Instruction Multiple Data ( SIMD )
A class of CPU instruction that can
process multiple data points to be
processed simultaneously.
A form of vectorised processing.
Once CPU stalls are minimised, the
challenge becomes processing data
on the CPU ( rate of instruction
retirement ) as fast as possible.
83. Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) B(1)
+ C(1)
=
+ =
+ =
+ =
Using conventional processing,
adding two arrays together,
each comprising of four
elements requires four
instructions.
A(2)
A(3)
A(4)
B(2)
B(3)
B(4)
C(2)
C(3)
C(4)
84. Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) A(2) A(3) A(4)
B(1) B(2) B(3) B(4)
+
C(1) C(2) C(3) C(4)
=
Using SIMD –
“Single instruction multiple data”
commands, the addition can be
performed using a single instruction.
86. Does The SQL Server Database Engine Leverage SIMD Instructions ?
Vtune amplifier does not provide the option to pick out
Streaming SIMD Extension (SSE) integer events.
However, for a floating point hash aggregate we would hope to
see floating point AVX instructions in use.
87. The Case Of The CPU Pressure Point: Conclusion
The batch mode engine needs to
leverage SIMD.
If the hash aggregate could ‘Crunch’
more data per clock cycle, it would
stand a better chance of keeping up
with the column store scan !.
Just about all other column store database
engines use SIMD, please can the batch engine
leverage SIMD Microsoft . . .
88. What About ?
AMD came up with the 3D Now SIMD instruction set.
Its losing the single threaded performance race to Intel:
Will this change with Excavator ?
AMD innovations:
1st 64 bit implementation of the x86 architecture
1st the break the 1Ghz barrier for a x86 CPU
1st to release native dual and quad core x86 processors
89. A Look Into The Future: CPU Technology
Moore’s law runs out of steam
=> transition away from CMOS
L4 caches becoming more prevalent.
Aggressive vectorisation by leveraging GPU
cores on the CPU die.
On CPU package NIC functionality.
Rise of the ARM processor in the data centre.
CPU
90. A Look Into The Future: Storage and Memory
Rise of:
NVMe and NVMe over fabrics.
Stacked memory technology.
Increase in NAND flash density
through TLC and 3D NAND.
Mass storage and main memory
converge.
91. A Look Into The Future: Database Engines
Column store technology becomes
the de facto standard for data
warehouses and marts.
Adoption of more
MonetDB innovations
into popular database engines:
Database cracking
Query recycling
92.
93. Which Memory Your Data Is In Matters !, Locality Matters !
L3 Cache
Here ?
L1 Instruction
Cache
L2 Unified Cache
Here ?
L1 Data Cache
Here ?
Core
Memory
bus
Where is my data ?
C P U
Hopefully
not here ?!?
94. Going Off CPU Results In Hitting A Performance Brick Wall
Affects data and
logical memory
mapping information.
Cache lines associated
with singleton
spinlocks needs to be
freed up as fast as
possible.
The impact of the
above has been
demonstrated.
Throughput/thread
Cache Size
95. The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine
C P U
Front end
Back end
Performance
Everything is good from a
CPU cache hit, fetch and
decode perspective.
The backend is struggling
to keep up with the front
end.
The emphatic answer is
to process multiple data
items per CPU clock cycle
=> leverage SIMD.
99. Appendix A: Instruction Execution And The CPU Front / Back Ends
Cache
Fetch
Decode
Execute
Branch
Predict
Decoded
Instruction
Buffer
Execute
Reorder
And
Retire
Front end Back end
100. Appendix B - The CPU Front / Back Ends In Detail
Front end Back end
101. Front end bound ( smaller is better )
IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)
Bad speculation
(UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)
Retiring
UOPS_RETIRE_SLOTS / (4 * Clock ticks)
Back end bound ( ideally, should = 1 - Retiring)
1 – (Front end bound + Bad speculation + Retiring)
Appendix C - CPU Pressure Points, Important Calculations
102. An illustration of what the
“General exploration” analysis
capability of the tool provides
Appendix D - VTune Amplifier General Exploration
Notas do Editor
Signal wait time = total wait time is to be expected for short waits on uncontended spin locks. CXPACKET waits can be discounted 99.9 % of the time.
Xperf can provide deep insights into the database engine that other tools cannot, in this case we can walk the stack associated with query execution and observe the total CPU consumption up to any point in the stack in milliseconds
This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks:
The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude.
We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack.
The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called.
The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks:
The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude.
We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack.
The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called.
The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
These results are for the scan of a 118Gb table against a cold buffer cache, with max memory for the instance set to 40Gb.
Note the difference in latency between accessing the on CPU cache and main memory, accessing main memory incurs a large penalty in terms of lost CPU cycles, this is important and one of the drivers behind the new optimizer mode that was introduced in SQL Server 2012 in order to support column store indexes.
Note the difference in latency between accessing the on CPU cache and main memory, accessing main memory incurs a large penalty in terms of lost CPU cycles, this is important and one of the drivers behind the new optimizer mode that was introduced in SQL Server 2012 in order to support column store indexes.
Keeping the rest of the database engine away from core 0 allows the spin lock cache line to be freed up as fast as possible.
Thread numbers taken away… Removed bottom
The results for this graph have been obtained using a single instance with large pages, LPIM, 40Gb max memory, the ‘Sorted’ column store index and a warm large object cache.
The results for this graph have been obtained using a single instance with large pages, LPIM, 40Gb max memory, the ‘Sorted’ column store index and a warm large object cache.