SlideShare uma empresa Scribd logo
1 de 102
L3 Cache
DBA Level 400
C P U
L2 Cache
L1 Cache
Core Core
L2 Cache
L1 Cache
About me
An independent SQL Consultant
A user of SQL Server from version 2000 onwards with 12+ years
experience.
Speaker, both at UK user group events and at conferences.
I have a passion for understanding how the database engine works
at a deep level.
“Everything fits in memory, so
performance is as good as it will
get. It fits in memory therefore
end of story”
Demonstration #1
A Simple Query That Defies
Conventional Wisdom
Demonstration #1 Repeated With ‘Bigger’ Hardware
CPU
6 core 2.0 Ghz
(Sandybridge)
Warm large object
cache used in all
tests to remove
storage as a factor.
CPU
6 core 2.0 Ghz
(Sandybridge)
48 Gb quad channel 1333
Mhz DDR3 memory
Hyper-threading enabled, unless specified otherwise.
Which SELECT Statement Has The Lowest Elapsed Time ?
17.41Mb column store Vs. 51.7Mb column store
WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM (SELECT a = 1
FROM master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
,CAST(((id - 1) % 1048576) AS money ) AS Price2
,CAST(((id - 1) % 1048576) AS money ) AS Price3
INTO FactInternetSalesBigNoSort
FROM generator
CROSS JOIN [dbo].[DimDate] d
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON
FactInternetSalesBigNoSort
SELECT CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [dbo].[FactInternetSalesBigNoSort] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
WITH generator AS (
SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a)
FROM (SELECT a = 1
FROM master.dbo.syscolumns) c1
CROSS JOIN master.dbo.syscolumns c2
)
SELECT d.DateKey AS OrderDateKey
,CAST(((id - 1) % 1048576) AS money ) AS Price1
,CAST(((id - 1) % 1048576) AS money ) AS Price2
,CAST(((id - 1) % 1048576) AS money ) AS Price3
INTO FactInternetSalesBigSorted
FROM generator
CROSS JOIN [dbo].[DimDate] d
CREATE CLUSTERED INDEX ccsi
ON FactInternetSalesBigNoSorted ( OrderDateKey )
CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON
FactInternetSalesBigNoSorted
WITH (DROP_EXISTING = ON)
SELECT CalendarQuarter
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [dbo].[FactInternetSalesBigSorted] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
The
fastest
?
The
fastest
?
The Case of The Two Column Store Index Sizes
SQL Server query tuning 101
The optimizer will always use the
smaller data structures it can
find to satisfy the query, right ?
How Well Do The Queries Using The Two Column Stores Scale ?
0
10000
20000
30000
40000
50000
60000
70000
80000
2 4 6 8 10 12 14 16 18 20 22 24
Time(ms)
Degree of Parallelism
Non-sorted column store Sorted column store
Data creation statement scaled using top 300,000 to
create 1,095,600,000 rows.
Can We Use All Available CPU Resource ?
0
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20 22 24
PercentageCPUUtilization
Degree of Parallelism
Non-sorted Sorted
Memory access
should be 100%
CPU intensive ?!?
Wait Statistics Do Not Help Here !
Stats are for the query ran with a DOP of 24, a warm column store object pool and the
column store created on pre-sorted data.
Spin Locks Do Not Provide Any Clues Either
Executes in 775 ms for a warm column store object pool
12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles
Total spins 293,491
SELECT [CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [FactInternetSalesBig] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION (MAXDOP 24)
Well Documented Tools
Dynamic management views and functions
Performance counters
Extended events
Not all wait events, latches and spin locks are well documented if
documented at all.
Takeaway: These tools are not going to tell us definitively
where our CPU time is going !!!
The Problem With Query Plan Costs
Assumptions:
The buffer cache is cold.
IO cannot be performed in parallel.
Data in different columns is never correlated ( improved on in SQL 2014 ).
Hash distribution is always uniform.
Etc. . . .
Based on the amount of time it took a
developers machine to complete certain
operations, a Dell OptiPlex ( according to legend ).
Response Time
Wait Time
+
Service Time
We need something to help
find out what is happening
here.
A Not So Well Documented Tool
Introducing the Windows Performance Toolkit
Comes as part of the Windows Assessment and Deployment Kit.
Traces are created via xperf and windows performance recorder.
It utilises Event Tracing for Windows (ETW)
Visualise traces using windows performance analyser.
The real power is being able to stack walk the database engine
Public Symbols
These are labels contained in .pdb files that provide information on what programming
construct generated a piece of machine code, also known as debug symbols.
Caution:
public
symbols
are version
specific,
down to
CU level !
Obtaining An ETW Trace Stack Walking The Database Engine
xperf –on base –stackwalk profile
xperf –d stackwalk.etl
WPA
SQL Statement
Basic xperf Command Line Syntax
xperf –on < kernel flags | kernel groups > [ -stackwalk < stackwalk kernel providers ]
Kernel groups are groups of flags and not to be confused with Windows kernel groups
Takeaway: kernel groups make life
easier
ETW Tracing Is Generally Light Weight
This comes from the Premier Field Engineer Blog: Identifying the cause of SQL Server IO
bottlenecks using XPerf, it reduced my IO throughput from 3300 Mb/s to 300 Mb/s
XPERF -on PROC_THREAD+LOADER+
FLT_IO_INIT+FLT_IO+FLT_FASTIO+FLT_IO_FAILURE+
FILENAME+FILE_IO+FILE_IO_INIT+
DISK_IO+HARD_FAULTS+DPC+
INTERRUPT+CSWITCH+PROFILE+DRIVERS+DISPATCHER
-stackwalk MiniFilterPreOpInit+MiniFilterPostOpInit+
CSWITCH+PROFILE+ThreadCreate+ReadyThread+
DiskReadInit+DiskWriteInit+DiskFlushInit+
FileCreate+FileCleanup+FileClose+FileRead+FileWrite
-BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular
Takeaway: start off with the smallest set of kernel providers you
can get away with and add incrementally.
The Database Engine Composition ( SQL Server 2012 onwards )
Database Engine
Language Processing: sqllang.dll
Runtime: sqlmin.dll, sqltst.dll, qds.dll
SQLOS: sqldk.dll, sqlos.dll
Demonstration #2
Stack Walking
an ETW Trace
Of Our Simple Query
Demonstration #3
Investigation
IO Sizes
Via an ETW Trace File
Demonstration #4
Quantifying The Cost Of
Extracting Row Values
During A Table Scan
North bridge
Core
Core 2 Architecture
L3 Cache
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
Core
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
C P U
South bridge
Front side bus
Core 2 Architecture
 Latency when talking to IO and memory controllers even before accessing
memory or IO.
 Single level TLB.
 Hyper-threading via NetBurst, this delivered poor performance.
 Design was not module, 4 CPU was 2 x 2 core CPUs “Bolted together”.
 Only last generation “Dunnington” had a L3 cache.
Core
Core i Series Generation 1 ‘Nehalem’
L3 Cache
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
TLB
Memory bus
C P U
QPI. . .
Un-core
 Integrated memory controller.
 NetBurst replaced with a new
hyper threading technology.
 The Lookaside Buffer ( TLB ) for
caching logical to physical memory
mappings has two levels.
 Front side bus replaced by the
Quick Path Inter connector (QPI).
 Genuine modular design.
Core
Core i Series Generation 2 ‘Sandybridge’
L3 Cache
L1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L0 UOP cache
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
Bi-directional ring bus
PCIe 2.0
controllerTLBMemory bus
C P U
QPI…
Un-core
 Integrated PCIe 2.0 controller.
 Level 0 uop cache.
 L3 cache connected to core
via bi-directional ring bus.
 Advanced Vector eXtensions v1.
 Data Direct IO.
Cache Lines
 Unit of transfer between memory and the CPU.
 Cache lines are used to create “Cache entries”, these
are tagged with the requested memory location.
 Data structures should be cache line aligned in
order to avoid split register operations.
 If different cores access the same cache line for
writes, this effectively ‘Bounces’ the L2 cache
=> Try to avoid threads sharing cache lines.
CPU
Cache Line – 64 bits
CPU Cache, Memory and IO Sub System, Latency
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
but, memory is the new flash storage and CPU cache is the new RAM . . .
Slide borrowed from
Thomas Kejser with his
kind permission.
4
4
4
11
11
11
14
18
38
167
0 50 100 150 200
L1 Cache sequential access
L1 Cache In Page Random access
L1 Cache In Full Random access
L2 Cache sequential access
L2 Cache In Page Random access
L2 Cache Full Random access
L3 Cache sequential access
L3 Cache In Page Random access
L3 Cache Full Random access
Main memory
The CPU Cache Hierarchy Latencies In CPU Cycles Memory
 Leverage the pre-fetcher as
much as possible.
 Larger CPU caches
 L4 Cache => Crystalwell eDram
 DDR4 memory
 By pass main memory
 Stacked memory:
 Hybrid memory cubes ( Micron, Intel etc ).
 High bandwidth memory ( AMD, Hynix ).
Main Memory Is Holding The CPU Back, Solutions . . .
Main
memory
CPU
What The Pre-Fetcher Loves
Column Store index
Sequential scan
What The Pre-Fetcher Hates !
Hash Table
Can this be
improved ?
By Passing Main Memory With Data Direct IO
C P U
Core
L3 Cache
The Old
World
The New
World
With
Data
Direct IO
Core
C P U
Core
L3 Cache
Core
Transactions/s With and Without Data Direct IO
0
10
20
30
40
50
60
70
80
90
2 x 10
GBe
2 x 10
GBe
4 x 10
GBe
4 x 10
GBe
6 x 10
GBe
6 x 10
GBe
8 x 10
GBe
8 x 10
GBe
Transactions/Sec(Mu)
Single Socket IO Performance
Xeon
5600
Xeon
E5
The Case Of The Two Column Store Index Sizes
Could the CPU cycles required to
access the CPU cache versus main
memory be a factor ?
. . . lets dig deeper
Call Stack For Query Against Column Store On Non-Pre-Sorted Data
Hash agg lookup
weight 65,329.87
Column Store scan
weight 28,488.73
Control flow
Data flow
Where Is The Bottleneck In The Plan ?
The stack trace is indicating that the
Bottleneck is right here
Does The Stream Aggregate Perform Any Better ?
 Query takes seven minutes and nine seconds
to run.
 Performance is killed by a huge row mode sort
prior to the stream aggregate !!!
( sorted column store )
Call Stack For Query Against Column Store On Pre-Sorted Data
Hash agg lookup
weight:
 now 275.00
 before 65,329.87
Column Store scan
weight
 now 45,764.07
 before 28,488.73
The Case Of The Two Column Stores
The hash aggregate using the column
stored created on pre-sorted data is
very CPU efficient.
Why ?
SQL Server and Large Memory Pages, How Does It Work ?
L3 Cache
Power and
Clock
Core
Bi-directional ring bus
PCI
TLB
C P U
QPI
Un-core
Core
Page
Translation
Table
Memory
Controller
DTLB
( 1 st level )
STLB
( 2 nd level )
~10s of
CPU
cycles
160+
CPU
cycles
The TLB Without Large Page Support
L3 Cache
Power and
Clock
Core
Bi-directional ring bus
PCITLB: 32
x 4Kb
pages
C P U
QPI
Un-core
Core
Memory
Controller
128Kb of logical to
physical memory
mappings covered*
* Nehalem through to Ivybridge
architectures
The TLB With Large Page Support
L3 Cache
Power and
Clock
Core
Bi-directional ring bus
PCITLB: 32
x 2Mb
pages
C P U
QPI
Un-core
Core
Memory
Controller
128Kb of logical to
physical memory mapping
coverage is increased to
64Mb !!!
Fewer trips off the
CPU to the page
table
What Difference Do Large Pages Make ?
130s elapsed time
28%
saving
Large page
support off
Large page
support on
Page Lookups/s 222220 263766
0
50000
100000
150000
200000
250000
300000
93s elapsed time
Single 6 core
socket
CPU Pipeline Architecture
C P U
Front end
Back end
 A ‘Pipeline’ of logical slots
runs through the processor.
 Front end can issue four
micro ops per clock cycle.
 Back end can retire up to
four micro operations per
clock cycle.
AllocationRetirement
Pipeline ‘Bubbles’ Are Bad !
C P U
Front end
Back end
 Empty slots are referred to as
‘Bubbles’.
 Causes of front end bubbles:
 Bad speculation.
 CPU stalls.
 Data dependencies
A = B + C
E = A + D
 Back end bubbles can be due
to excessive demand for
specific types of execution unit.
AllocationRetirement
Bubble
Bubble
Bubble
Bubble
A Basic NUMA Architecture
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
Remote
memory
access
Local
memory
access
Local
memory
access
NUMA Node 0 NUMA Node 1
IOHub
Four and Eight Node NUMA QPI Topologies ( Nehalem i7 onwards )
CPU 1 CPU 3
CPU 0 CPU 2
CPU 6 CPU 7
CPU 4 CPU 5
CPU 2 CPU 3
CPU 0 CPU 1
IO Hub
IO Hub
IOHubIOHub
IOHub
With 18 core Xeon’s in the offing, these
topologies will become increasingly rare.
Remote Node Memory Access Comes At A Cost
An additional 20% overhead when
accessing ‘Foreign’ memory !
( from coreinfo )
Local Vs Remote Memory Access and Thread Locality
How does SQLOS
schedule hyper-
threads in relation
to physical cores ?
( 6 cores per socket )
CPU
socket 0
CPU
socket 1
Core 0 Core 1 Core 2
Core 3 Core 4 Core 5
Making Use Of CPU Stalls With Hyper Threading ( Nehalem i7 onwards )
Access
n row B-tree
L2
L3
Last level
cache miss n row B-tree
1. Session 1 performs an
index seek, pages not in CPU
cache
2. A CPU stall takes place
( 160 clock cycles+ ) whilst
the page is retrieved from
memory.
3. The ‘Dead’ CPU stall
cycles gives the physical
core the opportunity to
run a 2nd thread.
Core
L1
Hyper-Threading Scalability For The ‘Sorted’ Column Store Index
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59
0
10
20
30
40
50
60
70
80
90
100
CPUUtilisation
Degree of Parallelism
 5 % CPU utilisation per core
 60% of each core utilised
by first hyper thread
L1
L1
L1
L1
L3
L2
L2
L2
L2
Core
Core
Core
Core
SQL OS Layer
One Scheduler Per
Logical Processor
 The OS has no concept of SQL
resources such as latches.
 ‘Hard’ context switches into soft user
mode context switches.
 SQL OS scheduler threads by
prioritizing L2 cache hits and reuse
over ‘Fairness’.
The Rationale Behind SQL OS
The LOGCACHE_ACCESS Spin Lock
Buffer Offset (cache line)
Alloc Slot in Buffer
MemCpy
Slot
Content
Log Writer
Writer Queue
Async I/O
Completion Port
Slot
1
LOGBUFFER
WRITELOG
Signal thread
which issued
commit
T0
Tn
Slot
127
Slot
126
LOG FLUSHQ
Slide borrowed from
Thomas Kejser with his
kind permission.
This spin lock is the
ultimate bottleneck for
ACID transactions in SQL
Server
Core Core
Log
writer
thread
The log writer
needs to be able
to free the spin
lock cache line as
fast as possible.
Spinlocks and Context Switches
Cache line
Keeping The Rest Of The Database Engine Away From The Log Writer
Thread
The log writer
thread is always be
assigned to the first
free CPU core.
A Simple Test Create LOGCACHE_ACCESS Spin Lock Pressure
 Create a table that ensures
there is one row per page, in
order to eliminate
PAGELATCH_EX latch activity.
 Insert 1000 rows into the table,
the SPIDS for 22 sessions should
fall in the range of 1 -> 1000.
 Procedure to perform
a single row update 10,000
times for the SPID of the
session it is running in.
Results
234,915,993
10,554,325
0
50000000
100000000
150000000
200000000
250000000
Default CPU affinity mask Core 0 removed from affinity mask
Order of
magnitude
saving !!!
 The log writer thread is always assigned to core 0.
 By isolating the rest of the database engine from core 0, the log
writer does not have to contend with so many context switches
when handing off and receiving back the log cache access spin
lock.
How Has This Drop In Spins Been Achieved ?
CPU
+
Segment Hash Key Value
Scan
Segment Hash Key Value
Scan
Column Store on
Non-Pre-Sorted Data
Hash table is likely to be at the high
latency end of the cache hierarchy.
Hypothesis
Sequential
access 
Random
access 
Column Store on Pre-Sorted Data
Hash table is likely to be at the low
latency end of the cache hierarchy.
The Case Of The 60% CPU Utilisation Ceiling
Column store scans are pre-fetcher
friendly, hash joins and hash
aggregates are not
Can a hash join or aggregate keep up
with a column store scan ?
Introducing Intel VTune Amplifier XE
Investigating events at the CPU cache,
clock cycle and instruction level requires
software outside the standard Windows
and SQL Server tool set.
Refer to Appendix D for an overview of what “General exploration”
provides.
VTune Can Go One Better Than WPT, Down To CPU Cycle Level !
181,578,272,367 versus 466,000,699 clock cycles !!!
This Is What The CPU Stall Picture Looks Like Against DOP
2 4 6 8 10 12 14 16 18 20 22 24
Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763
Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082
0
1,000,000,000
2,000,000,000
3,000,000,000
4,000,000,000
5,000,000,000
6,000,000,000
LLCMisses
Degree of Parallelism
Non-sorted Sorted
Does The OrderDateKey Column Fit In The L3 Cache ?
Table Name Column Name Size (Mb)
FactInternetSalesBigNoSort
OrderDateKey 1786182
Price1 3871
Price2 3871
Price3 3871
FactInternetSalesBigSorted
OrderDateKey 738
Price1 2965127
Price2 2965127
Price3 2965127
No , L3 cache is 20Mb in size
The Case Of The Two Column Store Index Sizes: Conclusion
Turning the memory access on
the hash aggregate table from random
to sequential probes
=
CPU savings > cost of scanning an
enlarged column store
Skew Is Also A Factor In Batch Mode Hash Join PerformanceBatch Mode
Hash Joins
70
Row Mode Batch Mode
Expensive to repartition inputs
Data skew reduces parallelism
Hash join
B1
B2
B3
Bn
Hash table
(shared)
Thread
Thread
Build input
Thread
Thread
Thread
Probe input
B1
B2
B4
Bm
B3
Build input
Hash join
Exchange Exchange
Probe input
Exchange Exchange
Data skew speeds up processing
No repartitioning
Slide borrowed
from Thomas
Kejser with his
kind permission.
The Case Of The 60% CPU Utilisation Ceiling
Still not solved. Tuning 101, if you
cannot max out the CPU capacity or
IOPS bandwidth, there must be
some form of contention at play . . .
What Does Digging Deep Into The Call Stack Tell Us ?
Throttling !
The Case Of The 60% CPU Utilisation Ceiling: Conclusion
The hash aggregate cannot
keep up with the column store
scan.
The batch engine therefore
throttles the column store scan
by calling sleep system calls !!!.
The integration services engine does
something very similar, refer to
“Back pressure”.
The Case of The CPU Pressure Point
Where are the pressure points on
the CPU and what can be done to
resolve this ?.
Making Efficient Use Of The CPU In The “In Memory” World
C P U
D A T A F L O W
D A T A F L O WFront endBack end
Backend Pressure
Retirement throttled due to
pressure on back end
resources ( port saturation )
Frontend Pressure
Front end issuing < 4 uops per cycle whilst
backend is ready to accept uops
( CPU stalls, bad speculation, data
dependencies )
Lots Of KPIs To Choose From, Which To Select ?
 CPU Cycles Per Retired Instruction (CPI)
This should ideally be 0.25, anything
approaching 1.0 is bad.
 Front end bound
Reflects the front end under supplying the back
end with work, this figure should be as close to
zero as possible.
 Back end bound
The Back end cannot accept work from the front
end because there is excessive demand for
specific execution units.
These Are The Pressure Point Statistics For The ‘Sorted’ Column Store
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24
KPIValue
Degree of Parallelism
CPI Front end Bound Back end bound
Refer to Appendix C for the formulae from
which these metrics are derived.
Which Parts Of The Database Engine Are Suffering Backend Pressure ?
Results obtained for a degree of parallelism of 24.
256-bit
FMUL
Blend
Back end Pressure Can Manifest As “Port Saturation”
( *Sandybridge core)
CBagAggregateExpression::
TryAggregateUsingQE_Pure
*
SELECT [CalendarQuarter]
,SUM([Price1])
,SUM([Price2])
,SUM([Price3])
FROM [FactInternetSalesBig] f
JOIN [DimDate] d
ON f.OrderDateKey = d.DateKey
GROUP BY CalendarQuarter
OPTION (MAXDOP 24)
Ports Saturation Analysis
0.46
0.45
0.16
0.17
0.1
0.55
0 0.1 0.2 0.3 0.4 0.5 0.6
0
1
2
3
4
5
Hit Ratio
Port
0.7 and above is deemed as port saturation,
if we can drive CPU utilisation above 60% we
may start to see this on ports 0, 1 and 5.
Back end Pressure
Can we help the back end
keep up with the front end ?
Single Instruction Multiple Data ( SIMD )
 A class of CPU instruction that can
process multiple data points to be
processed simultaneously.
 A form of vectorised processing.
 Once CPU stalls are minimised, the
challenge becomes processing data
on the CPU ( rate of instruction
retirement ) as fast as possible.
Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) B(1)
+ C(1)
=
+ =
+ =
+ =
Using conventional processing,
adding two arrays together,
each comprising of four
elements requires four
instructions.
A(2)
A(3)
A(4)
B(2)
B(3)
B(4)
C(2)
C(3)
C(4)
Lowering Clock Cycles Per Instruction By Leveraging SIMD
A(1) A(2) A(3) A(4)
B(1) B(2) B(3) B(4)
+
C(1) C(2) C(3) C(4)
=
Using SIMD –
“Single instruction multiple data”
commands, the addition can be
performed using a single instruction.
Intel Advanced Vector eXtensions
2011 2012 2013 2014 2015 Future
Westmere
Sandy
Bridge
Ivy Bridge Haswell Broadwell Skylake
87 GFLOPS 185 GFLOPS ~225 GFLOPS ~500 GFLOPS tbd GFLOPS tbd GFLOPS
32 nm
SSE 4.2
DDR3
PCIe2
32 nm
AVX
(256 bit
registers)
DDR3
PCIe3
22 nm 22 nm
AVX2
(new
instructions)
DDR4
PCIe3
14 nm 14 nm
AVX 3.2
(512 bit
registers)
DDR4
PCIe4
AVX Registers getting wider, instruction set getting richer
Does The SQL Server Database Engine Leverage SIMD Instructions ?

 Vtune amplifier does not provide the option to pick out
Streaming SIMD Extension (SSE) integer events.
 However, for a floating point hash aggregate we would hope to
see floating point AVX instructions in use.
The Case Of The CPU Pressure Point: Conclusion
The batch mode engine needs to
leverage SIMD.
If the hash aggregate could ‘Crunch’
more data per clock cycle, it would
stand a better chance of keeping up
with the column store scan !.
Just about all other column store database
engines use SIMD, please can the batch engine
leverage SIMD Microsoft . . .
What About ?
 AMD came up with the 3D Now SIMD instruction set.
 Its losing the single threaded performance race to Intel:
 Will this change with Excavator ?
 AMD innovations:
 1st 64 bit implementation of the x86 architecture
 1st the break the 1Ghz barrier for a x86 CPU
 1st to release native dual and quad core x86 processors
A Look Into The Future: CPU Technology
 Moore’s law runs out of steam
=> transition away from CMOS
 L4 caches becoming more prevalent.
 Aggressive vectorisation by leveraging GPU
cores on the CPU die.
 On CPU package NIC functionality.
 Rise of the ARM processor in the data centre.
CPU
A Look Into The Future: Storage and Memory
 Rise of:
 NVMe and NVMe over fabrics.
 Stacked memory technology.
 Increase in NAND flash density
through TLC and 3D NAND.
 Mass storage and main memory
converge.
A Look Into The Future: Database Engines
 Column store technology becomes
the de facto standard for data
warehouses and marts.
 Adoption of more
MonetDB innovations
into popular database engines:
 Database cracking
 Query recycling
Which Memory Your Data Is In Matters !, Locality Matters !
L3 Cache
Here ?
L1 Instruction
Cache
L2 Unified Cache
Here ?
L1 Data Cache
Here ?
Core
Memory
bus
Where is my data ?
C P U
Hopefully
not here ?!?
Going Off CPU Results In Hitting A Performance Brick Wall
 Affects data and
logical memory
mapping information.
 Cache lines associated
with singleton
spinlocks needs to be
freed up as fast as
possible.
 The impact of the
above has been
demonstrated.
Throughput/thread
Cache Size
The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine
C P U
Front end

Back end

Performance
 Everything is good from a
CPU cache hit, fetch and
decode perspective.
 The backend is struggling
to keep up with the front
end.
 The emphatic answer is
to process multiple data
items per CPU clock cycle
=> leverage SIMD.
chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8
Appendix A: Instruction Execution And The CPU Front / Back Ends
Cache
Fetch
Decode
Execute
Branch
Predict
Decoded
Instruction
Buffer
Execute
Reorder
And
Retire
Front end Back end
Appendix B - The CPU Front / Back Ends In Detail
Front end Back end
 Front end bound ( smaller is better )
 IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)
 Bad speculation
 (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)
 Retiring
 UOPS_RETIRE_SLOTS / (4 * Clock ticks)
 Back end bound ( ideally, should = 1 - Retiring)
 1 – (Front end bound + Bad speculation + Retiring)
Appendix C - CPU Pressure Points, Important Calculations
An illustration of what the
“General exploration” analysis
capability of the tool provides
Appendix D - VTune Amplifier General Exploration

Mais conteúdo relacionado

Mais procurados

Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
BertrandDrouvot
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
BertrandDrouvot
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky   oracle, memory & linuxChristo kutrovsky   oracle, memory & linux
Christo kutrovsky oracle, memory & linux
Kyle Hailey
 

Mais procurados (20)

Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Troubleshooting Complex Oracle Performance Problems with Tanel Poder
Troubleshooting Complex Oracle Performance Problems with Tanel PoderTroubleshooting Complex Oracle Performance Problems with Tanel Poder
Troubleshooting Complex Oracle Performance Problems with Tanel Poder
 
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
Automatic Storage Management (ASM) metrics are a goldmine: Let's use them!
 
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
Reduce Resource Consumption & Clone in Seconds your Oracle Virtual Environmen...
 
Oracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BIOracle in-Memory Column Store for BI
Oracle in-Memory Column Store for BI
 
Christo kutrovsky oracle, memory & linux
Christo kutrovsky   oracle, memory & linuxChristo kutrovsky   oracle, memory & linux
Christo kutrovsky oracle, memory & linux
 
Testing Delphix: easy data virtualization
Testing Delphix: easy data virtualizationTesting Delphix: easy data virtualization
Testing Delphix: easy data virtualization
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs PostgreSQL WAL for DBAs
PostgreSQL WAL for DBAs
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
 
HeroLympics Eng V03 Henk Vd Valk
HeroLympics  Eng V03 Henk Vd ValkHeroLympics  Eng V03 Henk Vd Valk
HeroLympics Eng V03 Henk Vd Valk
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Как PostgreSQL работает с диском
Как PostgreSQL работает с дискомКак PostgreSQL работает с диском
Как PostgreSQL работает с диском
 
Dbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easyDbvisit replicate: logical replication made easy
Dbvisit replicate: logical replication made easy
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 

Destaque

chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
Sher Shah Merkhel
 
Red hat enterprise_virtualization_load
Red hat enterprise_virtualization_loadRed hat enterprise_virtualization_load
Red hat enterprise_virtualization_load
silviucojocaru
 

Destaque (20)

Sgi Hpc Day Kiev 2009 10 Uv
Sgi Hpc Day Kiev 2009 10 UvSgi Hpc Day Kiev 2009 10 Uv
Sgi Hpc Day Kiev 2009 10 Uv
 
Enabling ARM® Server Technology for the Datacenter
Enabling ARM® Server Technology for the DatacenterEnabling ARM® Server Technology for the Datacenter
Enabling ARM® Server Technology for the Datacenter
 
Red Hat Virtualization Where Performance Takes Off!
Red Hat Virtualization Where Performance Takes Off!Red Hat Virtualization Where Performance Takes Off!
Red Hat Virtualization Where Performance Takes Off!
 
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
chap 18 multicore computers
chap 18 multicore computers chap 18 multicore computers
chap 18 multicore computers
 
Frequency modulation
Frequency modulationFrequency modulation
Frequency modulation
 
Red hat enterprise_virtualization_load
Red hat enterprise_virtualization_loadRed hat enterprise_virtualization_load
Red hat enterprise_virtualization_load
 
6th Generation Processor Announcement
6th Generation Processor Announcement6th Generation Processor Announcement
6th Generation Processor Announcement
 
Futex Scaling for Multi-core Systems
Futex Scaling for Multi-core SystemsFutex Scaling for Multi-core Systems
Futex Scaling for Multi-core Systems
 
Performance Benchmarking of Clouds Evaluating OpenStack
Performance Benchmarking of Clouds                Evaluating OpenStackPerformance Benchmarking of Clouds                Evaluating OpenStack
Performance Benchmarking of Clouds Evaluating OpenStack
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
 
Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux Kernel
 
Mastering kvm virtualization- A complete guide of KVM virtualization
Mastering kvm virtualization- A complete guide of KVM virtualizationMastering kvm virtualization- A complete guide of KVM virtualization
Mastering kvm virtualization- A complete guide of KVM virtualization
 
Multiprocessor system
Multiprocessor systemMultiprocessor system
Multiprocessor system
 
Processors
ProcessorsProcessors
Processors
 
Multicore Processsors
Multicore ProcesssorsMulticore Processsors
Multicore Processsors
 
Race to Reality: The Next Billion-People Market Opportunity
Race to Reality: The Next Billion-People Market OpportunityRace to Reality: The Next Billion-People Market Opportunity
Race to Reality: The Next Billion-People Market Opportunity
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 

Semelhante a Sql server engine cpu cache as the new ram

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latency
hyeongchae lee
 

Semelhante a Sql server engine cpu cache as the new ram (20)

SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)SQL Server In-Memory OLTP introduction (Hekaton)
SQL Server In-Memory OLTP introduction (Hekaton)
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
In-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
In-Memory and TimeSeries Technology to Accelerate NoSQL AnalyticsIn-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
In-Memory and TimeSeries Technology to Accelerate NoSQL Analytics
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latency
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
SQL Track: In Memory OLTP in SQL Server
SQL Track: In Memory OLTP in SQL ServerSQL Track: In Memory OLTP in SQL Server
SQL Track: In Memory OLTP in SQL Server
 
Storage 101 for VMware admins 2015
Storage 101 for VMware admins 2015Storage 101 for VMware admins 2015
Storage 101 for VMware admins 2015
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 

Mais de Chris Adkin

Mais de Chris Adkin (9)

Bdc from bare metal to k8s
Bdc   from bare metal to k8sBdc   from bare metal to k8s
Bdc from bare metal to k8s
 
Data weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clustersData weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clusters
 
Data relay introduction to big data clusters
Data relay introduction to big data clustersData relay introduction to big data clusters
Data relay introduction to big data clusters
 
Ci with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgiumCi with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgium
 
Continuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL ServerContinuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL Server
 
TSQL Coding Guidelines
TSQL Coding GuidelinesTSQL Coding Guidelines
TSQL Coding Guidelines
 
J2EE Performance And Scalability Bp
J2EE Performance And Scalability BpJ2EE Performance And Scalability Bp
J2EE Performance And Scalability Bp
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch Processing
 
Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql Tuning
 

Sql server engine cpu cache as the new ram

  • 1. L3 Cache DBA Level 400 C P U L2 Cache L1 Cache Core Core L2 Cache L1 Cache
  • 2. About me An independent SQL Consultant A user of SQL Server from version 2000 onwards with 12+ years experience. Speaker, both at UK user group events and at conferences. I have a passion for understanding how the database engine works at a deep level.
  • 3. “Everything fits in memory, so performance is as good as it will get. It fits in memory therefore end of story”
  • 4. Demonstration #1 A Simple Query That Defies Conventional Wisdom
  • 5. Demonstration #1 Repeated With ‘Bigger’ Hardware CPU 6 core 2.0 Ghz (Sandybridge) Warm large object cache used in all tests to remove storage as a factor. CPU 6 core 2.0 Ghz (Sandybridge) 48 Gb quad channel 1333 Mhz DDR3 memory Hyper-threading enabled, unless specified otherwise.
  • 6. Which SELECT Statement Has The Lowest Elapsed Time ? 17.41Mb column store Vs. 51.7Mb column store WITH generator AS ( SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT d.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS money ) AS Price1 ,CAST(((id - 1) % 1048576) AS money ) AS Price2 ,CAST(((id - 1) % 1048576) AS money ) AS Price3 INTO FactInternetSalesBigNoSort FROM generator CROSS JOIN [dbo].[DimDate] d CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBigNoSort SELECT CalendarQuarter ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigNoSort] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter WITH generator AS ( SELECT TOP 3000 id = Row_Number() OVER (ORDER BY a) FROM (SELECT a = 1 FROM master.dbo.syscolumns) c1 CROSS JOIN master.dbo.syscolumns c2 ) SELECT d.DateKey AS OrderDateKey ,CAST(((id - 1) % 1048576) AS money ) AS Price1 ,CAST(((id - 1) % 1048576) AS money ) AS Price2 ,CAST(((id - 1) % 1048576) AS money ) AS Price3 INTO FactInternetSalesBigSorted FROM generator CROSS JOIN [dbo].[DimDate] d CREATE CLUSTERED INDEX ccsi ON FactInternetSalesBigNoSorted ( OrderDateKey ) CREATE CLUSTERED COLUMNSTORE INDEX ccsi ON FactInternetSalesBigNoSorted WITH (DROP_EXISTING = ON) SELECT CalendarQuarter ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [dbo].[FactInternetSalesBigSorted] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter The fastest ? The fastest ?
  • 7. The Case of The Two Column Store Index Sizes SQL Server query tuning 101 The optimizer will always use the smaller data structures it can find to satisfy the query, right ?
  • 8. How Well Do The Queries Using The Two Column Stores Scale ? 0 10000 20000 30000 40000 50000 60000 70000 80000 2 4 6 8 10 12 14 16 18 20 22 24 Time(ms) Degree of Parallelism Non-sorted column store Sorted column store Data creation statement scaled using top 300,000 to create 1,095,600,000 rows.
  • 9. Can We Use All Available CPU Resource ? 0 10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 16 18 20 22 24 PercentageCPUUtilization Degree of Parallelism Non-sorted Sorted Memory access should be 100% CPU intensive ?!?
  • 10. Wait Statistics Do Not Help Here ! Stats are for the query ran with a DOP of 24, a warm column store object pool and the column store created on pre-sorted data.
  • 11. Spin Locks Do Not Provide Any Clues Either Executes in 775 ms for a warm column store object pool 12 cores x 2.0 Ghz x 0.775 = 1,860,000,000 CPU cycles Total spins 293,491 SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter OPTION (MAXDOP 24)
  • 12. Well Documented Tools Dynamic management views and functions Performance counters Extended events Not all wait events, latches and spin locks are well documented if documented at all. Takeaway: These tools are not going to tell us definitively where our CPU time is going !!!
  • 13. The Problem With Query Plan Costs Assumptions: The buffer cache is cold. IO cannot be performed in parallel. Data in different columns is never correlated ( improved on in SQL 2014 ). Hash distribution is always uniform. Etc. . . . Based on the amount of time it took a developers machine to complete certain operations, a Dell OptiPlex ( according to legend ).
  • 14. Response Time Wait Time + Service Time We need something to help find out what is happening here.
  • 15. A Not So Well Documented Tool
  • 16. Introducing the Windows Performance Toolkit Comes as part of the Windows Assessment and Deployment Kit. Traces are created via xperf and windows performance recorder. It utilises Event Tracing for Windows (ETW) Visualise traces using windows performance analyser. The real power is being able to stack walk the database engine
  • 17. Public Symbols These are labels contained in .pdb files that provide information on what programming construct generated a piece of machine code, also known as debug symbols. Caution: public symbols are version specific, down to CU level !
  • 18. Obtaining An ETW Trace Stack Walking The Database Engine xperf –on base –stackwalk profile xperf –d stackwalk.etl WPA SQL Statement
  • 19. Basic xperf Command Line Syntax xperf –on < kernel flags | kernel groups > [ -stackwalk < stackwalk kernel providers ] Kernel groups are groups of flags and not to be confused with Windows kernel groups Takeaway: kernel groups make life easier
  • 20. ETW Tracing Is Generally Light Weight This comes from the Premier Field Engineer Blog: Identifying the cause of SQL Server IO bottlenecks using XPerf, it reduced my IO throughput from 3300 Mb/s to 300 Mb/s XPERF -on PROC_THREAD+LOADER+ FLT_IO_INIT+FLT_IO+FLT_FASTIO+FLT_IO_FAILURE+ FILENAME+FILE_IO+FILE_IO_INIT+ DISK_IO+HARD_FAULTS+DPC+ INTERRUPT+CSWITCH+PROFILE+DRIVERS+DISPATCHER -stackwalk MiniFilterPreOpInit+MiniFilterPostOpInit+ CSWITCH+PROFILE+ThreadCreate+ReadyThread+ DiskReadInit+DiskWriteInit+DiskFlushInit+ FileCreate+FileCleanup+FileClose+FileRead+FileWrite -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular Takeaway: start off with the smallest set of kernel providers you can get away with and add incrementally.
  • 21. The Database Engine Composition ( SQL Server 2012 onwards ) Database Engine Language Processing: sqllang.dll Runtime: sqlmin.dll, sqltst.dll, qds.dll SQLOS: sqldk.dll, sqlos.dll
  • 22. Demonstration #2 Stack Walking an ETW Trace Of Our Simple Query
  • 24. Demonstration #4 Quantifying The Cost Of Extracting Row Values During A Table Scan
  • 25. North bridge Core Core 2 Architecture L3 Cache L1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core Core L1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core C P U South bridge Front side bus
  • 26. Core 2 Architecture  Latency when talking to IO and memory controllers even before accessing memory or IO.  Single level TLB.  Hyper-threading via NetBurst, this delivered poor performance.  Design was not module, 4 CPU was 2 x 2 core CPUs “Bolted together”.  Only last generation “Dunnington” had a L3 cache.
  • 27. Core Core i Series Generation 1 ‘Nehalem’ L3 Cache L1 Instruction Cache 32KB L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core TLB Memory bus C P U QPI. . . Un-core  Integrated memory controller.  NetBurst replaced with a new hyper threading technology.  The Lookaside Buffer ( TLB ) for caching logical to physical memory mappings has two levels.  Front side bus replaced by the Quick Path Inter connector (QPI).  Genuine modular design.
  • 28. Core Core i Series Generation 2 ‘Sandybridge’ L3 Cache L1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L0 UOP cache L2 Unified Cache 256K L1 Data Cache 32KB Core Bi-directional ring bus PCIe 2.0 controllerTLBMemory bus C P U QPI… Un-core  Integrated PCIe 2.0 controller.  Level 0 uop cache.  L3 cache connected to core via bi-directional ring bus.  Advanced Vector eXtensions v1.  Data Direct IO.
  • 29. Cache Lines  Unit of transfer between memory and the CPU.  Cache lines are used to create “Cache entries”, these are tagged with the requested memory location.  Data structures should be cache line aligned in order to avoid split register operations.  If different cores access the same cache line for writes, this effectively ‘Bounces’ the L2 cache => Try to avoid threads sharing cache lines. CPU Cache Line – 64 bits
  • 30. CPU Cache, Memory and IO Sub System, Latency Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 1ns 10ns 100ns 100us 10ms10us but, memory is the new flash storage and CPU cache is the new RAM . . . Slide borrowed from Thomas Kejser with his kind permission.
  • 31. 4 4 4 11 11 11 14 18 38 167 0 50 100 150 200 L1 Cache sequential access L1 Cache In Page Random access L1 Cache In Full Random access L2 Cache sequential access L2 Cache In Page Random access L2 Cache Full Random access L3 Cache sequential access L3 Cache In Page Random access L3 Cache Full Random access Main memory The CPU Cache Hierarchy Latencies In CPU Cycles Memory
  • 32.  Leverage the pre-fetcher as much as possible.  Larger CPU caches  L4 Cache => Crystalwell eDram  DDR4 memory  By pass main memory  Stacked memory:  Hybrid memory cubes ( Micron, Intel etc ).  High bandwidth memory ( AMD, Hynix ). Main Memory Is Holding The CPU Back, Solutions . . . Main memory CPU
  • 33. What The Pre-Fetcher Loves Column Store index Sequential scan
  • 34. What The Pre-Fetcher Hates ! Hash Table Can this be improved ?
  • 35. By Passing Main Memory With Data Direct IO C P U Core L3 Cache The Old World The New World With Data Direct IO Core C P U Core L3 Cache Core
  • 36. Transactions/s With and Without Data Direct IO 0 10 20 30 40 50 60 70 80 90 2 x 10 GBe 2 x 10 GBe 4 x 10 GBe 4 x 10 GBe 6 x 10 GBe 6 x 10 GBe 8 x 10 GBe 8 x 10 GBe Transactions/Sec(Mu) Single Socket IO Performance Xeon 5600 Xeon E5
  • 37. The Case Of The Two Column Store Index Sizes Could the CPU cycles required to access the CPU cache versus main memory be a factor ? . . . lets dig deeper
  • 38. Call Stack For Query Against Column Store On Non-Pre-Sorted Data Hash agg lookup weight 65,329.87 Column Store scan weight 28,488.73
  • 39. Control flow Data flow Where Is The Bottleneck In The Plan ? The stack trace is indicating that the Bottleneck is right here
  • 40. Does The Stream Aggregate Perform Any Better ?  Query takes seven minutes and nine seconds to run.  Performance is killed by a huge row mode sort prior to the stream aggregate !!! ( sorted column store )
  • 41. Call Stack For Query Against Column Store On Pre-Sorted Data Hash agg lookup weight:  now 275.00  before 65,329.87 Column Store scan weight  now 45,764.07  before 28,488.73
  • 42. The Case Of The Two Column Stores The hash aggregate using the column stored created on pre-sorted data is very CPU efficient. Why ?
  • 43. SQL Server and Large Memory Pages, How Does It Work ? L3 Cache Power and Clock Core Bi-directional ring bus PCI TLB C P U QPI Un-core Core Page Translation Table Memory Controller DTLB ( 1 st level ) STLB ( 2 nd level ) ~10s of CPU cycles 160+ CPU cycles
  • 44. The TLB Without Large Page Support L3 Cache Power and Clock Core Bi-directional ring bus PCITLB: 32 x 4Kb pages C P U QPI Un-core Core Memory Controller 128Kb of logical to physical memory mappings covered* * Nehalem through to Ivybridge architectures
  • 45. The TLB With Large Page Support L3 Cache Power and Clock Core Bi-directional ring bus PCITLB: 32 x 2Mb pages C P U QPI Un-core Core Memory Controller 128Kb of logical to physical memory mapping coverage is increased to 64Mb !!! Fewer trips off the CPU to the page table
  • 46. What Difference Do Large Pages Make ? 130s elapsed time 28% saving Large page support off Large page support on Page Lookups/s 222220 263766 0 50000 100000 150000 200000 250000 300000 93s elapsed time Single 6 core socket
  • 47. CPU Pipeline Architecture C P U Front end Back end  A ‘Pipeline’ of logical slots runs through the processor.  Front end can issue four micro ops per clock cycle.  Back end can retire up to four micro operations per clock cycle. AllocationRetirement
  • 48. Pipeline ‘Bubbles’ Are Bad ! C P U Front end Back end  Empty slots are referred to as ‘Bubbles’.  Causes of front end bubbles:  Bad speculation.  CPU stalls.  Data dependencies A = B + C E = A + D  Back end bubbles can be due to excessive demand for specific types of execution unit. AllocationRetirement Bubble Bubble Bubble Bubble
  • 49. A Basic NUMA Architecture Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Core Core Core Core L1 L1 L1 L1 L3 L2 L2 L2 L2 Remote memory access Local memory access Local memory access NUMA Node 0 NUMA Node 1
  • 50. IOHub Four and Eight Node NUMA QPI Topologies ( Nehalem i7 onwards ) CPU 1 CPU 3 CPU 0 CPU 2 CPU 6 CPU 7 CPU 4 CPU 5 CPU 2 CPU 3 CPU 0 CPU 1 IO Hub IO Hub IOHubIOHub IOHub With 18 core Xeon’s in the offing, these topologies will become increasingly rare.
  • 51. Remote Node Memory Access Comes At A Cost An additional 20% overhead when accessing ‘Foreign’ memory ! ( from coreinfo )
  • 52. Local Vs Remote Memory Access and Thread Locality How does SQLOS schedule hyper- threads in relation to physical cores ? ( 6 cores per socket ) CPU socket 0 CPU socket 1 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5
  • 53. Making Use Of CPU Stalls With Hyper Threading ( Nehalem i7 onwards ) Access n row B-tree L2 L3 Last level cache miss n row B-tree 1. Session 1 performs an index seek, pages not in CPU cache 2. A CPU stall takes place ( 160 clock cycles+ ) whilst the page is retrieved from memory. 3. The ‘Dead’ CPU stall cycles gives the physical core the opportunity to run a 2nd thread. Core L1
  • 54. Hyper-Threading Scalability For The ‘Sorted’ Column Store Index 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 CPU Utilisation 5 8 10 13 15 18 20 23 25 28 30 33 35 38 40 43 46 48 50 53 54 58 59 0 10 20 30 40 50 60 70 80 90 100 CPUUtilisation Degree of Parallelism  5 % CPU utilisation per core  60% of each core utilised by first hyper thread
  • 55. L1 L1 L1 L1 L3 L2 L2 L2 L2 Core Core Core Core SQL OS Layer One Scheduler Per Logical Processor  The OS has no concept of SQL resources such as latches.  ‘Hard’ context switches into soft user mode context switches.  SQL OS scheduler threads by prioritizing L2 cache hits and reuse over ‘Fairness’. The Rationale Behind SQL OS
  • 56. The LOGCACHE_ACCESS Spin Lock Buffer Offset (cache line) Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG Signal thread which issued commit T0 Tn Slot 127 Slot 126 LOG FLUSHQ Slide borrowed from Thomas Kejser with his kind permission. This spin lock is the ultimate bottleneck for ACID transactions in SQL Server
  • 57. Core Core Log writer thread The log writer needs to be able to free the spin lock cache line as fast as possible. Spinlocks and Context Switches Cache line
  • 58. Keeping The Rest Of The Database Engine Away From The Log Writer Thread The log writer thread is always be assigned to the first free CPU core.
  • 59. A Simple Test Create LOGCACHE_ACCESS Spin Lock Pressure  Create a table that ensures there is one row per page, in order to eliminate PAGELATCH_EX latch activity.  Insert 1000 rows into the table, the SPIDS for 22 sessions should fall in the range of 1 -> 1000.  Procedure to perform a single row update 10,000 times for the SPID of the session it is running in.
  • 60. Results 234,915,993 10,554,325 0 50000000 100000000 150000000 200000000 250000000 Default CPU affinity mask Core 0 removed from affinity mask Order of magnitude saving !!!
  • 61.  The log writer thread is always assigned to core 0.  By isolating the rest of the database engine from core 0, the log writer does not have to contend with so many context switches when handing off and receiving back the log cache access spin lock. How Has This Drop In Spins Been Achieved ?
  • 62. CPU +
  • 63. Segment Hash Key Value Scan Segment Hash Key Value Scan Column Store on Non-Pre-Sorted Data Hash table is likely to be at the high latency end of the cache hierarchy. Hypothesis Sequential access  Random access  Column Store on Pre-Sorted Data Hash table is likely to be at the low latency end of the cache hierarchy.
  • 64. The Case Of The 60% CPU Utilisation Ceiling Column store scans are pre-fetcher friendly, hash joins and hash aggregates are not Can a hash join or aggregate keep up with a column store scan ?
  • 65. Introducing Intel VTune Amplifier XE Investigating events at the CPU cache, clock cycle and instruction level requires software outside the standard Windows and SQL Server tool set. Refer to Appendix D for an overview of what “General exploration” provides.
  • 66. VTune Can Go One Better Than WPT, Down To CPU Cycle Level ! 181,578,272,367 versus 466,000,699 clock cycles !!!
  • 67. This Is What The CPU Stall Picture Looks Like Against DOP 2 4 6 8 10 12 14 16 18 20 22 24 Non-sorted 13,200,924 30,902,163 161,411,298 1,835,828,499 2,069,544,858 4,580,720,628 2,796,495,741 3,080,615,628 3,950,376,507 4,419,593,391 4,952,446,647 5,311,271,763 Sorted 3,000,210 1,200,084 16,203,164 29,102,037 34,802,436 35,102,457 48,903,413 64,204,494 63,004,410 85,205,964 68,404,788 72,605,082 0 1,000,000,000 2,000,000,000 3,000,000,000 4,000,000,000 5,000,000,000 6,000,000,000 LLCMisses Degree of Parallelism Non-sorted Sorted
  • 68. Does The OrderDateKey Column Fit In The L3 Cache ? Table Name Column Name Size (Mb) FactInternetSalesBigNoSort OrderDateKey 1786182 Price1 3871 Price2 3871 Price3 3871 FactInternetSalesBigSorted OrderDateKey 738 Price1 2965127 Price2 2965127 Price3 2965127 No , L3 cache is 20Mb in size
  • 69. The Case Of The Two Column Store Index Sizes: Conclusion Turning the memory access on the hash aggregate table from random to sequential probes = CPU savings > cost of scanning an enlarged column store
  • 70. Skew Is Also A Factor In Batch Mode Hash Join PerformanceBatch Mode Hash Joins 70 Row Mode Batch Mode Expensive to repartition inputs Data skew reduces parallelism Hash join B1 B2 B3 Bn Hash table (shared) Thread Thread Build input Thread Thread Thread Probe input B1 B2 B4 Bm B3 Build input Hash join Exchange Exchange Probe input Exchange Exchange Data skew speeds up processing No repartitioning Slide borrowed from Thomas Kejser with his kind permission.
  • 71. The Case Of The 60% CPU Utilisation Ceiling Still not solved. Tuning 101, if you cannot max out the CPU capacity or IOPS bandwidth, there must be some form of contention at play . . .
  • 72. What Does Digging Deep Into The Call Stack Tell Us ? Throttling !
  • 73. The Case Of The 60% CPU Utilisation Ceiling: Conclusion The hash aggregate cannot keep up with the column store scan. The batch engine therefore throttles the column store scan by calling sleep system calls !!!. The integration services engine does something very similar, refer to “Back pressure”.
  • 74. The Case of The CPU Pressure Point Where are the pressure points on the CPU and what can be done to resolve this ?.
  • 75. Making Efficient Use Of The CPU In The “In Memory” World C P U D A T A F L O W D A T A F L O WFront endBack end Backend Pressure Retirement throttled due to pressure on back end resources ( port saturation ) Frontend Pressure Front end issuing < 4 uops per cycle whilst backend is ready to accept uops ( CPU stalls, bad speculation, data dependencies )
  • 76. Lots Of KPIs To Choose From, Which To Select ?  CPU Cycles Per Retired Instruction (CPI) This should ideally be 0.25, anything approaching 1.0 is bad.  Front end bound Reflects the front end under supplying the back end with work, this figure should be as close to zero as possible.  Back end bound The Back end cannot accept work from the front end because there is excessive demand for specific execution units.
  • 77. These Are The Pressure Point Statistics For The ‘Sorted’ Column Store 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 14 16 18 20 22 24 KPIValue Degree of Parallelism CPI Front end Bound Back end bound Refer to Appendix C for the formulae from which these metrics are derived.
  • 78. Which Parts Of The Database Engine Are Suffering Backend Pressure ? Results obtained for a degree of parallelism of 24.
  • 79. 256-bit FMUL Blend Back end Pressure Can Manifest As “Port Saturation” ( *Sandybridge core) CBagAggregateExpression:: TryAggregateUsingQE_Pure * SELECT [CalendarQuarter] ,SUM([Price1]) ,SUM([Price2]) ,SUM([Price3]) FROM [FactInternetSalesBig] f JOIN [DimDate] d ON f.OrderDateKey = d.DateKey GROUP BY CalendarQuarter OPTION (MAXDOP 24)
  • 80. Ports Saturation Analysis 0.46 0.45 0.16 0.17 0.1 0.55 0 0.1 0.2 0.3 0.4 0.5 0.6 0 1 2 3 4 5 Hit Ratio Port 0.7 and above is deemed as port saturation, if we can drive CPU utilisation above 60% we may start to see this on ports 0, 1 and 5.
  • 81. Back end Pressure Can we help the back end keep up with the front end ?
  • 82. Single Instruction Multiple Data ( SIMD )  A class of CPU instruction that can process multiple data points to be processed simultaneously.  A form of vectorised processing.  Once CPU stalls are minimised, the challenge becomes processing data on the CPU ( rate of instruction retirement ) as fast as possible.
  • 83. Lowering Clock Cycles Per Instruction By Leveraging SIMD A(1) B(1) + C(1) = + = + = + = Using conventional processing, adding two arrays together, each comprising of four elements requires four instructions. A(2) A(3) A(4) B(2) B(3) B(4) C(2) C(3) C(4)
  • 84. Lowering Clock Cycles Per Instruction By Leveraging SIMD A(1) A(2) A(3) A(4) B(1) B(2) B(3) B(4) + C(1) C(2) C(3) C(4) = Using SIMD – “Single instruction multiple data” commands, the addition can be performed using a single instruction.
  • 85. Intel Advanced Vector eXtensions 2011 2012 2013 2014 2015 Future Westmere Sandy Bridge Ivy Bridge Haswell Broadwell Skylake 87 GFLOPS 185 GFLOPS ~225 GFLOPS ~500 GFLOPS tbd GFLOPS tbd GFLOPS 32 nm SSE 4.2 DDR3 PCIe2 32 nm AVX (256 bit registers) DDR3 PCIe3 22 nm 22 nm AVX2 (new instructions) DDR4 PCIe3 14 nm 14 nm AVX 3.2 (512 bit registers) DDR4 PCIe4 AVX Registers getting wider, instruction set getting richer
  • 86. Does The SQL Server Database Engine Leverage SIMD Instructions ?   Vtune amplifier does not provide the option to pick out Streaming SIMD Extension (SSE) integer events.  However, for a floating point hash aggregate we would hope to see floating point AVX instructions in use.
  • 87. The Case Of The CPU Pressure Point: Conclusion The batch mode engine needs to leverage SIMD. If the hash aggregate could ‘Crunch’ more data per clock cycle, it would stand a better chance of keeping up with the column store scan !. Just about all other column store database engines use SIMD, please can the batch engine leverage SIMD Microsoft . . .
  • 88. What About ?  AMD came up with the 3D Now SIMD instruction set.  Its losing the single threaded performance race to Intel:  Will this change with Excavator ?  AMD innovations:  1st 64 bit implementation of the x86 architecture  1st the break the 1Ghz barrier for a x86 CPU  1st to release native dual and quad core x86 processors
  • 89. A Look Into The Future: CPU Technology  Moore’s law runs out of steam => transition away from CMOS  L4 caches becoming more prevalent.  Aggressive vectorisation by leveraging GPU cores on the CPU die.  On CPU package NIC functionality.  Rise of the ARM processor in the data centre. CPU
  • 90. A Look Into The Future: Storage and Memory  Rise of:  NVMe and NVMe over fabrics.  Stacked memory technology.  Increase in NAND flash density through TLC and 3D NAND.  Mass storage and main memory converge.
  • 91. A Look Into The Future: Database Engines  Column store technology becomes the de facto standard for data warehouses and marts.  Adoption of more MonetDB innovations into popular database engines:  Database cracking  Query recycling
  • 92.
  • 93. Which Memory Your Data Is In Matters !, Locality Matters ! L3 Cache Here ? L1 Instruction Cache L2 Unified Cache Here ? L1 Data Cache Here ? Core Memory bus Where is my data ? C P U Hopefully not here ?!?
  • 94. Going Off CPU Results In Hitting A Performance Brick Wall  Affects data and logical memory mapping information.  Cache lines associated with singleton spinlocks needs to be freed up as fast as possible.  The impact of the above has been demonstrated. Throughput/thread Cache Size
  • 95. The Backend Of The CPU Is Now The Bottleneck For The Batch mode Engine C P U Front end  Back end  Performance  Everything is good from a CPU cache hit, fetch and decode perspective.  The backend is struggling to keep up with the front end.  The emphatic answer is to process multiple data items per CPU clock cycle => leverage SIMD.
  • 96.
  • 98.
  • 99. Appendix A: Instruction Execution And The CPU Front / Back Ends Cache Fetch Decode Execute Branch Predict Decoded Instruction Buffer Execute Reorder And Retire Front end Back end
  • 100. Appendix B - The CPU Front / Back Ends In Detail Front end Back end
  • 101.  Front end bound ( smaller is better )  IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)  Bad speculation  (UOPS_ISSUED.ANY – UOPS.RETIRED.RETIRED_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES) / (4 * Clock ticks)  Retiring  UOPS_RETIRE_SLOTS / (4 * Clock ticks)  Back end bound ( ideally, should = 1 - Retiring)  1 – (Front end bound + Bad speculation + Retiring) Appendix C - CPU Pressure Points, Important Calculations
  • 102. An illustration of what the “General exploration” analysis capability of the tool provides Appendix D - VTune Amplifier General Exploration

Notas do Editor

  1. Signal wait time = total wait time is to be expected for short waits on uncontended spin locks. CXPACKET waits can be discounted 99.9 % of the time.
  2. Xperf can provide deep insights into the database engine that other tools cannot, in this case we can walk the stack associated with query execution and observe the total CPU consumption up to any point in the stack in milliseconds
  3. This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks: The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude. We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack. The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called. The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
  4. This slide is the “Big reveal”, for the column store created on the pre-sorted data, there are four points of interest when comparing the two call stacks: The hash table lookup is more expensive for the query using the column store created on the non-sorted data, by an order of magnitude. We see a line below the Flookup for the query against the column store created on non-sorted data which we do not see in the other call stack. The hash aggregate lookup for the query on the “Non-sorted” column store is throttling back the column store scan more than its counterpart in the other call stack, this is based on the number of times this is called. The scan for the sorted column store is twice as expensive as that for the non-sorted column store.
  5. These results are for the scan of a 118Gb table against a cold buffer cache, with max memory for the instance set to 40Gb.
  6. Note the difference in latency between accessing the on CPU cache and main memory, accessing main memory incurs a large penalty in terms of lost CPU cycles, this is important and one of the drivers behind the new optimizer mode that was introduced in SQL Server 2012 in order to support column store indexes.
  7. Note the difference in latency between accessing the on CPU cache and main memory, accessing main memory incurs a large penalty in terms of lost CPU cycles, this is important and one of the drivers behind the new optimizer mode that was introduced in SQL Server 2012 in order to support column store indexes.
  8. Keeping the rest of the database engine away from core 0 allows the spin lock cache line to be freed up as fast as possible.
  9. Thread numbers taken away… Removed bottom
  10. The results for this graph have been obtained using a single instance with large pages, LPIM, 40Gb max memory, the ‘Sorted’ column store index and a warm large object cache.
  11. The results for this graph have been obtained using a single instance with large pages, LPIM, 40Gb max memory, the ‘Sorted’ column store index and a warm large object cache.