SlideShare uma empresa Scribd logo
1 de 53
DBA Level 400
About Me
I’m pushing the database engine as
hard as I can captain, she’s going
blow.
 An independent SQL consultant.
 A user of SQL Server since 2000.
 14+ years of SQL Server experience.
The ‘Standard’ stuff What I’m passionate about !
The Exercise
Squeeze every last drop of performance
out of the hardware !
ostress –E –dSingletonInsert –Q”exec usp_insert” –n40
Test Environment
 SQL Server 2016 CTP 2.3
 Windows server 2012 R2
 2 x 10 Xeon V3 cores 2.2Ghz with hyper-threading enabled
 64GB DDR 4 quad channel memory
 4 x SanDisk Extreme Pro 480GB Raid 1 (64K allocation size ) )
 ostress used for generating concurrent workload
 Use the conventional database engine to begin with . . .
I Will Be Using Windows Performance Toolkit . . . A Lot !
 It allows CPU time to be
quantified across the whole
database engine.
 Not just what Microsoft deem
what we should see
but everything !.
 The database engine
equivalent of seeing the Matrix
in code form ;-)
Where Everyone Starts From . . . A Monotonically Increasing Key
CREATE TABLE [dbo].[MyBigTable] (
[c1] [bigint] IDENTITY(1, 1) NOT NULL,
,[c2] [datetime] NULL,
,[c3] [char](111) NULL,
,[c4] [int] NULL,
,[c5] [int] NULL,
,[c6] [bigint] NULL,
CONSTRAINT [PK_BigTableSeq] PRIMARY KEY CLUSTERED (
[c1] ASC
)
)
CPU utilization02:12:26 Waits stats
The “Last Page Problem”
Min
Min
Min Min
Min
Min
Min
Min Min
Min
HOBT_ROOT
Max
Overcoming The “Last Page” Problem
600
616
982
7946
8170
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
SPID Offset
Partition + SPID Offset
NEWID()
IDENTITY
NEWSEQUENTIALiD
Elapsed Time (s)
KeyType
Elapsed Time (s) / Key Type
What are
we waiting
on ?
Can Delayed Durability Help ?
265
600
0 100 200 300 400 500 600 700
Delayed durability
Conventional
Elapsed Time (s)
LoggingType
Elapsed time (s) / Logging Type
What Is Wrong In Task Manager ?
Fixing CPU Core Starvation With Trace Flag 8008
 The scheduler with
least load is now
favoured over the
‘Preferred’ scheduler.
 Documented in this
CSS engineers note.
 Elapsed time has
gone backwards, it is
now 453 seconds !
why ?.
Where Are Our CPU Cycles Going ?
How Spinlocks Work
A task on a scheduler will spin until it can acquire the spinlock it is after
For short lived waits this uses less CPU cycles than a yielding then waiting for
the task thread to be at the head of the runnable queue.
Spinlock Backoff
We have to yield the scheduler at some stage !
Introducing The LOGCACHE_ACCESS Spinlock
Buffer Offset (cache line)
LOGCACHE
ACCESS
Alloc Slot in
Buffer
MemCpy
Slot
Content
Log Writer
Writer Queue
Async I/O
Completion Port
Slot
1
LOGBUFFER
WRITELOG
LOG
FLUSHQ
Signal thread
which issued
commit
T0
Tn
Slot
127
Slot
126
The bit we are
interested in
Anatomy of A Modern CPU
Core
L3 Cache
L1 Instruction
Cache 32KB
L2 Unified Cache 256K
Power
and
Clock
QPI
Memory
Controller
L1 Data Cache
32KB
Core
CoreL1 Instruction
Cache 32KB
L2 Unified Cache 256K
L1 Data Cache
32KB
Core
TLBMemory bus
C P U
QPI. . .
Un-core
L0 UOP Cache L0 UOP Cache
Memory, Cache Lines and The CPU Cache
C P U
new OperationData() new OperationData() new OperationData()
Cache Line Cache LineCache Line
64B
Cache Line
Cache Line
Cache Line
Cache Line
Tag
Tag
Tag
Tag
C P U C a c h e
Spinlocks and Memory
spin_acquire
Int s
spin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU
L3
Core
Core
C P U
L3
Core
Core
C P U
What Happens If We Give The Log Writer Its Own CPU Core ?
600
265
1193
231
0
200
400
600
800
1000
1200
1400
Conventional Logging Delayed Durability TF8008, Delayed
Durability
TF8008, Delayed
Durability, Affinity mask
change
ElapsedTime(s)
Configuration
Elapsed time (s)
We Get The Lowest Elapsed Time So Far
* With 38 threads, all other tests with 40.
Scalability With and Without A CPU Core Dedicated To The Log Writer
0
100,000
200,000
300,000
400,000
500,000
600,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Inserts/s
Insert Threads
Insert Rate / Insert Threads
Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1
. . . and What About LOGCACHE_ACCESS Spins ?
0
2,000,000,000
4,000,000,000
6,000,000,000
8,000,000,000
10,000,000,000
12,000,000,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
Spins
Threads
LOGCACHE_ACCESS spins / Thread Count
Baseline Log Writer with Dedicated CPU Core
What Difference Has This Made To Where CPU Time Is Going ?
With the
default
CPU
affinity
mask
Log writer
with
dedicated
CPU core
63,166,836 ms
(40 threads)
Vs.
220,168 ms
(38 threads)
Optimizations That Failed To Make The Grade
 Large memory pages
Allows The Look aside buffer to cover a more
memory for logical to physical memory
mapping.
 Trace flag 2330
Stops spins on OPT_IDX_STATS.
 Trace flag 1118
prevents mixed allocation extents
– enabled by default in SQL Server 2016
A Different Spinlock Is Now The Most Spin Intensive
A new spinlock is now the most spin intensive:
XDESMGR, probably spinlock<109,9,1>
what does it do ?
Digging Into The Call Stack To Understand Undocumented Spinlocks
xperf -on PROC_THREAD+LOADER+PROFILE -StackWalk Profile
xperf –d stackwalk.etl
1. Start trace
2. Run workload
3. Stop trace
4. Load trace into WPA
5. Locate spinlock in call stack 6. ‘Invert’ the call stack
Examining The XDESMGR Spinlock By Digging Into The Call Stack
 This serialises access to the part of the database engine that allocates
and destroys transaction ids.
 How do you relieve pressure on this spinlock ?
 Have multiple insert statement per transaction.
Options For Dealing With The XDESMGR Spinlock
 Relieving pressure on the LOGCACHE_ACCESS spinlock makes the
XDESMGR spinlock the bottleneck.
 There are three places to go at this point:
 Increase the ratio of transactions to DML statements.
 Shard the table across databases and instances.
 Use in memory OLTP native transactions.
Increasing The Batch Size By Just One Makes A Big Difference !
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
Insert Rate / Thread Count
Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1
Log Writer With Dedicated Core Batch Size=2
. . . and The Difference This Makes To XDESMGR Spins
0
20,000,000,000
40,000,000,000
60,000,000,000
80,000,000,000
100,000,000,000
120,000,000,000
140,000,000,000
160,000,000,000
180,000,000,000
200,000,000,000
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36
XDESMGR Spins / Thread Count
Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1
Log Writer With Dedicated Core Batch Size=2
Does It Matter Which NUMA Node The Insert Runs On ?
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
Faster here ?
Numa Node 0
. . . Or faster here?
Numa Node 1
“Whats really
going to bake your
noodle . . .”
8 threads
here
73 s
8 threads
here
125 s
What Does Windows Performance Toolkit Have To Tell Us ?
18 insert
thread log
writer
CPU socket
Co-location.
18 insert
threads not
co-located on
same socket
as the log
writer
84,697 ms
Vs.
11,281,235 ms
So I Should Look At Tuning The CPU Affinity Mask ?
 Get the basics right first:
 Minimize transaction log fragmentation ( both internal and external ).
 Use low latency storage.
 Avoid log intensive operations, page splits etc . . .
 Use minimally logged operations where appropriate.
 Only when:
 All of the above has been done.
 The disk row store engine is being used.
 The workload is OLTP heavy using more than 12 CPU cores, 6 per socket,
look at giving the log writer a CPU core to itself.
Hard To Solve Logging Issues
 I’m have to use the disk row store engine.
 My single threaded app cannot easily be multi threaded.
 How do I get the best possible write log performance ?
 Use NUMA connection affinity
to connect to the same socket
as the log writer.
 Disable hyper-threading,
whole cores and always faster
than hyper-threads.
 ‘Affinitize’ the rest of the
database engine away from
the log writer thread ‘Home’
CPU core.
 Go for a CPU with the best
single threaded performance
available.
The CPU Cycle Cost Of Spinlock Cache Line Transfer
spin_acquire
Int s
spin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU
L3Core
C P U
C P U
C P U
C P U
100 CPU cycles
Core
34 CPU cycles
100 CPU cycles
34 CPU cycles
Core to core on the same socket Core to core on different sockets
Remember, All Memory Access Is CPU Intensive
This Man Seriously Knows A Lot About Memory
 Ulrich Drepper, author of:
What Every Programmer Should Know About Memory
 From Understanding CPU Caches
“Use per CPU memory; lock thread to specific CPU”
This is our CPU affinity mask trick 
Cache Line Ping Pong
IOHub
CPU 6 CPU 7
CPU 4 CPU 5
CPU 2 CPU 3
CPU 0 CPU 1
IOHubIOHub
IOHub
“Cache line
ping pong
is deadly for
performance”
The more CPU sockets and cores you
have the greater the ramifications this
has for SQL Server scalability on
“Big boxes”.
‘Sharding’ The Database Across Instances
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
L3
Core 0
Core 1
Core 2
Core 4
Core 3
Core 5
Core 6
Core 7
Core 9
Core 8
C P U
Instance A - ‘Affinitized’
to NUMA Node 0
Instance B - ‘Affinitized’
to NUMA Node 1  ‘Shard’ databases
across instances.
 2 x
LOGCACHE_ACCES
S and XDESMGR
spinlocks.
 Spinlock cache
entries are bound
by the latency of
the L3 cache, not
the quick path
inter-connect.
What Can We Get From An Instance ‘Affinitized’ To One CPU Socket ?
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
500,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Inserts/s
Threads
Insert Rate / Thread Count
With a Batch Size of 2, 32 Threads Achieve The Best Throughput
Logging
related
activity
Latching !
Where to now ?
In Memory OLTP To The Rescue, But What Will It Give Us ?
 Only redo is written to the transaction
log (durability = SCHEMA_AND_DATA)
Does this relieve pressure on the
LOGCACHE_ACCESS spinlock ?.
 Zero latching and locking.
 Native procedure compilation.
 No “Last page” problem due to
IMOLTP’s use of hash buckets.
 Spinlocks will still be in play though .
Insert Scalability with A Non Natively Compiled Stored Procedure
0
100,000
200,000
300,000
400,000
500,000
600,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Inserts/s
Threads
Insert Rate / Thread Count
Default Engine IMOLTP Range Index
IMOLTP Hash Index bc=8388608 IMOLTP Hash Index bc=16777216
What Does The BLOCKER_ENUM Spinlock Protect ?
Transaction synchronization between the default and in-memory OLTP engines ?
Where Are Our CPU Cycles Going, The Overhead Of Language Processing
Time to try native in memory OLTP transactions
and compiled stored procedures ?
Insert Scalability with A Natively Compiled Stored Procedure
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Inserts/s
Threads
Insert Rate / Thread Count
bucket count=8388608 bucket count=16777216 bucket count=33554432 range
Hash Indexes Bucket Count and Balancing The Equation
Smaller bucket counts = better cache line reuse
+ reduced TLB thrashing
+ reduced hash table cache out
Larger bucket counts = reduced cache line reuse
+ increased TLB thrashing
+ less hash bucket scanning for lookups
Is Our CPU Affinity Mask Trick Relevant To In Memory OLTP ?
 Default CPU
affinity mask
and 18 insert
threads.
 A CPU core
dedicated to
the log writer
and 18 insert
threads.
Optimizations That Failed To Make The Grade
 Large memory pages
As per the default database engine, this made
no difference to performance.
 Turning off adjacent cache line pre-fetching
This can degrade performance by saturating
the memory bus when hyper threading is in
use and cause cache pollution when the
pre-fetched line is not used.
Takeaways
 Monotonically increasing keys do not scale with the default database engine.
 Dedicate a CPU core to the log write to relieve pressure on the LOGCACHE_ACCESS
spinlock.
 Batch DML statements together per transaction to relieve XDESMGR spinlock pressure.
 The further the LOGCACHE_ACCESS spinlock cache line has to travel, the more
performance is degraded.
 Native compilation results in a performance increase of over an order of magnitude
(at least) over non natively compiled stored procedures.
 There is a bucket count “Sweet spot” for IMOLTP hash indexes which is influenced by
hash collisions, bucket scans and hash lookup table cache out.
Further Reading
 Super scaling singleton inserts blog post
 Tuning The LOGCACHE_ACCESS Spinlock On A “Big Box” blog post
 Tuning The XDESMGR Spinlock On A “Big Box” blog post
chris1adkin@yahoo.co.uk
http://uk.linkedin.com/in/wollatondba
ChrisAdkin8

Mais conteúdo relacionado

Mais procurados

pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLCommand Prompt., Inc
 
Hyperledger Fabric Architecture
Hyperledger Fabric ArchitectureHyperledger Fabric Architecture
Hyperledger Fabric Architecture상문 오
 
Network policy @ k8s day
Network policy @ k8s dayNetwork policy @ k8s day
Network policy @ k8s dayChia-Chun Shih
 
Containerd + buildkit breakout
Containerd + buildkit breakoutContainerd + buildkit breakout
Containerd + buildkit breakoutDocker, Inc.
 
Tune Your Go Garbage-Collector
Tune Your Go Garbage-CollectorTune Your Go Garbage-Collector
Tune Your Go Garbage-CollectorWeaveworks
 
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Kuniyasu Suzaki
 
Docker Registry V2
Docker Registry V2Docker Registry V2
Docker Registry V2Docker, Inc.
 
Linux Kernel Image
Linux Kernel ImageLinux Kernel Image
Linux Kernel Image艾鍗科技
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IOPiyush Katariya
 
Secondary Index Search in InnoDB
Secondary Index Search in InnoDBSecondary Index Search in InnoDB
Secondary Index Search in InnoDBMIJIN AN
 
PostgreSQL 공간관리 살펴보기 이근오
PostgreSQL 공간관리 살펴보기 이근오PostgreSQL 공간관리 살펴보기 이근오
PostgreSQL 공간관리 살펴보기 이근오PgDay.Seoul
 
Qt Internationalization
Qt InternationalizationQt Internationalization
Qt InternationalizationICS
 
IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2
IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2
IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2jayeshpar2006
 
강좌 05 통신용 PC 프로그래밍
강좌 05 통신용 PC 프로그래밍강좌 05 통신용 PC 프로그래밍
강좌 05 통신용 PC 프로그래밍chcbaram
 
Is Rust Programming ready for embedded development?
Is Rust Programming ready for embedded development?Is Rust Programming ready for embedded development?
Is Rust Programming ready for embedded development?Knoldus Inc.
 
USENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPFUSENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPFTaeung Song
 

Mais procurados (20)

pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
Vault
VaultVault
Vault
 
Hyperledger Fabric Architecture
Hyperledger Fabric ArchitectureHyperledger Fabric Architecture
Hyperledger Fabric Architecture
 
Network policy @ k8s day
Network policy @ k8s dayNetwork policy @ k8s day
Network policy @ k8s day
 
Containerd + buildkit breakout
Containerd + buildkit breakoutContainerd + buildkit breakout
Containerd + buildkit breakout
 
Tune Your Go Garbage-Collector
Tune Your Go Garbage-CollectorTune Your Go Garbage-Collector
Tune Your Go Garbage-Collector
 
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
Hardware-assisted Isolated Execution Environment to run trusted OS and applic...
 
Timer
TimerTimer
Timer
 
Docker Registry V2
Docker Registry V2Docker Registry V2
Docker Registry V2
 
Linux Kernel Image
Linux Kernel ImageLinux Kernel Image
Linux Kernel Image
 
Introduction to Rust
Introduction to RustIntroduction to Rust
Introduction to Rust
 
Concurrency, Parallelism And IO
Concurrency,  Parallelism And IOConcurrency,  Parallelism And IO
Concurrency, Parallelism And IO
 
Secondary Index Search in InnoDB
Secondary Index Search in InnoDBSecondary Index Search in InnoDB
Secondary Index Search in InnoDB
 
PostgreSQL 공간관리 살펴보기 이근오
PostgreSQL 공간관리 살펴보기 이근오PostgreSQL 공간관리 살펴보기 이근오
PostgreSQL 공간관리 살펴보기 이근오
 
Qt Internationalization
Qt InternationalizationQt Internationalization
Qt Internationalization
 
IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2
IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2
IBM Notes Traveler Administration and Log Troubleshooting tips - Part 2
 
강좌 05 통신용 PC 프로그래밍
강좌 05 통신용 PC 프로그래밍강좌 05 통신용 PC 프로그래밍
강좌 05 통신용 PC 프로그래밍
 
Is Rust Programming ready for embedded development?
Is Rust Programming ready for embedded development?Is Rust Programming ready for embedded development?
Is Rust Programming ready for embedded development?
 
USENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPFUSENIX Vault'19: Performance analysis in Linux storage stack with BPF
USENIX Vault'19: Performance analysis in Linux storage stack with BPF
 
Unix
UnixUnix
Unix
 

Semelhante a Super scaling singleton inserts

Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentalsChris Adkin
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramChris Adkin
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertChris Adkin
 
Sql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architecturesSql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architecturesChris Adkin
 
Leveraging memory in sql server
Leveraging memory in sql serverLeveraging memory in sql server
Leveraging memory in sql serverChris Adkin
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobKapil Goyal
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs FasterBob Ward
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
Low level java programming
Low level java programmingLow level java programming
Low level java programmingPeter Lawrey
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Amazon Web Services
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesAmazon Web Services
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Intel Software Brasil
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kazuhito Ohkawa
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...Amazon Web Services
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 

Semelhante a Super scaling singleton inserts (20)

Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentals
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ram
 
Scaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insertScaling sql server 2014 parallel insert
Scaling sql server 2014 parallel insert
 
Sql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architecturesSql sever engine batch mode and cpu architectures
Sql sever engine batch mode and cpu architectures
 
Leveraging memory in sql server
Leveraging memory in sql serverLeveraging memory in sql server
Leveraging memory in sql server
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slob
 
SQL Server It Just Runs Faster
SQL Server It Just Runs FasterSQL Server It Just Runs Faster
SQL Server It Just Runs Faster
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 

Mais de Chris Adkin

Bdc from bare metal to k8s
Bdc   from bare metal to k8sBdc   from bare metal to k8s
Bdc from bare metal to k8sChris Adkin
 
Data weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clustersData weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clustersChris Adkin
 
Data relay introduction to big data clusters
Data relay introduction to big data clustersData relay introduction to big data clusters
Data relay introduction to big data clustersChris Adkin
 
Ci with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgiumCi with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgiumChris Adkin
 
Continuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL ServerContinuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL ServerChris Adkin
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeChris Adkin
 
Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)Chris Adkin
 
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow EngineScaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow EngineChris Adkin
 
Building scalable application with sql server
Building scalable application with sql serverBuilding scalable application with sql server
Building scalable application with sql serverChris Adkin
 
TSQL Coding Guidelines
TSQL Coding GuidelinesTSQL Coding Guidelines
TSQL Coding GuidelinesChris Adkin
 
J2EE Performance And Scalability Bp
J2EE Performance And Scalability BpJ2EE Performance And Scalability Bp
J2EE Performance And Scalability BpChris Adkin
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch ProcessingChris Adkin
 
Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql TuningChris Adkin
 

Mais de Chris Adkin (13)

Bdc from bare metal to k8s
Bdc   from bare metal to k8sBdc   from bare metal to k8s
Bdc from bare metal to k8s
 
Data weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clustersData weekender deploying prod grade sql 2019 big data clusters
Data weekender deploying prod grade sql 2019 big data clusters
 
Data relay introduction to big data clusters
Data relay introduction to big data clustersData relay introduction to big data clusters
Data relay introduction to big data clusters
 
Ci with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgiumCi with jenkins docker and mssql belgium
Ci with jenkins docker and mssql belgium
 
Continuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL ServerContinuous Integration With Jenkins Docker SQL Server
Continuous Integration With Jenkins Docker SQL Server
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch mode
 
Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)Column store indexes and batch processing mode (nx power lite)
Column store indexes and batch processing mode (nx power lite)
 
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow EngineScaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
 
Building scalable application with sql server
Building scalable application with sql serverBuilding scalable application with sql server
Building scalable application with sql server
 
TSQL Coding Guidelines
TSQL Coding GuidelinesTSQL Coding Guidelines
TSQL Coding Guidelines
 
J2EE Performance And Scalability Bp
J2EE Performance And Scalability BpJ2EE Performance And Scalability Bp
J2EE Performance And Scalability Bp
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch Processing
 
Oracle Sql Tuning
Oracle Sql TuningOracle Sql Tuning
Oracle Sql Tuning
 

Último

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 

Último (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

Super scaling singleton inserts

  • 2. About Me I’m pushing the database engine as hard as I can captain, she’s going blow.  An independent SQL consultant.  A user of SQL Server since 2000.  14+ years of SQL Server experience. The ‘Standard’ stuff What I’m passionate about !
  • 3. The Exercise Squeeze every last drop of performance out of the hardware ! ostress –E –dSingletonInsert –Q”exec usp_insert” –n40
  • 4. Test Environment  SQL Server 2016 CTP 2.3  Windows server 2012 R2  2 x 10 Xeon V3 cores 2.2Ghz with hyper-threading enabled  64GB DDR 4 quad channel memory  4 x SanDisk Extreme Pro 480GB Raid 1 (64K allocation size ) )  ostress used for generating concurrent workload  Use the conventional database engine to begin with . . .
  • 5. I Will Be Using Windows Performance Toolkit . . . A Lot !  It allows CPU time to be quantified across the whole database engine.  Not just what Microsoft deem what we should see but everything !.  The database engine equivalent of seeing the Matrix in code form ;-)
  • 6. Where Everyone Starts From . . . A Monotonically Increasing Key CREATE TABLE [dbo].[MyBigTable] ( [c1] [bigint] IDENTITY(1, 1) NOT NULL, ,[c2] [datetime] NULL, ,[c3] [char](111) NULL, ,[c4] [int] NULL, ,[c5] [int] NULL, ,[c6] [bigint] NULL, CONSTRAINT [PK_BigTableSeq] PRIMARY KEY CLUSTERED ( [c1] ASC ) ) CPU utilization02:12:26 Waits stats
  • 7. The “Last Page Problem” Min Min Min Min Min Min Min Min Min Min HOBT_ROOT Max
  • 8. Overcoming The “Last Page” Problem 600 616 982 7946 8170 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 SPID Offset Partition + SPID Offset NEWID() IDENTITY NEWSEQUENTIALiD Elapsed Time (s) KeyType Elapsed Time (s) / Key Type What are we waiting on ?
  • 9. Can Delayed Durability Help ? 265 600 0 100 200 300 400 500 600 700 Delayed durability Conventional Elapsed Time (s) LoggingType Elapsed time (s) / Logging Type
  • 10. What Is Wrong In Task Manager ?
  • 11. Fixing CPU Core Starvation With Trace Flag 8008  The scheduler with least load is now favoured over the ‘Preferred’ scheduler.  Documented in this CSS engineers note.  Elapsed time has gone backwards, it is now 453 seconds ! why ?.
  • 12. Where Are Our CPU Cycles Going ?
  • 13. How Spinlocks Work A task on a scheduler will spin until it can acquire the spinlock it is after For short lived waits this uses less CPU cycles than a yielding then waiting for the task thread to be at the head of the runnable queue.
  • 14. Spinlock Backoff We have to yield the scheduler at some stage !
  • 15. Introducing The LOGCACHE_ACCESS Spinlock Buffer Offset (cache line) LOGCACHE ACCESS Alloc Slot in Buffer MemCpy Slot Content Log Writer Writer Queue Async I/O Completion Port Slot 1 LOGBUFFER WRITELOG LOG FLUSHQ Signal thread which issued commit T0 Tn Slot 127 Slot 126 The bit we are interested in
  • 16. Anatomy of A Modern CPU Core L3 Cache L1 Instruction Cache 32KB L2 Unified Cache 256K Power and Clock QPI Memory Controller L1 Data Cache 32KB Core CoreL1 Instruction Cache 32KB L2 Unified Cache 256K L1 Data Cache 32KB Core TLBMemory bus C P U QPI. . . Un-core L0 UOP Cache L0 UOP Cache
  • 17. Memory, Cache Lines and The CPU Cache C P U new OperationData() new OperationData() new OperationData() Cache Line Cache LineCache Line 64B Cache Line Cache Line Cache Line Cache Line Tag Tag Tag Tag C P U C a c h e
  • 18. Spinlocks and Memory spin_acquire Int s spin_acquire Int s spin_acquire Int s Transfer cache line Transfer cache line CPU CPU L3 Core Core C P U L3 Core Core C P U
  • 19. What Happens If We Give The Log Writer Its Own CPU Core ?
  • 20. 600 265 1193 231 0 200 400 600 800 1000 1200 1400 Conventional Logging Delayed Durability TF8008, Delayed Durability TF8008, Delayed Durability, Affinity mask change ElapsedTime(s) Configuration Elapsed time (s) We Get The Lowest Elapsed Time So Far * With 38 threads, all other tests with 40.
  • 21. Scalability With and Without A CPU Core Dedicated To The Log Writer 0 100,000 200,000 300,000 400,000 500,000 600,000 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 Inserts/s Insert Threads Insert Rate / Insert Threads Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1
  • 22. . . . and What About LOGCACHE_ACCESS Spins ? 0 2,000,000,000 4,000,000,000 6,000,000,000 8,000,000,000 10,000,000,000 12,000,000,000 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Spins Threads LOGCACHE_ACCESS spins / Thread Count Baseline Log Writer with Dedicated CPU Core
  • 23. What Difference Has This Made To Where CPU Time Is Going ? With the default CPU affinity mask Log writer with dedicated CPU core 63,166,836 ms (40 threads) Vs. 220,168 ms (38 threads)
  • 24. Optimizations That Failed To Make The Grade  Large memory pages Allows The Look aside buffer to cover a more memory for logical to physical memory mapping.  Trace flag 2330 Stops spins on OPT_IDX_STATS.  Trace flag 1118 prevents mixed allocation extents – enabled by default in SQL Server 2016
  • 25. A Different Spinlock Is Now The Most Spin Intensive A new spinlock is now the most spin intensive: XDESMGR, probably spinlock<109,9,1> what does it do ?
  • 26. Digging Into The Call Stack To Understand Undocumented Spinlocks xperf -on PROC_THREAD+LOADER+PROFILE -StackWalk Profile xperf –d stackwalk.etl 1. Start trace 2. Run workload 3. Stop trace 4. Load trace into WPA 5. Locate spinlock in call stack 6. ‘Invert’ the call stack
  • 27. Examining The XDESMGR Spinlock By Digging Into The Call Stack  This serialises access to the part of the database engine that allocates and destroys transaction ids.  How do you relieve pressure on this spinlock ?  Have multiple insert statement per transaction.
  • 28. Options For Dealing With The XDESMGR Spinlock  Relieving pressure on the LOGCACHE_ACCESS spinlock makes the XDESMGR spinlock the bottleneck.  There are three places to go at this point:  Increase the ratio of transactions to DML statements.  Shard the table across databases and instances.  Use in memory OLTP native transactions.
  • 29. Increasing The Batch Size By Just One Makes A Big Difference ! 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Insert Rate / Thread Count Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1 Log Writer With Dedicated Core Batch Size=2
  • 30. . . . and The Difference This Makes To XDESMGR Spins 0 20,000,000,000 40,000,000,000 60,000,000,000 80,000,000,000 100,000,000,000 120,000,000,000 140,000,000,000 160,000,000,000 180,000,000,000 200,000,000,000 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 XDESMGR Spins / Thread Count Baseline (Batch Size=1) Log Writer With Dedicated Core Batch Size=1 Log Writer With Dedicated Core Batch Size=2
  • 31. Does It Matter Which NUMA Node The Insert Runs On ? L3 Core 0 Core 1 Core 2 Core 4 Core 3 Core 5 Core 6 Core 7 Core 9 Core 8 C P U L3 Core 0 Core 1 Core 2 Core 4 Core 3 Core 5 Core 6 Core 7 Core 9 Core 8 C P U Faster here ? Numa Node 0 . . . Or faster here? Numa Node 1 “Whats really going to bake your noodle . . .” 8 threads here 73 s 8 threads here 125 s
  • 32. What Does Windows Performance Toolkit Have To Tell Us ? 18 insert thread log writer CPU socket Co-location. 18 insert threads not co-located on same socket as the log writer 84,697 ms Vs. 11,281,235 ms
  • 33. So I Should Look At Tuning The CPU Affinity Mask ?  Get the basics right first:  Minimize transaction log fragmentation ( both internal and external ).  Use low latency storage.  Avoid log intensive operations, page splits etc . . .  Use minimally logged operations where appropriate.  Only when:  All of the above has been done.  The disk row store engine is being used.  The workload is OLTP heavy using more than 12 CPU cores, 6 per socket, look at giving the log writer a CPU core to itself.
  • 34. Hard To Solve Logging Issues  I’m have to use the disk row store engine.  My single threaded app cannot easily be multi threaded.  How do I get the best possible write log performance ?  Use NUMA connection affinity to connect to the same socket as the log writer.  Disable hyper-threading, whole cores and always faster than hyper-threads.  ‘Affinitize’ the rest of the database engine away from the log writer thread ‘Home’ CPU core.  Go for a CPU with the best single threaded performance available.
  • 35. The CPU Cycle Cost Of Spinlock Cache Line Transfer spin_acquire Int s spin_acquire Int s spin_acquire Int s Transfer cache line Transfer cache line CPU CPU L3Core C P U C P U C P U C P U 100 CPU cycles Core 34 CPU cycles 100 CPU cycles 34 CPU cycles Core to core on the same socket Core to core on different sockets
  • 36. Remember, All Memory Access Is CPU Intensive
  • 37. This Man Seriously Knows A Lot About Memory  Ulrich Drepper, author of: What Every Programmer Should Know About Memory  From Understanding CPU Caches “Use per CPU memory; lock thread to specific CPU” This is our CPU affinity mask trick 
  • 38. Cache Line Ping Pong IOHub CPU 6 CPU 7 CPU 4 CPU 5 CPU 2 CPU 3 CPU 0 CPU 1 IOHubIOHub IOHub “Cache line ping pong is deadly for performance” The more CPU sockets and cores you have the greater the ramifications this has for SQL Server scalability on “Big boxes”.
  • 39. ‘Sharding’ The Database Across Instances L3 Core 0 Core 1 Core 2 Core 4 Core 3 Core 5 Core 6 Core 7 Core 9 Core 8 C P U L3 Core 0 Core 1 Core 2 Core 4 Core 3 Core 5 Core 6 Core 7 Core 9 Core 8 C P U Instance A - ‘Affinitized’ to NUMA Node 0 Instance B - ‘Affinitized’ to NUMA Node 1  ‘Shard’ databases across instances.  2 x LOGCACHE_ACCES S and XDESMGR spinlocks.  Spinlock cache entries are bound by the latency of the L3 cache, not the quick path inter-connect.
  • 40. What Can We Get From An Instance ‘Affinitized’ To One CPU Socket ? 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 500,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Inserts/s Threads Insert Rate / Thread Count
  • 41. With a Batch Size of 2, 32 Threads Achieve The Best Throughput Logging related activity Latching ! Where to now ?
  • 42. In Memory OLTP To The Rescue, But What Will It Give Us ?  Only redo is written to the transaction log (durability = SCHEMA_AND_DATA) Does this relieve pressure on the LOGCACHE_ACCESS spinlock ?.  Zero latching and locking.  Native procedure compilation.  No “Last page” problem due to IMOLTP’s use of hash buckets.  Spinlocks will still be in play though .
  • 43. Insert Scalability with A Non Natively Compiled Stored Procedure 0 100,000 200,000 300,000 400,000 500,000 600,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Inserts/s Threads Insert Rate / Thread Count Default Engine IMOLTP Range Index IMOLTP Hash Index bc=8388608 IMOLTP Hash Index bc=16777216
  • 44. What Does The BLOCKER_ENUM Spinlock Protect ? Transaction synchronization between the default and in-memory OLTP engines ?
  • 45. Where Are Our CPU Cycles Going, The Overhead Of Language Processing Time to try native in memory OLTP transactions and compiled stored procedures ?
  • 46. Insert Scalability with A Natively Compiled Stored Procedure 0 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 7,000,000 8,000,000 9,000,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Inserts/s Threads Insert Rate / Thread Count bucket count=8388608 bucket count=16777216 bucket count=33554432 range
  • 47. Hash Indexes Bucket Count and Balancing The Equation Smaller bucket counts = better cache line reuse + reduced TLB thrashing + reduced hash table cache out Larger bucket counts = reduced cache line reuse + increased TLB thrashing + less hash bucket scanning for lookups
  • 48. Is Our CPU Affinity Mask Trick Relevant To In Memory OLTP ?  Default CPU affinity mask and 18 insert threads.  A CPU core dedicated to the log writer and 18 insert threads.
  • 49. Optimizations That Failed To Make The Grade  Large memory pages As per the default database engine, this made no difference to performance.  Turning off adjacent cache line pre-fetching This can degrade performance by saturating the memory bus when hyper threading is in use and cause cache pollution when the pre-fetched line is not used.
  • 50. Takeaways  Monotonically increasing keys do not scale with the default database engine.  Dedicate a CPU core to the log write to relieve pressure on the LOGCACHE_ACCESS spinlock.  Batch DML statements together per transaction to relieve XDESMGR spinlock pressure.  The further the LOGCACHE_ACCESS spinlock cache line has to travel, the more performance is degraded.  Native compilation results in a performance increase of over an order of magnitude (at least) over non natively compiled stored procedures.  There is a bucket count “Sweet spot” for IMOLTP hash indexes which is influenced by hash collisions, bucket scans and hash lookup table cache out.
  • 51. Further Reading  Super scaling singleton inserts blog post  Tuning The LOGCACHE_ACCESS Spinlock On A “Big Box” blog post  Tuning The XDESMGR Spinlock On A “Big Box” blog post
  • 52.

Notas do Editor

  1. SQL Server 2008 R2 introduced the concept of “Exponential back off”.
  2. The log writer is always assigned to the first CPU core of one of the CPU sockets, which is usually socket 0 (NUMA node 0). Because hyper-threading is enabled, each physical CPU core appears in the affinity mask as two logical processors, which is why two logical processors are being removed from the affinity mask. Where hyper-threading to be disabled there would be a 1:1 relationship between logical processors and physical CPU cores, in which case only one logical processor would be removed from the affinity mask.
  3. LOGBUFFER waits => Occurs when a task is waiting for space in the log buffer to store a log record. Consistently high values may indicate that the log devices cannot keep up with the amount of log being generated by the server. Essentially 30 threads causes the write band width of our storage to be saturated.
  4. The LOGCACHE_ACCESS spins for both tests are very similar, the key difference is that with the “CPU affinity mask trick” we are getting the same number of spins as we do with the baseline with superior insert throughput.
  5. Changing the CPU affinity mask has ensured that when log writer needs to release the cache line associated with the LOGCACHE_ACCESS spinlock, no SQL OS scheduler level swap in of the log writer is required first. Not only does this cost us CPU time but the sharing of a CPU core by the log writer and any other task means that data and instructions in the L1/2 cache of the core may be wiped out when the other task is running.
  6. As is invariably the case with performance tuning, you remove one bottleneck only for a new one to appear somewhere else.
  7. I am assuming you are already using the lowest latency storage available PCIe based flash with a NVMe driver. The term “Affinitizing” the rest of the database engine away from the log writer thread is a grandiose way of referring to the CPU affinity mask trick.
  8. 134217728 corresponds to the
  9. 134217728 corresponds to the
  10. Using a natively compiled stored procedure for the insert into an in memory table makes a tremendous difference, we can see that even with two threads and a compiled procedure that the in memory OTLP engine is being its disk based row store counter part. Other takeaways include the fact that a hash index beats a range index for insert throughput and that there is a bucket count sweet spot for the best performance.