SlideShare uma empresa Scribd logo
1 de 20
Fast Multi-Column Sorting in
Main-Memory Column-Stores
Wenjian Xu†, Ziqiang Feng†, Eric Lo‡
†The Hong Kong Polytechnic University
‡The Chinese University of Hong Kong
Background
Analytic
database
Read-most
queries
Main
memory
Column
store
2
Column
compression
De-
normalization
Sort
• Implementing SQL operators like
• GROUP BY
• ORDER BY
• PARTITION BY
3
SIMD-Sort
4
256-bit SIMD register
0xBBB0000000F
0x22200000001
0x333000F0009
0x8000000000E
0x11000300001
0x1020FF00000
0x10800000090
0x1000200000E
…
44-bit column
0xBBB0000000F 0x22200000001 0x333000F0009 0x8000000000E
64-bit bank
4x data
parallelism
Bank size could be 8-bit,
16-bit, 32-bit, or 64-bit
SIMD-Sort
5
0xA000
0x1000
0x2000
0x7200
0x0000
0x020F
0x0800
0x0002
…
0xA000 0x1000 0x2000 0x7200 0x0000 0x020F 0x0800 0x0002 0xBB00 0x1C00 0x0022 0x7200 0x00F0 0xFFFF 0xBBCF 0x1000
44-bit column 16-bit bank
16x data
parallelism
256-bit SIMD register
Parallelism degree
depends on the code
width of the column
16
Bank size could be 8-bit,
16-bit, 32-bit, or 64-bit
Multi-Column Sorting
6
SELECT 
FROM orders
ORDER BY order_date, retail_price
Multiple
attributes
Multi-Column Sorting Scan+Lookup+Aggregation
Q1 Q2 Q3 Q7
Q9 Q10 Q16 Q18
TPC-H QueriesMulti-Column Sorting
becomes the bottleneck
Widespread in workloads:
45% TPC-H queries, 72% TPC-DS queries
Our work:
Optimizing Multi-Column Sorting
Example Query:
State-of-the-Art Implementation:
Column-at-a-Time
7
X (20-bit)
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
1
2
3
4
5
6
7
32-bit bank
SIMD-sort
8x parallelism
0x10001
0x10001
0x10001
0x10001
0x10003
0x10003
0x10003
0xEEEEE 1
0x00000 2
0xEEEEE 3
0x00000 4
0xEEEEE 5
0x00000 6
0xEEEEE 7
oid Y (12-bit)
0xAAA
0xCCC
0xBBB
0xAAA
0xAAA
0xFFF
0xCCC
Order by X, Y
State-of-the-Art Implementation:
Column-at-a-Time
8
X (20-bit)
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
1
2
3
4
5
6
7
32-bit bank
SIMD-sort
8x parallelism
0x00000
0x00000
0x00000
0xEEEEE
0xEEEEE
0xEEEEE
0xEEEEE
2
4
6
1
3
5
7
Y (12-bit)
0xAAA
0xCCC
0xBBB
0xAAA
0xAAA
0xFFF
0xCCC
oid
0xCCC
0xAAA
0xFFF
0xAAA
Order by X, Y
State-of-the-Art Implementation:
Column-at-a-Time
9
X (20-bit)
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
1
2
3
4
5
6
7
32-bit bank
SIMD-sort
8x parallelism
0x00000
0x00000
0x00000
0xEEEEE
0xEEEEE
0xEEEEE
0xEEEEE
2
4
6
1
3
5
7
Y (12-bit)
2
4
6
1
3
5
7
16-bit bank
SIMD-sort
16x parallelism
16-bit bank
SIMD-sort
16x parallelism
0xAAA
0xCCC
0xBBB
0xAAA
0xAAA
0xFFF
0xCCC
oid
LOOKUP
0xCCC
0xAAA
0xFFF
0xAAA
0xBBB
0xAAA
0xCCC
0xAAA
0xCCC
0xFFF
0xAAA
0xAAA
0xBBB
0xCCC
4
2
6
1
5
3
7
Order by X, Y
Can we do better?
Option 1: Stitch Together
10
X (20-bit)
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
1
2
3
4
5
6
7
32-bit bank
SIMD-sort
8x parallelism
0x00000
0x00000
0x00000
0xEEEEE
0xEEEEE
0xEEEEE
0xEEEEE
2
4
6
1
3
5
7
Y (12-bit)
2
4
6
1
3
5
7
16-bit bank
SIMD-sort
16x parallelism
16-bit bank
SIMD-sort
16x parallelism
0xAAA
0xCCC
0xBBB
0xAAA
0xAAA
0xFFF
0xCCC
oid
0xCCC
0xAAA
0xFFF
0xAAA
0xBBB
0xAAA
0xCCC
0xAAA
0xCCC
0xFFF
0xAAA
0xAAA
0xBBB
0xCCC
4
2
6
1
5
3
7
0xEEEEE AAA
0x00000 CCC
0xEEEEE BBB
LOOKUP
Stitch
LOOKUP
Column-at-a-Time
Stitch X and Y
State-of-the-Art Implementation:
Column-at-a-Time
Option 1: Stitch Together
11
X (20-bit)
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
1
2
3
4
5
6
7
32-bit bank
SIMD-sort
8x parallelism
0x00000
0x00000
0x00000
0xEEEEE
0xEEEEE
0xEEEEE
0xEEEEE
2
4
6
1
3
5
7
Y (12-bit)
2
4
6
1
3
5
7
16-bit bank
SIMD-sort
16x parallelism
16-bit bank
SIMD-sort
16x parallelism
0xAAA
0xCCC
0xBBB
0xAAA
0xAAA
0xFFF
0xCCC
oid
0xCCC
0xAAA
0xFFF
0xAAA
0xBBB
0xAAA
0xCCC
0xAAA
0xCCC
0xFFF
0xAAA
0xAAA
0xBBB
0xCCC
4
2
6
1
5
3
7
Supercolumn
(32-bit)
LOOKUP
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
0x00000
0xEEEEE
AAA
CCC
BBB
AAA
AAA
FFF
CCC
32-bit bank
SIMD-sort
8x parallelism
1
2
3
4
5
6
7
0x00000AAA
0x00000CCC
0x00000FFF
0xEEEEEAAA
0xEEEEEAAA
0xEEEEEBBB
0xEEEEECCC
4
2
6
1
5
3
7
Save one LOOKUP
operation 
LOOKUP
Stitch
Column-at-a-Time
Stitch X and Y
Correctness
proved!
Save one round
of sorting 
Stitch overhead 
12
Is stitch together always good?
Let’s consider another example.
Option 1: Stitch Together
13
X (20-bit)
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
32-bit bank
SIMD-sort
8x parallelism
0x10001
0x10001
0x10001
0x10003
0x10003
0x10003
0x10003
Y (12-bit)
32-bit bank
SIMD-sort
8x parallelism
32-bit bank
SIMD-sort
8x parallelism
0xAAAAA
0xCCCCC
0xAAAAA
0xCCCCC
0xCCCCC
0xAAAAA
0xCCCCC
0x00C
0x00A
0x00F
0x00A
0x00B
0x00A
0x00C
0x00A
0x00C
0x00F
0x00A
0x00A
0x00B
0x00C
LOOKUPLOOKUP
24 20
Supercolumn
(32-bit)
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
AAAAA
CCCCC
AAAAA
CCCCC
CCCCC
AAAAA
CCCCC
32-bit bank
SIMD-sort
4x parallelism
0x00000AAA
0x00000CCC
0x00000FFF
0xEEEEEAAA
0xEEEEEAAA
0xEEEEEBBB
0xEEEEECCC
Stitch Stitch X and Y
44
64
Column-at-a-Time
Lower Data
Parallelism 
Any alternatives other than Stitching X
and Y in this example?
0xAAAAA
0xCCCCC
0xAAAAA
0xCCCCC
0xCCCCC
0xAAAAA
0xCCCCC
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
Option 2: Bit Borrowing
14
X (24-bit)
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
0x000000
0xEEEEEE
32-bit bank
SIMD-sort
8x parallelism
0x10001
0x10001
0x10001
0x10003
0x10003
0x10003
0x10003
Y (20-bit)
32-bit bank
SIMD-sort
8x parallelism
32-bit bank
SIMD-sort
8x parallelism
0x00C
0x00A
0x00F
0x00A
0x00B
0x00A
0x00C
0x00A
0x00C
0x00F
0x00A
0x00A
0x00B
0x00C
LOOKUPLOOKUP
<< 4 bits
X (24-bit) Y (20-bit)
0xAAAAA
0xCCCCC
0xAAAAA
0xCCCCC
0xCCCCC
0xAAAAA
0xCCCCC
A
C
A
C
C
A
C
32-bit bank
SIMD-sort
8x parallelism
16-bit bank
SIMD-sort
16x parallelism
16-bit bank
SIMD-sort
16x parallelism
0x000000A
0x000000C
0x000000C
0xEEEEEEA
0xEEEEEEA
0xEEEEEEC
0xEEEEEEC
28 16
Option 1: Stitch Together
Column-at-a-Time
Borrowing bits from Y to X
Improved
parallelism
LOOKUP
Optimal Plan
• Given 3 columns with 11-bit, 14-bit, and 21-bit to be sorted:
15
• Cost model
• Plan enumeration and
search
Stitch
together?
Bit
borrowing?
Split into
more rounds? In the paper:
Num. of possible
Plans: 2(11+14+21)
Experiments
• Setup
Intel Xeon E5 10-core & Intel i7 quad-core
AVX2 instruction set (256 bits)
• Data sets
 TPC-H
 TPC-H Skew
 TPC-DS
 Real data (Airline Origin and Destination Survey)
16
Speedup over Column-at-a-Time
17
1.8X ~ 5.5X speedup
TPC-H TPC-H Skew TPC-DS Real Data
Data Size Scalability
18
Linear data size scalability
Our solution for Multi-Column Sorting
Core/thread Scalability
19
Linear core/thread scalability
Our solution for Multi-Column Sorting
Summary
• First work to pinpoint and tackle the issue of multi-column
sorting
• Our technique: manipulate the bits across input columns
• Up to 5.5X speedup in query execution.
20

Mais conteúdo relacionado

Semelhante a fast multi-column sorting in main-memory column-stores

Interfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorInterfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorVikas Gupta
 
Displaying Animated Images on GLCD display with LPC2148 Microcontroller
Displaying Animated Images on GLCD display with LPC2148 MicrocontrollerDisplaying Animated Images on GLCD display with LPC2148 Microcontroller
Displaying Animated Images on GLCD display with LPC2148 MicrocontrollerOmkar Rane
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingMatsuo and Tsumura lab.
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeSlide_N
 
Porting NetBSD to the LatticeMico32 open source CPU by Yann Sionneau
Porting NetBSD to the LatticeMico32 open source CPU by Yann SionneauPorting NetBSD to the LatticeMico32 open source CPU by Yann Sionneau
Porting NetBSD to the LatticeMico32 open source CPU by Yann Sionneaueurobsdcon
 
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxchristinemaritza
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimonSisimon Soman
 
"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural NetKen'ichi Matsui
 
Practical attacks on commercial white-box cryptography solutions
Practical attacks on commercial white-box cryptography solutionsPractical attacks on commercial white-box cryptography solutions
Practical attacks on commercial white-box cryptography solutionsLINE Corporation
 
Code dive 2019 kamil witecki - should i care about cpu cache
Code dive 2019   kamil witecki - should i care about cpu cacheCode dive 2019   kamil witecki - should i care about cpu cache
Code dive 2019 kamil witecki - should i care about cpu cacheKamil Witecki
 

Semelhante a fast multi-column sorting in main-memory column-stores (15)

Interfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessorInterfacing memory with 8086 microprocessor
Interfacing memory with 8086 microprocessor
 
Displaying Animated Images on GLCD display with LPC2148 Microcontroller
Displaying Animated Images on GLCD display with LPC2148 MicrocontrollerDisplaying Animated Images on GLCD display with LPC2148 Microcontroller
Displaying Animated Images on GLCD display with LPC2148 Microcontroller
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with Multithreading
 
Silent stores
Silent storesSilent stores
Silent stores
 
8 x8m guide
8 x8m guide8 x8m guide
8 x8m guide
 
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-codeProgramming the Cell Processor A simple raytracer from pseudo-code to spu-code
Programming the Cell Processor A simple raytracer from pseudo-code to spu-code
 
Porting NetBSD to the LatticeMico32 open source CPU by Yann Sionneau
Porting NetBSD to the LatticeMico32 open source CPU by Yann SionneauPorting NetBSD to the LatticeMico32 open source CPU by Yann Sionneau
Porting NetBSD to the LatticeMico32 open source CPU by Yann Sionneau
 
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
 
Windows debugging sisimon
Windows debugging   sisimonWindows debugging   sisimon
Windows debugging sisimon
 
"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net
 
Lecture.1
Lecture.1Lecture.1
Lecture.1
 
Practical attacks on commercial white-box cryptography solutions
Practical attacks on commercial white-box cryptography solutionsPractical attacks on commercial white-box cryptography solutions
Practical attacks on commercial white-box cryptography solutions
 
Code dive 2019 kamil witecki - should i care about cpu cache
Code dive 2019   kamil witecki - should i care about cpu cacheCode dive 2019   kamil witecki - should i care about cpu cache
Code dive 2019 kamil witecki - should i care about cpu cache
 
SHA512.pptx
SHA512.pptxSHA512.pptx
SHA512.pptx
 
lect13.ppt
lect13.pptlect13.ppt
lect13.ppt
 

Último

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 

Último (20)

Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 

fast multi-column sorting in main-memory column-stores

  • 1. Fast Multi-Column Sorting in Main-Memory Column-Stores Wenjian Xu†, Ziqiang Feng†, Eric Lo‡ †The Hong Kong Polytechnic University ‡The Chinese University of Hong Kong
  • 3. Sort • Implementing SQL operators like • GROUP BY • ORDER BY • PARTITION BY 3
  • 4. SIMD-Sort 4 256-bit SIMD register 0xBBB0000000F 0x22200000001 0x333000F0009 0x8000000000E 0x11000300001 0x1020FF00000 0x10800000090 0x1000200000E … 44-bit column 0xBBB0000000F 0x22200000001 0x333000F0009 0x8000000000E 64-bit bank 4x data parallelism Bank size could be 8-bit, 16-bit, 32-bit, or 64-bit
  • 5. SIMD-Sort 5 0xA000 0x1000 0x2000 0x7200 0x0000 0x020F 0x0800 0x0002 … 0xA000 0x1000 0x2000 0x7200 0x0000 0x020F 0x0800 0x0002 0xBB00 0x1C00 0x0022 0x7200 0x00F0 0xFFFF 0xBBCF 0x1000 44-bit column 16-bit bank 16x data parallelism 256-bit SIMD register Parallelism degree depends on the code width of the column 16 Bank size could be 8-bit, 16-bit, 32-bit, or 64-bit
  • 6. Multi-Column Sorting 6 SELECT  FROM orders ORDER BY order_date, retail_price Multiple attributes Multi-Column Sorting Scan+Lookup+Aggregation Q1 Q2 Q3 Q7 Q9 Q10 Q16 Q18 TPC-H QueriesMulti-Column Sorting becomes the bottleneck Widespread in workloads: 45% TPC-H queries, 72% TPC-DS queries Our work: Optimizing Multi-Column Sorting Example Query:
  • 7. State-of-the-Art Implementation: Column-at-a-Time 7 X (20-bit) 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 1 2 3 4 5 6 7 32-bit bank SIMD-sort 8x parallelism 0x10001 0x10001 0x10001 0x10001 0x10003 0x10003 0x10003 0xEEEEE 1 0x00000 2 0xEEEEE 3 0x00000 4 0xEEEEE 5 0x00000 6 0xEEEEE 7 oid Y (12-bit) 0xAAA 0xCCC 0xBBB 0xAAA 0xAAA 0xFFF 0xCCC Order by X, Y
  • 8. State-of-the-Art Implementation: Column-at-a-Time 8 X (20-bit) 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 1 2 3 4 5 6 7 32-bit bank SIMD-sort 8x parallelism 0x00000 0x00000 0x00000 0xEEEEE 0xEEEEE 0xEEEEE 0xEEEEE 2 4 6 1 3 5 7 Y (12-bit) 0xAAA 0xCCC 0xBBB 0xAAA 0xAAA 0xFFF 0xCCC oid 0xCCC 0xAAA 0xFFF 0xAAA Order by X, Y
  • 9. State-of-the-Art Implementation: Column-at-a-Time 9 X (20-bit) 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 1 2 3 4 5 6 7 32-bit bank SIMD-sort 8x parallelism 0x00000 0x00000 0x00000 0xEEEEE 0xEEEEE 0xEEEEE 0xEEEEE 2 4 6 1 3 5 7 Y (12-bit) 2 4 6 1 3 5 7 16-bit bank SIMD-sort 16x parallelism 16-bit bank SIMD-sort 16x parallelism 0xAAA 0xCCC 0xBBB 0xAAA 0xAAA 0xFFF 0xCCC oid LOOKUP 0xCCC 0xAAA 0xFFF 0xAAA 0xBBB 0xAAA 0xCCC 0xAAA 0xCCC 0xFFF 0xAAA 0xAAA 0xBBB 0xCCC 4 2 6 1 5 3 7 Order by X, Y Can we do better?
  • 10. Option 1: Stitch Together 10 X (20-bit) 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 1 2 3 4 5 6 7 32-bit bank SIMD-sort 8x parallelism 0x00000 0x00000 0x00000 0xEEEEE 0xEEEEE 0xEEEEE 0xEEEEE 2 4 6 1 3 5 7 Y (12-bit) 2 4 6 1 3 5 7 16-bit bank SIMD-sort 16x parallelism 16-bit bank SIMD-sort 16x parallelism 0xAAA 0xCCC 0xBBB 0xAAA 0xAAA 0xFFF 0xCCC oid 0xCCC 0xAAA 0xFFF 0xAAA 0xBBB 0xAAA 0xCCC 0xAAA 0xCCC 0xFFF 0xAAA 0xAAA 0xBBB 0xCCC 4 2 6 1 5 3 7 0xEEEEE AAA 0x00000 CCC 0xEEEEE BBB LOOKUP Stitch LOOKUP Column-at-a-Time Stitch X and Y State-of-the-Art Implementation: Column-at-a-Time
  • 11. Option 1: Stitch Together 11 X (20-bit) 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 1 2 3 4 5 6 7 32-bit bank SIMD-sort 8x parallelism 0x00000 0x00000 0x00000 0xEEEEE 0xEEEEE 0xEEEEE 0xEEEEE 2 4 6 1 3 5 7 Y (12-bit) 2 4 6 1 3 5 7 16-bit bank SIMD-sort 16x parallelism 16-bit bank SIMD-sort 16x parallelism 0xAAA 0xCCC 0xBBB 0xAAA 0xAAA 0xFFF 0xCCC oid 0xCCC 0xAAA 0xFFF 0xAAA 0xBBB 0xAAA 0xCCC 0xAAA 0xCCC 0xFFF 0xAAA 0xAAA 0xBBB 0xCCC 4 2 6 1 5 3 7 Supercolumn (32-bit) LOOKUP 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE 0x00000 0xEEEEE AAA CCC BBB AAA AAA FFF CCC 32-bit bank SIMD-sort 8x parallelism 1 2 3 4 5 6 7 0x00000AAA 0x00000CCC 0x00000FFF 0xEEEEEAAA 0xEEEEEAAA 0xEEEEEBBB 0xEEEEECCC 4 2 6 1 5 3 7 Save one LOOKUP operation  LOOKUP Stitch Column-at-a-Time Stitch X and Y Correctness proved! Save one round of sorting  Stitch overhead 
  • 12. 12 Is stitch together always good? Let’s consider another example.
  • 13. Option 1: Stitch Together 13 X (20-bit) 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE 32-bit bank SIMD-sort 8x parallelism 0x10001 0x10001 0x10001 0x10003 0x10003 0x10003 0x10003 Y (12-bit) 32-bit bank SIMD-sort 8x parallelism 32-bit bank SIMD-sort 8x parallelism 0xAAAAA 0xCCCCC 0xAAAAA 0xCCCCC 0xCCCCC 0xAAAAA 0xCCCCC 0x00C 0x00A 0x00F 0x00A 0x00B 0x00A 0x00C 0x00A 0x00C 0x00F 0x00A 0x00A 0x00B 0x00C LOOKUPLOOKUP 24 20 Supercolumn (32-bit) 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE AAAAA CCCCC AAAAA CCCCC CCCCC AAAAA CCCCC 32-bit bank SIMD-sort 4x parallelism 0x00000AAA 0x00000CCC 0x00000FFF 0xEEEEEAAA 0xEEEEEAAA 0xEEEEEBBB 0xEEEEECCC Stitch Stitch X and Y 44 64 Column-at-a-Time Lower Data Parallelism  Any alternatives other than Stitching X and Y in this example?
  • 14. 0xAAAAA 0xCCCCC 0xAAAAA 0xCCCCC 0xCCCCC 0xAAAAA 0xCCCCC 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE Option 2: Bit Borrowing 14 X (24-bit) 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE 0x000000 0xEEEEEE 32-bit bank SIMD-sort 8x parallelism 0x10001 0x10001 0x10001 0x10003 0x10003 0x10003 0x10003 Y (20-bit) 32-bit bank SIMD-sort 8x parallelism 32-bit bank SIMD-sort 8x parallelism 0x00C 0x00A 0x00F 0x00A 0x00B 0x00A 0x00C 0x00A 0x00C 0x00F 0x00A 0x00A 0x00B 0x00C LOOKUPLOOKUP << 4 bits X (24-bit) Y (20-bit) 0xAAAAA 0xCCCCC 0xAAAAA 0xCCCCC 0xCCCCC 0xAAAAA 0xCCCCC A C A C C A C 32-bit bank SIMD-sort 8x parallelism 16-bit bank SIMD-sort 16x parallelism 16-bit bank SIMD-sort 16x parallelism 0x000000A 0x000000C 0x000000C 0xEEEEEEA 0xEEEEEEA 0xEEEEEEC 0xEEEEEEC 28 16 Option 1: Stitch Together Column-at-a-Time Borrowing bits from Y to X Improved parallelism LOOKUP
  • 15. Optimal Plan • Given 3 columns with 11-bit, 14-bit, and 21-bit to be sorted: 15 • Cost model • Plan enumeration and search Stitch together? Bit borrowing? Split into more rounds? In the paper: Num. of possible Plans: 2(11+14+21)
  • 16. Experiments • Setup Intel Xeon E5 10-core & Intel i7 quad-core AVX2 instruction set (256 bits) • Data sets  TPC-H  TPC-H Skew  TPC-DS  Real data (Airline Origin and Destination Survey) 16
  • 17. Speedup over Column-at-a-Time 17 1.8X ~ 5.5X speedup TPC-H TPC-H Skew TPC-DS Real Data
  • 18. Data Size Scalability 18 Linear data size scalability Our solution for Multi-Column Sorting
  • 19. Core/thread Scalability 19 Linear core/thread scalability Our solution for Multi-Column Sorting
  • 20. Summary • First work to pinpoint and tackle the issue of multi-column sorting • Our technique: manipulate the bits across input columns • Up to 5.5X speedup in query execution. 20

Notas do Editor

  1. Co-operated with Ziqiang Feng from PolyU and Eric Lo From CU 1,
  2. This work is basically all around analytic databases You know in such databases we deal with read-most queries. To support real-time query processing, we try to put all the data in memory We use a column oriented store Furthermore, Data columns are encoded for memory efficiency And we use the de-normalization techniques to eliminate joins
  3. A crucial operation in mmcs, as it could be used to implement SQL operators…
  4. --Especially, utilize SIMD features offered by modern CPU. --encoded with 44 bit, load them into SIMD registers for sorting --current SIMD register is usually 256-b long, much wider than normal CPU register --In such register, each operand, or bank size, could be 8b, 16b, 32b or 64b. --have to use 64-bit bank (need to mention that 32 not enough!) --during the sorting process, each SIMD instruction could process 4 column values in parallel. Compared to scalar sorting, SIMD sort could achieve…
  5. May not be familiar with *code width* => metion with *column is encoded with 36-bit* and *another column with 16-bit as the code width* Floating point numbers with limited precision can be scaled to integers by multiplication with a certain factor Explain more why 16-bit bank is used, 8 not enough, 32 too wasteful
  6. How to bring out multi-column sorting: Column stored, two columns stored separately. Traditional column-store First sorts column order_date, then it sorts column retail_price for each group of tied order_date values Red part represents the time spent for multi-column sorting; blue part refers to time spent for other operations
  7. Next , we turn to work on column Y. Note that according to the order by clause, the ordering of column Y should be conducted based on the ordering of Column X. Before sort column Y, we have to re-order it according to sorted column X This can be achieved by a sequence of lookup operation through the object id of column X. Now we get column Y ordered by X Next, we need to identify that there two groups in column x where each group contains tied values of X; correspondingly, second round sorting is performed within each group of column y;
  8. Obtain the same result as column-at-a-time solution. Essentially, this stitch strategy just Sorts two columns in one go, thus eliminating one round of sorting
  9. The reduced data parallelism may offset that benefit and make the stitching strategy inferior to the column-at-a-time solution.
  10. Our cost model can accurately quantify the cost of each plan. In the model, we divide the process of MCS in detailed steps and run calibration experiments to improve the accuracy of plan cost on specific platforms As for plan search, we invent pruning rules to make sure that search process itself would not be a bottleneck.