SlideShare uma empresa Scribd logo
1 de 19
Variations of the Star
Schema Benchmark to Test
the Effects of Data Skew on
Query Performance
T IL M AN N R ABL , M EIKEL PO ESS, H AN S - AR N O
J AC O BSEN , PAT R IC K AN D EL IZABETH O’N EIL

ICPE 2013, PRAGUE, 24/04/2013
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG
Real Life Data is
Distributed Uniformly…
Well, Not Really
◦ Customers zip codes typically clustered around metropolitan areas
◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific
periods
◦ US retail sales:
◦ peak during Holiday Season
◦ December sales are 2x of
January sales

Source: US Census Data
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

2
Student Seminar Signup
Distribution

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

3
How Can Skew Effect
Database Systems?
Data placement
◦ Partitioning
◦ Indexing

Data structures
◦ Tree balance
◦ Bucket fill ratio
◦ Histograms

Optimizer  finding the optimal query plan
◦ Index vs. non-index driven plans
◦ Hash join vs. merge join
◦ Hash group by vs. sort group by

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

4
Agenda
Data Skew in Current Benchmarks

Star Schema Benchmark (SSB)
Parallel Data Generation Framework (PDGF)

Introducing Skew in SSB

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

5
Data Skew in Benchmarks
TPC-D (1994-1999): only uniform data
◦ SIGMOD 1997 - “Successor of TPC-D
should include data skew”
◦ No effect until …

TPC-DS (released 2012)
◦ Contains comparability zones
◦ Not fully utilized

TPC-D/H variations
◦ Chaudhuri and Narayasa: Zipfian distribution on all columns
◦ Crolotte and Ghazal: comparability zones

Still lots of open potential

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

6
Star Schema Benchmark I

Star schema version of TPC-H
◦
◦
◦
◦

Merged Order and Lineitem
Date dimension
Dropped Partsupp
Selectivity hierarchies
◦ C_City  C_Nation  C_Region
◦ …

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

7
Star Schema Benchmark II
Completely new set of queries
4 flights of 3-4 queries
◦ Designed for functional coverage and selectivity coverage
◦ Drill down in dimension hierarchies
◦ Predefined selectivity

Q1.1

select
from
where
and
and
and

sum(lo_extendedprice*lo_discount) as revenue
lineorder, date
lo_orderdate = d_datekey
d_year = 1993
lo_discount between 1 and 3
lo_quantity < 25;

Q1.2

select
from
where
and
and
and

sum(lo_extendedprice*lo_discount) as revenue
lineorder, date
lo_orderdate = d_datekey
d_yearmonthnum = 199301
lo_discount between 1 and 3
lo_quantity between 26 and 35;

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

8
Parallel Data Generation
Framework
Generic data generation framework
Relational model
◦ Schema specified in configuration file
◦ Post-processing stage for alternative representations

Repeatable computation
◦ Based on XORSHIFT random number generators
◦ Hierarchical seeding strategy

Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12.
Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11.
Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11.
Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

9
XML

PDGF

DB

Configuring PDGF
Schema configuration
Relational model
◦ Tables, fields

Properties
◦ Table size, characters, …

Generators
◦ Simple generators
◦ Metagenerators

Update definition
◦ Insert, update, delete
◦ Generated as change data capture

<table name="SUPPLIER">
<size>${S}</size>
<field name="S_SUPPKEY" size="" type="NUMERIC“
primary="true" unique="true">
<gen_IdGenerator />
</field>
<field name="S_NAME" size="25" type="VARCHAR">
<gen_PrePostfixGenerator>
<gen_PaddingGenerator>
<gen_OtherFieldValueGenerator>
<reference field="S_SUPPKEY" />
</gen_OtherFieldValueGenerator >
<character>0</character>
<padToLeft>true</padToLeft>
<size>9</size>
</gen_PaddingGenerator >
<prefix>Supplier </prefix>
</gen_PrePostfixGenerator>
</field>
[..]

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

10
Opportunities to Inject Data
Skew in
Foreign key relations
◦ E.g., L_PARTKEY

One fact table measures
◦ E.g., L_Quantity

Single dimension hierarchy

◦ E.g., P_Brand → P_Category → P_Mfgr

Multiple dimension hierarchies

◦ E.g., City → Nation in Supplier and Customer

Experimental methodology
◦
◦
◦
◦

One experiment series for each of the above
Comparison to original SSB
Comparison of index-forced, non-index, and automatic optimizer mode
SSB scale factor 100 (100 GB), x86 server

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

11
Skew in Foreign Key
Relations
Very realistic
Easy to implement in PDGF

◦ Just add a distribution to the reference

<distribution name="Exponential“ lambda="0.26235" />

But!
Dimension attributes uniformly distributed

Dimension keys uncorrelated to dimension attributes
 Very limited effect on selectivity
Focus on attributes in selectivity predicates

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

12
Skew in Fact Table Measure
– Lo_Quantity
Lo_Quantity distribution
◦ Values range between 0 and 50
◦ Originally uniform distribution with:
◦ P(X=x)=0.02
◦ Coefficient of variation of 0.00000557

◦ Proposed skewed distribution with:
◦

P(X

x)

0 .3
1 .3

x

Query 1.1
◦ lo_quantity < x, x ∈ [2, 51]

Results
◦ Switches too early to non-index plan
◦ Switches too late to non-index plan
◦ Optimizer agnostic to distribution
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

13
Skew in Single Dimension
Hierarchy - Part
P_Category distribution
◦ Uniform P(X=x)=0.04
◦ Skewed P(X=x)= 0.01 - 48.36
◦ Probabilities explicitly defined

Query 2.1
◦ Restrictions on two dimensions

Results uniform case
◦ Index driven superior
◦ Optimizer chooses non-index driven

Results skewed case
◦ Switches too early to non-index
plan

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

14
Skew in Multiple Dimension
Hierarchies – S_City &
C_City
Skewed S_City & C_City
◦ Probabilites exponentially
distributed

Query 3.3

Join Cardinality

Elapsed Time

◦ Restrictions on 3 dimensions
◦ Variation on Supplier and Customer
city

Results uniform and skewed cases
◦ Automatic plan performs best
◦ Cross over between automatic
uniform and skewed too late

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

15
Conclusion & Future Work
PDGF implementation of SSB
Introduction of skew in SSB
Extensive performance analysis
◦ Several interesting optimizer effects
◦ Performance impact of skew

Future Work
Further analysis on impact of skew
Skew in query generation
Complete suite for testing skew effects

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

16
Thanks
Questions?
Download and try PDGF:
http://www.paralleldatageneration.org
(scripts used in the study available on website above)

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

17
Back-up Slides
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

18
Configuring PDGF
Generation
Generation configuration
Defines the output
◦
◦
◦
◦

Scheduling
Data format
Sorting
File name and location

Post processing
◦
◦
◦
◦

Filtering of values
Merging of tables
Splitting of tables
Templates (e.g. XML / queries)

<table name="QUERY_PARAMETERS" exclude="false" >
<output name="CompiledTemplateOutput" >
[..]
<template ><!-int y = (fields [0]. getPlainValue ()).intValue ();
int d = (fields [1]. getPlainValue ()).intValue ();
int q = (fields [2]. getPlainValue ()).intValue ();
String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR;
buffer.append("-- Q1.1" + n);
buffer.append("select sum(lo_extendedprice *");
buffer.append("
lo_discount) as revenue" + n);
buffer.append(“ from lineorder , date" + n);
buffer.append(“ where lo_orderdate = d_datekey" + n);
buffer.append(“
and d_year = " + y + n);
buffer.append(“
and lo_disc between " + (d - 1));
buffer.append(“
and " + (d + 1) + n);
buffer.append(“
and lo_quantity < " + q + ";" + n);
--></template >
</output >
</table >

RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

19

Mais conteúdo relacionado

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Destaque

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

  • 1. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance T IL M AN N R ABL , M EIKEL PO ESS, H AN S - AR N O J AC O BSEN , PAT R IC K AN D EL IZABETH O’N EIL ICPE 2013, PRAGUE, 24/04/2013 MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG
  • 2. Real Life Data is Distributed Uniformly… Well, Not Really ◦ Customers zip codes typically clustered around metropolitan areas ◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific periods ◦ US retail sales: ◦ peak during Holiday Season ◦ December sales are 2x of January sales Source: US Census Data RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2
  • 3. Student Seminar Signup Distribution RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 3
  • 4. How Can Skew Effect Database Systems? Data placement ◦ Partitioning ◦ Indexing Data structures ◦ Tree balance ◦ Bucket fill ratio ◦ Histograms Optimizer  finding the optimal query plan ◦ Index vs. non-index driven plans ◦ Hash join vs. merge join ◦ Hash group by vs. sort group by RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 4
  • 5. Agenda Data Skew in Current Benchmarks Star Schema Benchmark (SSB) Parallel Data Generation Framework (PDGF) Introducing Skew in SSB RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 5
  • 6. Data Skew in Benchmarks TPC-D (1994-1999): only uniform data ◦ SIGMOD 1997 - “Successor of TPC-D should include data skew” ◦ No effect until … TPC-DS (released 2012) ◦ Contains comparability zones ◦ Not fully utilized TPC-D/H variations ◦ Chaudhuri and Narayasa: Zipfian distribution on all columns ◦ Crolotte and Ghazal: comparability zones Still lots of open potential RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 6
  • 7. Star Schema Benchmark I Star schema version of TPC-H ◦ ◦ ◦ ◦ Merged Order and Lineitem Date dimension Dropped Partsupp Selectivity hierarchies ◦ C_City  C_Nation  C_Region ◦ … RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 7
  • 8. Star Schema Benchmark II Completely new set of queries 4 flights of 3-4 queries ◦ Designed for functional coverage and selectivity coverage ◦ Drill down in dimension hierarchies ◦ Predefined selectivity Q1.1 select from where and and and sum(lo_extendedprice*lo_discount) as revenue lineorder, date lo_orderdate = d_datekey d_year = 1993 lo_discount between 1 and 3 lo_quantity < 25; Q1.2 select from where and and and sum(lo_extendedprice*lo_discount) as revenue lineorder, date lo_orderdate = d_datekey d_yearmonthnum = 199301 lo_discount between 1 and 3 lo_quantity between 26 and 35; RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 8
  • 9. Parallel Data Generation Framework Generic data generation framework Relational model ◦ Schema specified in configuration file ◦ Post-processing stage for alternative representations Repeatable computation ◦ Based on XORSHIFT random number generators ◦ Hierarchical seeding strategy Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12. Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11. Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11. Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10. RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 9
  • 10. XML PDGF DB Configuring PDGF Schema configuration Relational model ◦ Tables, fields Properties ◦ Table size, characters, … Generators ◦ Simple generators ◦ Metagenerators Update definition ◦ Insert, update, delete ◦ Generated as change data capture <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..] RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 10
  • 11. Opportunities to Inject Data Skew in Foreign key relations ◦ E.g., L_PARTKEY One fact table measures ◦ E.g., L_Quantity Single dimension hierarchy ◦ E.g., P_Brand → P_Category → P_Mfgr Multiple dimension hierarchies ◦ E.g., City → Nation in Supplier and Customer Experimental methodology ◦ ◦ ◦ ◦ One experiment series for each of the above Comparison to original SSB Comparison of index-forced, non-index, and automatic optimizer mode SSB scale factor 100 (100 GB), x86 server RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 11
  • 12. Skew in Foreign Key Relations Very realistic Easy to implement in PDGF ◦ Just add a distribution to the reference <distribution name="Exponential“ lambda="0.26235" /> But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes  Very limited effect on selectivity Focus on attributes in selectivity predicates RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 12
  • 13. Skew in Fact Table Measure – Lo_Quantity Lo_Quantity distribution ◦ Values range between 0 and 50 ◦ Originally uniform distribution with: ◦ P(X=x)=0.02 ◦ Coefficient of variation of 0.00000557 ◦ Proposed skewed distribution with: ◦ P(X x) 0 .3 1 .3 x Query 1.1 ◦ lo_quantity < x, x ∈ [2, 51] Results ◦ Switches too early to non-index plan ◦ Switches too late to non-index plan ◦ Optimizer agnostic to distribution RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 13
  • 14. Skew in Single Dimension Hierarchy - Part P_Category distribution ◦ Uniform P(X=x)=0.04 ◦ Skewed P(X=x)= 0.01 - 48.36 ◦ Probabilities explicitly defined Query 2.1 ◦ Restrictions on two dimensions Results uniform case ◦ Index driven superior ◦ Optimizer chooses non-index driven Results skewed case ◦ Switches too early to non-index plan RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 14
  • 15. Skew in Multiple Dimension Hierarchies – S_City & C_City Skewed S_City & C_City ◦ Probabilites exponentially distributed Query 3.3 Join Cardinality Elapsed Time ◦ Restrictions on 3 dimensions ◦ Variation on Supplier and Customer city Results uniform and skewed cases ◦ Automatic plan performs best ◦ Cross over between automatic uniform and skewed too late RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 15
  • 16. Conclusion & Future Work PDGF implementation of SSB Introduction of skew in SSB Extensive performance analysis ◦ Several interesting optimizer effects ◦ Performance impact of skew Future Work Further analysis on impact of skew Skew in query generation Complete suite for testing skew effects RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 16
  • 17. Thanks Questions? Download and try PDGF: http://www.paralleldatageneration.org (scripts used in the study available on website above) RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 17
  • 18. Back-up Slides RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 18
  • 19. Configuring PDGF Generation Generation configuration Defines the output ◦ ◦ ◦ ◦ Scheduling Data format Sorting File name and location Post processing ◦ ◦ ◦ ◦ Filtering of values Merging of tables Splitting of tables Templates (e.g. XML / queries) <table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ><!-int y = (fields [0]. getPlainValue ()).intValue (); int d = (fields [1]. getPlainValue ()).intValue (); int q = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder , date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); --></template > </output > </table > RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 19

Notas do Editor

  1. Data skew is naturally occuring