Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

Variations of the Star
Schema Benchmark to Test
the Effects of Data Skew on
Query Performance
T IL M AN N R ABL , M EIKEL PO ESS, H AN S - AR N O
J AC O BSEN , PAT R IC K AN D EL IZABETH O’N EIL

ICPE 2013, PRAGUE, 24/04/2013
MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG

Real Life Data is
Distributed Uniformly…
Well, Not Really
◦ Customers zip codes typically clustered around metropolitan areas
◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific
periods
◦ US retail sales:
◦ peak during Holiday Season
◦ December sales are 2x of
January sales

Source: US Census Data
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS

2

Student Seminar Signup
Distribution


3

How Can Skew Effect
Database Systems?
Data placement
◦ Partitioning
◦ Indexing

Data structures
◦ Tree balance
◦ Bucket fill ratio
◦ Histograms

Optimizer  finding the optimal query plan
◦ Index vs. non-index driven plans
◦ Hash join vs. merge join
◦ Hash group by vs. sort group by


4

Agenda
Data Skew in Current Benchmarks

Star Schema Benchmark (SSB)
Parallel Data Generation Framework (PDGF)

Introducing Skew in SSB


5

Data Skew in Benchmarks
TPC-D (1994-1999): only uniform data
◦ SIGMOD 1997 - “Successor of TPC-D
should include data skew”
◦ No effect until …

TPC-DS (released 2012)
◦ Contains comparability zones
◦ Not fully utilized

TPC-D/H variations
◦ Chaudhuri and Narayasa: Zipfian distribution on all columns
◦ Crolotte and Ghazal: comparability zones

Still lots of open potential


6

Star Schema Benchmark I

Star schema version of TPC-H
◦
◦
◦
◦

Merged Order and Lineitem
Date dimension
Dropped Partsupp
Selectivity hierarchies
◦ C_City  C_Nation  C_Region
◦ …


7

Star Schema Benchmark II
Completely new set of queries
4 flights of 3-4 queries
◦ Designed for functional coverage and selectivity coverage
◦ Drill down in dimension hierarchies
◦ Predefined selectivity

Q1.1

select
from
where
and
and
and

sum(lo_extendedprice*lo_discount) as revenue
lineorder, date
lo_orderdate = d_datekey
d_year = 1993
lo_discount between 1 and 3
lo_quantity < 25;

Q1.2

select
from
where
and
and
and

sum(lo_extendedprice*lo_discount) as revenue
lineorder, date
lo_orderdate = d_datekey
d_yearmonthnum = 199301
lo_discount between 1 and 3
lo_quantity between 26 and 35;


8

Parallel Data Generation
Framework
Generic data generation framework
Relational model
◦ Schema specified in configuration file
◦ Post-processing stage for alternative representations

Repeatable computation
◦ Based on XORSHIFT random number generators
◦ Hierarchical seeding strategy

Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12.
Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11.
Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11.
Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.

9

XML

PDGF

DB

Configuring PDGF
Schema configuration
Relational model
◦ Tables, fields

Properties
◦ Table size, characters, …

Generators
◦ Simple generators
◦ Metagenerators

Update definition
◦ Insert, update, delete
◦ Generated as change data capture

<table name="SUPPLIER">
<size>${S}</size>
<field name="S_SUPPKEY" size="" type="NUMERIC“
primary="true" unique="true">
<gen_IdGenerator />
</field>
<field name="S_NAME" size="25" type="VARCHAR">
<gen_PrePostfixGenerator>
<gen_PaddingGenerator>
<gen_OtherFieldValueGenerator>
<reference field="S_SUPPKEY" />
</gen_OtherFieldValueGenerator >
<character>0</character>
<padToLeft>true</padToLeft>
<size>9</size>
</gen_PaddingGenerator >
<prefix>Supplier </prefix>
</gen_PrePostfixGenerator>
</field>
[..]


10

Opportunities to Inject Data
Skew in
Foreign key relations
◦ E.g., L_PARTKEY

One fact table measures
◦ E.g., L_Quantity

Single dimension hierarchy

◦ E.g., P_Brand → P_Category → P_Mfgr

Multiple dimension hierarchies

◦ E.g., City → Nation in Supplier and Customer

Experimental methodology
◦
◦
◦
◦

One experiment series for each of the above
Comparison to original SSB
Comparison of index-forced, non-index, and automatic optimizer mode
SSB scale factor 100 (100 GB), x86 server


11

Skew in Foreign Key
Relations
Very realistic
Easy to implement in PDGF

◦ Just add a distribution to the reference

<distribution name="Exponential“ lambda="0.26235" />

But!
Dimension attributes uniformly distributed

Dimension keys uncorrelated to dimension attributes
 Very limited effect on selectivity
Focus on attributes in selectivity predicates


12

Skew in Fact Table Measure
– Lo_Quantity
Lo_Quantity distribution
◦ Values range between 0 and 50
◦ Originally uniform distribution with:
◦ P(X=x)=0.02
◦ Coefficient of variation of 0.00000557

◦ Proposed skewed distribution with:
◦

P(X

x)

0 .3
1 .3

x

Query 1.1
◦ lo_quantity < x, x ∈ [2, 51]

Results
◦ Switches too early to non-index plan
◦ Switches too late to non-index plan
◦ Optimizer agnostic to distribution

13

Skew in Single Dimension
Hierarchy - Part
P_Category distribution
◦ Uniform P(X=x)=0.04
◦ Skewed P(X=x)= 0.01 - 48.36
◦ Probabilities explicitly defined

Query 2.1
◦ Restrictions on two dimensions

Results uniform case
◦ Index driven superior
◦ Optimizer chooses non-index driven

Results skewed case
◦ Switches too early to non-index
plan


14

Skew in Multiple Dimension
Hierarchies – S_City &
C_City
Skewed S_City & C_City
◦ Probabilites exponentially
distributed

Query 3.3

Join Cardinality

Elapsed Time

◦ Restrictions on 3 dimensions
◦ Variation on Supplier and Customer
city

Results uniform and skewed cases
◦ Automatic plan performs best
◦ Cross over between automatic
uniform and skewed too late


15

Conclusion & Future Work
PDGF implementation of SSB
Introduction of skew in SSB
Extensive performance analysis
◦ Several interesting optimizer effects
◦ Performance impact of skew

Future Work
Further analysis on impact of skew
Skew in query generation
Complete suite for testing skew effects


16

Thanks
Questions?
Download and try PDGF:
http://www.paralleldatageneration.org
(scripts used in the study available on website above)


17

Back-up Slides

18

Configuring PDGF
Generation
Generation configuration
Defines the output
◦
◦
◦
◦

Scheduling
Data format
Sorting
File name and location

Post processing
◦
◦
◦
◦

Filtering of values
Merging of tables
Splitting of tables
Templates (e.g. XML / queries)

<table name="QUERY_PARAMETERS" exclude="false" >
<output name="CompiledTemplateOutput" >
[..]
<template ><!-int y = (fields [0]. getPlainValue ()).intValue ();
int d = (fields [1]. getPlainValue ()).intValue ();
int q = (fields [2]. getPlainValue ()).intValue ();
String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR;
buffer.append("-- Q1.1" + n);
buffer.append("select sum(lo_extendedprice *");
buffer.append("
lo_discount) as revenue" + n);
buffer.append(“ from lineorder , date" + n);
buffer.append(“ where lo_orderdate = d_datekey" + n);
buffer.append(“
and d_year = " + y + n);
buffer.append(“
and lo_disc between " + (d - 1));
buffer.append(“
and " + (d + 1) + n);
buffer.append(“
and lo_quantity < " + q + ";" + n);
--></template >
</output >
</table >


19

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

Recomendados

Recomendados

Mais conteúdo relacionado

Último

Último (20)

Destaque

Destaque (20)

Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance

Notas do Editor