SlideShare a Scribd company logo
1 of 33
Download to read offline
Big Data Provenance - Challenges and
Implications for Benchmarking
Boris Glavic
IIT DBGroup
December 17, 2012
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
Disclaimer
ā€¢ My previous benchmark experience
ā€¢ Limited to being a user
ā€¢ My main background is in provenance
ā€¢ . . . mostly for RDBMS
ā€¢ The main point of this talk (visions)
ā€¢ Provenance as a benchmark use-case
ā€¢ Provenance as a supporting technology for benchmarking
ā€¢ I will make some outrageous claims to get my point across
Slide 1 of 23 Boris Glavic Big Data Provenance - Provenance
What is Provenance?
ā€¢ Information about the creation process and origin of data
ā€¢ Data items and collections
ā€¢ Data item = atomic unit of data
ā€¢ Transformation
ā€¢ Abstraction for processing
ā€¢ Inputs and outputs are data items (collections)
Example
Slide 2 of 23 Boris Glavic Big Data Provenance - Provenance
Types of Provenance
ā€¢ Given a data item d
Data Provenance
ā€¢ Which data items where used in the generation of d
ā€¢ Representation: e.g., a set of data items
Transformation Provenance
ā€¢ Which transformations generated d
ā€¢ transitively
Additional Information
ā€¢ Execution Environment
ā€¢ User
ā€¢ Time
ā€¢ . . .
Slide 3 of 23 Boris Glavic Big Data Provenance - Provenance
Data and Transformation Provenance
Example (Data Provenance)
Provenance
for
Data
Provenance
Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
Data and Transformation Provenance
Example (Transformation Provenance)
Transformation
Provenance
Provenance
for
Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
Coarse-grained vs. Fine-grained Data Provenance
Coarse-grained Provenance
ā€¢ Transformations are handled as black-boxes
ā€¢ ā‡’Each output of transformation
ā€¢ . . . depends on all inputs
Fine-grained Provenance
ā€¢ Consider processing logic of transformation to determine
data-ļ¬‚ow
ā€¢ ā‡’All data items that contributed to a result
ā€¢ Suļ¬ƒciency: suļ¬ƒcient to derive d through transformation
ā€¢ Necessity: necessary in deriving d through transformation
ā€˜
Slide 5 of 23 Boris Glavic Big Data Provenance - Provenance
Granularity
Example (Coarse-grained)
Provenance
for
coarse
grained
Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
Granularity
Example (Fine-grained)
Provenance
for
ļ¬ne
grained
Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
Use-cases
ā€¢ Data debugging
ā€¢ E.g., tracing an erroneous data item back to erroneous inputs
ā€¢ Trust and Quality
ā€¢ E.g., computing trust based on trust in data items in
provenance
ā€¢ Probabilistic data
ā€¢ Computing probability of result based on probabilities in
provenance
ā€¢ Security
ā€¢ Enforce access-control on query results based on provenance
ā€¢ Understanding misbehaviour of systems
ā€¢ E.g., detecting security breaches
Slide 7 of 23 Boris Glavic Big Data Provenance - Provenance
Annotations
Provenance as Annotations
ā€¢ Model provenance as annotations on the data
ā€¢ ā‡’Provenance management =
ā€¢ Eļ¬ƒcient storage of annotations
ā€¢ Propagating annotations through transformations
Example (Systems applying this approach)
ā€¢ Perm
ā€¢ Orchestra
ā€¢ DBNotes
ā€¢ Trio
ā€¢ . . . and many more
Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
Annotations
References
Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya.
An Annotation Management System for Relational Databases.
VLDB Journal, 14(4):373ā€“396, 2005.
Jennifer Widom.
Trio: A System for Managing Data, Uncertainty, and Lineage.
Managing and Mining Uncertain Data, 2008.
Boris Glavic and Gustavo Alonso.
Perm: Processing Provenance and Data on the same Data Model through Query Rewriting.
ICDE, 174-185, 2009.
T.J. Green, G. Karvounarakis, and Z.G.I.V. Tannen.
Provenance in ORCHESTRA.
IEEE Data Eng. Bull., 33(3):9-16, 2010.
Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
Size Considerations
ā€¢ Provenance models dependencies between inputs and outputs
of a transformation
ā€¢ ā‡’A subset of Inputs Ɨ Outputs
ā€¢ ā‡’Can be quadratic in size
ā€¢ DAG of transformations
ā€¢ multiply by maximal path length
ā€¢ Provenance of two data items often overlaps
ā€¢ Reuse of intermediate results
ā€¢ Coarse-grained provenance
ā€˜
Slide 9 of 23 Boris Glavic Big Data Provenance - Provenance
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
Big Provenance Challenges
ā€¢ Big data analytics is all about agility
ā€¢ ā‡’un/semi-structured data and no extensive meta-data
available
ā€¢ ā‡’hard to deļ¬ne provenance model
ā€¢ Transparency of distribution
ā€¢ Provenance use-case may require location information
ā€¢ Analytics that cross system boundaries
ā€¢ Inter-operability of storage formats and systems
Slide 10 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - Provenance for Map-Reduce Workļ¬‚ows
ā€¢ Fine-grained provenance for workļ¬‚ows that are DAGs of map
and reduce functions
ā€¢ Add provenance to values: (key, value) ā†’ (key, (value, p))
ā€¢ Provenance determined by map/reduce function semantics
ā€¢ Wrap map and reduce functions to handle provenance
ā€¢ Strip of and cache provenance
ā€¢ Call original function
ā€¢ Re-attach provenance according to function semantics
References
R. Ikeda, H. Park, and J. Widom.
Provenance for generalized map and reduce workļ¬‚ows.
CIDR, 273-283, 2011.
Slide 11 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - OS Provenance for the Cloud
PASS with Cloud Storage Backend
ā€¢ File and process level provenance
ā€¢ Each node runs a modiļ¬ed linux kernel (PASS)
ā€¢ Intercept system-calls for ļ¬le and process creation
ā€¢ Store provenance in Amazon S3, SimpleDB, and/or SQS
Instrumenting the Xen Hypervisor
ā€¢ File and process level provenance for virtual machines
ā€¢ Intercept ā€œhyper-callsā€ for ļ¬le and process creation
ā€¢ Store provenance in DB
Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - OS Provenance for the Cloud
References
M.I. Seltzer, P. Macko, and M.A. Chiarini.
Collecting provenance via the xen hypervisor.
TaPP, 2011.
K.K. Muniswamy-Reddy, P. Macko, and M. Seltzer.
Provenance for the cloud.
FAST, 15ā€“14, 2010.
Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - Sketching Distributed Provenance
ā€¢ File + process level provenance modelled as DAG
ā€¢ intra- and inter-host dependencies
ā€¢ inter = socket communication
ā€¢ Intercept system calls
ā€¢ Distributed provenance graph storage
ā€¢ Each node stores the part of the provenance graph
corresponding to its local processing
ā€¢ Links to other host for inter-host dependencies
ā€¢ Query provenance
ā€¢ Nodes exchange summaries of provenance graphs using bloom
ļ¬lters
References
T. Malik, L. Nistor, and A. Gehani.
Tracking and sketching distributed data provenance.
eScience, 190ā€“197, 2010.
Slide 13 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Take-away Message
ā€¢ Challenging problems
ā€¢ Approaches that address distribution directly
ā€¢ Approaches that map relational techniques to Big Data
ā€¢ ā‡’more work needed!
Slide 14 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
Provenance as Data and Workloads
Provenance-based Performance Metrics
Proļ¬ling and Monitoring with Provenance
4 Conclusions
Why Benchmark?
Impress the Customer - Competition
ā€¢ Competition should be fair
ā€¢ Stable benchmark
ā€¢ Similar to real world workloads
ā€¢ Precise deļ¬nitions
ā€¢ ā‡’probably several iterations of design + test
Understand (and Improve) System Performance
ā€¢ Stable benchmark
ā€¢ Similar to real world workloads
ā€¢ Proļ¬ling and monitoring
Slide 15 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Benchmarking 101
A Comprehensive Benchmark . . .
ā€¢ Deļ¬ne Parameters
ā€¢ Data-set speciļ¬cations
ā€¢ Structure and interrelationships
ā€¢ Value ranges and distributions
ā€¢ Data type deļ¬nitions
ā€¢ Provide data generator and/or validator
ā€¢ Workload speciļ¬cation
ā€¢ Specify jobs
ā€¢ Restrictions for running the jobs
ā€¢ Provide workload generator and/or validator
ā€¢ Benchmark metrics speciļ¬cation
ā€¢ What to measure
ā€¢ How to measure
Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Benchmarking 101
A Comprehensive Benchmark . . . TPC-H
ā€¢ Deļ¬ne Parameters e.g., SF
ā€¢ Data-set speciļ¬cations Standard Speciļ¬cation
ā€¢ Structure and interrelationships Schema provided
ā€¢ Value ranges and distributions e.g., L DISCOUNT column values
between 0.00 and 1.00
ā€¢ Data type deļ¬nitions e.g., date is YYYY-MM-DD at least 14
years range
ā€¢ Provide data generator and/or validator dbgen
ā€¢ Workload speciļ¬cation Standard Speciļ¬cation
ā€¢ Specify jobs Query and update workloads
ā€¢ Restrictions for running the jobs
ā€¢ Provide workload generator and/or validator qgen
ā€¢ Benchmark metrics speciļ¬cation Standard Speciļ¬cation
ā€¢ What to measure e.g., Composite Query-per-Hour Metric
ā€¢ How to measure e.g., ļ¬rst query char to last output char
Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Big Data Benchmarking
ā€¢ Data generation is a necessity
ā€¢ Shipping pre-generated datasets of Big Data dimensions is
unfeasible
ā€¢ Complex and mixed workloads better match real-world use of
Big Data analytics
ā€¢ Hard to understand why a system performs bad/well on a
complex workload
ā€¢ Robustness of performance and scalability important
ā‡’Measure it!
ā€¢ What is a good metric for robustness and how to compute it?
Slide 17 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Generating Large Datasets and Workloads
ā€¢ Provenance data is large
ā€¢ Compute provenance to generate large datasets
ā€¢ Run provenance-aware system to collect provenance for a small
workload
ā€¢ The resulting data-set can be used in the benchmark
ā€¢ ā‡’Generate large data-sets from simple ā€œgeneratorsā€
Example
ā€¢ Simple task: sharpen an image
ā€¢ Apply approach that instruments the program to detect
data-dependencies as provenance
ā€¢ Huge and very ļ¬ne-granular provenance
Slide 18 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Stress-Testing Exploitation of Data Commonalities
ā€¢ Provenance data large, but large overlap
ā€¢ ā‡’Use provenance to test how well a system is able to exploit
data commonalities
Example
ā€¢ Streaming data with multi-step windowed aggregation
ā€¢ ā‡’Can predict overlap in provenance
ā€¢ Generate provenance
ā€¢ Queries over provenance as workload
Slide 19 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Data-centric Performance Information as Provenance
ā€¢ Assume the hypothetical existence of god
ā€¢ Big provenance system
ā€¢ Eļ¬ƒciently computes all types of provenance
ā€¢ For any Big Data system
ā€¢ Also evaluates any types of queries
Data-centric performance information as provenance
ā€¢ For each data item use god to record
ā€¢ execution times of each transformation in provenance
ā€¢ history of data movements for each data item
ā€¢ . . .
Example (Measure Robustness)
ā€¢ Benchmark that requires several runs of mixed job workloads
ā€¢ Workloads between runs overlap
ā€¢ Performance metric: Resource consumption variation per job
Slide 20 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Using Provenance in Proļ¬ling
Proļ¬ling and Monitoring
ā€¢ Identify causes for (benchmark) results
ā€¢ Even more important for Big Data
ā€¢ Diverse workloads
ā€¢ Distribution
ā€¢ . . .
Example (How can Provenance help?)
ā€¢ How do I proļ¬le a program on one node
ā€¢ e.g., instrument and collect statistics about execution
ā€¢ Provenance provides data-centric view
ā€¢ Identify repeated computations
ā€¢ Overlaps in provenance (same data, diļ¬€erent
nodes/transformations)
ā€¢ Identify unnecessary computations
ā€¢ Computations that are not in the ļ¬ne-grained provenance of
anything
Slide 21 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
Conclusions
ā€¢ Big Provenance: Many interesting research problems
ā€¢ Provenance as benchmark use-case
ā€¢ Provenance for benchmark data generation
ā€¢ Provenance-based performance metrics
ā€¢ Supporting proļ¬ling and monitoring
Slide 22 of 23 Boris Glavic Big Data Provenance - Conclusions
Questions?
Info
ā€¢ Boris Glavic: http://www.cs.iit.edu/~glavic/
ā€¢ IIT DBGroup: http://www.cs.iit.edu/~dbgroup/
ā€¢ Perm: http://permdbms.sourceforge.net/
Slide 23 of 23 Boris Glavic Big Data Provenance - Conclusions

More Related Content

Viewers also liked

OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
Steven Francia
Ā 

Viewers also liked (19)

Making sense of (big) data - visual analytics and provenance
Making sense of (big) data - visual analytics and provenanceMaking sense of (big) data - visual analytics and provenance
Making sense of (big) data - visual analytics and provenance
Ā 
Intro Course "Big data in Agriculture" Agenda
Intro Course "Big data in Agriculture" AgendaIntro Course "Big data in Agriculture" Agenda
Intro Course "Big data in Agriculture" Agenda
Ā 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
Ā 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
Ā 
Big Data and Agriculture
Big Data and AgricultureBig Data and Agriculture
Big Data and Agriculture
Ā 
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
Ā 
Big Data in Agriculture - Setting the scene for the CGIAR
Big Data in Agriculture - Setting the scene for the CGIARBig Data in Agriculture - Setting the scene for the CGIAR
Big Data in Agriculture - Setting the scene for the CGIAR
Ā 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
Ā 
Big Data in Agriculture, the SemaGrow and agINFRA experience
Big Data in Agriculture, the SemaGrow and agINFRA experienceBig Data in Agriculture, the SemaGrow and agINFRA experience
Big Data in Agriculture, the SemaGrow and agINFRA experience
Ā 
Big Data in Agriculture : Opportunities for data driven agronomy
Big Data in Agriculture : Opportunities for data driven agronomyBig Data in Agriculture : Opportunities for data driven agronomy
Big Data in Agriculture : Opportunities for data driven agronomy
Ā 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
Ā 
Agriculture and Big Data
Agriculture and Big DataAgriculture and Big Data
Agriculture and Big Data
Ā 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
Ā 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
Ā 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
Ā 
What is big data?
What is big data?What is big data?
What is big data?
Ā 
Technology in Ministry
Technology in MinistryTechnology in Ministry
Technology in Ministry
Ā 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
Ā 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Ā 

Similar to WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
Ā 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
Ā 
[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ
[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ
[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ
台ē£č³‡ę–™ē§‘å­øå¹“ęœƒ
Ā 

Similar to WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking" (20)

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
Ā 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
Ā 
Montali - DB-Nets: On The Marriage of Colored Petri Nets ā€Øand Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets ā€Øand Relational Data...Montali - DB-Nets: On The Marriage of Colored Petri Nets ā€Øand Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets ā€Øand Relational Data...
Ā 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Ā 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Ā 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
Ā 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Ā 
Convergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and SupercomputingConvergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and Supercomputing
Ā 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
Ā 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Ā 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
Ā 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
Ā 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
Ā 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
Ā 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
Ā 
Provinance in scientific workflows in e science
Provinance in scientific workflows in e scienceProvinance in scientific workflows in e science
Provinance in scientific workflows in e science
Ā 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
Ā 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Ā 
[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ
[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ
[ē³»åˆ—ę“»å‹•] č³‡ę–™ęŽ¢å‹˜é€ŸéŠ
Ā 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
Ā 

More from Boris Glavic

2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
Boris Glavic
Ā 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
Boris Glavic
Ā 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
Boris Glavic
Ā 

More from Boris Glavic (17)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
Ā 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
Ā 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
Ā 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
Ā 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
Ā 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
Ā 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
Ā 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
Ā 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
Ā 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested Subqueries
Ā 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
Ā 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
Ā 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
Ā 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
Ā 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data Mining
Ā 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
Ā 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
Ā 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
SĆ©rgio Sacani
Ā 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
Ā 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
SĆ©rgio Sacani
Ā 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
Ā 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
levieagacer
Ā 

Recently uploaded (20)

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
Ā 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
Ā 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Ā 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Ā 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Ā 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
Ā 
Kochi ā¤CALL GIRL 84099*07087 ā¤CALL GIRLS IN Kochi ESCORT SERVICEā¤CALL GIRL
Kochi ā¤CALL GIRL 84099*07087 ā¤CALL GIRLS IN Kochi ESCORT SERVICEā¤CALL GIRLKochi ā¤CALL GIRL 84099*07087 ā¤CALL GIRLS IN Kochi ESCORT SERVICEā¤CALL GIRL
Kochi ā¤CALL GIRL 84099*07087 ā¤CALL GIRLS IN Kochi ESCORT SERVICEā¤CALL GIRL
Ā 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Ā 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
Ā 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
Ā 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Ā 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
Ā 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
Ā 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Ā 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
Ā 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Ā 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
Ā 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
Ā 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
Ā 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
Ā 

WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"

  • 1. Big Data Provenance - Challenges and Implications for Benchmarking Boris Glavic IIT DBGroup December 17, 2012
  • 2. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions
  • 3. Disclaimer ā€¢ My previous benchmark experience ā€¢ Limited to being a user ā€¢ My main background is in provenance ā€¢ . . . mostly for RDBMS ā€¢ The main point of this talk (visions) ā€¢ Provenance as a benchmark use-case ā€¢ Provenance as a supporting technology for benchmarking ā€¢ I will make some outrageous claims to get my point across Slide 1 of 23 Boris Glavic Big Data Provenance - Provenance
  • 4. What is Provenance? ā€¢ Information about the creation process and origin of data ā€¢ Data items and collections ā€¢ Data item = atomic unit of data ā€¢ Transformation ā€¢ Abstraction for processing ā€¢ Inputs and outputs are data items (collections) Example Slide 2 of 23 Boris Glavic Big Data Provenance - Provenance
  • 5. Types of Provenance ā€¢ Given a data item d Data Provenance ā€¢ Which data items where used in the generation of d ā€¢ Representation: e.g., a set of data items Transformation Provenance ā€¢ Which transformations generated d ā€¢ transitively Additional Information ā€¢ Execution Environment ā€¢ User ā€¢ Time ā€¢ . . . Slide 3 of 23 Boris Glavic Big Data Provenance - Provenance
  • 6. Data and Transformation Provenance Example (Data Provenance) Provenance for Data Provenance Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
  • 7. Data and Transformation Provenance Example (Transformation Provenance) Transformation Provenance Provenance for Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
  • 8. Coarse-grained vs. Fine-grained Data Provenance Coarse-grained Provenance ā€¢ Transformations are handled as black-boxes ā€¢ ā‡’Each output of transformation ā€¢ . . . depends on all inputs Fine-grained Provenance ā€¢ Consider processing logic of transformation to determine data-ļ¬‚ow ā€¢ ā‡’All data items that contributed to a result ā€¢ Suļ¬ƒciency: suļ¬ƒcient to derive d through transformation ā€¢ Necessity: necessary in deriving d through transformation ā€˜ Slide 5 of 23 Boris Glavic Big Data Provenance - Provenance
  • 9. Granularity Example (Coarse-grained) Provenance for coarse grained Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
  • 10. Granularity Example (Fine-grained) Provenance for ļ¬ne grained Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
  • 11. Use-cases ā€¢ Data debugging ā€¢ E.g., tracing an erroneous data item back to erroneous inputs ā€¢ Trust and Quality ā€¢ E.g., computing trust based on trust in data items in provenance ā€¢ Probabilistic data ā€¢ Computing probability of result based on probabilities in provenance ā€¢ Security ā€¢ Enforce access-control on query results based on provenance ā€¢ Understanding misbehaviour of systems ā€¢ E.g., detecting security breaches Slide 7 of 23 Boris Glavic Big Data Provenance - Provenance
  • 12. Annotations Provenance as Annotations ā€¢ Model provenance as annotations on the data ā€¢ ā‡’Provenance management = ā€¢ Eļ¬ƒcient storage of annotations ā€¢ Propagating annotations through transformations Example (Systems applying this approach) ā€¢ Perm ā€¢ Orchestra ā€¢ DBNotes ā€¢ Trio ā€¢ . . . and many more Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
  • 13. Annotations References Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373ā€“396, 2005. Jennifer Widom. Trio: A System for Managing Data, Uncertainty, and Lineage. Managing and Mining Uncertain Data, 2008. Boris Glavic and Gustavo Alonso. Perm: Processing Provenance and Data on the same Data Model through Query Rewriting. ICDE, 174-185, 2009. T.J. Green, G. Karvounarakis, and Z.G.I.V. Tannen. Provenance in ORCHESTRA. IEEE Data Eng. Bull., 33(3):9-16, 2010. Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
  • 14. Size Considerations ā€¢ Provenance models dependencies between inputs and outputs of a transformation ā€¢ ā‡’A subset of Inputs Ɨ Outputs ā€¢ ā‡’Can be quadratic in size ā€¢ DAG of transformations ā€¢ multiply by maximal path length ā€¢ Provenance of two data items often overlaps ā€¢ Reuse of intermediate results ā€¢ Coarse-grained provenance ā€˜ Slide 9 of 23 Boris Glavic Big Data Provenance - Provenance
  • 15. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions
  • 16. Big Provenance Challenges ā€¢ Big data analytics is all about agility ā€¢ ā‡’un/semi-structured data and no extensive meta-data available ā€¢ ā‡’hard to deļ¬ne provenance model ā€¢ Transparency of distribution ā€¢ Provenance use-case may require location information ā€¢ Analytics that cross system boundaries ā€¢ Inter-operability of storage formats and systems Slide 10 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 17. Example - Provenance for Map-Reduce Workļ¬‚ows ā€¢ Fine-grained provenance for workļ¬‚ows that are DAGs of map and reduce functions ā€¢ Add provenance to values: (key, value) ā†’ (key, (value, p)) ā€¢ Provenance determined by map/reduce function semantics ā€¢ Wrap map and reduce functions to handle provenance ā€¢ Strip of and cache provenance ā€¢ Call original function ā€¢ Re-attach provenance according to function semantics References R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workļ¬‚ows. CIDR, 273-283, 2011. Slide 11 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 18. Example - OS Provenance for the Cloud PASS with Cloud Storage Backend ā€¢ File and process level provenance ā€¢ Each node runs a modiļ¬ed linux kernel (PASS) ā€¢ Intercept system-calls for ļ¬le and process creation ā€¢ Store provenance in Amazon S3, SimpleDB, and/or SQS Instrumenting the Xen Hypervisor ā€¢ File and process level provenance for virtual machines ā€¢ Intercept ā€œhyper-callsā€ for ļ¬le and process creation ā€¢ Store provenance in DB Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 19. Example - OS Provenance for the Cloud References M.I. Seltzer, P. Macko, and M.A. Chiarini. Collecting provenance via the xen hypervisor. TaPP, 2011. K.K. Muniswamy-Reddy, P. Macko, and M. Seltzer. Provenance for the cloud. FAST, 15ā€“14, 2010. Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 20. Example - Sketching Distributed Provenance ā€¢ File + process level provenance modelled as DAG ā€¢ intra- and inter-host dependencies ā€¢ inter = socket communication ā€¢ Intercept system calls ā€¢ Distributed provenance graph storage ā€¢ Each node stores the part of the provenance graph corresponding to its local processing ā€¢ Links to other host for inter-host dependencies ā€¢ Query provenance ā€¢ Nodes exchange summaries of provenance graphs using bloom ļ¬lters References T. Malik, L. Nistor, and A. Gehani. Tracking and sketching distributed data provenance. eScience, 190ā€“197, 2010. Slide 13 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 21. Take-away Message ā€¢ Challenging problems ā€¢ Approaches that address distribution directly ā€¢ Approaches that map relational techniques to Big Data ā€¢ ā‡’more work needed! Slide 14 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 22. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking Provenance as Data and Workloads Provenance-based Performance Metrics Proļ¬ling and Monitoring with Provenance 4 Conclusions
  • 23. Why Benchmark? Impress the Customer - Competition ā€¢ Competition should be fair ā€¢ Stable benchmark ā€¢ Similar to real world workloads ā€¢ Precise deļ¬nitions ā€¢ ā‡’probably several iterations of design + test Understand (and Improve) System Performance ā€¢ Stable benchmark ā€¢ Similar to real world workloads ā€¢ Proļ¬ling and monitoring Slide 15 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 24. Benchmarking 101 A Comprehensive Benchmark . . . ā€¢ Deļ¬ne Parameters ā€¢ Data-set speciļ¬cations ā€¢ Structure and interrelationships ā€¢ Value ranges and distributions ā€¢ Data type deļ¬nitions ā€¢ Provide data generator and/or validator ā€¢ Workload speciļ¬cation ā€¢ Specify jobs ā€¢ Restrictions for running the jobs ā€¢ Provide workload generator and/or validator ā€¢ Benchmark metrics speciļ¬cation ā€¢ What to measure ā€¢ How to measure Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 25. Benchmarking 101 A Comprehensive Benchmark . . . TPC-H ā€¢ Deļ¬ne Parameters e.g., SF ā€¢ Data-set speciļ¬cations Standard Speciļ¬cation ā€¢ Structure and interrelationships Schema provided ā€¢ Value ranges and distributions e.g., L DISCOUNT column values between 0.00 and 1.00 ā€¢ Data type deļ¬nitions e.g., date is YYYY-MM-DD at least 14 years range ā€¢ Provide data generator and/or validator dbgen ā€¢ Workload speciļ¬cation Standard Speciļ¬cation ā€¢ Specify jobs Query and update workloads ā€¢ Restrictions for running the jobs ā€¢ Provide workload generator and/or validator qgen ā€¢ Benchmark metrics speciļ¬cation Standard Speciļ¬cation ā€¢ What to measure e.g., Composite Query-per-Hour Metric ā€¢ How to measure e.g., ļ¬rst query char to last output char Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 26. Big Data Benchmarking ā€¢ Data generation is a necessity ā€¢ Shipping pre-generated datasets of Big Data dimensions is unfeasible ā€¢ Complex and mixed workloads better match real-world use of Big Data analytics ā€¢ Hard to understand why a system performs bad/well on a complex workload ā€¢ Robustness of performance and scalability important ā‡’Measure it! ā€¢ What is a good metric for robustness and how to compute it? Slide 17 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 27. Generating Large Datasets and Workloads ā€¢ Provenance data is large ā€¢ Compute provenance to generate large datasets ā€¢ Run provenance-aware system to collect provenance for a small workload ā€¢ The resulting data-set can be used in the benchmark ā€¢ ā‡’Generate large data-sets from simple ā€œgeneratorsā€ Example ā€¢ Simple task: sharpen an image ā€¢ Apply approach that instruments the program to detect data-dependencies as provenance ā€¢ Huge and very ļ¬ne-granular provenance Slide 18 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 28. Stress-Testing Exploitation of Data Commonalities ā€¢ Provenance data large, but large overlap ā€¢ ā‡’Use provenance to test how well a system is able to exploit data commonalities Example ā€¢ Streaming data with multi-step windowed aggregation ā€¢ ā‡’Can predict overlap in provenance ā€¢ Generate provenance ā€¢ Queries over provenance as workload Slide 19 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 29. Data-centric Performance Information as Provenance ā€¢ Assume the hypothetical existence of god ā€¢ Big provenance system ā€¢ Eļ¬ƒciently computes all types of provenance ā€¢ For any Big Data system ā€¢ Also evaluates any types of queries Data-centric performance information as provenance ā€¢ For each data item use god to record ā€¢ execution times of each transformation in provenance ā€¢ history of data movements for each data item ā€¢ . . . Example (Measure Robustness) ā€¢ Benchmark that requires several runs of mixed job workloads ā€¢ Workloads between runs overlap ā€¢ Performance metric: Resource consumption variation per job Slide 20 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 30. Using Provenance in Proļ¬ling Proļ¬ling and Monitoring ā€¢ Identify causes for (benchmark) results ā€¢ Even more important for Big Data ā€¢ Diverse workloads ā€¢ Distribution ā€¢ . . . Example (How can Provenance help?) ā€¢ How do I proļ¬le a program on one node ā€¢ e.g., instrument and collect statistics about execution ā€¢ Provenance provides data-centric view ā€¢ Identify repeated computations ā€¢ Overlaps in provenance (same data, diļ¬€erent nodes/transformations) ā€¢ Identify unnecessary computations ā€¢ Computations that are not in the ļ¬ne-grained provenance of anything Slide 21 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 31. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions
  • 32. Conclusions ā€¢ Big Provenance: Many interesting research problems ā€¢ Provenance as benchmark use-case ā€¢ Provenance for benchmark data generation ā€¢ Provenance-based performance metrics ā€¢ Supporting proļ¬ling and monitoring Slide 22 of 23 Boris Glavic Big Data Provenance - Conclusions
  • 33. Questions? Info ā€¢ Boris Glavic: http://www.cs.iit.edu/~glavic/ ā€¢ IIT DBGroup: http://www.cs.iit.edu/~dbgroup/ ā€¢ Perm: http://permdbms.sourceforge.net/ Slide 23 of 23 Boris Glavic Big Data Provenance - Conclusions