WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
1. Big Data Provenance - Challenges and
Implications for Benchmarking
Boris Glavic
IIT DBGroup
December 17, 2012
2. Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
3. Disclaimer
ā¢ My previous benchmark experience
ā¢ Limited to being a user
ā¢ My main background is in provenance
ā¢ . . . mostly for RDBMS
ā¢ The main point of this talk (visions)
ā¢ Provenance as a benchmark use-case
ā¢ Provenance as a supporting technology for benchmarking
ā¢ I will make some outrageous claims to get my point across
Slide 1 of 23 Boris Glavic Big Data Provenance - Provenance
4. What is Provenance?
ā¢ Information about the creation process and origin of data
ā¢ Data items and collections
ā¢ Data item = atomic unit of data
ā¢ Transformation
ā¢ Abstraction for processing
ā¢ Inputs and outputs are data items (collections)
Example
Slide 2 of 23 Boris Glavic Big Data Provenance - Provenance
5. Types of Provenance
ā¢ Given a data item d
Data Provenance
ā¢ Which data items where used in the generation of d
ā¢ Representation: e.g., a set of data items
Transformation Provenance
ā¢ Which transformations generated d
ā¢ transitively
Additional Information
ā¢ Execution Environment
ā¢ User
ā¢ Time
ā¢ . . .
Slide 3 of 23 Boris Glavic Big Data Provenance - Provenance
6. Data and Transformation Provenance
Example (Data Provenance)
Provenance
for
Data
Provenance
Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
7. Data and Transformation Provenance
Example (Transformation Provenance)
Transformation
Provenance
Provenance
for
Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
8. Coarse-grained vs. Fine-grained Data Provenance
Coarse-grained Provenance
ā¢ Transformations are handled as black-boxes
ā¢ āEach output of transformation
ā¢ . . . depends on all inputs
Fine-grained Provenance
ā¢ Consider processing logic of transformation to determine
data-ļ¬ow
ā¢ āAll data items that contributed to a result
ā¢ Suļ¬ciency: suļ¬cient to derive d through transformation
ā¢ Necessity: necessary in deriving d through transformation
ā
Slide 5 of 23 Boris Glavic Big Data Provenance - Provenance
11. Use-cases
ā¢ Data debugging
ā¢ E.g., tracing an erroneous data item back to erroneous inputs
ā¢ Trust and Quality
ā¢ E.g., computing trust based on trust in data items in
provenance
ā¢ Probabilistic data
ā¢ Computing probability of result based on probabilities in
provenance
ā¢ Security
ā¢ Enforce access-control on query results based on provenance
ā¢ Understanding misbehaviour of systems
ā¢ E.g., detecting security breaches
Slide 7 of 23 Boris Glavic Big Data Provenance - Provenance
12. Annotations
Provenance as Annotations
ā¢ Model provenance as annotations on the data
ā¢ āProvenance management =
ā¢ Eļ¬cient storage of annotations
ā¢ Propagating annotations through transformations
Example (Systems applying this approach)
ā¢ Perm
ā¢ Orchestra
ā¢ DBNotes
ā¢ Trio
ā¢ . . . and many more
Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
13. Annotations
References
Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya.
An Annotation Management System for Relational Databases.
VLDB Journal, 14(4):373ā396, 2005.
Jennifer Widom.
Trio: A System for Managing Data, Uncertainty, and Lineage.
Managing and Mining Uncertain Data, 2008.
Boris Glavic and Gustavo Alonso.
Perm: Processing Provenance and Data on the same Data Model through Query Rewriting.
ICDE, 174-185, 2009.
T.J. Green, G. Karvounarakis, and Z.G.I.V. Tannen.
Provenance in ORCHESTRA.
IEEE Data Eng. Bull., 33(3):9-16, 2010.
Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
14. Size Considerations
ā¢ Provenance models dependencies between inputs and outputs
of a transformation
ā¢ āA subset of Inputs Ć Outputs
ā¢ āCan be quadratic in size
ā¢ DAG of transformations
ā¢ multiply by maximal path length
ā¢ Provenance of two data items often overlaps
ā¢ Reuse of intermediate results
ā¢ Coarse-grained provenance
ā
Slide 9 of 23 Boris Glavic Big Data Provenance - Provenance
15. Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
16. Big Provenance Challenges
ā¢ Big data analytics is all about agility
ā¢ āun/semi-structured data and no extensive meta-data
available
ā¢ āhard to deļ¬ne provenance model
ā¢ Transparency of distribution
ā¢ Provenance use-case may require location information
ā¢ Analytics that cross system boundaries
ā¢ Inter-operability of storage formats and systems
Slide 10 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
17. Example - Provenance for Map-Reduce Workļ¬ows
ā¢ Fine-grained provenance for workļ¬ows that are DAGs of map
and reduce functions
ā¢ Add provenance to values: (key, value) ā (key, (value, p))
ā¢ Provenance determined by map/reduce function semantics
ā¢ Wrap map and reduce functions to handle provenance
ā¢ Strip of and cache provenance
ā¢ Call original function
ā¢ Re-attach provenance according to function semantics
References
R. Ikeda, H. Park, and J. Widom.
Provenance for generalized map and reduce workļ¬ows.
CIDR, 273-283, 2011.
Slide 11 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
18. Example - OS Provenance for the Cloud
PASS with Cloud Storage Backend
ā¢ File and process level provenance
ā¢ Each node runs a modiļ¬ed linux kernel (PASS)
ā¢ Intercept system-calls for ļ¬le and process creation
ā¢ Store provenance in Amazon S3, SimpleDB, and/or SQS
Instrumenting the Xen Hypervisor
ā¢ File and process level provenance for virtual machines
ā¢ Intercept āhyper-callsā for ļ¬le and process creation
ā¢ Store provenance in DB
Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
19. Example - OS Provenance for the Cloud
References
M.I. Seltzer, P. Macko, and M.A. Chiarini.
Collecting provenance via the xen hypervisor.
TaPP, 2011.
K.K. Muniswamy-Reddy, P. Macko, and M. Seltzer.
Provenance for the cloud.
FAST, 15ā14, 2010.
Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
20. Example - Sketching Distributed Provenance
ā¢ File + process level provenance modelled as DAG
ā¢ intra- and inter-host dependencies
ā¢ inter = socket communication
ā¢ Intercept system calls
ā¢ Distributed provenance graph storage
ā¢ Each node stores the part of the provenance graph
corresponding to its local processing
ā¢ Links to other host for inter-host dependencies
ā¢ Query provenance
ā¢ Nodes exchange summaries of provenance graphs using bloom
ļ¬lters
References
T. Malik, L. Nistor, and A. Gehani.
Tracking and sketching distributed data provenance.
eScience, 190ā197, 2010.
Slide 13 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
21. Take-away Message
ā¢ Challenging problems
ā¢ Approaches that address distribution directly
ā¢ Approaches that map relational techniques to Big Data
ā¢ āmore work needed!
Slide 14 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
22. Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
Provenance as Data and Workloads
Provenance-based Performance Metrics
Proļ¬ling and Monitoring with Provenance
4 Conclusions
23. Why Benchmark?
Impress the Customer - Competition
ā¢ Competition should be fair
ā¢ Stable benchmark
ā¢ Similar to real world workloads
ā¢ Precise deļ¬nitions
ā¢ āprobably several iterations of design + test
Understand (and Improve) System Performance
ā¢ Stable benchmark
ā¢ Similar to real world workloads
ā¢ Proļ¬ling and monitoring
Slide 15 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
24. Benchmarking 101
A Comprehensive Benchmark . . .
ā¢ Deļ¬ne Parameters
ā¢ Data-set speciļ¬cations
ā¢ Structure and interrelationships
ā¢ Value ranges and distributions
ā¢ Data type deļ¬nitions
ā¢ Provide data generator and/or validator
ā¢ Workload speciļ¬cation
ā¢ Specify jobs
ā¢ Restrictions for running the jobs
ā¢ Provide workload generator and/or validator
ā¢ Benchmark metrics speciļ¬cation
ā¢ What to measure
ā¢ How to measure
Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
25. Benchmarking 101
A Comprehensive Benchmark . . . TPC-H
ā¢ Deļ¬ne Parameters e.g., SF
ā¢ Data-set speciļ¬cations Standard Speciļ¬cation
ā¢ Structure and interrelationships Schema provided
ā¢ Value ranges and distributions e.g., L DISCOUNT column values
between 0.00 and 1.00
ā¢ Data type deļ¬nitions e.g., date is YYYY-MM-DD at least 14
years range
ā¢ Provide data generator and/or validator dbgen
ā¢ Workload speciļ¬cation Standard Speciļ¬cation
ā¢ Specify jobs Query and update workloads
ā¢ Restrictions for running the jobs
ā¢ Provide workload generator and/or validator qgen
ā¢ Benchmark metrics speciļ¬cation Standard Speciļ¬cation
ā¢ What to measure e.g., Composite Query-per-Hour Metric
ā¢ How to measure e.g., ļ¬rst query char to last output char
Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
26. Big Data Benchmarking
ā¢ Data generation is a necessity
ā¢ Shipping pre-generated datasets of Big Data dimensions is
unfeasible
ā¢ Complex and mixed workloads better match real-world use of
Big Data analytics
ā¢ Hard to understand why a system performs bad/well on a
complex workload
ā¢ Robustness of performance and scalability important
āMeasure it!
ā¢ What is a good metric for robustness and how to compute it?
Slide 17 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
27. Generating Large Datasets and Workloads
ā¢ Provenance data is large
ā¢ Compute provenance to generate large datasets
ā¢ Run provenance-aware system to collect provenance for a small
workload
ā¢ The resulting data-set can be used in the benchmark
ā¢ āGenerate large data-sets from simple āgeneratorsā
Example
ā¢ Simple task: sharpen an image
ā¢ Apply approach that instruments the program to detect
data-dependencies as provenance
ā¢ Huge and very ļ¬ne-granular provenance
Slide 18 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
28. Stress-Testing Exploitation of Data Commonalities
ā¢ Provenance data large, but large overlap
ā¢ āUse provenance to test how well a system is able to exploit
data commonalities
Example
ā¢ Streaming data with multi-step windowed aggregation
ā¢ āCan predict overlap in provenance
ā¢ Generate provenance
ā¢ Queries over provenance as workload
Slide 19 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
29. Data-centric Performance Information as Provenance
ā¢ Assume the hypothetical existence of god
ā¢ Big provenance system
ā¢ Eļ¬ciently computes all types of provenance
ā¢ For any Big Data system
ā¢ Also evaluates any types of queries
Data-centric performance information as provenance
ā¢ For each data item use god to record
ā¢ execution times of each transformation in provenance
ā¢ history of data movements for each data item
ā¢ . . .
Example (Measure Robustness)
ā¢ Benchmark that requires several runs of mixed job workloads
ā¢ Workloads between runs overlap
ā¢ Performance metric: Resource consumption variation per job
Slide 20 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
30. Using Provenance in Proļ¬ling
Proļ¬ling and Monitoring
ā¢ Identify causes for (benchmark) results
ā¢ Even more important for Big Data
ā¢ Diverse workloads
ā¢ Distribution
ā¢ . . .
Example (How can Provenance help?)
ā¢ How do I proļ¬le a program on one node
ā¢ e.g., instrument and collect statistics about execution
ā¢ Provenance provides data-centric view
ā¢ Identify repeated computations
ā¢ Overlaps in provenance (same data, diļ¬erent
nodes/transformations)
ā¢ Identify unnecessary computations
ā¢ Computations that are not in the ļ¬ne-grained provenance of
anything
Slide 21 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
31. Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
32. Conclusions
ā¢ Big Provenance: Many interesting research problems
ā¢ Provenance as benchmark use-case
ā¢ Provenance for benchmark data generation
ā¢ Provenance-based performance metrics
ā¢ Supporting proļ¬ling and monitoring
Slide 22 of 23 Boris Glavic Big Data Provenance - Conclusions
33. Questions?
Info
ā¢ Boris Glavic: http://www.cs.iit.edu/~glavic/
ā¢ IIT DBGroup: http://www.cs.iit.edu/~dbgroup/
ā¢ Perm: http://permdbms.sourceforge.net/
Slide 23 of 23 Boris Glavic Big Data Provenance - Conclusions