A number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented.
2. Introduction
• Who am I: Shivnath Babu
• Associate Prof. of Computer Science at Duke University
• Chief Scientist at Unravel Data Systems
• Build tools for easy system management
• What is this talk about: BigFrame
• BigFrame helps you benchmark big data analytics systems …
• … with a benchmark created automatically by BigFrame …
• … for your custom application and workload needs
• First open-source release planned for August 2013
14. Challenges for Practitioners
Which system to
use for the app that I
am developing?
• Features (e.g., graph data)
• Performance (e.g., claims like
System A is 50x faster than B)
• Resource efficiency
• Growth and scalability
• Multi-tenancy
App Developers,
Data Scientists
15. Different parts of
my app have different
requirements
Compose “best of
breed” systems
OR
Use “one size fits all”
system?
Managing many
systems is hard!
System Admins
Challenges for Practitioners
Which system to
use for the app that I
am developing?
App Developers,
Data Scientists
16. Managing many
systems is hard!
Different parts of
my app have different
requirements
Total Cost of
Ownership (TCO)?
CIOSystem Admins
Challenges for Practitioners
Which system to
use for the app that I
am developing?
App Developers,
Data Scientists
37. Use Case I: Exploratory BI
• Large volumes of relational data
• Mostly aggregation and few joins
• Can Spark’s performance match that of an MPP DB?
38. Use Case II: Complex BI
• Large volumes of relational data
• Even larger volumes of text data
• Combined analytics
39. • Large volume and velocity of
relational and text data
Use Case III: Dashboards
• Continuously-updated Dashboards
40. Use Case IV: Does One Size Fit All?
• Growing set of applications have to
process relational, text, & graph data
• Compose “best of breed” systems or
use a “one size fits all” system?
41. Use Case V: Multi-tenancy and SLAs
• Big data deployments are
increasingly multi-tenant and
need to meet SLAs
42. Working with the Community
• First release of BigFrame planned for August 2013
• With feedback from benchmark developers (BigBench)
• Open-source with extensibility APIs
• Benchmark Drivers for more systems
• Utilities (accessed through the Benchmark Driver to
drill down into system behavior during benchmarking)
• Instantiate the BigFrame pipeline for more app domains
43. • “Benchmarks shape a field (for better or worse) …”
-- David Patterson, Univ. of California, Berkeley
• Benchmarks meet different needs for different people
• End customers, application developers, system
designers, system administrators, researchers, CIOs
• BigFrame helps users generate benchmarks that best
meet their needs