Demystifying Systems for Interactive and Real-time Analytics

•

8 gostaram•11,059 visualizações

A number of systems have been released recently for use in interactive and real-time analytics. Examples include Drill, Druid, Impala, Muppet, Shark/Spark, Storm, and Tez. It can be confusing for a practitioner to pick the best system for her specific needs. Statements like “this system is 10x better than Hive” can be misleading without understanding factors like: (i) the workload and environment where the improvement can be repeatably obtained, (ii) whether proper system tuning can change the result, and (iii) whether the results can be different under other workloads. Duke and two other research institutions are jointly conducting a large-scale experimental study with multiple systems and workloads in order to answer these questions of broad interest. The workloads used in the study represent new-generation analytics needs that cover a diverse spectrum including SQL-like queries, machine-learning analysis, graph and matrix processing, and queries running continuously over rapid data streams. The talk will use the results from this study to present the strengths and weaknesses of each system, and rigorously characterize the scenarios where each system is the right choice. Opportunities to improve the systems with new features or by cross pollination of features from multiple systems will also be presented.

Tecnologia

The BigFrame Team
Duke University, Hong Kong Polytechnic
University, and HP Labs

Introduction
• Who am I: Shivnath Babu
• Associate Prof. of Computer Science at Duke University
• Chief Scientist at Unravel Data Systems
• Build tools for easy system management
• What is this talk about: BigFrame
• BigFrame helps you benchmark big data analytics systems …
• … with a benchmark created automatically by BigFrame …
• … for your custom application and workload needs
• First open-source release planned for August 2013

What does this mean for
Big Data Practitioners?

Gives them a lot of power!
From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html

Challenges for Practitioners
Which system to
use for the app that I
am developing?
• Features (e.g., graph data)
• Performance (e.g., claims like
System A is 50x faster than B)
• Resource efficiency
• Growth and scalability
• Multi-tenancy
App Developers,
Data Scientists

Different parts of
my app have different
requirements
Compose “best of
breed” systems
OR
Use “one size fits all”
system?
Managing many
systems is hard!
System Admins
Challenges for Practitioners
Which system to
use for the app that I
am developing?
App Developers,
Data Scientists

Managing many
systems is hard!
Different parts of
my app have different
requirements
Total Cost of
Ownership (TCO)?
CIOSystem Admins
Challenges for Practitioners
Which system to
use for the app that I
am developing?
App Developers,
Data Scientists

How a user uses BigFrame
BigFrame
Interface
Benchmark
Generator
HBase
Hive
Map
Reduce
Benchmark Driver
for System
Under Test

bspec: Benchmark Specification
HBase
Hive
Map
Reduce
2. Data refresh
pattern
3. Query streams
4.Evaluationmetrics
1. Data for
initial load

What does the user
(want to) specify?
BigFrame
Interface

bigif: BigFrame’s InputFormat
Data Variety
Relational,
text, array,
graph
Small,
medium,
large
Data Volume
Query
Volume
Query
concurrency
& classes
Data
Velocity
At rest,
slow,
fast
Micro,
Macro
Query
Variety
Exploratory,
Continuous
Query
Velocity

Benchmark Generation
Benchmark
Generator

Application Domain
Modeled Currently
E-commerce
sales, promotions,
recommendations
Social media
sentiment &
influence

Application Domain
Modeled Currently
Item
Customer
Web_sales
Promotion
Tweets
Relationships

Application Domain
Modeled Currently
Item
Web_sales
Promotion

Use Case I: Exploratory BI
• Large volumes of relational data
• Mostly aggregation and few joins
• Can Spark’s performance match that of an MPP DB?

Use Case II: Complex BI
• Large volumes of relational data
• Even larger volumes of text data
• Combined analytics

• Large volume and velocity of
relational and text data
Use Case III: Dashboards
• Continuously-updated Dashboards

Use Case IV: Does One Size Fit All?
• Growing set of applications have to
process relational, text, & graph data
• Compose “best of breed” systems or
use a “one size fits all” system?

Use Case V: Multi-tenancy and SLAs
• Big data deployments are
increasingly multi-tenant and
need to meet SLAs

Working with the Community
• First release of BigFrame planned for August 2013
• With feedback from benchmark developers (BigBench)
• Open-source with extensibility APIs
• Benchmark Drivers for more systems
• Utilities (accessed through the Benchmark Driver to
drill down into system behavior during benchmarking)
• Instantiate the BigFrame pipeline for more app domains

• “Benchmarks shape a field (for better or worse) …”
-- David Patterson, Univ. of California, Berkeley
• Benchmarks meet different needs for different people
• End customers, application developers, system
designers, system administrators, researchers, CIOs
• BigFrame helps users generate benchmarks that best
meet their needs

Mais conteúdo relacionado

Mais procurados

Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani

Intro to Machine Learning with H2O and AWSSri Ambati

Deep Credit Risk Ranking with LSTM with Kyle GroveDatabricks

Machine Learning with Big Data using Apache SparkInSemble

Artificial Intelligence at LinkedInBill Liu

machine learning in the age of big data: new approaches and business applicat...Armando Vieira

Graph-Powered Machine LearningDatabricks

Large-Scale Machine Learning at Twitternep_test_account

Building intelligent applications, experimental ML with Uber’s Data Science W...DataWorks Summit

Big Data Analytics with HadoopPhilippe Julio

Guiding through a typical Machine Learning PipelineMichael Gerke

Apache MahoutSave Manos

Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics

Real World End to End machine Learning PipelineSrivatsan Srinivasan

BigdataShankar R

Data infrastructure and Hadoop at LinkedInHari Shankar Sreekumar

Hadoop - An IntroductionShankar R

Future of Data - Big DataShankar R

Azure Machine LearningMostafa

Big data, map reduce and beyonddatasalt

Mais procurados (20)

Machine Learning Deep Learning AI and Data Science

Intro to Machine Learning with H2O and AWS

Deep Credit Risk Ranking with LSTM with Kyle Grove

Machine Learning with Big Data using Apache Spark

Artificial Intelligence at LinkedIn

machine learning in the age of big data: new approaches and business applicat...

Graph-Powered Machine Learning

Large-Scale Machine Learning at Twitter

Building intelligent applications, experimental ML with Uber’s Data Science W...

Big Data Analytics with Hadoop

Guiding through a typical Machine Learning Pipeline

Apache Mahout

Graph Based Machine Learning with Applications to Media Analytics

Real World End to End machine Learning Pipeline

Bigdata

Data infrastructure and Hadoop at LinkedIn

Hadoop - An Introduction

Future of Data - Big Data

Azure Machine Learning

Big data, map reduce and beyond

Destaque

Jubatus talk at HadoopSummit 2013Preferred Networks

前回のCasual Talkでいただいたご要望に対する進捗状況JubatusOfficial

Jubatusハンズオン分散編odasatoshi

Video Analysis in HadoopDataWorks Summit

機械学習チュートリアル@Jubatus Casual TalksYuya Unno

Jubatus on MavericksJubatusOfficial

Jubatusをベースにしたオーディエンスの分析エンジンの紹介JubatusOfficial

評BanにおけるJubatus活用事例JubatusOfficial

標的型メール対策製品でのJubatus活用事例JubatusOfficial

Jubatus 0.6.0 新機能紹介JubatusOfficial

Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類Hirotaka Ogawa

Jubatusで始める機械学習JubatusOfficial

Jubatus Casual Talks #2 Jubatus開発者入門Shuzo Kashihara

世界征服を目指すJubatusだからこそ期待する5つのポイントNTT DATA OSS Professional Services

Jubatus Casual Talks #2 : 0.5.0の新機能（クラスタリング）の紹介瑛村下

センサデータ解析におけるJubatus活用事例JubatusOfficial

Jubatus casulatalks2Daichi Morifuji

Jubatus分類器の活用テクニックJubatusOfficial

A use case of online machine learning using JubatusNTT DATA OSS Professional Services

Jubatus使ってみた作ってみたJubatusJubatusOfficial

Destaque (20)

Jubatus talk at HadoopSummit 2013

前回のCasual Talkでいただいたご要望に対する進捗状況

Jubatusハンズオン分散編

Video Analysis in Hadoop

機械学習チュートリアル@Jubatus Casual Talks

Jubatus on Mavericks

Jubatusをベースにしたオーディエンスの分析エンジンの紹介

評BanにおけるJubatus活用事例

標的型メール対策製品でのJubatus活用事例

Jubatus 0.6.0 新機能紹介

Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類

Jubatusで始める機械学習

Jubatus Casual Talks #2 Jubatus開発者入門

世界征服を目指すJubatusだからこそ期待する5つのポイント

Jubatus Casual Talks #2 : 0.5.0の新機能（クラスタリング）の紹介

センサデータ解析におけるJubatus活用事例

Jubatus casulatalks2

Jubatus分類器の活用テクニック

A use case of online machine learning using Jubatus

Jubatus使ってみた作ってみたJubatus

Semelhante a Demystifying Systems for Interactive and Real-time Analytics

[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business

Customer value analysis of big data productsVikas Sardana

TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...Vasu S

Fried data summit big data for lob contentJeff Fried

02 a holistic approach to big dataRaul Chong

Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox

Liberating data power of APIsBala Iyer

Are You Prepared For The Future Of Data Technologies?Dell World

Software Analytics: Data Analytics for Software EngineeringTao Xie

Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j

High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox

Big Data & DS Analytics for PAARLPhilippine Association of Academic/Research Librarians

Data Discovery and Metadatamarkgrover

Strata sf - Amundsen presentationTao Feng

Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29

Machine LearningRamiro Aduviri Velasco

Meetup SF - AmundsenPhilippe Mizrahi

Big Data Evolutionitnewsafrica

Bigdata and Hadoop with applicationsPadma Metta

Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch

Semelhante a Demystifying Systems for Interactive and Real-time Analytics (20)

[Webinar] Getting to Insights Faster: A Framework for Agile Big Data

Customer value analysis of big data products

TDWI Checklist - The Automation and Optimization of Advanced Analytics Based ...

Fried data summit big data for lob content

02 a holistic approach to big data

Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...

Liberating data power of APIs

Are You Prepared For The Future Of Data Technologies?

Software Analytics: Data Analytics for Software Engineering

Neo4j GraphDay Seattle- Sept19- Connected data imperative

High Performance Data Analytics and a Java Grande Run Time

Big Data & DS Analytics for PAARL

Data Discovery and Metadata

Strata sf - Amundsen presentation

Pemanfaatan Big Data Dalam Riset 2023.pptx

Machine Learning

Meetup SF - Amundsen

Big Data Evolution

Bigdata and Hadoop with applications

Executive Briefing: Why managing machines is harder than you think

Mais de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mais de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Último

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Exploring Multimodal Embeddings with MilvusZilliz

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Platformless Horizons for Digital AdaptabilityWSO2

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

DBX First Quarter 2024 Investor PresentationDropbox

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

Architecting Cloud Native ApplicationsWSO2

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Demystifying Systems for Interactive and Real-time Analytics

1. The BigFrame Team Duke University, Hong Kong Polytechnic University, and HP Labs

2. Introduction • Who am I: Shivnath Babu • Associate Prof. of Computer Science at Duke University • Chief Scientist at Unravel Data Systems • Build tools for easy system management • What is this talk about: BigFrame • BigFrame helps you benchmark big data analytics systems … • … with a benchmark created automatically by BigFrame … • … for your custom application and workload needs • First open-source release planned for August 2013

3. Analytics System Landscape

4. Analytics System Landscape

5. Analytics System Landscape

6. Analytics System Landscape

7. Analytics System Landscape

8. Analytics System Landscape

9. Analytics System Landscape

10. Analytics System Landscape

11. What does this mean for Big Data Practitioners?

12. Gives them a lot of power! From: http://animeonly.org/Digital-Wallpapers/Digital-renders/Spiderman-95061p.html

13. Even the mighty may need a little help

14. Challenges for Practitioners Which system to use for the app that I am developing? • Features (e.g., graph data) • Performance (e.g., claims like System A is 50x faster than B) • Resource efficiency • Growth and scalability • Multi-tenancy App Developers, Data Scientists

15. Different parts of my app have different requirements Compose “best of breed” systems OR Use “one size fits all” system? Managing many systems is hard! System Admins Challenges for Practitioners Which system to use for the app that I am developing? App Developers, Data Scientists

16. Managing many systems is hard! Different parts of my app have different requirements Total Cost of Ownership (TCO)? CIOSystem Admins Challenges for Practitioners Which system to use for the app that I am developing? App Developers, Data Scientists

17.

18.

19. One Approach

20. Useful, But …

21.

22.

23.

24.

25. How a user uses BigFrame BigFrame Interface Benchmark Generator HBase Hive Map Reduce Benchmark Driver for System Under Test

26. bspec: Benchmark Specification HBase Hive Map Reduce 2. Data refresh pattern 3. Query streams 4.Evaluationmetrics 1. Data for initial load

27. What does the user (want to) specify? BigFrame Interface

28. The 3Vs

29. bigif: BigFrame’s InputFormat Data Variety Relational, text, array, graph Small, medium, large Data Volume Query Volume Query concurrency & classes Data Velocity At rest, slow, fast Micro, Macro Query Variety Exploratory, Continuous Query Velocity

30. Benchmark Generation Benchmark Generator

31. Application Domain Modeled Currently E-commerce sales, promotions, recommendations Social media sentiment & influence

32. Application Domain Modeled Currently Item Customer Web_sales Promotion Tweets Relationships

33. Application Domain Modeled Currently Item Web_sales Promotion

34. Application Domain Modeled Currently

35. Benchmark Generation Benchmark Generator

36.

37. Use Case I: Exploratory BI • Large volumes of relational data • Mostly aggregation and few joins • Can Spark’s performance match that of an MPP DB?

38. Use Case II: Complex BI • Large volumes of relational data • Even larger volumes of text data • Combined analytics

39. • Large volume and velocity of relational and text data Use Case III: Dashboards • Continuously-updated Dashboards

40. Use Case IV: Does One Size Fit All? • Growing set of applications have to process relational, text, & graph data • Compose “best of breed” systems or use a “one size fits all” system?

41. Use Case V: Multi-tenancy and SLAs • Big data deployments are increasingly multi-tenant and need to meet SLAs

42. Working with the Community • First release of BigFrame planned for August 2013 • With feedback from benchmark developers (BigBench) • Open-source with extensibility APIs • Benchmark Drivers for more systems • Utilities (accessed through the Benchmark Driver to drill down into system behavior during benchmarking) • Instantiate the BigFrame pipeline for more app domains

43. • “Benchmarks shape a field (for better or worse) …” -- David Patterson, Univ. of California, Berkeley • Benchmarks meet different needs for different people • End customers, application developers, system designers, system administrators, researchers, CIOs • BigFrame helps users generate benchmarks that best meet their needs

Notas do Editor

Graph data can be processed on Hadoop.

Demystifying Systems for Interactive and Real-time Analytics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Demystifying Systems for Interactive and Real-time Analytics

Semelhante a Demystifying Systems for Interactive and Real-time Analytics (20)

Mais de DataWorks Summit

Mais de DataWorks Summit (20)

Último

Último (20)

Demystifying Systems for Interactive and Real-time Analytics

Notas do Editor