SlideShare a Scribd company logo
1 of 40
Download to read offline
A REAL TIME DATA QUERY ENGINE
Michael Natkovich & Nate Speidel
Allow Myself to Introduce . . . Myself
■ Nate Speidel
● nspeidel@oath.com
● Software Engineer
● 2+ years of solving data problems at Yahoo
Allow Myself to Introduce . . . Myself
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data problems at Yahoo
Motivation: Cycle of Sadness
■ Instrumentation validation is unbearably slow
● Needs to be seconds not hours
● Needs to be easy to query
● Needs programmatic access
Typical Query Engine
Data Flow
Persistence
Queries
Look Forward Query Engine
Data Flow
Query Engine
Current Queryable Data
Future Queryable Data Old Un-Queryable Data
Query Results
Typical Streaming Query Cost
Storm Query 1 Storm Query 2 Storm Query 3 Spark Query 1
Input: 2MM events/sec
Throughput: 1K events/sec/core
Resources: 2K cores/query
Total: 8K cores
Bullet Query Cost
Bullet Query 1 Bullet Query 2 Bullet Query 3 Bullet Query 4
Input: 2MM events/sec
Throughput: 1K events/sec/core
Resources: 2K cores
Total: 2K cores
Bullet
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, and scalable
■ UI for Ad-Hoc queries
■ API for programmatic querying
■ Pluggable interface to integrate with streaming data
What It’s For
Single stream,
multiple
consumers
Adhoc interactive
usage
Programmatic
short lived queries
What It’s Not For
Repeatable
queries
Currently no joins Not meant for ETL
Querying in Bullet
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distincts, Top K, Distributions
● DataSketches based
■ Queries have life spans
● All queries run for a specified duration (or infinitely)
■ Results are Windowed
● Windows can be time or record based
● Raw record or aggregate based
Streaming Aggregations
■ Motivation
● Calculating cardinality
● Getting live latency distributions
● Validate experimentation bucket sizes
■ Aggregations are Hard
● Data skew
● Intermediate results are large and expensive to move
● The longer you run, the more memory you need
● Incremental results can’t be combined
Overwhelm Single Combiner
Count Distinct: Naive
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
Vulnerable to Data Skew
Count Distinct: Typical
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distincts
6. Send Count
7. Combine Counts
Count Distinct: Sketches
1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches
Data Sketches
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too large)
■ Provable error bounds
■ Fixed memory footprint
■ Mergeable, allowing for parallel
processing
Data Sketches in Streams
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely proportional to size of a Sketch
■ Fixed Memory Ceiling
● Maximum Sketch size is configured in advance
● Memory cost of a query is thus known in advance
■ Allows Non-additive Operations to be Additive
● Sketches can be merged into a single Sketch without over
counting
● Allows tasks to be parallelized and cheaply combined later
● Allows results to be combined across windows of execution
Bullet’s Use of Data Sketches
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Distributions
Frequent Items Sketch Top K
Windowing
■ A way of breaking up an endless stream into digestible
components
■ Typically broken using time or records
■ Needed for incremental results
■ A window is the unit of incrementation
Windowing
■ Tumbling Windows*
● Contiguous non-overlapping windows at regular intervals
■ Hopping Windows
● Contiguous (possibly) overlapping windows at regular intervals
■ Sliding Windows*
● Event based windows looking back at regular event intervals
■ Cascading Windows
● Sliding windows that reset at a regular intervals too
■ Session Windows
● Sliding windows that reset if distance between events is exceeded
Why Windowing
■ Example: Number of distinct users in the next 60 seconds
■ Option 1: Wait 60 secs to get results
● No feedback :(
■ Option 2: Every 5 secs, get current state until end
● Continuous feedback with same final results
● Stop queries early (sufficient information gleaned, query bad, etc.)
● Quickly iterate queries
Tumbling Window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3 4 5
6 7
8 9
10 second window
Tumbling Window
3 record window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9
Sliding Window
3 record window
1 record slide
0 5 10
1 2 3 4 5
1
1 2
1 2 3
2 3 4
3 4 5
Query
& ID
Request
Processor
Data
Processor
Combiner
Bullet Data Stream
Bullet
WS
Performance Stats
Sensor Data
User Activity
IoT Data
Query
Results
Results Query & ID
Query & ID
Data Records
Matching
Events & ID
Core Design Principles
■ No persistence
● Tradeoff: Query Speed, Low Storage Cost > Repeatability
■ Scale for data and queries
● Each query cost is fixed and negligible, relative to data ingestion
■ Pluggable everything
● Run on top of any stream processor (Spark, Storm, etc.)
● Read from any data source (Kafka, Kinesis, etc.)
● Choose an implementation of the PubSub (Kafka, REST, etc.)
■ Tune everything
● Example: Sketch size vs Sketch accuracy
Overall Architecture
Backend Layer Detailed Architecture: Storm
Backend Layer Detailed Architecture: Spark
Performance: Linearly Scales for Data
Performance: Linearly Scales for Queries
Demos
■ Bullet Reddit
● https://youtu.be/p6rOy9F7K8U
■ Bullet Finance
● https://youtu.be/RMMT4Phdhr8
In Summary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverages the power of Data Sketches
■ Only need to enough hardware to read data
● Queries are basically free!
■ Abstraction layer that can sit on any Stream Framework
● Implementations available for Storm and Spark
■ Pluggable allowing for consumption from any data source
■ Fully open sourced!!
Future Work
■ BQL: SQL-like interface support (already supported in WS)
■ More stream processor support (Flink)
■ All the Windows!
■ More aggregations (Group By Count Distinct)
Links
■ Documentation: https://bullet-db.github.io/
■ Github: https://github.com/bullet-db
■ Contact Us
● Developers: bullet-dev@googlegroups.com
● Users: bullet-users@googlegroups.com
■ Data Sketches: https://datasketches.github.io/
■ Reddit API: https://www.reddit.com/dev/api/
QUESTIONS

More Related Content

More from Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
Yahoo Developer Network
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
Yahoo Developer Network
 

More from Yahoo Developer Network (20)

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
 
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
April 2016 HUG: The latest of Apache Hadoop YARN and running your docker apps...
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath

  • 1. A REAL TIME DATA QUERY ENGINE Michael Natkovich & Nate Speidel
  • 2. Allow Myself to Introduce . . . Myself ■ Nate Speidel ● nspeidel@oath.com ● Software Engineer ● 2+ years of solving data problems at Yahoo
  • 3. Allow Myself to Introduce . . . Myself ■ Michael Natkovich ● mln@oath.com ● Director Engineer ● 10+ years of causing data problems at Yahoo
  • 4. Motivation: Cycle of Sadness ■ Instrumentation validation is unbearably slow ● Needs to be seconds not hours ● Needs to be easy to query ● Needs programmatic access
  • 5. Typical Query Engine Data Flow Persistence Queries
  • 6. Look Forward Query Engine Data Flow Query Engine Current Queryable Data Future Queryable Data Old Un-Queryable Data Query Results
  • 7. Typical Streaming Query Cost Storm Query 1 Storm Query 2 Storm Query 3 Spark Query 1 Input: 2MM events/sec Throughput: 1K events/sec/core Resources: 2K cores/query Total: 8K cores
  • 8. Bullet Query Cost Bullet Query 1 Bullet Query 2 Bullet Query 3 Bullet Query 4 Input: 2MM events/sec Throughput: 1K events/sec/core Resources: 2K cores Total: 2K cores
  • 9. Bullet ■ Retrieves data that arrives after query submission ● Look Forward! ■ No persistence layer ■ Light-weight, fast, and scalable ■ UI for Ad-Hoc queries ■ API for programmatic querying ■ Pluggable interface to integrate with streaming data
  • 10. What It’s For Single stream, multiple consumers Adhoc interactive usage Programmatic short lived queries
  • 11. What It’s Not For Repeatable queries Currently no joins Not meant for ETL
  • 12.
  • 13. Querying in Bullet ■ Support filtering, logical operators on typed data ■ Supports aggregations ● Group By, Count Distincts, Top K, Distributions ● DataSketches based ■ Queries have life spans ● All queries run for a specified duration (or infinitely) ■ Results are Windowed ● Windows can be time or record based ● Raw record or aggregate based
  • 14. Streaming Aggregations ■ Motivation ● Calculating cardinality ● Getting live latency distributions ● Validate experimentation bucket sizes ■ Aggregations are Hard ● Data skew ● Intermediate results are large and expensive to move ● The longer you run, the more memory you need ● Incremental results can’t be combined
  • 15. Overwhelm Single Combiner Count Distinct: Naive 1. Read Input 2. Round Robin 3. Extract Field 4. Send to Combiner 5. Count Distincts
  • 16. Vulnerable to Data Skew Count Distinct: Typical 1. Read Input 2. Round Robin 3. Extract Field 4. Hash Partition 5. Count Distincts 6. Send Count 7. Combine Counts
  • 17. Count Distinct: Sketches 1. Read Input 2. Round Robin 3. Build Sketch 4. Send to Combiner 5. Merge Sketches
  • 18. Data Sketches ■ Sketches are a class of stochastic streaming algorithms ■ Provides approximate results (if data is too large) ■ Provable error bounds ■ Fixed memory footprint ■ Mergeable, allowing for parallel processing
  • 19. Data Sketches in Streams ■ Accurate to a Point ● Sketches sized correctly will be 100% accurate ● Error rate is inversely proportional to size of a Sketch ■ Fixed Memory Ceiling ● Maximum Sketch size is configured in advance ● Memory cost of a query is thus known in advance ■ Allows Non-additive Operations to be Additive ● Sketches can be merged into a single Sketch without over counting ● Allows tasks to be parallelized and cheaply combined later ● Allows results to be combined across windows of execution
  • 20. Bullet’s Use of Data Sketches Data Sketch Query Type Theta Sketch Count Distinct Tuple Sketch Group By Quantile Sketch Distributions Frequent Items Sketch Top K
  • 21. Windowing ■ A way of breaking up an endless stream into digestible components ■ Typically broken using time or records ■ Needed for incremental results ■ A window is the unit of incrementation
  • 22. Windowing ■ Tumbling Windows* ● Contiguous non-overlapping windows at regular intervals ■ Hopping Windows ● Contiguous (possibly) overlapping windows at regular intervals ■ Sliding Windows* ● Event based windows looking back at regular event intervals ■ Cascading Windows ● Sliding windows that reset at a regular intervals too ■ Session Windows ● Sliding windows that reset if distance between events is exceeded
  • 23. Why Windowing ■ Example: Number of distinct users in the next 60 seconds ■ Option 1: Wait 60 secs to get results ● No feedback :( ■ Option 2: Every 5 secs, get current state until end ● Continuous feedback with same final results ● Stop queries early (sufficient information gleaned, query bad, etc.) ● Quickly iterate queries
  • 24. Tumbling Window 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 second window
  • 25. Tumbling Window 3 record window 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
  • 26. Sliding Window 3 record window 1 record slide 0 5 10 1 2 3 4 5 1 1 2 1 2 3 2 3 4 3 4 5
  • 27.
  • 28. Query & ID Request Processor Data Processor Combiner Bullet Data Stream Bullet WS Performance Stats Sensor Data User Activity IoT Data Query Results Results Query & ID Query & ID Data Records Matching Events & ID
  • 29. Core Design Principles ■ No persistence ● Tradeoff: Query Speed, Low Storage Cost > Repeatability ■ Scale for data and queries ● Each query cost is fixed and negligible, relative to data ingestion ■ Pluggable everything ● Run on top of any stream processor (Spark, Storm, etc.) ● Read from any data source (Kafka, Kinesis, etc.) ● Choose an implementation of the PubSub (Kafka, REST, etc.) ■ Tune everything ● Example: Sketch size vs Sketch accuracy
  • 31. Backend Layer Detailed Architecture: Storm
  • 32. Backend Layer Detailed Architecture: Spark
  • 35.
  • 36. Demos ■ Bullet Reddit ● https://youtu.be/p6rOy9F7K8U ■ Bullet Finance ● https://youtu.be/RMMT4Phdhr8
  • 37. In Summary ■ Bullet is a lightweight and cheap stream query engine ■ It offers raw record and OLAP style queries ■ Leverages the power of Data Sketches ■ Only need to enough hardware to read data ● Queries are basically free! ■ Abstraction layer that can sit on any Stream Framework ● Implementations available for Storm and Spark ■ Pluggable allowing for consumption from any data source ■ Fully open sourced!!
  • 38. Future Work ■ BQL: SQL-like interface support (already supported in WS) ■ More stream processor support (Flink) ■ All the Windows! ■ More aggregations (Group By Count Distinct)
  • 39. Links ■ Documentation: https://bullet-db.github.io/ ■ Github: https://github.com/bullet-db ■ Contact Us ● Developers: bullet-dev@googlegroups.com ● Users: bullet-users@googlegroups.com ■ Data Sketches: https://datasketches.github.io/ ■ Reddit API: https://www.reddit.com/dev/api/