Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath

A REAL TIME DATA QUERY ENGINE
Michael Natkovich & Nate Speidel

Allow Myself to Introduce . . . Myself
■ Nate Speidel
● nspeidel@oath.com
● Software Engineer
● 2+ years of solving data problems at Yahoo

Allow Myself to Introduce . . . Myself
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data problems at Yahoo

Motivation: Cycle of Sadness
■ Instrumentation validation is unbearably slow
● Needs to be seconds not hours
● Needs to be easy to query
● Needs programmatic access

Typical Query Engine
Data Flow
Persistence
Queries

Look Forward Query Engine
Data Flow
Query Engine
Current Queryable Data
Future Queryable Data Old Un-Queryable Data
Query Results

Typical Streaming Query Cost
Storm Query 1 Storm Query 2 Storm Query 3 Spark Query 1
Input: 2MM events/sec
Throughput: 1K events/sec/core
Resources: 2K cores/query
Total: 8K cores

Bullet Query Cost
Bullet Query 1 Bullet Query 2 Bullet Query 3 Bullet Query 4
Input: 2MM events/sec
Throughput: 1K events/sec/core
Resources: 2K cores
Total: 2K cores

Bullet
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, and scalable
■ UI for Ad-Hoc queries
■ API for programmatic querying
■ Pluggable interface to integrate with streaming data

What It’s For
Single stream,
multiple
consumers
Adhoc interactive
usage
Programmatic
short lived queries

What It’s Not For
Repeatable
queries
Currently no joins Not meant for ETL

Querying in Bullet
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distincts, Top K, Distributions
● DataSketches based
■ Queries have life spans
● All queries run for a specified duration (or infinitely)
■ Results are Windowed
● Windows can be time or record based
● Raw record or aggregate based

Streaming Aggregations
■ Motivation
● Calculating cardinality
● Getting live latency distributions
● Validate experimentation bucket sizes
■ Aggregations are Hard
● Data skew
● Intermediate results are large and expensive to move
● The longer you run, the more memory you need
● Incremental results can’t be combined

Overwhelm Single Combiner
Count Distinct: Naive
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts

Vulnerable to Data Skew
Count Distinct: Typical
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distincts
6. Send Count
7. Combine Counts

Count Distinct: Sketches
1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches

Data Sketches
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too large)
■ Provable error bounds
■ Fixed memory footprint
■ Mergeable, allowing for parallel
processing

Data Sketches in Streams
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely proportional to size of a Sketch
■ Fixed Memory Ceiling
● Maximum Sketch size is configured in advance
● Memory cost of a query is thus known in advance
■ Allows Non-additive Operations to be Additive
● Sketches can be merged into a single Sketch without over
counting
● Allows tasks to be parallelized and cheaply combined later
● Allows results to be combined across windows of execution

Bullet’s Use of Data Sketches
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Distributions
Frequent Items Sketch Top K

Windowing
■ A way of breaking up an endless stream into digestible
components
■ Typically broken using time or records
■ Needed for incremental results
■ A window is the unit of incrementation

Windowing
■ Tumbling Windows*
● Contiguous non-overlapping windows at regular intervals
■ Hopping Windows
● Contiguous (possibly) overlapping windows at regular intervals
■ Sliding Windows*
● Event based windows looking back at regular event intervals
■ Cascading Windows
● Sliding windows that reset at a regular intervals too
■ Session Windows
● Sliding windows that reset if distance between events is exceeded

Why Windowing
■ Example: Number of distinct users in the next 60 seconds
■ Option 1: Wait 60 secs to get results
● No feedback :(
■ Option 2: Every 5 secs, get current state until end
● Continuous feedback with same final results
● Stop queries early (sufficient information gleaned, query bad, etc.)
● Quickly iterate queries

Tumbling Window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3 4 5
6 7
8 9
10 second window

Tumbling Window
3 record window
0 5 10 15 20 25 30
1 2 3 4 5 6 7 8 9
1 2 3
4 5 6
7 8 9

Sliding Window
3 record window
1 record slide
0 5 10
1 2 3 4 5
1
1 2
1 2 3
2 3 4
3 4 5

Query
& ID
Request
Processor
Data
Processor
Combiner
Bullet Data Stream
Bullet
WS
Performance Stats
Sensor Data
User Activity
IoT Data
Query
Results
Results Query & ID
Query & ID
Data Records
Matching
Events & ID

Core Design Principles
■ No persistence
● Tradeoff: Query Speed, Low Storage Cost > Repeatability
■ Scale for data and queries
● Each query cost is fixed and negligible, relative to data ingestion
■ Pluggable everything
● Run on top of any stream processor (Spark, Storm, etc.)
● Read from any data source (Kafka, Kinesis, etc.)
● Choose an implementation of the PubSub (Kafka, REST, etc.)
■ Tune everything
● Example: Sketch size vs Sketch accuracy

Backend Layer Detailed Architecture: Storm

Backend Layer Detailed Architecture: Spark

Performance: Linearly Scales for Data

Performance: Linearly Scales for Queries

Demos
■ Bullet Reddit
● https://youtu.be/p6rOy9F7K8U
■ Bullet Finance
● https://youtu.be/RMMT4Phdhr8

In Summary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverages the power of Data Sketches
■ Only need to enough hardware to read data
● Queries are basically free!
■ Abstraction layer that can sit on any Stream Framework
● Implementations available for Storm and Spark
■ Pluggable allowing for consumption from any data source
■ Fully open sourced!!

Future Work
■ BQL: SQL-like interface support (already supported in WS)
■ More stream processor support (Flink)
■ All the Windows!
■ More aggregations (Group By Count Distinct)

Links
■ Documentation: https://bullet-db.github.io/
■ Github: https://github.com/bullet-db
■ Contact Us
● Developers: bullet-dev@googlegroups.com
● Users: bullet-users@googlegroups.com
■ Data Sketches: https://datasketches.github.io/
■ Reddit API: https://www.reddit.com/dev/api/

Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath

Recommended

Recommended

More Related Content

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath