SlideShare uma empresa Scribd logo
1 de 54
Systems
• Real-time
raise alerts
• Real-time with historical
• Correlate
• Offline
• Develop initial monitoring query
• Back-test
• Progressive
Non-temporal analysis
Engine
+ Fabric
Interactive Query Authoring
Real-Time
Dashboard
• Performance
• Fabric & language integration
• Query model
Scenarios
• monitor
telemetry &
raise alerts
• correlate real-
time with logs
• develop initial
monitoring
query
• back-test over
historical logs
• offline analysis
(BI) with early
results
• Performance
• Fabric & language integration
• Query model
Q
1
2
3
2
1
5min Window
snapshots
logical time
Input
T-1
T-2
T-3
Output
Q = COUNT(*)
3
Relational
Model
Tempo-Relational
Model
QQQ Q Q𝜹𝜹𝜹 𝜹 𝜹
Supports broad & rich analytics
scenarios (relational, progressive,
time-based)
• Key enabler: performance +
fabric & language integration +
query model
struct ClickEvent { long ClickTime; long User; long AdId; }
var str = Network.ToStream(e => e.ClickTime, Latency(10secs));
var query =
str.Where(e => e.User % 100 < 5)
.Select(e => { e.AdId })
.GroupApply( e => e.AdId,
s => s.Window(5min).Aggregate(w => w.Count()));
query.Subscribe(e => Console.Write(e)); // write results to console
stream of batches
• More load  larger batches  better throughput
…
𝑜𝑝2
…
…
𝑜𝑝1
class DataBatch {
long[] SyncTime;
...
Bitvector BV;
}
class UserData_Gen : DataBatch {
long[] c_ClickTime;
long[] c_User;
long[] c_AdId;
}
…
𝑜𝑝2
…
…
𝑜𝑝1
timestamp payload columns
bitvector
str.Where(e => e.User % 100<5);
Send(events)
...
Application
Receive(results)
On(Batch b) {
for i = 0 to b.Size {
if !(b.c_User[i]%100 < 5)
set b.bitvector[i]
}
next-operator.On(b)
}
Trill
session windows,
http://aka.ms/trill
• Lots of “signals” in stream data
• IoT workflows combine relational & signal logic
M
Group-by ID
U
Union
ID Time Value
0 0:42:19 67
1 0:42:22 80
2 0:42:22 85
0 0:42:23 69
2 0:42:24 85
Remove noise
Interpolate missing data
Find periodicity
Discard invalid data
Correlate live data w/ history
σ ⋈ DSP
DSPσ ⋈
19
Which tools to use
to build such apps?
Data Processing
expert
Digital Signal
Processing expert
Engines: stream engines, DBMS, MPP systems
Data model: (tempo)-relational
Language: declarative (SQL, LINQ, functional)
Scenarios: real-time, offline, progressive
Engines: MATLAB, R
Data model: array
Language: imperative (array languages, C)
Scenarios: mostly offline, real-time
How to reconcile
two worlds?
Our solution:
• high-performance (2 OOM faster)
• one query language
• familiar abstractions to both worlds
1. Window
2. Per window: pipeline DSP ops
3. Unwindow
x[n]
x2
y[n]
x0
x1
y0
y1
y2
Per
Device
+
+
• Stream engine for relational
queries
• R for highly-optimized DSP
operations
• Problem: impedance mismatch
x2
+
+
x0
x1
y0
y1
y2
R
STREAM PROCESSING
SYSTEM
• Unified query model
• Non-uniform & uniform signals
• Type-safe mix of stream & signal operators
• Array-based extensibility framework
• DSP operator writer sees arrays
• Supports incremental computation
• “Walled garden” on top of Trill
• No changes in data model
• Inherits Trill’s efficient processing capability
(e.g., grouped computation)
TRILL DSP
Time
Input
events
e1
e2
e3
e4
e5 Time
Aggregated
events
1 1 1 1212
STREAMABLE SIGNALSTREAMABLE
var signal = stream.Where(e => e.Value < 100).Count()
STREAMS
SIGNALS
• Transition to signal domain
• E.g., result of an aggregate query
• Using stream operators to build signal operators
• E.g., adding two signals as a temporal join of two streams
left.Join(right, (l, r) => l + r)
Type-safe operations
• Sampling with interpolation
Time
Input
events
misaligned missing
30 60 90 120 150 180 210
Time
Output
events
30 60 90 120 150 180 210
interpolated
var uniformSignal = signal.Sample(30, 0, ip => ip.Linear(60));
Interpolation window
STREAMS
SIGNALS
UNIFORM
• Expose arrays only inside the windowing operator
var query = uniformSignal
.Window(512, 256,
w => w.FFT().Select(a => f(a)).IFFT(),
a => a.Sum())
)
Uniform signal Uniform signal
UNWIN
AGGFFT f IFFTWIN
• DSP pipeline & arrays instantiated only once ➞ better data
management
• DSP experts write array-array
operators
• Incremental DSP operators
• Leverage Trill’s grouping power!
OLD NEW
WindowHop
FFT f IFFT
4
8
16
32
64
128
256 230 179 128 76 25
HOP SIZE
TrillDSP (1 core) MATLAB
SparkR (16 cores) SciDB-R (16 cores)
Per sensor: Windowed FFT ➞ Function ➞ Inverse FFT ➞ Unwindow
NORMALIZED TIME TO TRILLDSP ON 16 CORES Pre-loaded datasets in
memory
• 100 groups in stream
Up to 2 OOM faster than
others
Performance benefits from:
• Efficient group processing,
group-aware DSP windowing
• Using circular arrays to manage
overlapping windows
• TrillDSP uses FFTW library
32
33
0
20
40
60
80
100
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
RunningTime(Secs)
Spark + Parquet Spark + JSON (Jackson)
JSON: >80% time is on parsing!
152
Speculation Level
Structural Index Level
Fields:
“id”
Logical positions:
“id” is the 3rd attribute
Physical positions:
“id” is at the 20th byte
Speculation
Fields:
“id”
Physical positions:
“id” is at the 20th byte
35
0.0
0.5
1.0
1.5
2.0
Gson Jackson Mison
ParsingSpeed(GB/s)
36
37
0
20
40
60
80
100
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
RunningTime(Secs)
Spark + Parquet Spark + JSON (Jackson) Spark + JSON (Mison)
Spark+Mison is ~10X faster than Spark+Jackson
Spark+Mison has comparable performance with Spark+Parquet in the most cases
rich space
temporal logic
• Transfer
ShardedStreamable
shards
• querying
• data movement
• keying
Operation Description
Query Applies unmodified query on each
(keyed) shard
Broadcast Duplicate each shard’s contents on
all shards
Multicast Copy tuples from each input shard
to zero or more specific result
shards
ReShard Load balance across shards
ReDistribute Move tuples so that same key
resides in same result shard
ReKey Changes key associated with each
row in each shard
…
…
…
…
e => e.Count()
Flat re-
distribute
e => e.Count()
e => e.Sum()
(l,r) => l.Join(r, …)
(l,r) => l.Join(r, …)
Flat re-
distribute
Flat
broadcast
No data
movement
str => str.SlidingWindow(Y).Count()
.Where(c => c > threshold)
(l, r) => l.WhereNotExists(y)
str => str.HoppingWindow(Z).Count()
•
•
•
•
•
•
Scan (Quill vs. SparkSQL) Time taken & scheduling overhead
Grouped agg with 40M groups Hopping window (Github data)
http://github.com/Microsoft/CRA
https://www.microsoft.com/en-us/research/people/badrishc/
From Trill to Quill and Beyond

Mais conteúdo relacionado

Mais procurados

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...Anis Nasir
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 
Deep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup LilleDeep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup LilleCarta Alfonso
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Florian Lautenschlager
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codesNAVER D2
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
 
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...Spark Summit
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code Ehsan Sharifi
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Hamza Aslam
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 

Mais procurados (20)

Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Pro...
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Deep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup LilleDeep Turnover Forecast - meetup Lille
Deep Turnover Forecast - meetup Lille
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-...
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code All Pairs-Shortest Path (Fast Floyd-Warshall) Code
All Pairs-Shortest Path (Fast Floyd-Warshall) Code
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 

Semelhante a From Trill to Quill and Beyond

Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsImpatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemReza Rahimi
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbZhangZhengming
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsMohsen Jafarzadeh
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Accidental Data Analytics
Accidental Data AnalyticsAccidental Data Analytics
Accidental Data AnalyticsAPNIC
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305Amazon Web Services
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase HBaseCon
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 

Semelhante a From Trill to Quill and Beyond (20)

Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsImpatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics
 
The Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management SystemThe Case for a Signal Oriented Data Stream Management System
The Case for a Signal Oriented Data Stream Management System
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Accidental Data Analytics
Accidental Data AnalyticsAccidental Data Analytics
Accidental Data Analytics
 
Spanner (may 19)
Spanner (may 19)Spanner (may 19)
Spanner (may 19)
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 

Último

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

From Trill to Quill and Beyond

  • 1.
  • 2.
  • 4.
  • 5.
  • 6.
  • 7. • Real-time raise alerts • Real-time with historical • Correlate • Offline • Develop initial monitoring query • Back-test • Progressive Non-temporal analysis Engine + Fabric Interactive Query Authoring Real-Time Dashboard
  • 8. • Performance • Fabric & language integration • Query model Scenarios • monitor telemetry & raise alerts • correlate real- time with logs • develop initial monitoring query • back-test over historical logs • offline analysis (BI) with early results
  • 9. • Performance • Fabric & language integration • Query model
  • 10. Q 1 2 3 2 1 5min Window snapshots logical time Input T-1 T-2 T-3 Output Q = COUNT(*) 3 Relational Model Tempo-Relational Model QQQ Q Q𝜹𝜹𝜹 𝜹 𝜹 Supports broad & rich analytics scenarios (relational, progressive, time-based)
  • 11. • Key enabler: performance + fabric & language integration + query model
  • 12. struct ClickEvent { long ClickTime; long User; long AdId; } var str = Network.ToStream(e => e.ClickTime, Latency(10secs)); var query = str.Where(e => e.User % 100 < 5) .Select(e => { e.AdId }) .GroupApply( e => e.AdId, s => s.Window(5min).Aggregate(w => w.Count())); query.Subscribe(e => Console.Write(e)); // write results to console
  • 13. stream of batches • More load  larger batches  better throughput … 𝑜𝑝2 … … 𝑜𝑝1
  • 14. class DataBatch { long[] SyncTime; ... Bitvector BV; } class UserData_Gen : DataBatch { long[] c_ClickTime; long[] c_User; long[] c_AdId; } … 𝑜𝑝2 … … 𝑜𝑝1 timestamp payload columns bitvector
  • 15. str.Where(e => e.User % 100<5); Send(events) ... Application Receive(results) On(Batch b) { for i = 0 to b.Size { if !(b.c_User[i]%100 < 5) set b.bitvector[i] } next-operator.On(b) } Trill
  • 17.
  • 18.
  • 19. • Lots of “signals” in stream data • IoT workflows combine relational & signal logic M Group-by ID U Union ID Time Value 0 0:42:19 67 1 0:42:22 80 2 0:42:22 85 0 0:42:23 69 2 0:42:24 85 Remove noise Interpolate missing data Find periodicity Discard invalid data Correlate live data w/ history σ ⋈ DSP DSPσ ⋈ 19 Which tools to use to build such apps?
  • 20. Data Processing expert Digital Signal Processing expert Engines: stream engines, DBMS, MPP systems Data model: (tempo)-relational Language: declarative (SQL, LINQ, functional) Scenarios: real-time, offline, progressive Engines: MATLAB, R Data model: array Language: imperative (array languages, C) Scenarios: mostly offline, real-time How to reconcile two worlds? Our solution: • high-performance (2 OOM faster) • one query language • familiar abstractions to both worlds
  • 21. 1. Window 2. Per window: pipeline DSP ops 3. Unwindow x[n] x2 y[n] x0 x1 y0 y1 y2 Per Device + +
  • 22. • Stream engine for relational queries • R for highly-optimized DSP operations • Problem: impedance mismatch x2 + + x0 x1 y0 y1 y2 R STREAM PROCESSING SYSTEM
  • 23. • Unified query model • Non-uniform & uniform signals • Type-safe mix of stream & signal operators • Array-based extensibility framework • DSP operator writer sees arrays • Supports incremental computation • “Walled garden” on top of Trill • No changes in data model • Inherits Trill’s efficient processing capability (e.g., grouped computation) TRILL DSP
  • 24.
  • 25. Time Input events e1 e2 e3 e4 e5 Time Aggregated events 1 1 1 1212 STREAMABLE SIGNALSTREAMABLE var signal = stream.Where(e => e.Value < 100).Count() STREAMS SIGNALS • Transition to signal domain • E.g., result of an aggregate query • Using stream operators to build signal operators • E.g., adding two signals as a temporal join of two streams left.Join(right, (l, r) => l + r) Type-safe operations
  • 26. • Sampling with interpolation Time Input events misaligned missing 30 60 90 120 150 180 210 Time Output events 30 60 90 120 150 180 210 interpolated var uniformSignal = signal.Sample(30, 0, ip => ip.Linear(60)); Interpolation window STREAMS SIGNALS UNIFORM
  • 27. • Expose arrays only inside the windowing operator var query = uniformSignal .Window(512, 256, w => w.FFT().Select(a => f(a)).IFFT(), a => a.Sum()) ) Uniform signal Uniform signal UNWIN AGGFFT f IFFTWIN • DSP pipeline & arrays instantiated only once ➞ better data management
  • 28. • DSP experts write array-array operators • Incremental DSP operators • Leverage Trill’s grouping power! OLD NEW WindowHop FFT f IFFT
  • 29. 4 8 16 32 64 128 256 230 179 128 76 25 HOP SIZE TrillDSP (1 core) MATLAB SparkR (16 cores) SciDB-R (16 cores) Per sensor: Windowed FFT ➞ Function ➞ Inverse FFT ➞ Unwindow NORMALIZED TIME TO TRILLDSP ON 16 CORES Pre-loaded datasets in memory • 100 groups in stream Up to 2 OOM faster than others Performance benefits from: • Efficient group processing, group-aware DSP windowing • Using circular arrays to manage overlapping windows • TrillDSP uses FFTW library
  • 30.
  • 31.
  • 32. 32
  • 33. 33 0 20 40 60 80 100 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 RunningTime(Secs) Spark + Parquet Spark + JSON (Jackson) JSON: >80% time is on parsing! 152
  • 34.
  • 35. Speculation Level Structural Index Level Fields: “id” Logical positions: “id” is the 3rd attribute Physical positions: “id” is at the 20th byte Speculation Fields: “id” Physical positions: “id” is at the 20th byte 35
  • 37. 37 0 20 40 60 80 100 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 RunningTime(Secs) Spark + Parquet Spark + JSON (Jackson) Spark + JSON (Mison) Spark+Mison is ~10X faster than Spark+Jackson Spark+Mison has comparable performance with Spark+Parquet in the most cases
  • 38.
  • 39.
  • 40. rich space temporal logic • Transfer ShardedStreamable
  • 41. shards • querying • data movement • keying Operation Description Query Applies unmodified query on each (keyed) shard Broadcast Duplicate each shard’s contents on all shards Multicast Copy tuples from each input shard to zero or more specific result shards ReShard Load balance across shards ReDistribute Move tuples so that same key resides in same result shard ReKey Changes key associated with each row in each shard … … … …
  • 42.
  • 43. e => e.Count() Flat re- distribute e => e.Count() e => e.Sum()
  • 44. (l,r) => l.Join(r, …) (l,r) => l.Join(r, …) Flat re- distribute Flat broadcast No data movement
  • 45. str => str.SlidingWindow(Y).Count() .Where(c => c > threshold) (l, r) => l.WhereNotExists(y) str => str.HoppingWindow(Z).Count()
  • 47. Scan (Quill vs. SparkSQL) Time taken & scheduling overhead
  • 48. Grouped agg with 40M groups Hopping window (Github data)
  • 49.
  • 51.
  • 52.