This is a talk given by Badrish Chandramouli at Portland State University on May 30, 2017, and overviews his recent and ongoing research directions in the space of stream processing and big data analytics.
15. str.Where(e => e.User % 100<5);
Send(events)
...
Application
Receive(results)
On(Batch b) {
for i = 0 to b.Size {
if !(b.c_User[i]%100 < 5)
set b.bitvector[i]
}
next-operator.On(b)
}
Trill
19. • Lots of “signals” in stream data
• IoT workflows combine relational & signal logic
M
Group-by ID
U
Union
ID Time Value
0 0:42:19 67
1 0:42:22 80
2 0:42:22 85
0 0:42:23 69
2 0:42:24 85
Remove noise
Interpolate missing data
Find periodicity
Discard invalid data
Correlate live data w/ history
σ ⋈ DSP
DSPσ ⋈
19
Which tools to use
to build such apps?
20. Data Processing
expert
Digital Signal
Processing expert
Engines: stream engines, DBMS, MPP systems
Data model: (tempo)-relational
Language: declarative (SQL, LINQ, functional)
Scenarios: real-time, offline, progressive
Engines: MATLAB, R
Data model: array
Language: imperative (array languages, C)
Scenarios: mostly offline, real-time
How to reconcile
two worlds?
Our solution:
• high-performance (2 OOM faster)
• one query language
• familiar abstractions to both worlds
22. • Stream engine for relational
queries
• R for highly-optimized DSP
operations
• Problem: impedance mismatch
x2
+
+
x0
x1
y0
y1
y2
R
STREAM PROCESSING
SYSTEM
23. • Unified query model
• Non-uniform & uniform signals
• Type-safe mix of stream & signal operators
• Array-based extensibility framework
• DSP operator writer sees arrays
• Supports incremental computation
• “Walled garden” on top of Trill
• No changes in data model
• Inherits Trill’s efficient processing capability
(e.g., grouped computation)
TRILL DSP
24.
25. Time
Input
events
e1
e2
e3
e4
e5 Time
Aggregated
events
1 1 1 1212
STREAMABLE SIGNALSTREAMABLE
var signal = stream.Where(e => e.Value < 100).Count()
STREAMS
SIGNALS
• Transition to signal domain
• E.g., result of an aggregate query
• Using stream operators to build signal operators
• E.g., adding two signals as a temporal join of two streams
left.Join(right, (l, r) => l + r)
Type-safe operations
26. • Sampling with interpolation
Time
Input
events
misaligned missing
30 60 90 120 150 180 210
Time
Output
events
30 60 90 120 150 180 210
interpolated
var uniformSignal = signal.Sample(30, 0, ip => ip.Linear(60));
Interpolation window
STREAMS
SIGNALS
UNIFORM
27. • Expose arrays only inside the windowing operator
var query = uniformSignal
.Window(512, 256,
w => w.FFT().Select(a => f(a)).IFFT(),
a => a.Sum())
)
Uniform signal Uniform signal
UNWIN
AGGFFT f IFFTWIN
• DSP pipeline & arrays instantiated only once ➞ better data
management
28. • DSP experts write array-array
operators
• Incremental DSP operators
• Leverage Trill’s grouping power!
OLD NEW
WindowHop
FFT f IFFT
29. 4
8
16
32
64
128
256 230 179 128 76 25
HOP SIZE
TrillDSP (1 core) MATLAB
SparkR (16 cores) SciDB-R (16 cores)
Per sensor: Windowed FFT ➞ Function ➞ Inverse FFT ➞ Unwindow
NORMALIZED TIME TO TRILLDSP ON 16 CORES Pre-loaded datasets in
memory
• 100 groups in stream
Up to 2 OOM faster than
others
Performance benefits from:
• Efficient group processing,
group-aware DSP windowing
• Using circular arrays to manage
overlapping windows
• TrillDSP uses FFTW library
35. Speculation Level
Structural Index Level
Fields:
“id”
Logical positions:
“id” is the 3rd attribute
Physical positions:
“id” is at the 20th byte
Speculation
Fields:
“id”
Physical positions:
“id” is at the 20th byte
35
41. shards
• querying
• data movement
• keying
Operation Description
Query Applies unmodified query on each
(keyed) shard
Broadcast Duplicate each shard’s contents on
all shards
Multicast Copy tuples from each input shard
to zero or more specific result
shards
ReShard Load balance across shards
ReDistribute Move tuples so that same key
resides in same result shard
ReKey Changes key associated with each
row in each shard
…
…
…
…