This talk will be about the reasons behind the new stream processing model, how it compare to the old batch model, what are their pros and cons, and a list of existing technologies implementing stream processing with their most prominent characteristics. It will contain details of one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map, beginning with an overview of all the requirements and the design. Finally, using an OpenData endpoint and the Hazelcast platform,showing a working demo implementation of it.
2. @nicolas_frankel
Me, myself and I
Former developer, team lead, architect,
blah-blah
Developer Advocate
Lots of crazy creative ideas
3. @nicolas_frankel
Hazelcast
HAZELCAST IMDG is an operational,
in-memory, distributed computing
platform that manages data using
in-memory storage and performs
execution for breakthrough
and scale.
HAZELCAST JET is the ultra
fast, application embeddable,
3rd generation stream
processing engine for low
latency batch and stream
processing.
13. @nicolas_frankel
Oops
When the execution time overlaps the
next execution schedule
When the space taken by the data
exceeds the storage capacity
When the batch fails mid-execution
etc.
17. @nicolas_frankel
Event Sourcing
“Event sourcing persists the state of a business entity
such an Order or a Customer as a sequence of state-
changing events. Whenever the state of a business
entity changes, a new event is appended to the list of
events. Since saving an event is a single operation, it is
inherently atomic. The application reconstructs an entity’s
current state by replaying the events.”
-- https://microservices.io/patterns/data/event-sourcing.html
22. @nicolas_frankel
Streaming is smart ETL
Processing
Ingest
In-Memory
Operational
Storage
Combine
Join, Enrich,
Group, Aggregate
Stream
Windowing,
Event-Time
Processing
Compute
Distributed and
Parallel
Computation
Transform
Filter, Clean,
Convert
Publish
In-Memory,
Subscriber
Notifications
Notify if response
time is 10% over 24
hour average, second
by second
23. @nicolas_frankel
Use Case: Analytics and Decision Making
Real-time dashboards
• Decision making
• Recommendations
Stats (gaming, infrastructure monitoring)
Prediction - often based on algorithmic
prediction
• Push stream through ML model
Complex Event Processing
29. @nicolas_frankel
Concept: Pipeline
Declaration (code) that defines and links
sources, transforms, and sinks
Platform-specific SDK (Pipeline API in
Jet)
Client submits pipeline to the Stream
Processing Engine (SPE)
30. @nicolas_frankel
Concept: Job
Running instance of pipeline in SPE
SPE executes the pipeline
Code execution
Data routing
Flow control
Parallel and distributed execution
31. @nicolas_frankel
Imperative model
final String text = "...";
final Map<String, Long> counts = new HashMap<>();
for (String word : text.split("W+")) {
Long count = counts.get(word);
counts.put(count == null ? 1L : count + 1);
}
32. @nicolas_frankel
Declarative model
Map<String, Long> counts = lines.stream()
.map(String::toLowerCase)
.flatMap(
line -> Arrays.stream(line.split("W+"))
)
.filter(word -> !word.isEmpty())
.collect(Collectors.groupingBy(
word -> word, Collectors.counting())
);
33. @nicolas_frankel
What Distributed Means to Hazelcast
Multiple nodes
Scalable storage and performance
Elasticity
Data stored, partitioned and replicated
No single point of failure
34. @nicolas_frankel
Distributed Parallel Processing
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<Long, String>map(BOOK_LINES))
.flatMap(line -> traverseArray(line.getValue().split("W+")))
.filter(word -> !word.isEmpty())
.groupingKey(wholeItem())
.aggregate(counting())
.drainTo(Sinks.map(COUNTS));
Data
Sink
Data
Source
from aggrmap filter to
Translate declarative code to a Directed Acyclic Graph
36. @nicolas_frankel
Open Data
« Open data is the idea that some data
should be freely available to everyone to
use and republish as they wish, without
restrictions from copyright, patents or
other mechanisms of control. »
--https://en.wikipedia.org/wiki/Open_data
37. @nicolas_frankel
Some Open Data initiatives
France:
• https://www.data.gouv.fr/fr/
Switzerland:
• https://opendata.swiss/en/
European Union:
• https://data.europa.eu/euodp/en/data/
43. @nicolas_frankel
General Transit Feed Specification
”The General Transit Feed Specification (GTFS) […] defines a
common format for public transportation schedules and
associated geographic information. GTFS feeds let public
transit agencies publish their transit data and developers write
applications that consume that data in an interoperable way.”
44. @nicolas_frankel
GTFS static model
Filename Required Defines
agency.txt Required Transit agencies with service represented in this dataset.
stops.txt Required
Stops where vehicles pick up or drop off riders. Also defines stations and station
entrances.
routes.txt Required Transit routes. A route is a group of trips that are displayed to riders as a single service.
trips.txt Required
Trips for each route. A trip is a sequence of two or more stops that occur during a
specific time period.
stop_times.txt Required Times that a vehicle arrives at and departs from stops for each trip.
calendar.txt Conditionally required
Service dates specified using a weekly schedule with start and end dates. This file is
required unless all dates of service are defined in calendar_dates.txt.
calendar_dates.txt Conditionally required
Exceptions for the services defined in the calendar.txt. If calendar.txt is omitted, then
calendar_dates.txt is required and must contain all dates of service.
fare_attributes.txt Optional Fare information for a transit agency's routes.
45. @nicolas_frankel
GTFS static model
Filename Required Defines
fare_rules.txt Optional Rules to apply fares for itineraries.
shapes.txt Optional Rules for mapping vehicle travel paths, sometimes referred to as route alignments.
frequencies.txt Optional
Headway (time between trips) for headway-based service or a compressed representation of fixed-schedule
service.
transfers.txt Optional Rules for making connections at transfer points between routes.
pathways.txt Optional Pathways linking together locations within stations.
levels.txt Optional Levels within stations.
feed_info.txt Optional Dataset metadata, including publisher, version, and expiration information.
translations.txt Optional Translated information of a transit agency.
attributions.txt Optional Specifies the attributions that are applied to the dataset.
47. @nicolas_frankel
Use-case: Swiss Public Transport
Open Data
GTFS static available as downloadable
.txt files
GTFS dynamic available as a REST
endpoint
49. @nicolas_frankel
The dynamic data pipeline
1. Source: web service
2. Split into trip updates
3. Enrich with trip data
4. Enrich with stop times data
5. Transform hours into timestamp
6. Enrich with location data
7. Sink: Hazelcast IMDG
52. @nicolas_frankel
Recap
Streaming has a lot of benefits
Leverage Open Data
It’s the Wild West out there
• No standards
• Real-world data sucks!
But you can get cool stuff done
Real-time (latency-sensitive) operations combined with analytics
Count usages per CC in last 10 secs, fraud if > 10
Real-time querying
Based on analytics, prediction
Fraud detection ran overnight has low value
Complex event processing
Pattern detection (if A and B -> C)
SPE runs this at scale
Valuable: IOT support. Machine analytics/predictions - fits into AI
Without streaming?
SP - make use of multi-processor multi-node runtime,
Minimize costs: data shuffling, context switching
Jet Does Distributed Parallel Processing
1/ Execution plan
2/ Execute it in parallel
How can be computation parallelized?
Task parallelism - make use of multiprocessor machines
Continuous - run in parallel and exchange data
MR - just two steps
SP - make use of multi-processor multi-node runtime,
Minimize costs: data shuffling, context switching
1/ Data parallelism - distribute data partitions among available resources
DAG deployed to all cluster members = more cores
What can be parallelized?
Source - if partitioned, can read in parallel
Map/filter - can run in parallel
Extend the edges
Shuffing / moving data is expensive
Keeps data local, even reads it locally when co-located with the data source