SlideShare uma empresa Scribd logo
1 de 34
Spark Streaming the Industrial IoT
Washington DC Area Spark Interactive
Jim Haughwout, Chief Architect & VP of Software
May 24, 2016
© 2016 Savi Technology • May 25, 2016 • Page 2
Today’s Talk
• Discuss challenges of streaming in general with tips on doing this with Spark
• Special focus: IoT’s complexities of immediately tying together physical and data world
• Our talks are in three parts:
- Part I: Top-level POV of using Spark Streaming for Industrial IoT (Jim)
- Part II: Spark Streaming and Expert Systems – Spark + Drools (James)
- Part III: Overcoming Deficiencies in Streams (Anderson of MetiStream)
About Us
© 2016 Savi Technology • May 25, 2016 • Page 4
Savi Technology
• Sensor analytics solutions for Industrial IoT
• Focus areas are risk and performance
• Customers are Fortune-1000 and government
• Real-time visibility using complex event
processing and machine learning algorithms
• Strategic insights using batch analytics
• Hardware Engineers, Data Engineers,
Software Engineers, and Data Scientists
• HQ in Alexandria; offices across world
Some examples of what we do… HARDWAREAPPLICATIONS SERVICESANALYTICS
© 2016 Savi Technology • May 25, 2016 • Page 5
Our version of Google Now: Parking -> Stationary
© 2016 Savi Technology • May 25, 2016 • Page 6
Progressive streaming analysis of IoT data: Rules + ML
Times in UTC
© 2016 Savi Technology • May 25, 2016 • Page 7
Alerting with predictive analytics: Commercial ETA
• 22 hours out
we predicted
driver would be
late (giving
advanced notice)
• That prediction
was < 5 minutes
vs. actual (on a
68-hour trip)
Times in America/New_York
© 2016 Savi Technology • May 25, 2016 • Page 8
Batch discovery and prescriptive analytics: reducing theft
Third largest transport firm had
2x the median suspect issues
Use of Spark @ Savi
© 2016 Savi Technology • May 25, 2016 • Page 10
We have fully embraced Apache Spark
Spark is the core of our tech stack:
• Using Spark for batch processing since Spark 1.0, for streaming since Spark 1.2.1
- We use discretized streams (DStreams); our fastest batch interval is 1 second
• 24x7 production operation, with full monitoring and high levels of test coverage
• Supporting Fortune-500 customers, managing billions of dollars of stuff in near real-time
• Fully-automated CI & CD with SOC II certification
• We launch new Spark software several times every week—
Push-button with no visible downtime to customers
• Gives use enormous scale and cost advantages vs. traditional enterprise technologies
• Uptime in last 12 months has been 100%—knock on wood
13 months ago we had a brief outage due to a DNS outage in AWS US-West-2
© 2016 Savi Technology • May 25, 2016 • Page 11
Spark is at the core of our “hybrid” Lambda architecture
In-house
Analytic
Tools
Sensor
Readers
Mobile
Apps
Enterprise
Data
Open
Data
Partner
Data
Sensor
Meshes
I N T E G R A T I O N L A Y E RA M Q
P
C o A P
F T P
H T T P
M Q T T
S O A P
T C P
U D P
X M P P
S E R V I N G
L A Y E R
B A T C H L A Y E R
S P E E D L A Y E R
Savi IoT
Adapter
Batch
Processing
Domain
Specific
CEP
Sensor
Agnostic
CEP
Modeling,
Machine
Learning
R S - 2 3 2
U S B
p R F I D
B l u e t o o t
h
Z i g B e e
8 0 2 . 1 1
6 L o W P A
N
a R F I D
G S M
G P R S
3 G
4 G / L T E
S A T C O M
Data
Serving
Layer
Notification
s
Savi Apps
Immutable
Data Store
Customer
Export
REST APIs
© 2016 Savi Technology • May 25, 2016 • Page 12
The Details: Tech stack distributions and versions
Data Applications Tools
 Jetty 9.3.9
 Kafka (0.8.2) via CDH 5.3.3
 Spark (1.4.1 -> 1.6.1)
 Scikit-learn 0.15.2
 Cassandra 2.1.8 via DSE 4.7
 GlusterFS 3.7
 PostgreSQL 9.3.3 with PostGIS
 Hadoop 2.5.0 via CDH 5.3.3
 Hive 0.13.1 via CDH 5.3.3
 Hue 3.7.0 via CDH 5.3.3
 Parquet-format 2.2.0 via Spark
 Parquet-mr 1.6.0 via Spark
 Gobblin 0.7.0
 Drools 6.3.0
 ZooKeeper 3.4.5 via CDH 5.3.3
 Nginx
 Bootstrap
 D3.js, AmCharts, Flot
 WildFly
 Flask
 Shibboleth
 PostgreSQL
 DSE Cassandra
 DSE Solr
 Also mobile on iOS, Android
 Github (Github Flow)
 Ansible
 Docker
 Jenkins
 Maven
 Bower
 Slack
 Fluentd
 Graylog
 Sentry
 Jupyter (PySpark, Folium, Pandas,
Matplotlib, Scikit-learn, etc.)
We program in Scala 2.10, Java 8, Python 2.7, HTML5, LESS.css, and JavaScript
We are hosted in AWS but are not using any AWS-specific solutions (e.g., EMR)
© 2016 Savi Technology • May 25, 2016 • Page 13
Why we chose Spark
• We started on Apache Storm and MapReduce (we use a Lambda architecture)
• Moved to 100% Spark over the last 18 months (finished last Summer)
• Spark is NOT the best at everything
• However, it is advancing quickly
• We are an analytics company: Spark provides a single unified framework
- Speed layer and Batch Layer
- Use by Engineering and Data Science
- Product apps and ad-hoc analytics
• Ultimately this gives us better agility and cost (development + operations)
For more on our journey see: http://bit.do/savi-spark
Spark Streaming @ Savi
Tips and lessons streaming data 24x7
© 2016 Savi Technology • May 25, 2016 • Page 15
Spark Streaming is a different animal
20 seconds
© 2016 Savi Technology • May 25, 2016 • Page 16
 Time is much more precious (and important) in the Streaming world time
- Seconds vs. minutes or hours
- Down-time or interruption is immediately visible to end users—in IoT this can lead to missing key events
- Need to avoid breakdown in stream due to surges or failures—both of which are more common in IoT
 Streaming resource utilization is different than batch
- CPU is rarely the limiting factor
- Memory is less of a limitation than typical for Spark
- I/O is a much more common limiting factor
Some tips and lessons learned managing these differences…
Spark Streaming is a different animal
© 2016 Savi Technology • May 25, 2016 • Page 17
 Tip 1: Leverage Kafka
- Faster than HDFS, more durable than in-memory
- Supports parallel, independent consumption from multiple processing streams
- Supports FIFO within partitions
 Tip 2: DAG of DAGs (DAG of Streaming Apps and Kafka topics)
- Break down process graph—even near real-time—into critical and non-critical paths
- Route non-critical processing to separate streams, with their own persisted queues
- Do same for interactions with lower-durability sources and targets
 Tip 3: (Caveat to Tip 2) Avoid over-complicating your DAG
- Every time you re-queue: you create opportunities to get data out-of-order
- Instead rely on at-least-once processing and add non-more-than-once protection to non-idempotent processing
Tips to defensively architect Spark Streaming
© 2016 Savi Technology • May 25, 2016 • Page 18
 Tip 4: Offload bad data to non-blocking paths
- Bad data will happen
- Design your apps to offload this to non-blocking paths (vs. failing)—keeps the stream alive
 Tip 5: (Caveat to Tip 4) Wind-down if infrastructure fails
- Running a streaming process with broken infrastructure will create lots of bad issues
- Instead wind-down (and alert) and allow Kafka to help you recover
- Wind-down and re-start will often “clear up” network or memory bottlenecks
 Tip 6: Preserve data lineage (and immutability)
- Preserve full data lineage of each stage of processing – will save you when dealing with real-world issues
- Keep everything, even failures – this allows you to replay data for analysis, recovery (you will need it)
Tips to defensively architect Spark Streaming (cont.)
© 2016 Savi Technology • May 25, 2016 • Page 19
 Tip 1: Over-subscribe your cores
- Minimum core count needed is NsourceTopics + 2
- For efficiency over-subscribe your cores.
- High multiples are fine
 Tip 2: Use broadcast variables to
persist shared ephemeral rules
 Tip 3: Limit Kafka topics per App
- Counter-intuitive for defensive programming
- Avoids starvation due to imbalanced loads
 Tip 4: Avoid the shuffle
- Shuffle is tough on I/O,
with streaming it is worse
- Instead rely on Kafka partitioning
- However, Kafka offset partitioning
is still a work-in-progress
Tips: performance tuning Spark Streaming
Streaming Real-world Industrial IoT Data:
“It’s very different than the Canonical Twitter stream analysis teaching example”
© 2016 Savi Technology • May 25, 2016 • Page 21
All the normal “dirty data” issues plus
Streaming means you have to handle much of this in near-real time
IoT + Spark Streaming = Physical + Data (in near real-time)
© 2016 Savi Technology • May 25, 2016 • Page 22
 The “IoT Menagerie”
- Millions of source IPs: white listing is impossible
- Many transport protocols and standard: HTTP, FTP, CoAP, MQTT, TCP, UDP, X12, GPRS
 Several tools available to ingest IoT transactions into your platform
- Even some directly to Kafka for processing by Spark, Storm and Flink
 However, not everything is a simple transaction – most is not
 The “obvious” solution: increasing MAX KAFKA SIZE does not work:
- Bottlenecking and serialization issues
- Ultimately will not be able to increase enough
 Lessons Learned: Use hybrid ingestion
- Append critical metadata immediately at point of ingestion
- Includes transaction ID and digital signature
- Split metadata from payload for complex and large data types
- Keeps memory low and is fully scalable
Challenge 1: Ingesting IoT data
S T R E A M I N G D A T A T Y P E S
Micro-
batches
Simple
transactions
Loggers
Sensor
constellations
30% of xacs
10% of data
20% of xacs
35% of data
5% of xacs
10% of data
45% of xacs
15% of data
MIME media
transactions
<1% of xacs
30% of data
© 2016 Savi Technology • May 25, 2016 • Page 23
Challenge 2: Handing stream interruptions and surges
• Massive increase in stream interruptions
(vs. normal server flows)
- Loss of power
- Movement in and out of coverage
- Bad OTA updates (can cause false DDoS events)
• Often undetected by anyone but Spark
• Overcoming these
- Monitor and alarm on anomalous values
- Tune your fetch rates to avoid overwhelming I/O
- Our Hope: New Spark back-pressure (still in beta)
© 2016 Savi Technology • May 25, 2016 • Page 24
 Transmission, authentication, and formatting errors are
much more frequent in IoT
- Ever had a cellphone call dropped or duplicate text?
- Data is rarely self-describing
- Firmware configuration management issues
- Standards non-compliance
 Duplication is much more common (and complex)
than traditional Lambda
- Duplicate data can “hide” in unique wrappers
- Duplicate data can be obscured by transaction IDs
- Duplicates can come beyond viably sustainable window durations
 Lessons Learned:
- Accept everything—even authentication errors
- Capture entire lineage of processing (metadata and payload)
- Route failures away from DAG—but preserve to replay and recover
- Map data to based atomic unit THEN digitally sign and de-duplicate data
Challenge 3: Cleansing and transforming IoT data
U N I Q U E T R A N S A C T I O N S E T
Duplicate Facts
(from prior set)
Unique Header
Unique Facts
(to this set)
Unique Header
Incomplete Facts
(in this set)
Unique Header
© 2016 Savi Technology • May 25, 2016 • Page 25
Transformation and Cleansing example: “Simple” raw IoT data
$690300SR86506702256878020160321155058-
16060a34ST-663-
000p00105300090008000030b74a67db4333102660000470
00a67fb4333102660000330009cd9b433310266000025000
10g81-
00077.09254000038.8064970066.7000002016032115514
2000010006.926480003.01000020e210000000000000001
00e21000000000000000200e2200000000000000055246a8
© 2016 Savi Technology • May 25, 2016 • Page 26
Transformation and Cleansing: Canonical format for analytics
Turning machine data into cleaned,
self-describing, agnostic data that can
be readily used for analytics and
machine learning
Sensor Message
Universal Read Format
© 2016 Savi Technology • May 25, 2016 • Page 27
 Streaming data is many, many small files
- 100s or 1000s per second
 Adding to HDFS creates the small file problem
- Many files (Name node swamping)
- Much smaller than HDFS block size (inefficient)
 Delaying too long makes batch analysis stale
- Kafka dues not support complex queries
 Lots of back and forth on this; current best practice:
- Organize streams by volume and type into Kafka topics
- Batch extract by topic based on volume AND time
- Ultimately convert to parquet-format for batch analytics
Challenge 4: The small files problem
And now, the hardest challenge: Streaming CEP…
© 2016 Savi Technology • May 25, 2016 • Page 28
Challenge 5: CEP processing messy physical realities
Spark Streaming needs to make decisions quick enough to matter…
In the physical world, real-time gets stale very quickly
© 2016 Savi Technology • May 25, 2016 • Page 29
Sometimes, data just gets lost (or significantly delayed)
0
1
2
3
When streaming the IoT, the time lag of information is ever-present
© 2016 Savi Technology • May 25, 2016 • Page 30
Also, people have been known to “contradict” sensors
 Sometimes legitimate
 Sometimes mistaken
 Sometimes malicious
People will argue that the
sensors (or rules) are wrong
© 2016 Savi Technology • May 25, 2016 • Page 31
Finally, once I alert you to something, I cannot undo it
Human memory is not a batch layer: it’s hard to forget Type I errors
© 2016 Savi Technology • May 25, 2016 • Page 32
 Prioritize Type I bias vs. Type II based on context
 Windowing can be helpful, but not always
- Data can be delayed hours or days (windowing is not cost-effective)
 Use self-healing rule sets (and algorithms)
- Immutable journal data models for state management
- Keep track of multiple time dimensions: latest, most recent
- Keep track of multiple signal dimensions: detected, reported
 Use batch layer to assist with self-healing
- Re-order on review
- Auto-resolve based on new data
 Add a human signals (to build trust)
- Do not hide corrections, make them clear
- Show full time lineage
- Allow human to re-order to understand effects of outages and delays
CEP in IoT: (Timeliness + Good Enough) > (Late + Perfect)
James will dive into this deeper…
© 2016 Savi Technology • May 25, 2016 • Page 33
 Some challenges to overcome streaming IoT
 Once you overcome these—and share insights with customers—
the the real fun begins. There is lots you can do with Spark
 Questions, ideas, comments:
jhaughwout@savi.com
 Starting to open source some tools at:
https://github.com/sensoranalytics/
 Visit us at 3601 Eisenhower Avenue
Thank you!
Spark Streaming the Industrial IoT

Mais conteúdo relacionado

Mais procurados

Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Databricks
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Spark Summit
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 

Mais procurados (20)

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaReal-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei ZahariaTrends for Big Data and Apache Spark in 2017 by Matei Zaharia
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
 

Destaque

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Parkinson’s disease
Parkinson’s diseaseParkinson’s disease
Parkinson’s disease
Kim Santana
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 

Destaque (20)

How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-Enable breakthrough in Parkinson disease research- Ido Karavany-
Enable breakthrough in Parkinson disease research- Ido Karavany-
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain   Big Data Baby Steps (4-12-2014) FinalUtah Big Mountain   Big Data Baby Steps (4-12-2014) Final
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
 
Parkinson’s disease
Parkinson’s diseaseParkinson’s disease
Parkinson’s disease
 
Vitamin d & parkinson's disease finnish cohort
Vitamin d & parkinson's disease finnish cohortVitamin d & parkinson's disease finnish cohort
Vitamin d & parkinson's disease finnish cohort
 
Making visible global injustice in health: mapping the causes of 57 million d...
Making visible global injustice in health: mapping the causes of 57 million d...Making visible global injustice in health: mapping the causes of 57 million d...
Making visible global injustice in health: mapping the causes of 57 million d...
 
Future of Apache Storm
Future of Apache StormFuture of Apache Storm
Future of Apache Storm
 
non motor manifestation of parkinson disease
non motor manifestation of parkinson diseasenon motor manifestation of parkinson disease
non motor manifestation of parkinson disease
 
Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4Private Cloud Delivers Big Data in Oil & Gas v4
Private Cloud Delivers Big Data in Oil & Gas v4
 
Parkinson disease final
Parkinson disease finalParkinson disease final
Parkinson disease final
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Fighting Mr. Parkinson with Data: Round Two
Fighting Mr. Parkinson with Data: Round TwoFighting Mr. Parkinson with Data: Round Two
Fighting Mr. Parkinson with Data: Round Two
 
Parkinson stand 03.12
Parkinson stand 03.12Parkinson stand 03.12
Parkinson stand 03.12
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Parkinson’s Disease
Parkinson’s DiseaseParkinson’s Disease
Parkinson’s Disease
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Parkinsons disease
Parkinsons diseaseParkinsons disease
Parkinsons disease
 

Semelhante a Spark Streaming the Industrial IoT

Semelhante a Spark Streaming the Industrial IoT (20)

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Data streaming
Data streamingData streaming
Data streaming
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
 
Big data knolx
Big data knolxBig data knolx
Big data knolx
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service   meetup ovh bordeauxOvh analytics data compute with apache spark as a service   meetup ovh bordeaux
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
 
OVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a ServiceOVH Analytics Data Compute - Apache Spark Cluster as a Service
OVH Analytics Data Compute - Apache Spark Cluster as a Service
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Wasp2 - IoT and Streaming Platform
Wasp2 - IoT and Streaming PlatformWasp2 - IoT and Streaming Platform
Wasp2 - IoT and Streaming Platform
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Spark Streaming the Industrial IoT

  • 1. Spark Streaming the Industrial IoT Washington DC Area Spark Interactive Jim Haughwout, Chief Architect & VP of Software May 24, 2016
  • 2. © 2016 Savi Technology • May 25, 2016 • Page 2 Today’s Talk • Discuss challenges of streaming in general with tips on doing this with Spark • Special focus: IoT’s complexities of immediately tying together physical and data world • Our talks are in three parts: - Part I: Top-level POV of using Spark Streaming for Industrial IoT (Jim) - Part II: Spark Streaming and Expert Systems – Spark + Drools (James) - Part III: Overcoming Deficiencies in Streams (Anderson of MetiStream)
  • 4. © 2016 Savi Technology • May 25, 2016 • Page 4 Savi Technology • Sensor analytics solutions for Industrial IoT • Focus areas are risk and performance • Customers are Fortune-1000 and government • Real-time visibility using complex event processing and machine learning algorithms • Strategic insights using batch analytics • Hardware Engineers, Data Engineers, Software Engineers, and Data Scientists • HQ in Alexandria; offices across world Some examples of what we do… HARDWAREAPPLICATIONS SERVICESANALYTICS
  • 5. © 2016 Savi Technology • May 25, 2016 • Page 5 Our version of Google Now: Parking -> Stationary
  • 6. © 2016 Savi Technology • May 25, 2016 • Page 6 Progressive streaming analysis of IoT data: Rules + ML Times in UTC
  • 7. © 2016 Savi Technology • May 25, 2016 • Page 7 Alerting with predictive analytics: Commercial ETA • 22 hours out we predicted driver would be late (giving advanced notice) • That prediction was < 5 minutes vs. actual (on a 68-hour trip) Times in America/New_York
  • 8. © 2016 Savi Technology • May 25, 2016 • Page 8 Batch discovery and prescriptive analytics: reducing theft Third largest transport firm had 2x the median suspect issues
  • 9. Use of Spark @ Savi
  • 10. © 2016 Savi Technology • May 25, 2016 • Page 10 We have fully embraced Apache Spark Spark is the core of our tech stack: • Using Spark for batch processing since Spark 1.0, for streaming since Spark 1.2.1 - We use discretized streams (DStreams); our fastest batch interval is 1 second • 24x7 production operation, with full monitoring and high levels of test coverage • Supporting Fortune-500 customers, managing billions of dollars of stuff in near real-time • Fully-automated CI & CD with SOC II certification • We launch new Spark software several times every week— Push-button with no visible downtime to customers • Gives use enormous scale and cost advantages vs. traditional enterprise technologies • Uptime in last 12 months has been 100%—knock on wood 13 months ago we had a brief outage due to a DNS outage in AWS US-West-2
  • 11. © 2016 Savi Technology • May 25, 2016 • Page 11 Spark is at the core of our “hybrid” Lambda architecture In-house Analytic Tools Sensor Readers Mobile Apps Enterprise Data Open Data Partner Data Sensor Meshes I N T E G R A T I O N L A Y E RA M Q P C o A P F T P H T T P M Q T T S O A P T C P U D P X M P P S E R V I N G L A Y E R B A T C H L A Y E R S P E E D L A Y E R Savi IoT Adapter Batch Processing Domain Specific CEP Sensor Agnostic CEP Modeling, Machine Learning R S - 2 3 2 U S B p R F I D B l u e t o o t h Z i g B e e 8 0 2 . 1 1 6 L o W P A N a R F I D G S M G P R S 3 G 4 G / L T E S A T C O M Data Serving Layer Notification s Savi Apps Immutable Data Store Customer Export REST APIs
  • 12. © 2016 Savi Technology • May 25, 2016 • Page 12 The Details: Tech stack distributions and versions Data Applications Tools  Jetty 9.3.9  Kafka (0.8.2) via CDH 5.3.3  Spark (1.4.1 -> 1.6.1)  Scikit-learn 0.15.2  Cassandra 2.1.8 via DSE 4.7  GlusterFS 3.7  PostgreSQL 9.3.3 with PostGIS  Hadoop 2.5.0 via CDH 5.3.3  Hive 0.13.1 via CDH 5.3.3  Hue 3.7.0 via CDH 5.3.3  Parquet-format 2.2.0 via Spark  Parquet-mr 1.6.0 via Spark  Gobblin 0.7.0  Drools 6.3.0  ZooKeeper 3.4.5 via CDH 5.3.3  Nginx  Bootstrap  D3.js, AmCharts, Flot  WildFly  Flask  Shibboleth  PostgreSQL  DSE Cassandra  DSE Solr  Also mobile on iOS, Android  Github (Github Flow)  Ansible  Docker  Jenkins  Maven  Bower  Slack  Fluentd  Graylog  Sentry  Jupyter (PySpark, Folium, Pandas, Matplotlib, Scikit-learn, etc.) We program in Scala 2.10, Java 8, Python 2.7, HTML5, LESS.css, and JavaScript We are hosted in AWS but are not using any AWS-specific solutions (e.g., EMR)
  • 13. © 2016 Savi Technology • May 25, 2016 • Page 13 Why we chose Spark • We started on Apache Storm and MapReduce (we use a Lambda architecture) • Moved to 100% Spark over the last 18 months (finished last Summer) • Spark is NOT the best at everything • However, it is advancing quickly • We are an analytics company: Spark provides a single unified framework - Speed layer and Batch Layer - Use by Engineering and Data Science - Product apps and ad-hoc analytics • Ultimately this gives us better agility and cost (development + operations) For more on our journey see: http://bit.do/savi-spark
  • 14. Spark Streaming @ Savi Tips and lessons streaming data 24x7
  • 15. © 2016 Savi Technology • May 25, 2016 • Page 15 Spark Streaming is a different animal 20 seconds
  • 16. © 2016 Savi Technology • May 25, 2016 • Page 16  Time is much more precious (and important) in the Streaming world time - Seconds vs. minutes or hours - Down-time or interruption is immediately visible to end users—in IoT this can lead to missing key events - Need to avoid breakdown in stream due to surges or failures—both of which are more common in IoT  Streaming resource utilization is different than batch - CPU is rarely the limiting factor - Memory is less of a limitation than typical for Spark - I/O is a much more common limiting factor Some tips and lessons learned managing these differences… Spark Streaming is a different animal
  • 17. © 2016 Savi Technology • May 25, 2016 • Page 17  Tip 1: Leverage Kafka - Faster than HDFS, more durable than in-memory - Supports parallel, independent consumption from multiple processing streams - Supports FIFO within partitions  Tip 2: DAG of DAGs (DAG of Streaming Apps and Kafka topics) - Break down process graph—even near real-time—into critical and non-critical paths - Route non-critical processing to separate streams, with their own persisted queues - Do same for interactions with lower-durability sources and targets  Tip 3: (Caveat to Tip 2) Avoid over-complicating your DAG - Every time you re-queue: you create opportunities to get data out-of-order - Instead rely on at-least-once processing and add non-more-than-once protection to non-idempotent processing Tips to defensively architect Spark Streaming
  • 18. © 2016 Savi Technology • May 25, 2016 • Page 18  Tip 4: Offload bad data to non-blocking paths - Bad data will happen - Design your apps to offload this to non-blocking paths (vs. failing)—keeps the stream alive  Tip 5: (Caveat to Tip 4) Wind-down if infrastructure fails - Running a streaming process with broken infrastructure will create lots of bad issues - Instead wind-down (and alert) and allow Kafka to help you recover - Wind-down and re-start will often “clear up” network or memory bottlenecks  Tip 6: Preserve data lineage (and immutability) - Preserve full data lineage of each stage of processing – will save you when dealing with real-world issues - Keep everything, even failures – this allows you to replay data for analysis, recovery (you will need it) Tips to defensively architect Spark Streaming (cont.)
  • 19. © 2016 Savi Technology • May 25, 2016 • Page 19  Tip 1: Over-subscribe your cores - Minimum core count needed is NsourceTopics + 2 - For efficiency over-subscribe your cores. - High multiples are fine  Tip 2: Use broadcast variables to persist shared ephemeral rules  Tip 3: Limit Kafka topics per App - Counter-intuitive for defensive programming - Avoids starvation due to imbalanced loads  Tip 4: Avoid the shuffle - Shuffle is tough on I/O, with streaming it is worse - Instead rely on Kafka partitioning - However, Kafka offset partitioning is still a work-in-progress Tips: performance tuning Spark Streaming
  • 20. Streaming Real-world Industrial IoT Data: “It’s very different than the Canonical Twitter stream analysis teaching example”
  • 21. © 2016 Savi Technology • May 25, 2016 • Page 21 All the normal “dirty data” issues plus Streaming means you have to handle much of this in near-real time IoT + Spark Streaming = Physical + Data (in near real-time)
  • 22. © 2016 Savi Technology • May 25, 2016 • Page 22  The “IoT Menagerie” - Millions of source IPs: white listing is impossible - Many transport protocols and standard: HTTP, FTP, CoAP, MQTT, TCP, UDP, X12, GPRS  Several tools available to ingest IoT transactions into your platform - Even some directly to Kafka for processing by Spark, Storm and Flink  However, not everything is a simple transaction – most is not  The “obvious” solution: increasing MAX KAFKA SIZE does not work: - Bottlenecking and serialization issues - Ultimately will not be able to increase enough  Lessons Learned: Use hybrid ingestion - Append critical metadata immediately at point of ingestion - Includes transaction ID and digital signature - Split metadata from payload for complex and large data types - Keeps memory low and is fully scalable Challenge 1: Ingesting IoT data S T R E A M I N G D A T A T Y P E S Micro- batches Simple transactions Loggers Sensor constellations 30% of xacs 10% of data 20% of xacs 35% of data 5% of xacs 10% of data 45% of xacs 15% of data MIME media transactions <1% of xacs 30% of data
  • 23. © 2016 Savi Technology • May 25, 2016 • Page 23 Challenge 2: Handing stream interruptions and surges • Massive increase in stream interruptions (vs. normal server flows) - Loss of power - Movement in and out of coverage - Bad OTA updates (can cause false DDoS events) • Often undetected by anyone but Spark • Overcoming these - Monitor and alarm on anomalous values - Tune your fetch rates to avoid overwhelming I/O - Our Hope: New Spark back-pressure (still in beta)
  • 24. © 2016 Savi Technology • May 25, 2016 • Page 24  Transmission, authentication, and formatting errors are much more frequent in IoT - Ever had a cellphone call dropped or duplicate text? - Data is rarely self-describing - Firmware configuration management issues - Standards non-compliance  Duplication is much more common (and complex) than traditional Lambda - Duplicate data can “hide” in unique wrappers - Duplicate data can be obscured by transaction IDs - Duplicates can come beyond viably sustainable window durations  Lessons Learned: - Accept everything—even authentication errors - Capture entire lineage of processing (metadata and payload) - Route failures away from DAG—but preserve to replay and recover - Map data to based atomic unit THEN digitally sign and de-duplicate data Challenge 3: Cleansing and transforming IoT data U N I Q U E T R A N S A C T I O N S E T Duplicate Facts (from prior set) Unique Header Unique Facts (to this set) Unique Header Incomplete Facts (in this set) Unique Header
  • 25. © 2016 Savi Technology • May 25, 2016 • Page 25 Transformation and Cleansing example: “Simple” raw IoT data $690300SR86506702256878020160321155058- 16060a34ST-663- 000p00105300090008000030b74a67db4333102660000470 00a67fb4333102660000330009cd9b433310266000025000 10g81- 00077.09254000038.8064970066.7000002016032115514 2000010006.926480003.01000020e210000000000000001 00e21000000000000000200e2200000000000000055246a8
  • 26. © 2016 Savi Technology • May 25, 2016 • Page 26 Transformation and Cleansing: Canonical format for analytics Turning machine data into cleaned, self-describing, agnostic data that can be readily used for analytics and machine learning Sensor Message Universal Read Format
  • 27. © 2016 Savi Technology • May 25, 2016 • Page 27  Streaming data is many, many small files - 100s or 1000s per second  Adding to HDFS creates the small file problem - Many files (Name node swamping) - Much smaller than HDFS block size (inefficient)  Delaying too long makes batch analysis stale - Kafka dues not support complex queries  Lots of back and forth on this; current best practice: - Organize streams by volume and type into Kafka topics - Batch extract by topic based on volume AND time - Ultimately convert to parquet-format for batch analytics Challenge 4: The small files problem And now, the hardest challenge: Streaming CEP…
  • 28. © 2016 Savi Technology • May 25, 2016 • Page 28 Challenge 5: CEP processing messy physical realities Spark Streaming needs to make decisions quick enough to matter… In the physical world, real-time gets stale very quickly
  • 29. © 2016 Savi Technology • May 25, 2016 • Page 29 Sometimes, data just gets lost (or significantly delayed) 0 1 2 3 When streaming the IoT, the time lag of information is ever-present
  • 30. © 2016 Savi Technology • May 25, 2016 • Page 30 Also, people have been known to “contradict” sensors  Sometimes legitimate  Sometimes mistaken  Sometimes malicious People will argue that the sensors (or rules) are wrong
  • 31. © 2016 Savi Technology • May 25, 2016 • Page 31 Finally, once I alert you to something, I cannot undo it Human memory is not a batch layer: it’s hard to forget Type I errors
  • 32. © 2016 Savi Technology • May 25, 2016 • Page 32  Prioritize Type I bias vs. Type II based on context  Windowing can be helpful, but not always - Data can be delayed hours or days (windowing is not cost-effective)  Use self-healing rule sets (and algorithms) - Immutable journal data models for state management - Keep track of multiple time dimensions: latest, most recent - Keep track of multiple signal dimensions: detected, reported  Use batch layer to assist with self-healing - Re-order on review - Auto-resolve based on new data  Add a human signals (to build trust) - Do not hide corrections, make them clear - Show full time lineage - Allow human to re-order to understand effects of outages and delays CEP in IoT: (Timeliness + Good Enough) > (Late + Perfect) James will dive into this deeper…
  • 33. © 2016 Savi Technology • May 25, 2016 • Page 33  Some challenges to overcome streaming IoT  Once you overcome these—and share insights with customers— the the real fun begins. There is lots you can do with Spark  Questions, ideas, comments: jhaughwout@savi.com  Starting to open source some tools at: https://github.com/sensoranalytics/  Visit us at 3601 Eisenhower Avenue Thank you!