3. Topics
Business use case
Training phase of the algorithm
Tech stack
Real time implementation
Demonstration on a force sensor
4. Data Model
We are currently working on these data models :
Unstructured data
Structured Data
Time series Data
For this talk we are going to concentrate on Time series data
5. Problem Statement
To build a reactive application which trains on limited amount of data.
6. Business use case
Main use case is in preventive maintenance systems.
Calendar based maintenance schedules and holding excessive inventory to reduce
downtime all lead to inefficiencies and increase costs.
Recent failures in machinery of oil rigs, car manufacturing plants have cost their
respective industries millions of dollars in down time and repairs.
Condition Based Monitoring systems are implemented with the goal of
eliminating unplanned downtime and reducing operations cost by maintaining the
proper equipment at the proper time.
As they say a stitch in time saves nine.
11. Time series analytics
Any analytics algorithm should be a mathematical model that should:
Data Compression: Compact representation of data
Signal Processing: extracting signal(sequences) even in presence of noise
Prediction: using model predict the future values of time series
12. Terminology
Patterns
Block of graph where values are within a
range
Patterns are grown from pairs of sequential
points till the block conform given
thresholds
Clusters
Similar type of patterns
13. Terminology
Sequences
A recurring series of patterns belonging to
a set of clusters.
Concepts
Sequences which are tagged as relevant to
the user.
Knowledge Base
Inference drawn from concepts.
This is the compressed representation of
the time series
14. Phases
Training phase
Objective is to build a Knowledge base.
Bulk historical data is given as input.
Parameters of the algorithm are fine tuned to match the use case.
Concepts are identified and assigned an action.
Validation Phase
Bulk
Bulk data is given.
Patterns are found and classified according to knowledge base.
Used to identify and tag scenarios over a known timeline.
15. Phases
Decision phase
Real Time
For example a Kafka source is provided.
Received data is processed in batches.
Patterns spanning multiples batches are stitched.
If a sequence is identified as a concept, the specified action is triggered.
20. Training phase output
Knowledge Base properties:
Data Compression: Compact representation of data
Signal Processing: extracting signal(sequences) even in presence of noise
Prediction: using model predict the future values of time series
21. Real time system
Light weight Computation framework
Ability to handle 3V’s (Volume, Velocity and Variety) of Big Data
Computation framework with micro batch processing architecture
23. Data Source
Data source that can keep the data from the source and ingest into computation
framework which can
Take Advantage of distributed computation framework
Store data in a fault tolerant manner
26. Connecting with IoT
Connect Mobile accelerometer to AWS IoT and stream data.
Train the system to predict an user’s behavior using accelerometer data.
28. Bottlenecks
Small File Issues: writing and reading huge number of small files.
Sharing data between batches.
29. Fix: Small Files Problem
Implemented a in memory queue to hold data for several batches and then
compile everything into a single file and write to storage system
Can also serve UI requests from in-memory queue.
This eliminates the extra read calls from storage system to serve UI requests
Allows the writes in first place to be asynchronous
30. Why Share data between batches
In Real time data ingestion, data can be broken into different batches depending
upon the batch size we choose
We need to take care of signals overflowing across batches
31. Sharing Data between batches
UpdateStateByKey
ssc.remember()
Spark Accumulators
In this section I am going go give brief introduction of architecture and business use case of our system.
Our goal is to make sense of any given sensor data may it be a pressure sensor in a valve or a camera on a self-driving car. By which we may be able to take smart decisions or make predictions about the future.
Unstructured data doesn't have relations between columns.
Whatever may be the data source, we want to make a generalized solution which can handle any type of variation and enable the user to get a specialized system for this own use case
Assume there is oil rig with 10 machine and 100 sensors each. Say, we know that a component in a machine needs maintainance every 3 months, but in many real life situations the component may breakdown pre maturely, which may cause the company millions in down time. Having a person monitor all the sensor outputs and determine whether any component needs maintainance is not a viable solution. Our system is built to handle this use case.
Please have a look this data from a pressure sensor in a valve. Say as an user we know that the first anomaly is caused by miss orientation of the spring and second is caused when seal of the valve is broken. Can you suggest any methods to isolate these 2 phenomenon.
Most of the traditional approaches wouldn't take in to consideration if a new type pattern emerges.
next slide our solution.
loss of similarity
Knowledge base is inferences drawn from the given data.
This is the pipeline all the phases of our application go through.
First you provide a data source, currently you can upload a local file in your computer or select it from your google chrome, formats supported are CSV and TSV.
Then the data is ingested using a schema provided. You can type cast variable, join columns from multiple files, etc.
Using this ingested data as our time series we can compute PCSC,
Simple example for say y=sin(x) time series model
Prediction on time series data is one of the use case for real time time series data analytics
talk 1 explains about how did we train the system and teach it the make decisions.
We use the trained system for real time analytics
We need some streaming or live computation framework
Spark streaming is a micro batch processing architecture
Collects stream data into small batches and processes it
Job Creation and scheduling overhead is in order of milliseconds
Batch interval can be as small as 1 Second
We can’t rely on original data source as it cannot provide recent data once lost
Apache kafka is a Distributed streaming platform
Publish and subscribe to streams of records.
Store streams of records in a fault-tolerant way
Streams of time series data will be pumping data into kafka
Spark will connect to kafka brokers and consume data
Processes data and store to database
UI server will pull data for visualization
Spark is the computation layer while kafka acts as data source for streaming data
We now have a streaming end to end application : a data source to stream, a compute framework and a storage system
We can connect any real time streaming source
One such demo: simple aws iot with moble sensors
Every batch is writing a lot of small files into storage system (HDFS)
We use parquet as it is one of the best compressed fomat of data available
Spark parquet format small files writing is adding up to extra overhead
Reading several small files from Storage to serve UI requests is also adding up to delay
Sharing data between batches
State maintain
Make one line
Basic problem with live streaming is data will be broken into batches.
Our mathematical model can’t rely upon a batch, needs to wait on for next batch to see if data is overflown