SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Real-Time IoT Analytics
with Apache Pulsar
What makes IoT analytics different?
!2
• The business value of data decreases rapidly after it is created, particularly in use cases such
as IoT Analytics, Industrial Automation, and Real-Time Event Monitoring and Anomaly Detection.
• The high-volume, high-velocity datasets used to feed these use cases often contain valuable,
but perishable, insights that must be acted upon immediately.
• In order to maximize the value of their data enterprises must fundamentally change their
approach to processing real-time data to focusing reducing their decision latency on the
perishable insights that exist within their real-time data streams.
• In this talk, we will present a solution based on Apache Pulsar Functions that significantly
reduces decision latency by using probabilistic algorithms to perform analytic calculations on
the edge.
Maximizing business value
!3
• Capture Latency: The amount of time
between when an event occurs and
when the event data arrives in the
system
• Analysis Latency: Time required to
perform your analysis
• Decision Latency: Time required to act
on your analysis.
• As latency increases, business value
decreases
The time value of data
!4
• Currently, most streaming platform utilize a server-client type architecture that funnel the data
from connected edge devices back to a cloud-based processing framework.
• This long-distance communication between billions of end devices and the Cloud suffers from
major issues:
• Latency. The end-to-end delay may not meet the requirement of many data streaming applications.
• Capacity. The volume of these incoming data streams may not be cost-effective on today’s network
infrastructure.
• Processing Lag: The time required to process the incoming data streams may exceed the time
value of the data or the Cloud-based processing system may not be able to keep pace with the
incoming data stream.
Existing streaming architectures
!5
Modern streaming architecture data flow
!6
Apache Pulsar: Edge-to-cloud streaming platform
!7
Apache Pulsar Functions
Flexible, serverless-inspired framework for executing
user-defined functions to process and transform data
• Implemented as simple methods, but allows you to leverage
existing libraries and code within Java or Python code.
• Functions execute against every single event that is
published to a specified topic, and write their results to
another topic. Forming a logical directed-acyclic graph.
• Enables dynamic filtering, transformation, routing and
analytics.
• Can run anywhere a JVM can, including edge devices.
• Supports parallel execution of instances.
Pulsar Functions: Stream-native processing
!9
Input Topic
Function
f(x)
Input Topic
Input Topic
Output Topic
Output Topic
Building blocks for IoT analytics
!10
Record-based filtering,
enrichment, processing Incoming
record
….
Processor
….
Output
record(s)
e.g. lookups, range
normalization, field extraction,
scoring
….
Cumulative aggregation,
filtering, analytics
e.g. counts, max, min,
cumulative average
Incoming
….
Output
State
….….
Window-based aggregation,
filtering, analytics
e.g. moving averages, pattern
detection
Incoming
….
Processor
….
Output
….
…. ….
Distributed Probabilistic Analytics
with Apache Pulsar Functions
Real-time IoT analytics using Apache Pulsar
!12
• We leverage Pulsar Functions to perform distributed analytics on the edge.
This reduces the volume of data transmitted back to the datacenter by
performing the calculations on the edge devices and only sending the results.
• The reduction in analysis latency comes from the use of probabilistic analytics
techniques that allow us to calculate results that can achieve a high degree of
accuracy while processing and storing only a few Kilobytes of data.
• While these algorithms are not 100% accurate, if you are willing to trade a small
amount of accuracy (often less than 0.01%) you can achieve a significant
increase in speed to insight, which is the key metric you are looking to improve.
Probabilistic analysis
!13
• Minimum Analytic Performance (MAP): The minimum level of accuracy required
within your application, e.g. do you need to know the temperature reading of a
sensor with 10 decimal points of accuracy, or will 1 suffice?
• Often times, it is sufficient to provide an approximate value when it is impossible
and/or impractical to provide a precise value. In many cases having an approximate
answer within a given time frame is better than waiting for an exact answer.
• If your use case does not require precise results and an approximate answer is
acceptable, then there the following techniques and algorithms will provide you
accurate approximations orders of magnitude faster, and requiring orders of
magnitude less memory.
Probabilistic algorithms & sketches
!14
• In order to compute certain analytic queries, such as user counts or web page
view time, requires us to keep copies of every unique value encountered.
• To compute the exact number of unique visitors per day, requires you to keep on
hand all the unique visitor records you have seen. Unique identifier counts are
not additive either, so no amount of parallelism will help you.
• Probabilistic algorithms can provide approximate values, estimates, and random
data samples for statistical analysis when the event stream is either too large to
store in memory, or the data is moving too fast to process. Instead of requiring
to keep such enormous data on-hand, we leverage algorithms that utilize small
data structures known as data sketches, that are usually kilobytes in size.
Data sketches
!15
• A central theme throughout most of these probabilistic data structures is
the concept of data sketches, which are designed to require only enough
of the data necessary to make an accurate estimation of the correct
answer.
• Typically, sketches are implemented a bit arrays or maps thereby requiring
memory on the order of Kilobytes, making them ideal for resource-
constrained computing environments typically found on the edge.
• Sketching algorithms only need to see each incoming item only once, and
are therefore ideal for processing infinite streams of data.
• Let’s walk through an demonstration
to show exactly what I mean by
sketches and show you that we do
not need 100% of the data in order
to make an accurate prediction of
what the picture contains
• How much of the data did you
require to identify the main item in
the picture?
Sketch Example
!16
• Configurable Accuracy
• Sketches sized correctly can be 100% accurate
• Error rate is inversely proportional to size of a Sketch
• Fixed Memory Utilization
• Maximum Sketch size is configured in advance
• Memory cost of a query is thus known in advance
• Allows Non-additive Operations to be Additive
• Sketches can be merged into a single Sketch without over counting
• Allows tasks to be parallelized and combined later
• Allows results to be combined across windows of execution
Data sketch properties
!17
Operations supported by sketches
!18
Theta Sketch Count Distinct Example: when you're doing profiling at the router level, you often want to estimate functions of distinct IP addresses,
and since you can't just maintain counters for each possible address.

Theta Sketches enable us to answer questions about the number of unique users (set union), the number of users who
did X and Y (set intersection), and the number of users who did X and did not do Y (set disjunction).
Tuple Sketch Group By Tuple Sketches are ideal for summarizing attributes such as impressions or clicks.

Tuple Sketches also provide sufficient methods so that user could develop a wrapper class that could facilitate
approximate joins or other common database operations.
Quantile
Sketches
Distribution Anomaly Detection

Consider this real data example of a stream of 230 million time-spent events collected from one our systems for a
period of just 30 minutes. Each event records the amount of time in milliseconds that a user spends on a web page
before moving to a different web page by taking some action, such as a click. Calculate the distribution of this dataset,
then determine for a given value where it lies within the distribution. Anything with the 99th percentile would be
considered anomalous and flagged for action.
Frequent Items
Sketches
Top-K Frequency estimation of Internet packet streams. 

Top-10 Tweets, Queries, items sold, etc.
Sampling Approximate
Query Processing
What is the ratio of ? What percentage of ? What is the average of ?

Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is
computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on
the database and do not need precise answers. An approximate answer along with a confidence interval would suit
most of the use cases.
Some Sketchy Functions
• Another common statistic computed is the frequency at which a specific element occurs within
an endless data stream with repeated elements, which enables us to answer questions such as;
“How many times has element X occurred in the data stream?”. These types of answers are
particularly useful in real time event monitoring and analysis.
• Consider trying to analyze and sample the IoT sensor data for just a single industrial plant that
can produce millions of readings per second. There isn’t enough time to perform the calculations
or store the data.
• In such a scenario you can chose to forego an exact answer, which will we never be able to
compute in time, for an approximate answer that is within an acceptable range of accuracy. The
most popular algorithm for estimating sample frequency is Count-Min Sketch, which as the name
suggests, provides a sketch (approximation) of your data without actually storing the data itself.
Event frequency
!20
• The Count-Min Sketch algorithm uses two elements:
• An M-by-K matrix of counters, each initialized to 0, where each row
corresponds to a hash function
• A collection of K independent hash functions h(x).
• When an element is added to the sketch, each of the hash
functions are applied to the element. These hash values are
treated as indexes into the bit array, and the corresponding array
element is set incremented by 1.
• Now that we have an approximate count for each element we
have seen stored in the M-by-K matrix, we are able to quickly
determine how many times an element X has occurred previously
in the stream by simply applying each of the hash functions to the
element, and retrieving all of the corresponding array elements
and using the SMALLEST value in the list are the approximate
event count.
Count-min sketch
!21
Pulsar Function: Event frequency
!22
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.clearspring.analytics.stream.frequency.CountMinSketch;
public class CountMinFunction implements Function<String, Void> {
CountMinSketch sketch = new CountMinSketch(20,20,128);
void process(String input, Context context) throws Exception {
sketch.add(input, 1); // Calculates bit indexes and performs +1
long count = sketch.estimateCount(input);
// React to the updated count
return null;
}
}
• Another common use of the Count-Min algorithm is maintaining lists of
frequent items which is commonly referred to as the “Heavy Hitters”. This
design pattern retains a list of items that occur more frequently than some
predefined value, e.g. the top-K list
• The K-Frequency-Estimation problem can also be solved by using the Count-
Min Sketch algorithm. The logic for updating the counts is exactly the same
as in the Event Frequency use case.
• However, there is an additional list of length K used to keep the top-K
elements seen that is updated.
K-Frequency-estimation, aka “Heavy Hitters”
!23
• Each of the hash functions are applied to the element. These
hash values are treated as indexes into the bit array, and the
corresponding array element is set incremented by 1.
• Calculate the event frequency for the element as we did in
the event frequency use case by applying each of the hash
functions to the element, and retrieving all of the
corresponding array elements like we did upon insertion.
However, this time rather than incremented each of these
array elements, we take the SMALLEST value in the list are
use that as the approximate event count.
• Compare the calculated event frequency of this element
against the smallest value in the top-K elements array, and if
it is LARGER, remove the smallest value and replace it with
the new element.
Pulsar Function: Bloom Filter
!24
Pulsar Function: Bloom filter
!25
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.clearspring.analytics.stream.StreamSummary;
public class CountMinFunction implements Function<String, Void> {
StreamSummary<String> summary = new StreamSummary<String> (256);
Void process(String input, Context context) throws Exception {
// Add the element to the sketch
summary.offer(input, 1)
// Grab the updated top 10,
List<Counter<String>> topK = summary.topK(10);
return null;
}
}
IoT Analytics Pipeline Using
Apache Pulsar Functions
• A network of smart meters enables utilities companies to gain greater
visibility into their customers energy consumption. With a network of
smart meters, utilities companies can monitor demand in real time and
• Increase/decrease energy generation to meet the demand
• Implement dynamic notifications to encourage consumers to use less
energy during peak demand periods.
• Provide real-time revenue forecasts to senior business leaders.
• Identify fault meters and schedule maintenance calls to repair them.
Identifying real-time energy consumption patterns
!27
Smart meter analytics flow
!28
Decodes from
binary format
Decode data
Sanity Check on
Reading
Sum of all usage in
rolling 5 minute
window
Identify Top-K
users by Meter ID
Validate Meter
Reading
Cumulative
Reading
Aggregation
Top-K Users
Total Energy
Demand
Count # of
occurrences by
meter ID
Event
Frequency
IoT analytics using Pulsar Functions
!29
Summary & Review
• IoT Analytics is an extremely complex problem, and
modern streaming platforms are not well suited to
solving this problem.
• Apache Pulsar Edge provides a platform for
implementing distributed analytics on the edge to
decrease the data capture time.
• Apache Pulsar Functions allows you to leverage
existing probabilistic analysis techniques to
provide approximate values, within an acceptable
degree of accuracy. Thereby reducing the analysis
time.
• Both techniques allow you to act upon your data
while the business value is still high.
Summary & Review
!31
Probabilistic
Algorithms
Pulsar Edge
Deployment
Streamlio and IoT analytics with Apache Pulsar

Mais conteúdo relacionado

Mais procurados

Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Designing and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open SourceDesigning and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open SourceDataWorks Summit/Hadoop Summit
 
Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesDatabricks
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architecturesArun Kejariwal
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalconfluent
 
Shared time-series-analysis-using-an-event-streaming-platform -_v2
Shared   time-series-analysis-using-an-event-streaming-platform -_v2Shared   time-series-analysis-using-an-event-streaming-platform -_v2
Shared time-series-analysis-using-an-event-streaming-platform -_v2confluent
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronArun Kejariwal
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
ksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time EventsksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time Eventsconfluent
 

Mais procurados (20)

Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Designing and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open SourceDesigning and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open Source
 
Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...Real time, streaming advanced analytics, approximations, and recommendations ...
Real time, streaming advanced analytics, approximations, and recommendations ...
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan ZvaraSpark Summit EU talk by Zoltan Zvara
Spark Summit EU talk by Zoltan Zvara
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 
Time series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_finalTime series-analysis-using-an-event-streaming-platform -_v3_final
Time series-analysis-using-an-event-streaming-platform -_v3_final
 
Shared time-series-analysis-using-an-event-streaming-platform -_v2
Shared   time-series-analysis-using-an-event-streaming-platform -_v2Shared   time-series-analysis-using-an-event-streaming-platform -_v2
Shared time-series-analysis-using-an-event-streaming-platform -_v2
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Anomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using HeronAnomaly detection in real-time data streams using Heron
Anomaly detection in real-time data streams using Heron
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
ksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time EventsksqlDB: Building Consciousness on Real Time Events
ksqlDB: Building Consciousness on Real Time Events
 

Semelhante a Streamlio and IoT analytics with Apache Pulsar

Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidStreamNative
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfAlbert Wong
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesMarco Parenzan
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructureSimon Belak
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with AnalyticsWSO2
 
Using Elasticsearch for Analytics
Using Elasticsearch for AnalyticsUsing Elasticsearch for Analytics
Using Elasticsearch for AnalyticsVaidik Kapoor
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfAlbert Wong
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupShlomo Yona
 

Semelhante a Streamlio and IoT analytics with Apache Pulsar (20)

Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Real-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdfReal-Time Analytics With StarRocks (DWH+DL).pdf
Real-Time Analytics With StarRocks (DWH+DL).pdf
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
Levelling up your data infrastructure
Levelling up your data infrastructureLevelling up your data infrastructure
Levelling up your data infrastructure
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
 
Using Elasticsearch for Analytics
Using Elasticsearch for AnalyticsUsing Elasticsearch for Analytics
Using Elasticsearch for Analytics
 
Informix MQTT Streaming
Informix MQTT StreamingInformix MQTT Streaming
Informix MQTT Streaming
 
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdfBuild User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf
 
Rise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetupRise of the machines -- Owasp israel -- June 2014 meetup
Rise of the machines -- Owasp israel -- June 2014 meetup
 

Mais de Streamlio

Infinite Topic Backlogs with Apache Pulsar
Infinite Topic Backlogs with Apache PulsarInfinite Topic Backlogs with Apache Pulsar
Infinite Topic Backlogs with Apache PulsarStreamlio
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar OverviewStreamlio
 
Strata London 2018: Multi-everything with Apache Pulsar
Strata London 2018:  Multi-everything with Apache PulsarStrata London 2018:  Multi-everything with Apache Pulsar
Strata London 2018: Multi-everything with Apache PulsarStreamlio
 
Introduction to Apache BookKeeper Distributed Storage
Introduction to Apache BookKeeper Distributed StorageIntroduction to Apache BookKeeper Distributed Storage
Introduction to Apache BookKeeper Distributed StorageStreamlio
 
Stream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStreamlio
 
Building data-driven microservices
Building data-driven microservicesBuilding data-driven microservices
Building data-driven microservicesStreamlio
 
Distributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarDistributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarStreamlio
 
Evaluating Streaming Data Solutions
Evaluating Streaming Data SolutionsEvaluating Streaming Data Solutions
Evaluating Streaming Data SolutionsStreamlio
 
Autopiloting Realtime Processing in Heron
Autopiloting Realtime Processing in HeronAutopiloting Realtime Processing in Heron
Autopiloting Realtime Processing in HeronStreamlio
 
Introduction to Apache Heron
Introduction to Apache HeronIntroduction to Apache Heron
Introduction to Apache HeronStreamlio
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...Streamlio
 

Mais de Streamlio (11)

Infinite Topic Backlogs with Apache Pulsar
Infinite Topic Backlogs with Apache PulsarInfinite Topic Backlogs with Apache Pulsar
Infinite Topic Backlogs with Apache Pulsar
 
Apache Pulsar Overview
Apache Pulsar OverviewApache Pulsar Overview
Apache Pulsar Overview
 
Strata London 2018: Multi-everything with Apache Pulsar
Strata London 2018:  Multi-everything with Apache PulsarStrata London 2018:  Multi-everything with Apache Pulsar
Strata London 2018: Multi-everything with Apache Pulsar
 
Introduction to Apache BookKeeper Distributed Storage
Introduction to Apache BookKeeper Distributed StorageIntroduction to Apache BookKeeper Distributed Storage
Introduction to Apache BookKeeper Distributed Storage
 
Stream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar Functions
 
Building data-driven microservices
Building data-driven microservicesBuilding data-driven microservices
Building data-driven microservices
 
Distributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache PulsarDistributed Crypto-Currency Trading with Apache Pulsar
Distributed Crypto-Currency Trading with Apache Pulsar
 
Evaluating Streaming Data Solutions
Evaluating Streaming Data SolutionsEvaluating Streaming Data Solutions
Evaluating Streaming Data Solutions
 
Autopiloting Realtime Processing in Heron
Autopiloting Realtime Processing in HeronAutopiloting Realtime Processing in Heron
Autopiloting Realtime Processing in Heron
 
Introduction to Apache Heron
Introduction to Apache HeronIntroduction to Apache Heron
Introduction to Apache Heron
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
 

Último

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 

Último (20)

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 

Streamlio and IoT analytics with Apache Pulsar

  • 2. What makes IoT analytics different? !2
  • 3. • The business value of data decreases rapidly after it is created, particularly in use cases such as IoT Analytics, Industrial Automation, and Real-Time Event Monitoring and Anomaly Detection. • The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately. • In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. • In this talk, we will present a solution based on Apache Pulsar Functions that significantly reduces decision latency by using probabilistic algorithms to perform analytic calculations on the edge. Maximizing business value !3
  • 4. • Capture Latency: The amount of time between when an event occurs and when the event data arrives in the system • Analysis Latency: Time required to perform your analysis • Decision Latency: Time required to act on your analysis. • As latency increases, business value decreases The time value of data !4
  • 5. • Currently, most streaming platform utilize a server-client type architecture that funnel the data from connected edge devices back to a cloud-based processing framework. • This long-distance communication between billions of end devices and the Cloud suffers from major issues: • Latency. The end-to-end delay may not meet the requirement of many data streaming applications. • Capacity. The volume of these incoming data streams may not be cost-effective on today’s network infrastructure. • Processing Lag: The time required to process the incoming data streams may exceed the time value of the data or the Cloud-based processing system may not be able to keep pace with the incoming data stream. Existing streaming architectures !5
  • 7. Apache Pulsar: Edge-to-cloud streaming platform !7
  • 9. Flexible, serverless-inspired framework for executing user-defined functions to process and transform data • Implemented as simple methods, but allows you to leverage existing libraries and code within Java or Python code. • Functions execute against every single event that is published to a specified topic, and write their results to another topic. Forming a logical directed-acyclic graph. • Enables dynamic filtering, transformation, routing and analytics. • Can run anywhere a JVM can, including edge devices. • Supports parallel execution of instances. Pulsar Functions: Stream-native processing !9 Input Topic Function f(x) Input Topic Input Topic Output Topic Output Topic
  • 10. Building blocks for IoT analytics !10 Record-based filtering, enrichment, processing Incoming record …. Processor …. Output record(s) e.g. lookups, range normalization, field extraction, scoring …. Cumulative aggregation, filtering, analytics e.g. counts, max, min, cumulative average Incoming …. Output State ….…. Window-based aggregation, filtering, analytics e.g. moving averages, pattern detection Incoming …. Processor …. Output …. …. ….
  • 11. Distributed Probabilistic Analytics with Apache Pulsar Functions
  • 12. Real-time IoT analytics using Apache Pulsar !12 • We leverage Pulsar Functions to perform distributed analytics on the edge. This reduces the volume of data transmitted back to the datacenter by performing the calculations on the edge devices and only sending the results. • The reduction in analysis latency comes from the use of probabilistic analytics techniques that allow us to calculate results that can achieve a high degree of accuracy while processing and storing only a few Kilobytes of data. • While these algorithms are not 100% accurate, if you are willing to trade a small amount of accuracy (often less than 0.01%) you can achieve a significant increase in speed to insight, which is the key metric you are looking to improve.
  • 13. Probabilistic analysis !13 • Minimum Analytic Performance (MAP): The minimum level of accuracy required within your application, e.g. do you need to know the temperature reading of a sensor with 10 decimal points of accuracy, or will 1 suffice? • Often times, it is sufficient to provide an approximate value when it is impossible and/or impractical to provide a precise value. In many cases having an approximate answer within a given time frame is better than waiting for an exact answer. • If your use case does not require precise results and an approximate answer is acceptable, then there the following techniques and algorithms will provide you accurate approximations orders of magnitude faster, and requiring orders of magnitude less memory.
  • 14. Probabilistic algorithms & sketches !14 • In order to compute certain analytic queries, such as user counts or web page view time, requires us to keep copies of every unique value encountered. • To compute the exact number of unique visitors per day, requires you to keep on hand all the unique visitor records you have seen. Unique identifier counts are not additive either, so no amount of parallelism will help you. • Probabilistic algorithms can provide approximate values, estimates, and random data samples for statistical analysis when the event stream is either too large to store in memory, or the data is moving too fast to process. Instead of requiring to keep such enormous data on-hand, we leverage algorithms that utilize small data structures known as data sketches, that are usually kilobytes in size.
  • 15. Data sketches !15 • A central theme throughout most of these probabilistic data structures is the concept of data sketches, which are designed to require only enough of the data necessary to make an accurate estimation of the correct answer. • Typically, sketches are implemented a bit arrays or maps thereby requiring memory on the order of Kilobytes, making them ideal for resource- constrained computing environments typically found on the edge. • Sketching algorithms only need to see each incoming item only once, and are therefore ideal for processing infinite streams of data.
  • 16. • Let’s walk through an demonstration to show exactly what I mean by sketches and show you that we do not need 100% of the data in order to make an accurate prediction of what the picture contains • How much of the data did you require to identify the main item in the picture? Sketch Example !16
  • 17. • Configurable Accuracy • Sketches sized correctly can be 100% accurate • Error rate is inversely proportional to size of a Sketch • Fixed Memory Utilization • Maximum Sketch size is configured in advance • Memory cost of a query is thus known in advance • Allows Non-additive Operations to be Additive • Sketches can be merged into a single Sketch without over counting • Allows tasks to be parallelized and combined later • Allows results to be combined across windows of execution Data sketch properties !17
  • 18. Operations supported by sketches !18 Theta Sketch Count Distinct Example: when you're doing profiling at the router level, you often want to estimate functions of distinct IP addresses, and since you can't just maintain counters for each possible address. Theta Sketches enable us to answer questions about the number of unique users (set union), the number of users who did X and Y (set intersection), and the number of users who did X and did not do Y (set disjunction). Tuple Sketch Group By Tuple Sketches are ideal for summarizing attributes such as impressions or clicks. Tuple Sketches also provide sufficient methods so that user could develop a wrapper class that could facilitate approximate joins or other common database operations. Quantile Sketches Distribution Anomaly Detection Consider this real data example of a stream of 230 million time-spent events collected from one our systems for a period of just 30 minutes. Each event records the amount of time in milliseconds that a user spends on a web page before moving to a different web page by taking some action, such as a click. Calculate the distribution of this dataset, then determine for a given value where it lies within the distribution. Anything with the 99th percentile would be considered anomalous and flagged for action. Frequent Items Sketches Top-K Frequency estimation of Internet packet streams. Top-10 Tweets, Queries, items sold, etc. Sampling Approximate Query Processing What is the ratio of ? What percentage of ? What is the average of ? Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on the database and do not need precise answers. An approximate answer along with a confidence interval would suit most of the use cases.
  • 20. • Another common statistic computed is the frequency at which a specific element occurs within an endless data stream with repeated elements, which enables us to answer questions such as; “How many times has element X occurred in the data stream?”. These types of answers are particularly useful in real time event monitoring and analysis. • Consider trying to analyze and sample the IoT sensor data for just a single industrial plant that can produce millions of readings per second. There isn’t enough time to perform the calculations or store the data. • In such a scenario you can chose to forego an exact answer, which will we never be able to compute in time, for an approximate answer that is within an acceptable range of accuracy. The most popular algorithm for estimating sample frequency is Count-Min Sketch, which as the name suggests, provides a sketch (approximation) of your data without actually storing the data itself. Event frequency !20
  • 21. • The Count-Min Sketch algorithm uses two elements: • An M-by-K matrix of counters, each initialized to 0, where each row corresponds to a hash function • A collection of K independent hash functions h(x). • When an element is added to the sketch, each of the hash functions are applied to the element. These hash values are treated as indexes into the bit array, and the corresponding array element is set incremented by 1. • Now that we have an approximate count for each element we have seen stored in the M-by-K matrix, we are able to quickly determine how many times an element X has occurred previously in the stream by simply applying each of the hash functions to the element, and retrieving all of the corresponding array elements and using the SMALLEST value in the list are the approximate event count. Count-min sketch !21
  • 22. Pulsar Function: Event frequency !22 import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.Function; import com.clearspring.analytics.stream.frequency.CountMinSketch; public class CountMinFunction implements Function<String, Void> { CountMinSketch sketch = new CountMinSketch(20,20,128); void process(String input, Context context) throws Exception { sketch.add(input, 1); // Calculates bit indexes and performs +1 long count = sketch.estimateCount(input); // React to the updated count return null; } }
  • 23. • Another common use of the Count-Min algorithm is maintaining lists of frequent items which is commonly referred to as the “Heavy Hitters”. This design pattern retains a list of items that occur more frequently than some predefined value, e.g. the top-K list • The K-Frequency-Estimation problem can also be solved by using the Count- Min Sketch algorithm. The logic for updating the counts is exactly the same as in the Event Frequency use case. • However, there is an additional list of length K used to keep the top-K elements seen that is updated. K-Frequency-estimation, aka “Heavy Hitters” !23
  • 24. • Each of the hash functions are applied to the element. These hash values are treated as indexes into the bit array, and the corresponding array element is set incremented by 1. • Calculate the event frequency for the element as we did in the event frequency use case by applying each of the hash functions to the element, and retrieving all of the corresponding array elements like we did upon insertion. However, this time rather than incremented each of these array elements, we take the SMALLEST value in the list are use that as the approximate event count. • Compare the calculated event frequency of this element against the smallest value in the top-K elements array, and if it is LARGER, remove the smallest value and replace it with the new element. Pulsar Function: Bloom Filter !24
  • 25. Pulsar Function: Bloom filter !25 import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.Function; import com.clearspring.analytics.stream.StreamSummary; public class CountMinFunction implements Function<String, Void> { StreamSummary<String> summary = new StreamSummary<String> (256); Void process(String input, Context context) throws Exception { // Add the element to the sketch summary.offer(input, 1) // Grab the updated top 10, List<Counter<String>> topK = summary.topK(10); return null; } }
  • 26. IoT Analytics Pipeline Using Apache Pulsar Functions
  • 27. • A network of smart meters enables utilities companies to gain greater visibility into their customers energy consumption. With a network of smart meters, utilities companies can monitor demand in real time and • Increase/decrease energy generation to meet the demand • Implement dynamic notifications to encourage consumers to use less energy during peak demand periods. • Provide real-time revenue forecasts to senior business leaders. • Identify fault meters and schedule maintenance calls to repair them. Identifying real-time energy consumption patterns !27
  • 28. Smart meter analytics flow !28 Decodes from binary format Decode data Sanity Check on Reading Sum of all usage in rolling 5 minute window Identify Top-K users by Meter ID Validate Meter Reading Cumulative Reading Aggregation Top-K Users Total Energy Demand Count # of occurrences by meter ID Event Frequency
  • 29. IoT analytics using Pulsar Functions !29
  • 31. • IoT Analytics is an extremely complex problem, and modern streaming platforms are not well suited to solving this problem. • Apache Pulsar Edge provides a platform for implementing distributed analytics on the edge to decrease the data capture time. • Apache Pulsar Functions allows you to leverage existing probabilistic analysis techniques to provide approximate values, within an acceptable degree of accuracy. Thereby reducing the analysis time. • Both techniques allow you to act upon your data while the business value is still high. Summary & Review !31 Probabilistic Algorithms Pulsar Edge Deployment