To keep up with fast-moving IoT data, you need technology that can collect, process and store data with performance and scalability. This presentation from Data Day Texas looks at the technology requirements and how Apache Pulsar can help to meet them.
3. • The business value of data decreases rapidly after it is created, particularly in use cases such
as IoT Analytics, Industrial Automation, and Real-Time Event Monitoring and Anomaly Detection.
• The high-volume, high-velocity datasets used to feed these use cases often contain valuable,
but perishable, insights that must be acted upon immediately.
• In order to maximize the value of their data enterprises must fundamentally change their
approach to processing real-time data to focusing reducing their decision latency on the
perishable insights that exist within their real-time data streams.
• In this talk, we will present a solution based on Apache Pulsar Functions that significantly
reduces decision latency by using probabilistic algorithms to perform analytic calculations on
the edge.
Maximizing business value
!3
4. • Capture Latency: The amount of time
between when an event occurs and
when the event data arrives in the
system
• Analysis Latency: Time required to
perform your analysis
• Decision Latency: Time required to act
on your analysis.
• As latency increases, business value
decreases
The time value of data
!4
5. • Currently, most streaming platform utilize a server-client type architecture that funnel the data
from connected edge devices back to a cloud-based processing framework.
• This long-distance communication between billions of end devices and the Cloud suffers from
major issues:
• Latency. The end-to-end delay may not meet the requirement of many data streaming applications.
• Capacity. The volume of these incoming data streams may not be cost-effective on today’s network
infrastructure.
• Processing Lag: The time required to process the incoming data streams may exceed the time
value of the data or the Cloud-based processing system may not be able to keep pace with the
incoming data stream.
Existing streaming architectures
!5
9. Flexible, serverless-inspired framework for executing
user-defined functions to process and transform data
• Implemented as simple methods, but allows you to leverage
existing libraries and code within Java or Python code.
• Functions execute against every single event that is
published to a specified topic, and write their results to
another topic. Forming a logical directed-acyclic graph.
• Enables dynamic filtering, transformation, routing and
analytics.
• Can run anywhere a JVM can, including edge devices.
• Supports parallel execution of instances.
Pulsar Functions: Stream-native processing
!9
Input Topic
Function
f(x)
Input Topic
Input Topic
Output Topic
Output Topic
10. Building blocks for IoT analytics
!10
Record-based filtering,
enrichment, processing Incoming
record
….
Processor
….
Output
record(s)
e.g. lookups, range
normalization, field extraction,
scoring
….
Cumulative aggregation,
filtering, analytics
e.g. counts, max, min,
cumulative average
Incoming
….
Output
State
….….
Window-based aggregation,
filtering, analytics
e.g. moving averages, pattern
detection
Incoming
….
Processor
….
Output
….
…. ….
12. Real-time IoT analytics using Apache Pulsar
!12
• We leverage Pulsar Functions to perform distributed analytics on the edge.
This reduces the volume of data transmitted back to the datacenter by
performing the calculations on the edge devices and only sending the results.
• The reduction in analysis latency comes from the use of probabilistic analytics
techniques that allow us to calculate results that can achieve a high degree of
accuracy while processing and storing only a few Kilobytes of data.
• While these algorithms are not 100% accurate, if you are willing to trade a small
amount of accuracy (often less than 0.01%) you can achieve a significant
increase in speed to insight, which is the key metric you are looking to improve.
13. Probabilistic analysis
!13
• Minimum Analytic Performance (MAP): The minimum level of accuracy required
within your application, e.g. do you need to know the temperature reading of a
sensor with 10 decimal points of accuracy, or will 1 suffice?
• Often times, it is sufficient to provide an approximate value when it is impossible
and/or impractical to provide a precise value. In many cases having an approximate
answer within a given time frame is better than waiting for an exact answer.
• If your use case does not require precise results and an approximate answer is
acceptable, then there the following techniques and algorithms will provide you
accurate approximations orders of magnitude faster, and requiring orders of
magnitude less memory.
14. Probabilistic algorithms & sketches
!14
• In order to compute certain analytic queries, such as user counts or web page
view time, requires us to keep copies of every unique value encountered.
• To compute the exact number of unique visitors per day, requires you to keep on
hand all the unique visitor records you have seen. Unique identifier counts are
not additive either, so no amount of parallelism will help you.
• Probabilistic algorithms can provide approximate values, estimates, and random
data samples for statistical analysis when the event stream is either too large to
store in memory, or the data is moving too fast to process. Instead of requiring
to keep such enormous data on-hand, we leverage algorithms that utilize small
data structures known as data sketches, that are usually kilobytes in size.
15. Data sketches
!15
• A central theme throughout most of these probabilistic data structures is
the concept of data sketches, which are designed to require only enough
of the data necessary to make an accurate estimation of the correct
answer.
• Typically, sketches are implemented a bit arrays or maps thereby requiring
memory on the order of Kilobytes, making them ideal for resource-
constrained computing environments typically found on the edge.
• Sketching algorithms only need to see each incoming item only once, and
are therefore ideal for processing infinite streams of data.
16. • Let’s walk through an demonstration
to show exactly what I mean by
sketches and show you that we do
not need 100% of the data in order
to make an accurate prediction of
what the picture contains
• How much of the data did you
require to identify the main item in
the picture?
Sketch Example
!16
17. • Configurable Accuracy
• Sketches sized correctly can be 100% accurate
• Error rate is inversely proportional to size of a Sketch
• Fixed Memory Utilization
• Maximum Sketch size is configured in advance
• Memory cost of a query is thus known in advance
• Allows Non-additive Operations to be Additive
• Sketches can be merged into a single Sketch without over counting
• Allows tasks to be parallelized and combined later
• Allows results to be combined across windows of execution
Data sketch properties
!17
18. Operations supported by sketches
!18
Theta Sketch Count Distinct Example: when you're doing profiling at the router level, you often want to estimate functions of distinct IP addresses,
and since you can't just maintain counters for each possible address.
Theta Sketches enable us to answer questions about the number of unique users (set union), the number of users who
did X and Y (set intersection), and the number of users who did X and did not do Y (set disjunction).
Tuple Sketch Group By Tuple Sketches are ideal for summarizing attributes such as impressions or clicks.
Tuple Sketches also provide sufficient methods so that user could develop a wrapper class that could facilitate
approximate joins or other common database operations.
Quantile
Sketches
Distribution Anomaly Detection
Consider this real data example of a stream of 230 million time-spent events collected from one our systems for a
period of just 30 minutes. Each event records the amount of time in milliseconds that a user spends on a web page
before moving to a different web page by taking some action, such as a click. Calculate the distribution of this dataset,
then determine for a given value where it lies within the distribution. Anything with the 99th percentile would be
considered anomalous and flagged for action.
Frequent Items
Sketches
Top-K Frequency estimation of Internet packet streams.
Top-10 Tweets, Queries, items sold, etc.
Sampling Approximate
Query Processing
What is the ratio of ? What percentage of ? What is the average of ?
Approximate query processing is a viable technique to use in these cases. A slightly less accurate result but which is
computed instantly is desirable in these cases. This is because most analysts are performing exploratory operation on
the database and do not need precise answers. An approximate answer along with a confidence interval would suit
most of the use cases.
20. • Another common statistic computed is the frequency at which a specific element occurs within
an endless data stream with repeated elements, which enables us to answer questions such as;
“How many times has element X occurred in the data stream?”. These types of answers are
particularly useful in real time event monitoring and analysis.
• Consider trying to analyze and sample the IoT sensor data for just a single industrial plant that
can produce millions of readings per second. There isn’t enough time to perform the calculations
or store the data.
• In such a scenario you can chose to forego an exact answer, which will we never be able to
compute in time, for an approximate answer that is within an acceptable range of accuracy. The
most popular algorithm for estimating sample frequency is Count-Min Sketch, which as the name
suggests, provides a sketch (approximation) of your data without actually storing the data itself.
Event frequency
!20
21. • The Count-Min Sketch algorithm uses two elements:
• An M-by-K matrix of counters, each initialized to 0, where each row
corresponds to a hash function
• A collection of K independent hash functions h(x).
• When an element is added to the sketch, each of the hash
functions are applied to the element. These hash values are
treated as indexes into the bit array, and the corresponding array
element is set incremented by 1.
• Now that we have an approximate count for each element we
have seen stored in the M-by-K matrix, we are able to quickly
determine how many times an element X has occurred previously
in the stream by simply applying each of the hash functions to the
element, and retrieving all of the corresponding array elements
and using the SMALLEST value in the list are the approximate
event count.
Count-min sketch
!21
22. Pulsar Function: Event frequency
!22
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.clearspring.analytics.stream.frequency.CountMinSketch;
public class CountMinFunction implements Function<String, Void> {
CountMinSketch sketch = new CountMinSketch(20,20,128);
void process(String input, Context context) throws Exception {
sketch.add(input, 1); // Calculates bit indexes and performs +1
long count = sketch.estimateCount(input);
// React to the updated count
return null;
}
}
23. • Another common use of the Count-Min algorithm is maintaining lists of
frequent items which is commonly referred to as the “Heavy Hitters”. This
design pattern retains a list of items that occur more frequently than some
predefined value, e.g. the top-K list
• The K-Frequency-Estimation problem can also be solved by using the Count-
Min Sketch algorithm. The logic for updating the counts is exactly the same
as in the Event Frequency use case.
• However, there is an additional list of length K used to keep the top-K
elements seen that is updated.
K-Frequency-estimation, aka “Heavy Hitters”
!23
24. • Each of the hash functions are applied to the element. These
hash values are treated as indexes into the bit array, and the
corresponding array element is set incremented by 1.
• Calculate the event frequency for the element as we did in
the event frequency use case by applying each of the hash
functions to the element, and retrieving all of the
corresponding array elements like we did upon insertion.
However, this time rather than incremented each of these
array elements, we take the SMALLEST value in the list are
use that as the approximate event count.
• Compare the calculated event frequency of this element
against the smallest value in the top-K elements array, and if
it is LARGER, remove the smallest value and replace it with
the new element.
Pulsar Function: Bloom Filter
!24
25. Pulsar Function: Bloom filter
!25
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.clearspring.analytics.stream.StreamSummary;
public class CountMinFunction implements Function<String, Void> {
StreamSummary<String> summary = new StreamSummary<String> (256);
Void process(String input, Context context) throws Exception {
// Add the element to the sketch
summary.offer(input, 1)
// Grab the updated top 10,
List<Counter<String>> topK = summary.topK(10);
return null;
}
}
27. • A network of smart meters enables utilities companies to gain greater
visibility into their customers energy consumption. With a network of
smart meters, utilities companies can monitor demand in real time and
• Increase/decrease energy generation to meet the demand
• Implement dynamic notifications to encourage consumers to use less
energy during peak demand periods.
• Provide real-time revenue forecasts to senior business leaders.
• Identify fault meters and schedule maintenance calls to repair them.
Identifying real-time energy consumption patterns
!27
28. Smart meter analytics flow
!28
Decodes from
binary format
Decode data
Sanity Check on
Reading
Sum of all usage in
rolling 5 minute
window
Identify Top-K
users by Meter ID
Validate Meter
Reading
Cumulative
Reading
Aggregation
Top-K Users
Total Energy
Demand
Count # of
occurrences by
meter ID
Event
Frequency
31. • IoT Analytics is an extremely complex problem, and
modern streaming platforms are not well suited to
solving this problem.
• Apache Pulsar Edge provides a platform for
implementing distributed analytics on the edge to
decrease the data capture time.
• Apache Pulsar Functions allows you to leverage
existing probabilistic analysis techniques to
provide approximate values, within an acceptable
degree of accuracy. Thereby reducing the analysis
time.
• Both techniques allow you to act upon your data
while the business value is still high.
Summary & Review
!31
Probabilistic
Algorithms
Pulsar Edge
Deployment