This document provides an introduction to the WSO2 Analytics Platform. It discusses how the platform allows users to collect data from various sources using a sensor API, then perform analysis on the data through both batch and real-time means. Batch analysis uses technologies like Apache Spark and Hadoop to perform tasks like finding averages, max/min, and building KPIs. Real-time analysis uses complex event processing to run queries over streaming data and detect patterns. The platform also enables predictive analytics using machine learning algorithms and anomaly detection. Results are then communicated through dashboards and alerts.
1. An Introduction to the WSO2
Analytics Platform
Srinath Perera
VP Research WSO2, Apache Member
(@srinath_perera)
srinath@wso2.com
2.
3.
4. Collect Data
One Sensor API to publish
events
- REST, Thrift, Java, JMS,
Kafka
- Java clients, java script clients*
First you define streams
(think it as a infinite table in
SQL DB)
Then publish events via
Sensor API
6. Collecting Data: Example
Java example: create and send events
Events send asynchronously
See client given in http://goo.gl/vIJzqc for more info
Agent agent = new Agent(agentConfiguration);
publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );
StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);
definition.addPayloadData("sid", STRING);
...
publisher.addStreamDefinition(definition);
...
Event event = new Event();
event.setPayloadData(eventData);
publisher.publish(STREAM_NAME, VERSION, event); Send events
Define Stream
Initialize Stream
7. Data Collection Examples
• Collect data from inbuilt agents in
WSO2 products, Tomcat etc.
• Collecting your log data via log stash
• Collecting JVM and JMX stats via agent
• Ingesting data from message queues
such as JMS or Kafka
• Pulling data from a RSS feed, or
scraping a web page
• Write a custom agent to collect data
from your system and push it to DAS
Photo credit http://www.torange.us/ CC license
8. Analysis: Batch Analytics
• Batch analytics reads data from a disk ( or some other
storage) and process them record by record
• “MapReduce” is most widely used technology for batch
analytics
– Apache Hadoop
– Apache Spark 30X faster and much more flexible
• Analytics (Min, Max, average, correlation, histograms, might
join or group data in many ways)
• Key Performance indicators (KPIs)
– E.g. Profit per square feet for retail
• Presented as a Dashboard
9. SQL like Queries: Spark SQL
Since many understands SQL, Hive made
large scale data processing Big Data
accessible to many
Expressive, short, and sweet.
Define core operations that covers 90%
of problems
Lets experts dig in when they like! (via
User Defined functions)
insert overwrite table BusSpeed
select hour, average(v) as avgV, busID
from BusStream group by busID, getHour(ts);
10. Spark SQL Query
Count entries where username is not empty group by user name
and ordered by the count
SELECT username, COUNT(*) AS count FROM wikiData WHERE
username <> '' GROUP BY username ORDER BY count DESC
LIMIT 10
11. Usecase: API Usage
• Looking at different API calls by countries
• Designed to draw attention to what APIs are used and where
12. Value of some Insights degrade
Fast!
For some usecases ( e.g. stock
markets, traffic, surveillance, patient
monitoring) the value of insights
degrades very quickly with time.
We need technology that can produce
outputs fast
Static Queries, but need very fast output
(Alerts, Realtime control)
Dynamic and Interactive Queries ( Data
exploration)
14. CEP Queries 1
Calculate average temperature over a 1 minute sliding window
group by roomNo
Define Stream TempStream(roomNo string, temp double )
from TempStream#window.time(1 min)
select roomNo, avg(temp) as avgTemp
group by roomNo
insert all events into AvgRoomTempStream ;
15. CEP Queries 2
Using data from a Football game
Kick stream shows kicks by players on the ball
Ball possession is hit by me, followed by any number of hits by me,
followed by hit by someone else
from every k1 =KickStream,
KickStream[playerid = k1.playerid]*,
KickStream[playerid != k1.playerid]
select ..
insert into BallPosessionStream;
16. People
Tracking via
BLE
• Track people through BLE via
triangulation
• Higher level logic via Complex
Event Processing
• Traffic Monitoring
• Smart retail
• Airport management
18. Scaling CEP Queries on top of Storm
▪ Accepts CEP queries with hints about how to partition streams
▪ Partition streams, build a Apache Storm topology running CEP nodes as Storm
Sprouts, and run it. see http://goo.gl/pP3kdX for more info.
19. CEP Queries On Strom
@dist(parallel='4’) ask to run it with 4 nodes
Use partition definition to break the data so they can run in parallel
define partition on TempStream.region {
@dist(parallel='4’)
from TempStream[temp > 33]
insert into HighTempStream;
}
from HighTempStream#window(1h)
select max(temp)as max
insert into HourlyMaxTempStream;
20. Interactive Analytics
Best way to explore data is by
asking Ad-hoc questions
Interactive Analytics ( Search)
let you query the system and
receive fast results (<10s)
Shows data in context (e.g. by
grouping events from the
same transaction together)
Built using Lucence based
Indexes.
SparkSQL> SELECT * FROM TWITTER_DATA
21. Predictive Analytics
Can you “Write a program to drive a Car?”
Machine learning
Takes in lot of examples, and build a program
that matches those examples
We call that program a “model”
Lot of tools
- R ( Statistical language)
- Sci-kit learn (Python)
- Apache Spark’s MLBase and Apache Mahout
(Java)
22. Predictive Analytics in DAS
• Building models
– With WSO2 Machine
Learner Product via a
Wizard ( powered by
MLLib)
– Build model using R and
export them as PMML
• Built models can be used
them with both WSO2 CEP
and ESB
23. Using the Model
Within CEP
from InputStream#ml:predict(’/../diabetes-model', 'double')
select *
insert into PredictionStream;
<predict>
<model storage-location=”../downloaded-ml-model"/>
<features>
<feature name="SI2" expression="$body/features/SI2"/>
..
</features>
<predictionOutput property="result"/>
</predict>
Within ESB
24. WSO2 Machine Learner
• Upload or select data
• Explore the data
• Train a Machine learning
model
26. Supported Algorithms
• Deep Learning based classification (H2O’s Stacked Autoencoders
Classifier).
• Classification algorithms - Decision Trees, Linear Regression, Lasso
Regression, SVM, Naïve
• K-Mean clustering for unsupervised learning on your data
• Employ Anomaly Detection using K Means Algorithm to identify
fraud, network penetration and other difficult scenarios
• Recommendations Engine (Collaborative Filtering Algorithm)
27. Predict wait time in the Airport
• Predicting the time
to go through
airport
• Real-time updates
and events to
passengers
• Let airport manage
by allocate resources
28. Predict Promising Customers
• Typical website can get millions of users
• Only very small fraction coverts
• Each user, we know what he access, where is
works, country, what browser, OS, etc.
• Problem is to predict what users will covert
• Used Logistic regression, Random Forest,
Survival Modeling etc.
29. Predict Super Bowl
• Predicted 7 of the 11
games
• Done with Random
Forest Algorithm
• Even what we missed
are instructive
See Yuda’s post: Predicting the Super Bowl with Machine Learning
30. Anomaly Detection:Markov Models
• Can model probability
of a sequences
• Given a sequence, can
predict likelihood, and
use that to detect
anomalies.
• Implemented with
WSO2 CEP
31. Anomaly Detection: Clustering
• Use clustering to identify
normal behavior as clusters
• Consider points away from
all cluster as anomalies.
• Point is considered away
from a cluster if it is
outside 99% percentile line
for that cluster
• Includes in WSO2 ML
32. Communicate: Dashboards
• Dashboard give an “Overall idea”
in a glance (e.g. car dashboard)
– Boring when everything is good!!
• Build your own dashboard.
– WSO2 DAS supports a gadget
generation Wizard
– You can write your own Gadgets
using D3 and Javascript.
33. Gadget Generation Wizard
• Starts with data in tabular format
• Map each column to dimension in your plot
like X,Y, color, point size, etc
• Create a chart with few clicks
Powered by
VizGrammer lib
that uses Vaga
undneath (see
https://github.com/
wso2/VizGrammar
)
34. Communicate: Alerts
▪ Done with CEP Queries
▪ Last Mile
- Email, SMS
- Push notifications to a UI
- Pager
- Trigger physical Alarm
35. Real Life Use Cases
▪ Cisco ( OEM the platform with Cisco
solutions, Health, Smart Parking)
▪ Experian ( Digital Marketing) - see video
▪ Pacific Controls ( Smart City Platform, Vehicle
tracking, building monitoring) - see video
▪ Throttling and Anomaly Detection ( by group
of Telco companies)
▪ API Analytics (13+ customers)
No battle plan survives
contact with the enemy
--Helmuth von Moltke
36. Key Differentiators
• Open Source, under Apache 2 license
• Publish data once, analyze it anyway you like
experience.
• Flexible packaging or as a scalable cluster
• Rich, extensible, SQL-like configuration language
• Compact, easy to learn syntax addressing complex
requirements, such as time windows, patterns,
sequences which would be complex to develop in a
programming language such as Java.
• Rich set of data connectors, which can be easily
extended
37. More Information
▪ Introducing WSO2 Analytics Platform: Note for Architects,
https://iwringer.wordpress.com/2015/03/18/introducing-wso2-
analytics-platform-note-for-architects/
▪ WSO2 Data Analytics Server, http://wso2.com/products/data-
analytics-server/
▪ WSO2 Complex Event Processor,
http://wso2.com/products/complex-event-processor/
▪ WSO2 Machine Learner, http://wso2.com/products/machine-learner/