Large scale data processing analyses and makes sense of large amounts of data. Although the field itself is not new, it is finding many usecases under the theme "Bigdata" where Google itself, IBM Watson, and Google's Driverless car are some of success stories. Spanning many fields, Large scale data processing brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture. Some usecases like Urban Planning can be slow, which is done in batch mode, while others like stock markets need results within Milliseconds, which are done in streaming fashion. There are different technologies for each case: MapReduce for batch processing and Complex Event Processing and Stream Processing for real-time usecases. Furthermore, the type of analysis range from basic statistics like mean to complicated prediction models based on machine Learning. In this talk, we will discuss data processing landscape: concepts, usecases, technologies and open questions while drawing examples from real world scenarios.
http://icter.org/conference/invited_speeches
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions
1.
2. e.g. Targeted
Marketing
• Assume mass emails to
– 1M people, reaction rate of
1%, 2$ cost per email =>
Cost 2M$ and reach of 10k
people.
• Lets say that looking at
demographics (e.g. where
they live and using decision
tables), you can find
– 250K people with reaction
rate of 6%, => cost 500K$
and reach of 15k people.
3. A day in your Life
Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
There are many decisions that
you can do better if only you can
access the data and process
them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence
4.
5. Internet of Things
• Currently physical world and
software worlds are
detached
• Internet of things promises
to bridge this
– It is about sensors and
actuators everywhere
– In your fridge, in your
blanket, in your chair, in your
carpet.. Yes even in your
socks
– Umbrella that light up when
there is rain and medicine
cups
6. What can We do with Big Data?
• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE Save 1% initiative (http://goo.gl/eYC0QE )
• in trains => 2B/ year
• US healthcare => 20B/ year
• In contrast, Sri Lanka total exports 9B/ year.
• Save lives
– Weather, Disease identification, Personalized
treatment
• Technology advancement
– Most high tech research are done via simulations
9. Hindsight: Batch Processing
• Programming model is MapReduce
– Apache Hadoop
– Spark
• Lot of tools built on top
– Hive Shark for (SQL style queries), Mahout (ML), Giraph
(Graph Processing)
• Store and process
• Slow (> 5 minutes for
results for a
reasonable usecase)
10. Usecase: Targeted Advertising
• Analytics Implemented with MapReduce or Queries
– Min, Max, average, correlation, histograms
– Might join or group data in many ways
– Heatmaps, temporal trends
• Key Performance indicators (KPIs)
– Average time for a ticket in customer service interactions
– Profit per square feet for retail
11. Real-time Analytics
• Idea is to process data as they are
received in streaming fashion (without
storing)
• Used when we need
– Very fast output (milliseconds)
– Lots of events (few 100k to millions)
• Two main technologies
– Stream Processing (e.g. Apache Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP) e.g.
WSO2 CEP
define partition “playerPartition” as PlayerDataStream.pid;
from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeed
insert into AvgSpeedStream
using partition playerPartition;
13. Sketch Algorithms
• Data Structures that can count millions
of entries with few KBs
– Provide approximate answers
– E.g. Count-Min Sketch, Bloom Filters
• Use Cases
– Counting items
– Point estimates, rangesum, heavy hitters,
quantiles, number of distinct elements
– Graph Summaries
– Linear algebraic problems such as
approximating matrix products, least
squares approximation and SVD
See https://sites.google.com/site/algoresearch/datastreamalgorithms
14. Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned
Aircrafts and data
on where they
were hit?
• How would you
add Armour?
15. Challenges: Causality
• Correlation does not imply Causality!! ( send a
book home example [1])
• Causality
– do repeat experiment with identical test
– If CAN’T do a randomized test (A/B test)
– With Big data we cannot do either
• Option 1: We can act on correlation if we can
verify the guess or if correctness is not critical
(Start Investigation, Check for a disease,
Marketing )
• Option 2: We verify correlations using A/B
testing or propensity analysis
[1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/
[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
16. Insight (Understanding Why ?)
• Pattern Mining – find frequent
associations (e.g. Market Basket),
frequent sequences
• Clustering
• Graph Analysis
• Knowledge Discovery
• Correlations between features and Finding principal
components
• Simulations, Complex System modeling, matching a
statistical distribution
17. Usecase: Big Data for development in SL?
• Done using CDR data
• People density 1pm vs
midnight (red =>
increased, blue =>
decreased)
• Urban Planning
– People distribution
– Mobility
– Waste Management
– E.g. see
http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
20. Usecase: Predictive Maintenance
• Idea is to fix the problem
before it broke, avoiding
expensive downtimes
– Airplanes, turbines, windmills
– Construction Equipment
– Car, Golf carts
• How
– Build a model for normal
operation and compare
deviation
– Match against known error
patterns
21. Challenges: Selecting the best
Algorithm for a Problem
• Types of data: categorical (C),
numerical (N)
N-> N = Regression
C-> C = Decision trees
N->C= SVM
• Amount of data
• Required accuracy
• Required interpretability
• Kind of underlying function
See Skytree: Choosing The Right Machine Learning
Methods,
https://www.youtube.com/watch?v=qMUpc10VsmA
22. Challenges: Feature Engineering
• In ML feature engineering is the key [1].
• You need features to form a kernel. Then you can
solve with less data.
• Deep learning can learn best feature (combination)
via semi or unsupervised learning [2]
1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM
2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
24. Challenges: Updating Models
● Incorporate more data
o We get more data over time
o We get feed back about
effectiveness of decisions
(e.g. Accuracy of Fraud)
o Trends change
● Track and update model
o Generate models in batch
mode and update
o Streaming (Online) ML,
which is an active research
topic
25. Challenges: Scaling ML Algorithms
• With more data we can
– Build more accurate and
detailed models [1]
• Scale => Distributed Systems
• Need to build new or adopt
algorithms or use other
methods
– Sampling
– Scaleable version of algorithms
(e.g. Decision Trees, NN )
[1] P Domingos, A Few Useful Things to Know about Machine Learning
26. Challenges: Lack of Labeled Data
• Most data is not labeled
• Idea of Semi Supervised
learning
• Provide Data + Examples +
Ontology, and algorithm find
new patterns
– Lot of Data
– Few example sentences
• Often uses Expectations
Maximization (EM) Algorithm
Watch Tom Mitchell’s Lecture
https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm
Ontology: People, Cities
Relationships: like,
dislike, live in
Examples: Bob (People)
lives in Colombo (City)