SlideShare uma empresa Scribd logo
1 de 27
e.g. Targeted
Marketing
• Assume mass emails to
– 1M people, reaction rate of
1%, 2$ cost per email =>
Cost 2M$ and reach of 10k
people.
• Lets say that looking at
demographics (e.g. where
they live and using decision
tables), you can find
– 250K people with reaction
rate of 6%, => cost 500K$
and reach of 15k people.
A day in your Life
 Think about a day in your life?
– What is the best road to take?
– Would there be any bad weather?
– How to invest my money?
– How is my health?
 There are many decisions that
you can do better if only you can
access the data and process
them.
http://www.flickr.com/photos/kcolwell/5
512461652/ CC licence
Internet of Things
• Currently physical world and
software worlds are
detached
• Internet of things promises
to bridge this
– It is about sensors and
actuators everywhere
– In your fridge, in your
blanket, in your chair, in your
carpet.. Yes even in your
socks
– Umbrella that light up when
there is rain and medicine
cups
What can We do with Big Data?
• Optimize (World is inefficient)
– 30% food wasted farm to plate
– GE Save 1% initiative (http://goo.gl/eYC0QE )
• in trains => 2B/ year
• US healthcare => 20B/ year
• In contrast, Sri Lanka total exports 9B/ year.
• Save lives
– Weather, Disease identification, Personalized
treatment
• Technology advancement
– Most high tech research are done via simulations
Big Data Architecture
Big data Processing Technologies
Landscape
Hindsight: Batch Processing
• Programming model is MapReduce
– Apache Hadoop
– Spark
• Lot of tools built on top
– Hive Shark for (SQL style queries), Mahout (ML), Giraph
(Graph Processing)
• Store and process
• Slow (> 5 minutes for
results for a
reasonable usecase)
Usecase: Targeted Advertising
• Analytics Implemented with MapReduce or Queries
– Min, Max, average, correlation, histograms
– Might join or group data in many ways
– Heatmaps, temporal trends
• Key Performance indicators (KPIs)
– Average time for a ticket in customer service interactions
– Profit per square feet for retail
Real-time Analytics
• Idea is to process data as they are
received in streaming fashion (without
storing)
• Used when we need
– Very fast output (milliseconds)
– Lots of events (few 100k to millions)
• Two main technologies
– Stream Processing (e.g. Apache Strom,
http://storm-project.net/ )
– Complex Event Processing (CEP) e.g.
WSO2 CEP
define partition “playerPartition” as PlayerDataStream.pid;
from PlayerDataStream#win.time(1m)
select pid, avg(speed) as avgSpeed
insert into AvgSpeedStream
using partition playerPartition;
Usecase: DEBS 2013, Football Game
Sketch Algorithms
• Data Structures that can count millions
of entries with few KBs
– Provide approximate answers
– E.g. Count-Min Sketch, Bloom Filters
• Use Cases
– Counting items
– Point estimates, rangesum, heavy hitters,
quantiles, number of distinct elements
– Graph Summaries
– Linear algebraic problems such as
approximating matrix products, least
squares approximation and SVD
See https://sites.google.com/site/algoresearch/datastreamalgorithms
Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from
http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned
Aircrafts and data
on where they
were hit?
• How would you
add Armour?
Challenges: Causality
• Correlation does not imply Causality!! ( send a
book home example [1])
• Causality
– do repeat experiment with identical test
– If CAN’T do a randomized test (A/B test)
– With Big data we cannot do either
• Option 1: We can act on correlation if we can
verify the guess or if correctness is not critical
(Start Investigation, Check for a disease,
Marketing )
• Option 2: We verify correlations using A/B
testing or propensity analysis
[1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/
[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
Insight (Understanding Why ?)
• Pattern Mining – find frequent
associations (e.g. Market Basket),
frequent sequences
• Clustering
• Graph Analysis
• Knowledge Discovery
• Correlations between features and Finding principal
components
• Simulations, Complex System modeling, matching a
statistical distribution
Usecase: Big Data for development in SL?
• Done using CDR data
• People density 1pm vs
midnight (red =>
increased, blue =>
decreased)
• Urban Planning
– People distribution
– Mobility
– Waste Management
– E.g. see
http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
Foresight (Predict)
• Build a Model
– Weather, Economic models
• Predict the future values
– Electricity load, traffic, demand,
sales
• Classification
– Spam detection, Group users,
Sentiment analysis
• Find anomalies
– Fraud, Predictive maintenance
• Recommendations
– Targeted advertising, product
recommendations
Usecase: Predictive Maintenance
• Idea is to fix the problem
before it broke, avoiding
expensive downtimes
– Airplanes, turbines, windmills
– Construction Equipment
– Car, Golf carts
• How
– Build a model for normal
operation and compare
deviation
– Match against known error
patterns
Challenges: Selecting the best
Algorithm for a Problem
• Types of data: categorical (C),
numerical (N)
 N-> N = Regression
 C-> C = Decision trees
 N->C= SVM
• Amount of data
• Required accuracy
• Required interpretability
• Kind of underlying function
See Skytree: Choosing The Right Machine Learning
Methods,
https://www.youtube.com/watch?v=qMUpc10VsmA
Challenges: Feature Engineering
• In ML feature engineering is the key [1].
• You need features to form a kernel. Then you can
solve with less data.
• Deep learning can learn best feature (combination)
via semi or unsupervised learning [2]
1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM
2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
Challenges: Taking Decisions (Context)
Challenges: Updating Models
● Incorporate more data
o We get more data over time
o We get feed back about
effectiveness of decisions
(e.g. Accuracy of Fraud)
o Trends change
● Track and update model
o Generate models in batch
mode and update
o Streaming (Online) ML,
which is an active research
topic
Challenges: Scaling ML Algorithms
• With more data we can
– Build more accurate and
detailed models [1]
• Scale => Distributed Systems
• Need to build new or adopt
algorithms or use other
methods
– Sampling
– Scaleable version of algorithms
(e.g. Decision Trees, NN )
[1] P Domingos, A Few Useful Things to Know about Machine Learning
Challenges: Lack of Labeled Data
• Most data is not labeled
• Idea of Semi Supervised
learning
• Provide Data + Examples +
Ontology, and algorithm find
new patterns
– Lot of Data
– Few example sentences
• Often uses Expectations
Maximization (EM) Algorithm
Watch Tom Mitchell’s Lecture
https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm
Ontology: People, Cities
Relationships: like,
dislike, live in
Examples: Bob (People)
lives in Colombo (City)
Outline

Mais conteúdo relacionado

Mais procurados

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Apache Storm
Apache StormApache Storm
Apache Storm
Edureka!
 

Mais procurados (20)

Introduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 UpdateIntroduction to WSO2 Analytics Platform: 2016 Q2 Update
Introduction to WSO2 Analytics Platform: 2016 Q2 Update
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 
Patterns of Streaming Applications
Patterns of Streaming ApplicationsPatterns of Streaming Applications
Patterns of Streaming Applications
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Osdc Complex Event Processing
Osdc Complex Event ProcessingOsdc Complex Event Processing
Osdc Complex Event Processing
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overview
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 

Semelhante a ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions

Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Srinath Perera
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
WSO2
 

Semelhante a ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions (20)

Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)Introduction to Data Processing (by Srinath Perera)
Introduction to Data Processing (by Srinath Perera)
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
kdd2015
kdd2015kdd2015
kdd2015
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
kaggle_meet_up
kaggle_meet_upkaggle_meet_up
kaggle_meet_up
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talkNYC Open Data Meetup-- Thoughtworks chief data scientist talk
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
 
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Building your big data solution
Building your big data solution Building your big data solution
Building your big data solution
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Mais de Srinath Perera

Mais de Srinath Perera (20)

Book: Software Architecture and Decision-Making
Book: Software Architecture and Decision-MakingBook: Software Architecture and Decision-Making
Book: Software Architecture and Decision-Making
 
Data science Applications in the Enterprise
Data science Applications in the EnterpriseData science Applications in the Enterprise
Data science Applications in the Enterprise
 
An Introduction to APIs
An Introduction to APIs An Introduction to APIs
An Introduction to APIs
 
An Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance ProfessionalsAn Introduction to Blockchain for Finance Professionals
An Introduction to Blockchain for Finance Professionals
 
AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?AI in the Real World: Challenges, and Risks and how to handle them?
AI in the Real World: Challenges, and Risks and how to handle them?
 
Healthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & ChallengesHealthcare + AI: Use cases & Challenges
Healthcare + AI: Use cases & Challenges
 
How would AI shape Future Integrations?
How would AI shape Future Integrations?How would AI shape Future Integrations?
How would AI shape Future Integrations?
 
The Role of Blockchain in Future Integrations
The Role of Blockchain in Future IntegrationsThe Role of Blockchain in Future Integrations
The Role of Blockchain in Future Integrations
 
Future of Serverless
Future of ServerlessFuture of Serverless
Future of Serverless
 
Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going? Blockchain: Where are we? Where are we going?
Blockchain: Where are we? Where are we going?
 
Few thoughts about Future of Blockchain
Few thoughts about Future of BlockchainFew thoughts about Future of Blockchain
Few thoughts about Future of Blockchain
 
A Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New TechnologiesA Visual Canvas for Judging New Technologies
A Visual Canvas for Judging New Technologies
 
Privacy in Bigdata Era
Privacy in Bigdata  EraPrivacy in Bigdata  Era
Privacy in Bigdata Era
 
Blockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and RisksBlockchain, Impact, Challenges, and Risks
Blockchain, Impact, Challenges, and Risks
 
Today's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology LandscapeToday's Technology and Emerging Technology Landscape
Today's Technology and Emerging Technology Landscape
 
An Emerging Technologies Timeline
An Emerging Technologies TimelineAn Emerging Technologies Timeline
An Emerging Technologies Timeline
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Analytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the UglyAnalytics and AI: The Good, the Bad and the Ugly
Analytics and AI: The Good, the Bad and the Ugly
 
Transforming a Business Through Analytics
Transforming a Business Through AnalyticsTransforming a Business Through Analytics
Transforming a Business Through Analytics
 
SoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration TechnologySoC Keynote:The State of the Art in Integration Technology
SoC Keynote:The State of the Art in Integration Technology
 

Último

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 

Último (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from Analytics to Predictions

  • 1.
  • 2. e.g. Targeted Marketing • Assume mass emails to – 1M people, reaction rate of 1%, 2$ cost per email => Cost 2M$ and reach of 10k people. • Lets say that looking at demographics (e.g. where they live and using decision tables), you can find – 250K people with reaction rate of 6%, => cost 500K$ and reach of 15k people.
  • 3. A day in your Life  Think about a day in your life? – What is the best road to take? – Would there be any bad weather? – How to invest my money? – How is my health?  There are many decisions that you can do better if only you can access the data and process them. http://www.flickr.com/photos/kcolwell/5 512461652/ CC licence
  • 4.
  • 5. Internet of Things • Currently physical world and software worlds are detached • Internet of things promises to bridge this – It is about sensors and actuators everywhere – In your fridge, in your blanket, in your chair, in your carpet.. Yes even in your socks – Umbrella that light up when there is rain and medicine cups
  • 6. What can We do with Big Data? • Optimize (World is inefficient) – 30% food wasted farm to plate – GE Save 1% initiative (http://goo.gl/eYC0QE ) • in trains => 2B/ year • US healthcare => 20B/ year • In contrast, Sri Lanka total exports 9B/ year. • Save lives – Weather, Disease identification, Personalized treatment • Technology advancement – Most high tech research are done via simulations
  • 8. Big data Processing Technologies Landscape
  • 9. Hindsight: Batch Processing • Programming model is MapReduce – Apache Hadoop – Spark • Lot of tools built on top – Hive Shark for (SQL style queries), Mahout (ML), Giraph (Graph Processing) • Store and process • Slow (> 5 minutes for results for a reasonable usecase)
  • 10. Usecase: Targeted Advertising • Analytics Implemented with MapReduce or Queries – Min, Max, average, correlation, histograms – Might join or group data in many ways – Heatmaps, temporal trends • Key Performance indicators (KPIs) – Average time for a ticket in customer service interactions – Profit per square feet for retail
  • 11. Real-time Analytics • Idea is to process data as they are received in streaming fashion (without storing) • Used when we need – Very fast output (milliseconds) – Lots of events (few 100k to millions) • Two main technologies – Stream Processing (e.g. Apache Strom, http://storm-project.net/ ) – Complex Event Processing (CEP) e.g. WSO2 CEP define partition “playerPartition” as PlayerDataStream.pid; from PlayerDataStream#win.time(1m) select pid, avg(speed) as avgSpeed insert into AvgSpeedStream using partition playerPartition;
  • 12. Usecase: DEBS 2013, Football Game
  • 13. Sketch Algorithms • Data Structures that can count millions of entries with few KBs – Provide approximate answers – E.g. Count-Min Sketch, Bloom Filters • Use Cases – Counting items – Point estimates, rangesum, heavy hitters, quantiles, number of distinct elements – Graph Summaries – Linear algebraic problems such as approximating matrix products, least squares approximation and SVD See https://sites.google.com/site/algoresearch/datastreamalgorithms
  • 14. Curious Case of Missing Data http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/ • WW II, Returned Aircrafts and data on where they were hit? • How would you add Armour?
  • 15. Challenges: Causality • Correlation does not imply Causality!! ( send a book home example [1]) • Causality – do repeat experiment with identical test – If CAN’T do a randomized test (A/B test) – With Big data we cannot do either • Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing ) • Option 2: We verify correlations using A/B testing or propensity analysis [1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/ [2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
  • 16. Insight (Understanding Why ?) • Pattern Mining – find frequent associations (e.g. Market Basket), frequent sequences • Clustering • Graph Analysis • Knowledge Discovery • Correlations between features and Finding principal components • Simulations, Complex System modeling, matching a statistical distribution
  • 17. Usecase: Big Data for development in SL? • Done using CDR data • People density 1pm vs midnight (red => increased, blue => decreased) • Urban Planning – People distribution – Mobility – Waste Management – E.g. see http://goo.gl/jPujmM From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
  • 18. Foresight (Predict) • Build a Model – Weather, Economic models • Predict the future values – Electricity load, traffic, demand, sales • Classification – Spam detection, Group users, Sentiment analysis • Find anomalies – Fraud, Predictive maintenance • Recommendations – Targeted advertising, product recommendations
  • 19.
  • 20. Usecase: Predictive Maintenance • Idea is to fix the problem before it broke, avoiding expensive downtimes – Airplanes, turbines, windmills – Construction Equipment – Car, Golf carts • How – Build a model for normal operation and compare deviation – Match against known error patterns
  • 21. Challenges: Selecting the best Algorithm for a Problem • Types of data: categorical (C), numerical (N)  N-> N = Regression  C-> C = Decision trees  N->C= SVM • Amount of data • Required accuracy • Required interpretability • Kind of underlying function See Skytree: Choosing The Right Machine Learning Methods, https://www.youtube.com/watch?v=qMUpc10VsmA
  • 22. Challenges: Feature Engineering • In ML feature engineering is the key [1]. • You need features to form a kernel. Then you can solve with less data. • Deep learning can learn best feature (combination) via semi or unsupervised learning [2] 1. Bekkerman’s talk https://www.youtube.com/watch?v=wjTJVhmu1JM 2. Deep Learning, http://cl.naist.jp/~kevinduh/a/deep2014/
  • 24. Challenges: Updating Models ● Incorporate more data o We get more data over time o We get feed back about effectiveness of decisions (e.g. Accuracy of Fraud) o Trends change ● Track and update model o Generate models in batch mode and update o Streaming (Online) ML, which is an active research topic
  • 25. Challenges: Scaling ML Algorithms • With more data we can – Build more accurate and detailed models [1] • Scale => Distributed Systems • Need to build new or adopt algorithms or use other methods – Sampling – Scaleable version of algorithms (e.g. Decision Trees, NN ) [1] P Domingos, A Few Useful Things to Know about Machine Learning
  • 26. Challenges: Lack of Labeled Data • Most data is not labeled • Idea of Semi Supervised learning • Provide Data + Examples + Ontology, and algorithm find new patterns – Lot of Data – Few example sentences • Often uses Expectations Maximization (EM) Algorithm Watch Tom Mitchell’s Lecture https://www.youtube.com/watch?v=psFnHkIjHA0Maximization algorithm Ontology: People, Cities Relationships: like, dislike, live in Examples: Bob (People) lives in Colombo (City)