SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Real-Time Machine Learning at
             Industrial scale
                          ... the battle of accuracy vs latency

                                        tumra.com
                                         @tumra
                                                                9th October 2012
TUMRA LTD, Building 3, Chiswick Park,
566 Chiswick High Road, W4 5YA                              Michael Cutler @cotdp
$ whoami
Michael Cutler (@cotdp)
●   Previously at British Sky Broadcasting
    ○   Last 7 years in R&D
    ○   Created several patented systems & algorithms
    ○   Kicked off ‘Big Data’ initiative at Sky in 2008

●   Co-founder CTO @ TUMRA in March '12
    ○   Real-time big data science platform
    ○   Alpha-testing with selected clients
Agenda
●   Background
●   Real-Time vs Batch processing
●   Accuracy vs Latency
●   Use Cases
     ○ eCommerce

     ○ Financial Services

     ○ Media

●   Questions
Background
Big Data is "in vogue", but what does it mean:
 ● Distributed processing

 ● Massively scalable

 ● Commodity



Apache Hadoop is "Kernel" of Big Data OS:
 ● Distributed Filesystem (HDFS)

 ● Parallel Processing (Map/Reduce, YARN)
Background (cont'd)
Solving problems with Big Data is hard:
 ● Tools are all low-level (Pig, Hive etc.)

 ● Skills are hard to find



What is "Data Science":
● Understanding data & solving problems

● Applies the following skills:

   ○   Statistical Analysis
   ○   Machine Learning
   ○   Communicating Results
Real-Time vs
Batch processing
Batch - Hoppers, Bins, Buckets




 Credit: http://bit.ly/Q71u4W
Real-Time - Flows & Streams




                          Credit: http://bit.ly/NOslqf
Real-Time vs Batch processing
Similarities to the Industrial Revolution:
 ● From handicraft to Batch & Real-Time

 ● Complexity increases



Need for "Real-Time":
● Wherever the variation can change faster

  than you can retrain models
● When you can't pre-compute everything

  ahead of time
Accuracy vs Latency
Accuracy vs Latency
Netflix Prize winning entry :-
● Ensemble of 100's of models

● Massively compute intensive solution

● Marginally better than much simpler models




IBM won the KDD Cup 2009 (Orange) :-
 ● IBM Watson team won by sheer brute force

 ● Used a "one of everything" approach

   generating hundreds of models
Accuracy vs Latency (cont'd)
Mathematical navel-gazing:
● Often the factor we're optimising for, isn't

  the thing we measure improvement in:
   ○ User ratings vs. customer longevity/value

   ○ Overfitting outliers vs. missing clear Fraud




Given the choice between a "best guess" now,
and a "marginally better" answer later, I'd take
the "best guess" every time.
However, that doesn't mean...
Accuracy vs Latency (cont'd)
It's a trade-off:
 ● Sometimes "best guess" is good enough,

 ● Other times we can wait for the accuracy,

 ● And of course, occasionally we want both!



Key objective:
 ● Most appropriate solution for the use-case

 ● Hybrid solutions part batch, part real-time
Use Case
eCommerce
Use Case - eCommerce
Objective - Increase profits
How:
●   Match potential customers to the right products
●   Personalise user experience on web & email
●   Customer lifecycle management

Method:
●   Ensemble of real-time models
●   Collect lots of implicit feedback data
Use Case - eCommerce (cont'd)
Detail:
●   Clustering - behavior, demogs
●   Simple predictors - keywords to products
●   Bayesian Bandit - blend the output

Requirements:
●   Predictions in < 50 ms
●   Online learning models
●   Occasional batch updates are OK
When eCommerce
    #FAILs
I've only ever bought Cat food...
... wait there's more, no Cat food
Even Amazon can #FAIL
Use Case
Financial Services
Use Case - Financial Services
Objective - Reduce Fraud
How:
●   Compute patterns/predictors for individuals
●   Cluster individuals and recompute for clusters
●   Compute baselines across all data

Method:
●   Hybrid and Hierarchical Clustering models
●   Simple predictors for individuals, clusters & baseline
Use Case - Financial Services
Detail:
●   CHEAT!!! ... Cluster to nearest centroid
     ○ will degrade over time (Hunchback Clusters)

●   Use simple metrics to alert (stddev)

Requirements:
●   Ability to alert/intervene near real-time < 1 second
●   Adapt to rapid changes (within baseline & clusters)
●   Periodic batch processing to recompute clusters
Use Case - Financial Services
Use Case
 Media
Use Case - Media
Objective - Generating Metadata
Why:
●   Drive second screen applications
●   Create new streams of information for resale

How:
●   Video / Audio analysis
●   Closed Caption or, Subtitle text processing
●   Knowledgebase :- People, Places, Products & Things
Use Case - Media (cont'd)
Method:
●   Natural Language Processing
    ○   Named Entity Recognition
    ○   Topic Extraction & Disambiguation
●   Graph databases & algorithms

Requirements:
●   Responses in < 1 second
●   Ability to learn new 'Things'
Example of 12,000 entities from our Knowledgebase...
Summary
Summary
Key points:
●   Clear move towards distributed algorithms
●   Latency is often more favorable than accuracy
●   Trade-offs are dependant on the use-cases

Further reading:
●   Apache Mahout - http://mahout.apache.org/
●   Storm Project - http://storm-project.net/
●   Data Science London - http://datasciencelondon.org/
●   Machine Learning Meetup - http://bit.ly/w8V8f6
Almost finished!
Introducing TUMRA Labs
API access to some of our real-time models:
●   Probabilistic Demographics

Coming Soon:
●   Language detection
●   Sentiment analysis
●   Metadata Generation


      Free to signup and easy to get started!
              http://labs.tumra.com/
Questions?
  Work          Personal
tumra.com      cotdp.com
 @tumra         @cotdp

Mais conteúdo relacionado

Mais procurados

Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
Dr. Mohan K. Bavirisetty
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
Raul Chong
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Data scientist enablement dse 400 week 8 roadmap
Data scientist enablement   dse 400   week 8 roadmap Data scientist enablement   dse 400   week 8 roadmap
Data scientist enablement dse 400 week 8 roadmap
Dr. Mohan K. Bavirisetty
 

Mais procurados (20)

Role of Analytics in Digital Business
Role of Analytics in Digital BusinessRole of Analytics in Digital Business
Role of Analytics in Digital Business
 
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
 
Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0 Polyglot Processing - An Introduction 1.0
Polyglot Processing - An Introduction 1.0
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
MAALBS Big Data agile framwork
MAALBS Big Data agile framwork MAALBS Big Data agile framwork
MAALBS Big Data agile framwork
 
FrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and CheaplyFrugalML: Using ML APIs More Accurately and Cheaply
FrugalML: Using ML APIs More Accurately and Cheaply
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and DesignReal-time Recommendations for Retail: Architecture, Algorithms, and Design
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
 
Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071Data analysis trend 2015 2016 v071
Data analysis trend 2015 2016 v071
 
The Other 99% of a Data Science Project
The Other 99% of a Data Science ProjectThe Other 99% of a Data Science Project
The Other 99% of a Data Science Project
 
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
 
The Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming ApplicationsThe Rise of Streaming SQL and Evolution of Streaming Applications
The Rise of Streaming SQL and Evolution of Streaming Applications
 
Meetup7 integration microservices_machine_learning
Meetup7 integration microservices_machine_learningMeetup7 integration microservices_machine_learning
Meetup7 integration microservices_machine_learning
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Data scientist enablement dse 400 week 8 roadmap
Data scientist enablement   dse 400   week 8 roadmap Data scientist enablement   dse 400   week 8 roadmap
Data scientist enablement dse 400 week 8 roadmap
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
 
SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?SPSS Modeler 16 What's New!?
SPSS Modeler 16 What's New!?
 

Destaque

Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
TUMRA | Big Data Science - Gain a competitive advantage through Big Data & Data Science
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
Anish Das
 
IBM Watson Health: How cognitive technologies have begun transforming clinica...
IBM Watson Health: How cognitive technologies have begun transforming clinica...IBM Watson Health: How cognitive technologies have begun transforming clinica...
IBM Watson Health: How cognitive technologies have begun transforming clinica...
Maged N. Kamel Boulos
 
Ibm's watson
Ibm's watsonIbm's watson
Ibm's watson
Hdavey01
 

Destaque (20)

Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Big Data - Fast Machine Learning at Scale + Couchbase
Big Data - Fast Machine Learning at Scale + CouchbaseBig Data - Fast Machine Learning at Scale + Couchbase
Big Data - Fast Machine Learning at Scale + Couchbase
 
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
 
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
 
Fast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approachFast detection of Android malware: machine learning approach
Fast detection of Android malware: machine learning approach
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in ActionNot Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Watson – from Jeopardy to healthcare
Watson – from Jeopardy to healthcareWatson – from Jeopardy to healthcare
Watson – from Jeopardy to healthcare
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
IBM Watson Health: How cognitive technologies have begun transforming clinica...
IBM Watson Health: How cognitive technologies have begun transforming clinica...IBM Watson Health: How cognitive technologies have begun transforming clinica...
IBM Watson Health: How cognitive technologies have begun transforming clinica...
 
How To Make an Executive Presentation 2011
How To Make an Executive Presentation 2011How To Make an Executive Presentation 2011
How To Make an Executive Presentation 2011
 
IBM WATSON
IBM WATSONIBM WATSON
IBM WATSON
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processing
 
IBM Watson in Healthcare
IBM Watson in HealthcareIBM Watson in Healthcare
IBM Watson in Healthcare
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Ibm's watson
Ibm's watsonIbm's watson
Ibm's watson
 
IBM Watson: How it Works, and What it means for Society beyond winning Jeopardy!
IBM Watson: How it Works, and What it means for Society beyond winning Jeopardy!IBM Watson: How it Works, and What it means for Society beyond winning Jeopardy!
IBM Watson: How it Works, and What it means for Society beyond winning Jeopardy!
 

Semelhante a Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
ConnectaDigital
 

Semelhante a Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012) (20)

Digital and data journey demystified: how it all works
Digital and data journey demystified: how it all worksDigital and data journey demystified: how it all works
Digital and data journey demystified: how it all works
 
[DSC Croatia 22] Building smarter ML and AI models and making them more accur...
[DSC Croatia 22] Building smarter ML and AI models and making them more accur...[DSC Croatia 22] Building smarter ML and AI models and making them more accur...
[DSC Croatia 22] Building smarter ML and AI models and making them more accur...
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
How to Use Deep Learning by Mu Sigma Product Manager
How to Use Deep Learning by Mu Sigma Product ManagerHow to Use Deep Learning by Mu Sigma Product Manager
How to Use Deep Learning by Mu Sigma Product Manager
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
Making advertising personal, 4th NL Recommenders Meetup
Making advertising personal, 4th NL Recommenders MeetupMaking advertising personal, 4th NL Recommenders Meetup
Making advertising personal, 4th NL Recommenders Meetup
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Deploying AI Applications in Enterprises
Deploying AI Applications in EnterprisesDeploying AI Applications in Enterprises
Deploying AI Applications in Enterprises
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
 
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
 
Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0
 
Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0Making better use of Data and AI in Industry 4.0
Making better use of Data and AI in Industry 4.0
 
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias CloudTransformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
 
Connecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud PlatformConnecta Event: Big Query och dataanalys med Google Cloud Platform
Connecta Event: Big Query och dataanalys med Google Cloud Platform
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Exploring the Cloud
Exploring the CloudExploring the Cloud
Exploring the Cloud
 
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...SDD2017 - 03 Abed Ajraou  - putting data science in your business a first uti...
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
 
Witekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenanceWitekio introducing-predictive-maintenance
Witekio introducing-predictive-maintenance
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

  • 1. Real-Time Machine Learning at Industrial scale ... the battle of accuracy vs latency tumra.com @tumra 9th October 2012 TUMRA LTD, Building 3, Chiswick Park, 566 Chiswick High Road, W4 5YA Michael Cutler @cotdp
  • 2. $ whoami Michael Cutler (@cotdp) ● Previously at British Sky Broadcasting ○ Last 7 years in R&D ○ Created several patented systems & algorithms ○ Kicked off ‘Big Data’ initiative at Sky in 2008 ● Co-founder CTO @ TUMRA in March '12 ○ Real-time big data science platform ○ Alpha-testing with selected clients
  • 3. Agenda ● Background ● Real-Time vs Batch processing ● Accuracy vs Latency ● Use Cases ○ eCommerce ○ Financial Services ○ Media ● Questions
  • 4. Background Big Data is "in vogue", but what does it mean: ● Distributed processing ● Massively scalable ● Commodity Apache Hadoop is "Kernel" of Big Data OS: ● Distributed Filesystem (HDFS) ● Parallel Processing (Map/Reduce, YARN)
  • 5. Background (cont'd) Solving problems with Big Data is hard: ● Tools are all low-level (Pig, Hive etc.) ● Skills are hard to find What is "Data Science": ● Understanding data & solving problems ● Applies the following skills: ○ Statistical Analysis ○ Machine Learning ○ Communicating Results
  • 7. Batch - Hoppers, Bins, Buckets Credit: http://bit.ly/Q71u4W
  • 8. Real-Time - Flows & Streams Credit: http://bit.ly/NOslqf
  • 9. Real-Time vs Batch processing Similarities to the Industrial Revolution: ● From handicraft to Batch & Real-Time ● Complexity increases Need for "Real-Time": ● Wherever the variation can change faster than you can retrain models ● When you can't pre-compute everything ahead of time
  • 11. Accuracy vs Latency Netflix Prize winning entry :- ● Ensemble of 100's of models ● Massively compute intensive solution ● Marginally better than much simpler models IBM won the KDD Cup 2009 (Orange) :- ● IBM Watson team won by sheer brute force ● Used a "one of everything" approach generating hundreds of models
  • 12. Accuracy vs Latency (cont'd) Mathematical navel-gazing: ● Often the factor we're optimising for, isn't the thing we measure improvement in: ○ User ratings vs. customer longevity/value ○ Overfitting outliers vs. missing clear Fraud Given the choice between a "best guess" now, and a "marginally better" answer later, I'd take the "best guess" every time.
  • 14. Accuracy vs Latency (cont'd) It's a trade-off: ● Sometimes "best guess" is good enough, ● Other times we can wait for the accuracy, ● And of course, occasionally we want both! Key objective: ● Most appropriate solution for the use-case ● Hybrid solutions part batch, part real-time
  • 16. Use Case - eCommerce Objective - Increase profits How: ● Match potential customers to the right products ● Personalise user experience on web & email ● Customer lifecycle management Method: ● Ensemble of real-time models ● Collect lots of implicit feedback data
  • 17. Use Case - eCommerce (cont'd) Detail: ● Clustering - behavior, demogs ● Simple predictors - keywords to products ● Bayesian Bandit - blend the output Requirements: ● Predictions in < 50 ms ● Online learning models ● Occasional batch updates are OK
  • 18.
  • 19. When eCommerce #FAILs
  • 20. I've only ever bought Cat food...
  • 21. ... wait there's more, no Cat food
  • 24. Use Case - Financial Services Objective - Reduce Fraud How: ● Compute patterns/predictors for individuals ● Cluster individuals and recompute for clusters ● Compute baselines across all data Method: ● Hybrid and Hierarchical Clustering models ● Simple predictors for individuals, clusters & baseline
  • 25. Use Case - Financial Services Detail: ● CHEAT!!! ... Cluster to nearest centroid ○ will degrade over time (Hunchback Clusters) ● Use simple metrics to alert (stddev) Requirements: ● Ability to alert/intervene near real-time < 1 second ● Adapt to rapid changes (within baseline & clusters) ● Periodic batch processing to recompute clusters
  • 26. Use Case - Financial Services
  • 28. Use Case - Media Objective - Generating Metadata Why: ● Drive second screen applications ● Create new streams of information for resale How: ● Video / Audio analysis ● Closed Caption or, Subtitle text processing ● Knowledgebase :- People, Places, Products & Things
  • 29. Use Case - Media (cont'd) Method: ● Natural Language Processing ○ Named Entity Recognition ○ Topic Extraction & Disambiguation ● Graph databases & algorithms Requirements: ● Responses in < 1 second ● Ability to learn new 'Things' Example of 12,000 entities from our Knowledgebase...
  • 30.
  • 31.
  • 32.
  • 34. Summary Key points: ● Clear move towards distributed algorithms ● Latency is often more favorable than accuracy ● Trade-offs are dependant on the use-cases Further reading: ● Apache Mahout - http://mahout.apache.org/ ● Storm Project - http://storm-project.net/ ● Data Science London - http://datasciencelondon.org/ ● Machine Learning Meetup - http://bit.ly/w8V8f6
  • 36. Introducing TUMRA Labs API access to some of our real-time models: ● Probabilistic Demographics Coming Soon: ● Language detection ● Sentiment analysis ● Metadata Generation Free to signup and easy to get started! http://labs.tumra.com/
  • 37. Questions? Work Personal tumra.com cotdp.com @tumra @cotdp