SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
Streaming Data Mining
PRESENTED BY Edo Liberty⎪ April 11, 2014
Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission.
Parts of this presentation
were given with Jelani Nelson
(Harvard) as a KDD tutorial on
streaming data mining.
2 Yahoo Confidential & Proprietary
Data
Computation Result
The World
Single machine data mining
3 Yahoo Confidential & Proprietary
Data Data Data Data
Computation Result
The World
Distributed storage
4 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The World
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Distributed model (map/reduce, message passing, …)
5 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The World
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
ComputationQuery
Distributed model (indexes, tables, databases, …)
207 big-data infographics (meta infographic)
6 Yahoo Confidential & Proprietary
7 Yahoo Confidential & Proprietary
8 Yahoo Confidential & Proprietary
Sketch
The World
Query Algorithm ResultQuery
Result
Computation
The streaming model
9 Yahoo Confidential & Proprietary
Aggregate+
Sketch
The World
Query Algorithm ResultQuery
Result
Compute
+ Sketch
Compute
+ Sketch
Compute
+ Sketch
Compute
+ Sketch
The parallel streaming model
10 Yahoo Confidential & Proprietary
1 7 8 1 0 1 7 7
Sketch
Result
Iterator
Computation
The streaming model (more accurately)
O(n)Items
O(polylog(n)) Space
O(polylog(n)) Computation per item
11 Yahoo Confidential & Proprietary
Sketch Result
Iterator Iterator
Communication complexity
1 7 8 1 0 1 7 7
Frequent items
Misra, Gries. Finding repeated elements, 1982.
Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002
Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003
The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002
Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
13 Yahoo Confidential & Proprietary
d
n
f( ) = 5
14 Yahoo Confidential & Proprietary
f( ) = 5
d
15 Yahoo Confidential & Proprietary
`
16 Yahoo Confidential & Proprietary
`
17 Yahoo Confidential & Proprietary
`
18 Yahoo Confidential & Proprietary
`
19 Yahoo Confidential & Proprietary
`
20 Yahoo Confidential & Proprietary
`
21 Yahoo Confidential & Proprietary
`
22 Yahoo Confidential & Proprietary
f0
( ) = 0
`
f0
( ) = 2
23 Yahoo Confidential & Proprietary
Assume we do this timest
Second fact: f0
(x) f(x) t
f0
(x)  f(x)First fact:
The proof (very short)
24 Yahoo Confidential & Proprietary
Third (not so obvious) fact:
Which gives . In words:
We can only delete items times!
t  n/`
0
P
f0
(x) =
P
f(x) t · ` = n t · `
⌅
The proof (very short)
` n/`
|f0
(x) f(x)|  n/`
Useful form…
25 Yahoo Confidential & Proprietary
Define
And
We get that
This is very useful for keeping approx’ distributions!
p(x) = f(x)/n
p0
(x) = f0
(x)/n
|p0
(x) p(x)|  1/`
Threading Machine Generated Email
27 Yahoo Confidential & Proprietary
Email threads
A simple email thread (that’s not very hard to do…)
Threading Machine Generated Email
28 Yahoo Confidential & Proprietary
Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013
29 Yahoo Confidential & Proprietary
Threading Machine Generated Email
30 Yahoo Confidential & Proprietary
Threading Machine Generated Email
What else can we do in the streaming model…
31 Yahoo Confidential & Proprietary
Items (words, IP-adresses, events, clicks,...):
§  Item frequencies
§  Counting distinct elements
§  Moment and entropy estimation
§  Approximate set operations
Vectors (text documents, images, example features,...)
§  Dimensionality reduction
§  Clustering (k-means, k-median,…)
§  Linear Regression
§  Machine learning (some of it at least)
Matrices (text corpora, user preferences, graphs...)
§  Covariance estimation matrix
§  Low rank approximation
§  Sparsification
Thanks!
32 Yahoo Confidential & Proprietary
Yahoo does big data algorithms, software and systems!
Speak to our Talent Team or visit Careers.Yahoo.com and explore our
career opportunities in NYC or Sunnyvale, CA
Seth Tropper
satropper@yahoo-inc.com
Doug DeSimone
desimone@yahoo-inc.com
Keith Daniels
kdnl@yahoo-inc.com
Yahoo is an equal opportunity employer.

Mais conteúdo relacionado

Mais procurados

Mi primer map reduce
Mi primer map reduceMi primer map reduce
Mi primer map reducebetabeers
 
Mi primer map reduce
Mi primer map reduceMi primer map reduce
Mi primer map reduceRuben Orta
 
A Survey Of R Graphics
A Survey Of R GraphicsA Survey Of R Graphics
A Survey Of R GraphicsDataspora
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Dr. Volkan OBAN
 
การจัดการฉากหลังของสไลด์
การจัดการฉากหลังของสไลด์การจัดการฉากหลังของสไลด์
การจัดการฉากหลังของสไลด์PomPam Comsci
 
Exponential Functions
Exponential FunctionsExponential Functions
Exponential Functionsacwalk03
 

Mais procurados (11)

Mi primer map reduce
Mi primer map reduceMi primer map reduce
Mi primer map reduce
 
Mi primer map reduce
Mi primer map reduceMi primer map reduce
Mi primer map reduce
 
A Survey Of R Graphics
A Survey Of R GraphicsA Survey Of R Graphics
A Survey Of R Graphics
 
Funções 5
Funções  5Funções  5
Funções 5
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 
การจัดการฉากหลังของสไลด์
การจัดการฉากหลังของสไลด์การจัดการฉากหลังของสไลด์
การจัดการฉากหลังของสไลด์
 
MS2 POwer Rules
MS2 POwer RulesMS2 POwer Rules
MS2 POwer Rules
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
 
Exponential Functions
Exponential FunctionsExponential Functions
Exponential Functions
 
Seminar psu 20.10.2013
Seminar psu 20.10.2013Seminar psu 20.10.2013
Seminar psu 20.10.2013
 

Semelhante a MLconf NYC Edo Liberty

Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016chris wiggins
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataAnalyticsWeek
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t HaveJ On The Beach
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at NubankDatabricks
 
UBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseUBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseJennifer Bryan
 
Master Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno SiebesMaster Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno SiebesMedia Perspectives
 
208 dataflowdgm
208 dataflowdgm208 dataflowdgm
208 dataflowdgmTCT
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5AbdelmonsifFadl
 
Applications of Machine Learning at UCSB
Applications of Machine Learning at UCSBApplications of Machine Learning at UCSB
Applications of Machine Learning at UCSBSri Ambati
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...David Coallier
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIlya Grigorik
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in PythonMarc Garcia
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!David Coallier
 
Sustainable Logging – SplunkLive! 2014
Sustainable Logging – SplunkLive! 2014Sustainable Logging – SplunkLive! 2014
Sustainable Logging – SplunkLive! 2014Paul Gilowey
 
How it works- Data Science
How it works- Data ScienceHow it works- Data Science
How it works- Data ScienceEdureka!
 

Semelhante a MLconf NYC Edo Liberty (20)

Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
 
UBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-courseUBC STAT545 2014 Cm001 intro to-course
UBC STAT545 2014 Cm001 intro to-course
 
F sharp - an overview
F sharp - an overviewF sharp - an overview
F sharp - an overview
 
Master Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno SiebesMaster Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno Siebes
 
Data flow
Data flowData flow
Data flow
 
How to Data Flow Diagram
How to Data Flow Diagram How to Data Flow Diagram
How to Data Flow Diagram
 
208 dataflowdgm
208 dataflowdgm208 dataflowdgm
208 dataflowdgm
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5
 
Applications of Machine Learning at UCSB
Applications of Machine Learning at UCSBApplications of Machine Learning at UCSB
Applications of Machine Learning at UCSB
 
208 dataflowdgm
208 dataflowdgm208 dataflowdgm
208 dataflowdgm
 
Data Science, what even...
Data Science, what even...Data Science, what even...
Data Science, what even...
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Data Science, what even?!
Data Science, what even?!Data Science, what even?!
Data Science, what even?!
 
Sustainable Logging – SplunkLive! 2014
Sustainable Logging – SplunkLive! 2014Sustainable Logging – SplunkLive! 2014
Sustainable Logging – SplunkLive! 2014
 
How it works- Data Science
How it works- Data ScienceHow it works- Data Science
How it works- Data Science
 

Mais de MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Mais de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

MLconf NYC Edo Liberty

  • 1. Streaming Data Mining PRESENTED BY Edo Liberty⎪ April 11, 2014 Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission. Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.
  • 2. 2 Yahoo Confidential & Proprietary Data Computation Result The World Single machine data mining
  • 3. 3 Yahoo Confidential & Proprietary Data Data Data Data Computation Result The World Distributed storage
  • 4. 4 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute Distributed model (map/reduce, message passing, …)
  • 5. 5 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute ComputationQuery Distributed model (indexes, tables, databases, …)
  • 6. 207 big-data infographics (meta infographic) 6 Yahoo Confidential & Proprietary
  • 7. 7 Yahoo Confidential & Proprietary
  • 8. 8 Yahoo Confidential & Proprietary Sketch The World Query Algorithm ResultQuery Result Computation The streaming model
  • 9. 9 Yahoo Confidential & Proprietary Aggregate+ Sketch The World Query Algorithm ResultQuery Result Compute + Sketch Compute + Sketch Compute + Sketch Compute + Sketch The parallel streaming model
  • 10. 10 Yahoo Confidential & Proprietary 1 7 8 1 0 1 7 7 Sketch Result Iterator Computation The streaming model (more accurately) O(n)Items O(polylog(n)) Space O(polylog(n)) Computation per item
  • 11. 11 Yahoo Confidential & Proprietary Sketch Result Iterator Iterator Communication complexity 1 7 8 1 0 1 7 7
  • 12. Frequent items Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
  • 13. 13 Yahoo Confidential & Proprietary d n f( ) = 5
  • 14. 14 Yahoo Confidential & Proprietary f( ) = 5 d
  • 15. 15 Yahoo Confidential & Proprietary `
  • 16. 16 Yahoo Confidential & Proprietary `
  • 17. 17 Yahoo Confidential & Proprietary `
  • 18. 18 Yahoo Confidential & Proprietary `
  • 19. 19 Yahoo Confidential & Proprietary `
  • 20. 20 Yahoo Confidential & Proprietary `
  • 21. 21 Yahoo Confidential & Proprietary `
  • 22. 22 Yahoo Confidential & Proprietary f0 ( ) = 0 ` f0 ( ) = 2
  • 23. 23 Yahoo Confidential & Proprietary Assume we do this timest Second fact: f0 (x) f(x) t f0 (x)  f(x)First fact: The proof (very short)
  • 24. 24 Yahoo Confidential & Proprietary Third (not so obvious) fact: Which gives . In words: We can only delete items times! t  n/` 0 P f0 (x) = P f(x) t · ` = n t · ` ⌅ The proof (very short) ` n/` |f0 (x) f(x)|  n/`
  • 25. Useful form… 25 Yahoo Confidential & Proprietary Define And We get that This is very useful for keeping approx’ distributions! p(x) = f(x)/n p0 (x) = f0 (x)/n |p0 (x) p(x)|  1/`
  • 27. 27 Yahoo Confidential & Proprietary Email threads A simple email thread (that’s not very hard to do…)
  • 28. Threading Machine Generated Email 28 Yahoo Confidential & Proprietary Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013
  • 29. 29 Yahoo Confidential & Proprietary Threading Machine Generated Email
  • 30. 30 Yahoo Confidential & Proprietary Threading Machine Generated Email
  • 31. What else can we do in the streaming model… 31 Yahoo Confidential & Proprietary Items (words, IP-adresses, events, clicks,...): §  Item frequencies §  Counting distinct elements §  Moment and entropy estimation §  Approximate set operations Vectors (text documents, images, example features,...) §  Dimensionality reduction §  Clustering (k-means, k-median,…) §  Linear Regression §  Machine learning (some of it at least) Matrices (text corpora, user preferences, graphs...) §  Covariance estimation matrix §  Low rank approximation §  Sparsification
  • 32. Thanks! 32 Yahoo Confidential & Proprietary Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA Seth Tropper satropper@yahoo-inc.com Doug DeSimone desimone@yahoo-inc.com Keith Daniels kdnl@yahoo-inc.com Yahoo is an equal opportunity employer.