2. 2 Yahoo Confidential & Proprietary
Data
Computation Result
The World
Single machine data mining
3. 3 Yahoo Confidential & Proprietary
Data Data Data Data
Computation Result
The World
Distributed storage
4. 4 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The World
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Distributed model (map/reduce, message passing, …)
5. 5 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The World
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
ComputationQuery
Distributed model (indexes, tables, databases, …)
8. 8 Yahoo Confidential & Proprietary
Sketch
The World
Query Algorithm ResultQuery
Result
Computation
The streaming model
9. 9 Yahoo Confidential & Proprietary
Aggregate+
Sketch
The World
Query Algorithm ResultQuery
Result
Compute
+ Sketch
Compute
+ Sketch
Compute
+ Sketch
Compute
+ Sketch
The parallel streaming model
10. 10 Yahoo Confidential & Proprietary
1 7 8 1 0 1 7 7
Sketch
Result
Iterator
Computation
The streaming model (more accurately)
O(n)Items
O(polylog(n)) Space
O(polylog(n)) Computation per item
11. 11 Yahoo Confidential & Proprietary
Sketch Result
Iterator Iterator
Communication complexity
1 7 8 1 0 1 7 7
12. Frequent items
Misra, Gries. Finding repeated elements, 1982.
Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002
Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003
The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002
Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
23. 23 Yahoo Confidential & Proprietary
Assume we do this timest
Second fact: f0
(x) f(x) t
f0
(x) f(x)First fact:
The proof (very short)
24. 24 Yahoo Confidential & Proprietary
Third (not so obvious) fact:
Which gives . In words:
We can only delete items times!
t n/`
0
P
f0
(x) =
P
f(x) t · ` = n t · `
⌅
The proof (very short)
` n/`
|f0
(x) f(x)| n/`
25. Useful form…
25 Yahoo Confidential & Proprietary
Define
And
We get that
This is very useful for keeping approx’ distributions!
p(x) = f(x)/n
p0
(x) = f0
(x)/n
|p0
(x) p(x)| 1/`
31. What else can we do in the streaming model…
31 Yahoo Confidential & Proprietary
Items (words, IP-adresses, events, clicks,...):
§ Item frequencies
§ Counting distinct elements
§ Moment and entropy estimation
§ Approximate set operations
Vectors (text documents, images, example features,...)
§ Dimensionality reduction
§ Clustering (k-means, k-median,…)
§ Linear Regression
§ Machine learning (some of it at least)
Matrices (text corpora, user preferences, graphs...)
§ Covariance estimation matrix
§ Low rank approximation
§ Sparsification
32. Thanks!
32 Yahoo Confidential & Proprietary
Yahoo does big data algorithms, software and systems!
Speak to our Talent Team or visit Careers.Yahoo.com and explore our
career opportunities in NYC or Sunnyvale, CA
Seth Tropper
satropper@yahoo-inc.com
Doug DeSimone
desimone@yahoo-inc.com
Keith Daniels
kdnl@yahoo-inc.com
Yahoo is an equal opportunity employer.