Seu SlideShare está sendo baixado. ×

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

1 de 83 Anúncio

# Large-scale real-time analytics for everyone

My slides from Highload Strategy conference in Vilnius.

My slides from Highload Strategy conference in Vilnius.

Anúncio
Anúncio

Anúncio

### Large-scale real-time analytics for everyone

1. 1. Large-scale real-time analytics for everyone: fast, cheap and 98% correct
2. 2. Pavel Kalaidin @facultyofwonder
3. 3. we have a lot of data memory is limited one pass would be great constant update time
4. 4. max, min, mean is trivial
5. 5. median, anyone?
6. 6. Sampling?
7. 7. Probabilistic algorithms
8. 8. Estimate is OK but nice to know how error is distributed
9. 9. def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m
10. 10. Memory used - 1 int! def frugal(stream): m = 0 for val in stream: if val > m: m += 1 elif val < m: m -= 1 return m It really works
11. 11. Percentiles?
12. 12. Demo: bit.ly/frugalsketch def frugal_1u(stream, m=0, q=0.5): for val in stream: r = np.random.random() if val > m and r > 1 - q: m += 1 elif val < m and r > q: m -= 1 return m
13. 13. Streaming + probabilistic = sketch
14. 14. What do we want? Get the number of unique users aka cardinality number
15. 15. What do we want? Get the number of unique users grouped by host, date, segment
16. 16. When do we want? Well, right now
17. 17. Data: 1010 elements, 109 unique int32 40Gb
18. 18. Straight-forward approach: hash-table
19. 19. Hash-table: 4Gb
20. 20. HyperLogLog: 1.5Kb, 2% error
21. 21. It all starts with an algorithm called LogLog
22. 22. Imagine I tell you I spent this morning flipping a coin
23. 23. and now tell you what was the longest non-interrupting run of heads
24. 24. 2 times or 100 times
25. 25. When I flipped a coin for longer time?
26. 26. We are interested in patterns in hashes (namely the longest runs of leading zeros = heads)
27. 27. Hash, don’t sample!* * need a good hash function
28. 28. Expecting: 0xxxxxx hashes - ~50% 1xxxxxx hashes - ~50% 00xxxxx hashes - ~25%
29. 29. estimate - 2R , where R - is a longest run of leading zeros in hashes
30. 30. I can perform several flipping experiments
31. 31. and average the number of zeros
32. 32. This is called stochastic averaging
33. 33. So far the estimate is 2R , where R is a is a longest run of leading zeros in hashes
34. 34. We will be using M buckets
35. 35. where ɑ is a normalization constant
36. 36. LogLog SuperLogLog
37. 37. LogLog SuperLogLog HyperLogLog arithmetic mean -> harmonic mean plus a couple of tweaks
38. 38. Standard error is 1.04/sqrt (M), where M is the number of buckets
39. 39. LogLog SuperLogLog HyperLogLog HyperLogLog++ Google, 2013 32 bit -> 64 bit + fixes for low cardinality bit.ly/HLLGoogle
40. 40. LogLog SuperLogLog HyperLogLog HyperLogLog++ Discrete Max-Count Facebook, 2014 bit.ly/DiscreteMaxCount
41. 41. Large scale?
42. 42. Suppose we have two HLL- sketches, let’s take a maximum value from corresponding buckets
43. 43. Resulting sketch has no loss in accuracy!
44. 44. What do we want? how many unique users belong to two segments?
45. 45. HLL intersection
46. 46. Inclusion-exclusion principle
47. 47. credits: http://research.neustar. biz/2012/12/17/hll-intersections-2/
48. 48. Python code: bit.ly/hloglog
49. 49. What do we want? Get the churn rate
50. 50. Straight forward: feed new data to a new sketch
51. 51. Sliding-window HyperLogLog
52. 52. We maintain a list of tuples (timestamp, R), where R is a possible maximum over future time
53. 53. Values that are no longer make sense are automatically discarded from the list
54. 54. One list per bucket
55. 55. Take a maximum R over the given timeframe from the past, then estimate as we do in a regular HLL
56. 56. Extra memory is required
57. 57. All the details: bit.ly/SlidingHLL
58. 58. hash, don’t sample estimate, not precise save memory streaming this slide is the sketch of the talk
59. 59. Lots of sketches for various purposes: percentiles, heavy hitters, similarity, other stream statistics
60. 60. Have we seen this user before?
61. 61. Bloom filter
62. 62. i h 1 h 2 h k 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0
63. 63. How many time did we see a user?
64. 64. Count-Min sketch is the answer: bit.ly/CountMinSketch
65. 65. w i +1 +1 +1 h1 h4 hd d Estimate - take minimum from d values
66. 66. Percentiles
67. 67. Frugal sketching is not that precise enough
68. 68. Sorting is pain
69. 69. Distribute incoming values to buckets?
70. 70. Some sort of clustering, maybe
71. 71. T-Digest
72. 72. Size is log(n), error is relative to q(1-q)
73. 73. Code: bit.ly/T-Digest-Java bit.ly/T-Digest-Python
74. 74. This is a growing field of computer science: stay tuned!
75. 75. Thanks and happy sketching!
76. 76. Reading list: Neustar Research blog: bit.ly/NRsketches Sketches overview: bit.ly/SketchesOverview Lecture notes on streaming algorithms: bit.ly/streaming-lectures