This document proposes HistoSketch, a method for sketching streaming histograms that preserves similarity and adapts to concept drift. It works by:
1) Generating weighted samples from histograms such that the probability two sketches match equals histogram similarity.
2) Incrementally updating sketches using a weight decay factor to forget older data and adapt to drift over time.
3) Evaluating HistoSketch on classification tasks involving synthetic and real-world streaming data, finding it approximates histogram similarity well using small, fixed-size sketches while adapting rapidly to drift.
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift
1. HistoSketch: Fast Similarity-Preserving Sketching
of Streaming Histograms with Concept Drift
Dingqi Yang*, Bin Li†, Laura Rettig*, Philippe Cudré-Mauroux*
*eXascale Infolab, University of Fribourg, Switzerland
†School of Computer Science, Fudan University, Shanghai, China
1
2. HistoSketch: Fast Similarity-Preserving Sketching of Streaming
Histograms with Concept Drift
2
What kind of location is this?
Places I’ve been:
Bar University
Museum Supermarket
0.7 0.6 0.14 0.21 0.41 0.63
0.64 0.65 0.21 0.86 0.24 0.82
0.64 0.65 0.21 0.86 0.24 0.82
0.7 0.6 0.14 0.21 0.41 0.63
Compute similarity
?
3. Motivation
• Histogram similarity: foundation for many machine learning tasks
• Cardinality of histograms over data streams continuously increases
• Similarity-preserving data sketches
• Compact, fixed size
• Preserve similarity under certain measure
• Are incrementally updateable
• Concept drift: distribution of a histogram changes over time
• If taken into account can improve accuracy of histogram-based similarity
techniques
• Typical method: gradual forgetting
3
4. Background
Given a data stream of incoming elements xt, with a weight wt
we compute a histogram V such that
Vi is the weighted cumulative count of the element i.
4
xtxt-1...
Streaming histogram elements xt with
wt
Corresponding histogram V
xt-2
5. Problem Formulation
• Create and maintain the similarity-preserving sketch S for the full
streaming histogram V such that
• each sketch has a fixed size K (K≪ |ℰ|);
• the collision probability between two sketches Sa and Sb is the normalized
similarity between the histograms Va and Vb
the Hamming distance between Sa and Sb approximates SIMNMM(Va, Vb);
• the sketch S(t+1) can be efficiently computed from the incoming histograms
element xt+1, S(t), and a weight decay factor λ.
5
xtxt-1...
New element xt+1 received Incremental updating
xt+1
S(t+1)
xt+1S(t) λ
xt-2
6. HistoSketch
• Based on the idea of consistent weighted sampling
• Generate samples such that the probability of drawing identical samples from
two vectors is equal to their min-max similarity.
• Method draws three random variables 𝑟𝑖,𝑗~𝐺𝑎𝑚𝑚𝑎(2,1), 𝑐𝑖,𝑗~𝐺𝑎𝑚𝑚𝑎 2,1 ,
𝛽𝑖,𝑗~𝑈𝑛𝑖𝑓𝑜𝑟𝑚(0,1) and then computes
𝑦𝑖,𝑗 = exp 𝑟𝑖,𝑗
log 𝑉𝑖
𝑟𝑖,𝑗
+ 𝛽𝑖,𝑗 − 𝛽𝑖,𝑗
which is used as input to the random hash value generation.
6
7. HistoSketch
• We propose a new method to compute 𝑦𝑖,𝑗
𝑦𝑖,𝑗 = exp(log 𝑉𝑖 − 𝑟𝑖,𝑗 𝛽𝑖,𝑗)
• and show that this method is 1) correct and 2) scale-invariant.
Sketch creation
𝑎𝑖,𝑗 =
𝑐𝑖,𝑗
𝑦𝑖,𝑗exp(𝑟𝑖,𝑗)
7
Sketch element Sj
Histogram V
0.7 0.6 0.14 0.21 0.41ai,j
3
0.14
The corresponding
hash value Aj
Computing hash values
1 2 3 4 5i =
Minimum
1. compute 𝑦𝑖,𝑗
2. compute
hash value 𝑎𝑖,𝑗
3. set 𝑆𝑗 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖∈ℇ
𝑎𝑖,𝑗
4. set 𝐴𝑗 = 𝑚𝑖𝑛 𝑖∈ℇ 𝑎𝑖,𝑗
8. HistoSketch
Incremental Sketch Update
Computation of sketch 𝑆 𝑡 + 1 relies only on 𝑆(𝑡) (with its corresponding hash values
𝐴 𝑡 ), an incoming element 𝑥𝑡+1 and the weight decay factor 𝜆.
8
Sketch element Sj(t) 3
0.147Adjusted hash value Aj(t)e-λ
1 2 3 4 5i =
Step II. Add xt+1
- 0.142 - - -ai,j
Computing hash value for i
1 2 3 4 5i =
Adjusting sketch
Sketch element Sj(t+1)2
0.142 Hash value Aj(t+1)
Step I. Scale V(t) by e-λ
Step III. Update sketch
0.14Original hash value Aj(t)
0.14×1/(e-λ)
Minimum
1. scale existing
elements in A
2. add 𝑖′ to
histogram
3. recompute 𝑎𝑖′
,𝑗
4. update sketch 𝑆𝑗 and hash
values 𝐴𝑗 with minimum 𝑎𝑗
9. Experimental Evaluation
• Classification task
• Given labeled streaming histograms, classify those histogram instances
without label
• KNN classifier takes data in the form of sketches for classification with 𝐾 = 5
• KNN takes most up-to-date training data for classification from continuously
updated sketches
9
10. Experimental Evaluation
• Synthetic dataset
• Generated from two Gaussian distributions representing two classes
• Simulate data streams with concept drift
• Abrupt: one stream starts to receive all elements from the other distribution
• Gradual: one stream starts to receive elements from the other distribution with
increasing probability, and the labels change
• Criteria:
1. How well is the similarity approximated? (impact of sketch length K)
2. How fast can it adapt to concept drift? (impact of weight decay factor λ)
10
11. Experimental Evaluation
1. Impact of sketch length K
• Fix 𝜆 = 0.02 and vary
𝐾 = [20, 50, 100, 200, 500, 1000]
• Compare against two methods that
retain the full histograms:
• Histogram-Classical with unweighted
elements
• Histogram-Forgetting with gradual
forgetting weights
• A sketch length of 𝐾 = 500 is
sufficient to approximate Histogram-
Forgetting
11
12. Experimental Evaluation
2. Impact of weight decay factor λ
• Fix 𝐾 = 100 and vary
𝜆 = [0, 0.005, 0.01, 0.02, 0.05, 0.1]
• Compare against Histogram-LatestK
which builds a histogram from the
latest 𝐾 = 100 elements in the
stream (unweighted)
• Similarity computation time:
• HistoSketch: 13ms
• Histogram-LatestK: 133ms
12
13. Experimental Evaluation
• POI dataset
• Infer a place’s category from its customers’ visiting pattern
• Foursquare dataset: user check-ins for two years from NYC, TKY, IST
• Data: user-time visit pairs discretized to the 168 hours in a week
• Comparised methods:
• Histogram-Coarse: discretized time slots are considered as histogram elements
• Histogram-Fine-Classical: user-time pairs are considered as histogram elements
• Histogram-Fine-LatestK: only latest K histogram elements
• Histogram-Fine-Forgetting: gradual forgetting weights (𝜆 = 0.01)
• POISketch: unweighted sketching method that approximates Histogram-Fine-Classical
• HistoSketch: approximates Histogram-Fine-Forgetting (𝜆 = 0.01)
• Fix 𝐾 = 100
13
16. Conclusion
• We introduced HistoSketch, an efficient similarity preserving sketching method
for streaming histograms with concept drift.
• We demonstrated the effectiveness in approximating normalized min-max
similarity.
• We use incremental updates to the sketches with gradual forgetting to adapt to
concept drift.
• We showed on both synthetic and real-world data sets that this method
effectively and efficiently approximates similarity and adapts to concept drift.
• We observed a speed-up of 7500x on classification with a small loss of accuracy
of around 3.5%.
16
Thank you!
18. Backup: HistoSketch Implementation
• Former histogram 𝑉 𝑡 is required to compute 𝑉(𝑡 + 1)
• The previous histogram is maintained in a modified count-min sketch 𝑄
• We extend the count-min sketch with decay weights by scaling all counters
𝑄(𝑡) ∙ 𝑒−𝜆
• Parameter configuration: 𝑑 = 10, 𝑔 = 50
guarantees an error of at most 4% with probability 0.999
18
20. Backup: Future Work
• Way to compute 𝑎𝑖,𝑗 can be further simplified
• Applications to other domains: e.g., recommendation, community
detection
20
Editor's Notes
Overall. Highlight some words (lots of monotonous text)
Just from looking at the photo, what location is this? Can’t go there but I know the places I’ve been to.
Check from these which is the most similar.
Analogy: full histogram = going there (potentially high effort, costly)
Sketch = looking at photo to judge similarity to known locations
Talk about concept drift
You may be familiar with common sketching techniques such as the very simple bloom filter or the count min sketch; these are not similarity-preserving
V: classical histogram is a vector of ever-growing cardinality
W_t are inversely proportional to the age
V: classical histogram is a vector of ever-growing cardinality
K is significantly smaller than the actual cardinality ℰ
Refer to paper for the proofs of these properties
Keep both sketch S and its corresponding hash values A
input: V, sketch length K, random variables r, c, 𝛽
output: S with hash values A
for 𝑗=1…𝐾:
Compute 𝑦 𝑖,𝑗
Compute 𝑎 𝑖,𝑗 using 𝑦 𝑖,𝑗
Set sketch element 𝑆 𝑗 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑖∈ℇ 𝑎 𝑖,𝑗
Set hash value 𝐴 𝑗 = 𝑚𝑖𝑛 𝑖∈ℇ 𝑎 𝑖,𝑗
Sum-to-one normalization ~ uniform scaling => can scale only on A and still be correct
Upon arrival of 𝑥 𝑡+1
Weights of existing histogram elements are adjusted by a factor of 𝑒 −𝜆 by scaling 𝐴.
Add incoming element 𝑖′ with weight 1 to scaled histogram.
Recompute 𝑎 𝑖′,𝑗 ; 𝑆 𝑗 𝑡+1 =𝑖′ if 𝑎 𝑖′,𝑗 < 𝐴 𝑗 𝑡 ∙ 𝑒 𝜆 , else 𝑆 𝑗 𝑡 𝐴 𝑗 𝑡+1 = 𝑎 𝑖′,𝑗 if 𝑎 𝑖′,𝑗 < 𝐴 𝑗 𝑡 ∙ 𝑒 𝜆 , else 𝐴 𝑗 𝑡 ∙ 𝑒 𝜆
Approximation = overall accuracy
Adaptation to concept drift = accuracy recovery speed after concept drift
Histogram-Classical adapts very slowly
Histogram-Forgetting is fast to adapt
HistoSketch adapts just as quickly as Histogram-Forgetting to concept drift
Sketch length K has no impact on adaptation speed, but has a positive impact on the classification accuracy: larger K longer sketch more accurate as they better approximate the original histogram
Although there is no big difference beyond K=500 which is almost equal to Histogram-Forgetting
Trade-off between concept drift adaptation speed and accuracy: a high weight decay factor means quicker recovery, as outdated data is quickly forgotten, but overall lower accuracy, as less information from former histograms is used
We observe the same trade-off in Histogram-LatestK, with its adaptation being slower than 0.05 and faster than 0.02, but also its accuracy being higher than 0.05 and lower than 0.02
Advantages of HistoSketch: HistoSketch can balance adaptation speed and accuracy, and HistoSketch is much faster for similarity computation (relies only on Hamming distance)
Reminder: what is POI
Different places typically have different temporal visiting patterns
Fine-grained patterns: user+time instead of only time gives more accuracy
POI abrupt change: e.g. change of type of POI – clothing store to art gallery
POI gradual change: e.g. small change in POI – new menu items at a restaurant
Split into 80% train and 20% unlabeled test
Histogram-coarse: (i.e. visitor count per time slot, no information on user)
POISketch = HistoSketch with decay factor 0
Histogram-Coarse: worst accuracy (to be expected)
Histogram-Fine-Forgetting: highest accuracy
HistoSketch outperforms POISketch due to gradual forgetting weights
HistoSketch: small loss in accuracy against Histogram-Fine-Forgetting
HistoSketch presents a speedup against Forgetting of about 7500x since the hamming distance can be computed much more efficiently than normalized min-max similarity between full histograms
HistoSketch also takes much less memory to maintain
Longer sketches = slightly higher processing time, but still much less than other methods
Can handle real-world scenarios: Foursquare: peaks at 7 million check-ins per day which is 81 check-ins per second
Method is also parallelizable as sketches for different POIs can be independently maintained
TODO: Add normalization
Maybe remove this slide
Accuracy increases over time with more information
POISketch approximates Classical, HistoSketch approximates Forgetting => higher accuracy
Larger improvement with the presence of abrupt drift: HistoSketch is more accurate at handling concept drift than Classical by approximating Forgetting
No sudden drop in accuracy as POIs don’t change their type simultaneously