Presentation at CIKM 2013 of the CUbRIK research paper: "Efficient Jaccard-based Diversity Analysis of Large
Document Collections" authored by Fan Deng, Stefan Siersdorfer and Sergej Zerr of L3S Research Center, partner of the CUbRIK Consortium.
CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections
1. 121st CIKM Conference Lahaina, Maui Hawaii 30/10/12
Fan Deng, Stefan Siersdorfer, Sergej Zerr
21st ACM Conference on Information and Knowledge Management, CIKM 2012, Lahiana, Maui Hawaii
2. Diversity
221st CIKM Conference Lahaina, Maui Hawaii 30/10/12
Diversity - Healthmeasure of an ecosystem
Biodiversity: is the degree of variation of life forms within a given
species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of
ecosystems (Wikipedia).
3. Diversity
Biodiversity: is the degree of variation of life forms within a given
species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of
ecosystems (Wikipedia).
330/10/1221st CIKM Conference Lahaina, Maui Hawaii
4. Diversity in Computer Science
430/10/12
Our focus: Topic diversity of the large text corpora
Social Web Environment – Ecosystem
Group dynamics
“Hot topics”, controversial topics
Diversity of opinions
Topic ambiguity
Temporal topic analysis
21st CIKM Conference Lahaina, Maui Hawaii
Increasing amounts of data are published on the Internet on a daily basis, not least
due to popular social web environments: YouTube, Flickr, blogosphere, … ect.
6. Diversity Metrics
• Simpson‟s1 diversity index
Each object belongs to one of a discrete sets of categories
• Stirling‟s2 index
Depends on distances between objects and their relative
occurrences
6
[1] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.
21st CIKM Conference Lahaina, Maui Hawaii 30/10/12
[2] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.
Z
i iD 1
2
)()()( jijiij ij ppdD
A
B
C
D
E
B
C
D E F
7. • Refined Jaccard Index – average Jaccard similarity between all
possible object pairs
• Note: lower RDJ value corresponds to higher diversity
• Problem: “All-Pair Problem”
• Solution: Estimation algorithms with probabilistic error bound
guarantees
Refined Jaccard Index
721st CIKM Conference Lahaina, Maui Hawaii 30/10/12
ji
ji OOJS
nn
RDJ ),(
)1(
2
nji1
∩ UU
Jaccard similarity
8. • Input: Relative error ε, accuracy confidence δ
• Output: Estimated RDJ value
•Algorithms: SampleDJ, TrackDJ (claims and proofs in the paper)
Estimation Algorithms
821st CIKM Conference Lahaina, Maui Hawaii 30/10/12
RDJ
RDJRDJ ||
Pr
10. • Execution time:
• Properties:
Execution time (number of trials) does not depend on the data set size, but
only on RDJ value
For a dataset with a very high diversity value can run infinitely long time.
SampleDJ Overview
1021st CIKM Conference Lahaina, Maui Hawaii 30/10/12
)
1
( 2
RDJ
11. Estimation Algorithm TrackDJ
1130/10/12
π1 = (E,B,A,C,D)
D1(A,B,C), D2(B,C,D)
h1(D1) = B
h1(D2) = B
),()]()(Pr[ yxyx DDJSDhDh
• Broder et al. 2000 proposed Min-wise independent hashing (Min-
hash)
21st CIKM Conference Lahaina, Maui Hawaii
12. Estimation Algorithm TrackDJ
1230/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)
π1 = (E,B,A,C,D)
h1(D1) = B
h1(D2) = B
h1(D3) = E
h1(D4) = E
h1(D5) = A
D1
D2
D3 D4
D5
D1
D2
D3
D4 D5
D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), D5(A,C,D)
21st CIKM Conference Lahaina, Maui Hawaii
13. •Time complexity:
• Properties:
Execution in linear time (depends on the data set size)
TrackDJ Overview
1330/10/12
)(nO
21st CIKM Conference Lahaina, Maui Hawaii
14. Experimental Evaluation of the Theoretical
Claims (Flickr Dataset)
1430/10/12
ε=5%, δ=95%
Data Set Size All Pairs SampleDJ TrackDJ
n RDJ Time(seconds) Error(%) Time(seconds) Error(%) Time(seconds)
1,000 0.00206 0.08 0.017 34 (0.57 min) 0 40 (0.66 min)
10,000 0.001992 8.82 0.028 40 (0.67 min) 0.013 410 (6.84 min)
100,000 0.001992 912 (15.21 min) 0.019 90 (1.50 min) 0.043 5,253 (1.46 h)
1,000,000 0.001993 97,215 (27 h) 0.08 223 (3.72 min) 0.041 51,730 (14.37 h)
Data Set Size All Pairs SampleDJ TrackDJ
n Time (seconds) RDJ Time (seconds) RDJ Time (seconds)
10,000,000 113 days (estimated) 0.001998 350 (5.84 min) 0.001997 790,016 (9.14 days)
20,000,000 450 days (estimated) 0.002203 246 (4.10 min) 0.002206 1,613,566 (16.68 days)
t t t
Dataset Size Dataset Size Dataset Size
21st CIKM Conference Lahaina, Maui Hawaii
16. Applications
1630/10/12
Flickr photo tags similarity over the time period 2005-2010
winter, snow, vacation, or house
graduation, wedding, beach
halloween, thanksgiving
christmas
21st CIKM Conference Lahaina, Maui Hawaii
SimilarDiverse
17. Applications: Diversity vs. #Clusters
1730/10/12
Size News Category RDJ
299,612 Corporate/Industrial 4.31
204,820 Makets 4.79
66,339 Economics 4.69
35,769 Government/Social 5.73
35,279 Sports 3.45
33,969 Domestic Politics 5.21
31,328 War, Civil War 5.81
Reuters RCV1 Categories
Size Group Title RDJ
139,344 Pictures of England 1.63
121,391 Dark Art 0.57
98,901 Aircraft Photos 1.99
89,606 Absolutely beautiful 0.51
76,265 Visual Arts!! 0.61
73,632 Lonely Planet:„Leaving„ 0.48
71,158 Lighthouse Lovers 4.56
Flickr Groups
21st CIKM Conference Lahaina, Maui Hawaii
19. Conclusion & Future Work
• Average similarity of all object pairs can be computed in linear time
• Two novel algorithms with probabilistic guarantees and different properties
SampleDJ: Fast for most datasets, does not depend on dataset size
TrackDJ: Solves the problem guaranteed in linear time
Future Work:
• Applying other similarity measures
• Studying visual features in multi-media collections
• Experiments with parallelization
1930/10/1221st CIKM Conference Lahaina, Maui Hawaii
20. Data sets and source code: http://www.l3s.de/~deng/
Fan Deng, Stefan Siersdorfer, Sergej Zerr
zerr@L3S.de
Thank you!
∩ UU
SampleDJ
TrackDJ
Jaccard similarity
Temporal diversity development in Flickr
http://en.wikipedia.org/wiki/File:Phanerozoic_Biodiversity.png
21. REFERENCES
[1] Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651 – 666, 2010.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. PODS ‟02, Madison, Wisconsin.
[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
[4] A. Z. Broder. Min-wise independent permutations: Theory and practice. ICALP ‟00, London, UK.
[5] A. Z. Broder. On the resemblance and containment of documents. SEQUENCES ‟97, Washington, USA.
[6] A. Z. Broder. Identifying and filtering near-duplicate documents. COM ‟00, London, UK, 2000.
[7] A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60:630–659, June 2000.
[8] M. Charikar. Similarity estimation techniques from rounding algorithms. STOC ‟02.
[9] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for monte carlo estimation. SIAM J. Comput., 29:1484–1496, March 2000.
[10] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. SCG ‟04, New York, USA.
[11] J. D. Fearon. Ethnic and cultural diversity by country*. Journal of Economic Growth, 8:195–222, 2003.
[12] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. WWW‟09, Madrid, Spain.
[13] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. STOC ‟98, Dallas, Texas, USA.
[14] C. C. Krebs. Ecological Methodology. HarperCollins, 1989.
[15] C. Lévêque and J.-C. Mounolou. Biodiversity. John Wiley & Sons, 2003.
[16] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
[17] M. Ley. The dblp computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.
[18] S. Lieberson. Measuring population diversity. American Sociological Review, 34(6):850–862, 1969.
[19] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[20] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res., 2:397–418, March 2002.
[21] E. Minack, W. Siberski, and W. Nejdl. Incremental diversification for very large sets: a streaming-based approach. In SIGIR ’11, Beijing, China.
[22] Olken. Random sampling from databases. In Ph.D. Diss. (University of California at Berkeley), 1993.
[23] O. Papapetrou, W. Siberski, and N. Fuhr. Text clustering for peer-to-peer networks with probabilistic guarantees. LNCS, pages V.5993, 293–305. Springer Berlin /
Heidelberg, 2010.
[24] D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW ’10, Raleigh, USA.
[25] I. Rafols and M. Meyer. Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics, 82(2):263–287, 2010.
[26] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In CIKM ’06.
[27] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.
[28] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.
[29] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia. Efficient computation of diverse query results. In ICDE’08, Washington, DC, USA.
[30] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In WWW ’05, New York, USA.
22. Similarity Measures
• There exists a large number of possible measures:
Cosine similarity, Okapi, Inverted distances, ect.
• Jaccard Similarity (Computationally efficient)
Each object belongs to one of a discrete sets of categories
2221th CIKM Conference Lahaina, Maui Hawaii 30/10/12
||
||
),(
ji
ji
ji
OO
OO
OOJS∩ UU
Text 1 island maui second largest hawaiian
Text 2 tenerife largest island seven canary
Jaccard Similarity JS=2/6=0.33
23. Estimation Algorithm TrackDJ
2321th CIKM Conference Lahaina, Maui Hawaii 30/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)
D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D)
π1 = (E,B,A,C,D)
h1(D1) = B
h1(D2) = B
h1(D3) = E
h1(D4) = E
h1(D5) = A
h2(D1) = A
h2(D2) = C
h2(D3) = C
h2(D4) = B
h2(D5) = A
π2 = (A,C,B,D,E)
),()]()(Pr[ yxyx DDJSDhDh
h1(D1) = B h1(D2) = B h1(D3) = E h1(D4) = E h1(D5) = A
24. Estimation Algorithm TrackDJ
2421th CIKM Conference Lahaina, Maui Hawaii 30/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)
π1 = (E,B,A,C,D)
h1(D1) = B
h1(D2) = B
h1(D3) = E
h1(D4) = E
h1(D5) = A
D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D)
),()]()(Pr[ yxyx DDJSDhDh
D1(A,B,C)
h1(D1) = B
…..
h2(D1) = A
h2(D2) = C
h2(D3) = C
h2(D4) = B
h2(D5) = A
π2 = (A,C,B,D,E)
28. Problem Statement “All-Pair” problem
• To measure the diversity of a dataset, similarity computation
between all possible pairs is required
O(n2) complexity
not feasible for large datasets
2821th CIKM Conference Lahaina, Maui Hawaii 30/10/12
A
B
C D
E
B
C
D E FF
29. Applications: Diversity vs. #Clusters
2921th CIKM Conference Lahaina, Maui Hawaii 30/10/12
Size News Category RDJ
299,612 Corporate/Industrial 4.31
204,820 Makets 4.79
66,339 Economics 4.69
35,769 Government/Social 5.73
35,279 Sports 3.45
33,969 Domestic Politics 5.21
31,328 War, Civil War 5.81
Reuters RCV1 Categories
Size Group Title RDJ
139,344 Pictures of England 1.63
121,391 Dark Art 0.57
98,901 Aircraft Photos 1.99
89,606 Absolutely beautiful 0.51
76,265 Visual Arts!! 0.61
73,632 Lonely Planet:„Leaving„ 0.48
71,158 Lighthouse Lovers 4.56
Flickr Groups
Size Educational Background RDJ
562,837 High School, Diploma, Ged 49,41
366,116 Some Colledge w.o. Degree 49,22
273,281 5th,6th,7th, or 8th Grade 51,63
213,941 Bachelors Degree 51,36
174,653 1st, 2nd, 3rd or 4th Grade 70.97
108,834 N/a Less Than 3Years Old 84.55
107,142 10th Grade 47.56
UCI US-Census Educ. Based Clusters
30. Similarity Measures
• There exists a large number of possible measures:
Cosine similarity, Okapi, Inverted distances, ect.
• Jaccard Similarity has special properties we make use
in our algorithms
3021th CIKM Conference Lahaina, Maui Hawaii 30/10/12
||
||
),(
ji
ji
ji
OO
OO
OOJS∩ UU
Notas do Editor
Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.
Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.
Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.
Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.