SlideShare uma empresa Scribd logo
1 de 30
121st CIKM Conference Lahaina, Maui Hawaii 30/10/12
Fan Deng, Stefan Siersdorfer, Sergej Zerr
21st ACM Conference on Information and Knowledge Management, CIKM 2012, Lahiana, Maui Hawaii
Diversity
221st CIKM Conference Lahaina, Maui Hawaii 30/10/12
Diversity - Healthmeasure of an ecosystem
Biodiversity: is the degree of variation of life forms within a given
species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of
ecosystems (Wikipedia).
Diversity
Biodiversity: is the degree of variation of life forms within a given
species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of
ecosystems (Wikipedia).
330/10/1221st CIKM Conference Lahaina, Maui Hawaii
Diversity in Computer Science
430/10/12
Our focus: Topic diversity of the large text corpora
Social Web Environment – Ecosystem
Group dynamics
“Hot topics”, controversial topics
Diversity of opinions
Topic ambiguity
Temporal topic analysis
21st CIKM Conference Lahaina, Maui Hawaii
Increasing amounts of data are published on the Internet on a daily basis, not least
due to popular social web environments: YouTube, Flickr, blogosphere, … ect.
Outline
• Motivation: Document Topic Diversity
• Diversity Metrics
• Proposed Efficient Algorithms: SampleDJ, TrackDJ
• Experiments
• Applications
• Future Work: Ideas&Directions
530/10/1221st CIKM Conference Lahaina, Maui Hawaii
Diversity Metrics
• Simpson‟s1 diversity index
 Each object belongs to one of a discrete sets of categories
• Stirling‟s2 index
 Depends on distances between objects and their relative
occurrences
6
[1] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.
21st CIKM Conference Lahaina, Maui Hawaii 30/10/12
[2] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.
Z
i iD 1
2
)()()( jijiij ij ppdD
A
B
C
D
E
B
C
D E F
• Refined Jaccard Index – average Jaccard similarity between all
possible object pairs
• Note: lower RDJ value corresponds to higher diversity
• Problem: “All-Pair Problem”
• Solution: Estimation algorithms with probabilistic error bound
guarantees
Refined Jaccard Index
721st CIKM Conference Lahaina, Maui Hawaii 30/10/12
ji
ji OOJS
nn
RDJ ),(
)1(
2
nji1
∩ UU
Jaccard similarity
• Input: Relative error ε, accuracy confidence δ
• Output: Estimated RDJ value
•Algorithms: SampleDJ, TrackDJ (claims and proofs in the paper)
Estimation Algorithms
821st CIKM Conference Lahaina, Maui Hawaii 30/10/12
RDJ
RDJRDJ ||
Pr
Estimation Algorithm SampleDJ
921st CIKM Conference Lahaina, Maui Hawaii 30/10/12
...
.
.
.
.
.
..
.
.
.
.. .. .
.
.
.
.
. . ..
.
. ..
.
.
.
.
.
..
.
.
. . .. . ......
Document Set
Document sub sets: Step 1
...
.
.
.
.. ..
. .
.
.. ..
.
.
.
.
.
. . ..
.
. ..
.
.
.
.
.
..
.
..
.
.
.
.. .
.
.
.
.
. ..
..
.. ..
.
...
.
.
.
.
. ..
..
.
.
.
.
. .
.
.. .
.
.
..
..
.
.
.
. .
.
.. .
.
..
...
.
.
.
.
.
. ..
..
.
.
.
. ..
..
.
.
. .
.
.. .
.
.
..
.
.
.
.
.
. .
.
..
.
.
..
...
.
.
.. .
. ..
..
.
.
.
. ..
.
MedianMedian
.
. . .. . ......
• Execution time:
• Properties:
 Execution time (number of trials) does not depend on the data set size, but
only on RDJ value
 For a dataset with a very high diversity value can run infinitely long time.
SampleDJ Overview
1021st CIKM Conference Lahaina, Maui Hawaii 30/10/12
)
1
( 2
RDJ
Estimation Algorithm TrackDJ
1130/10/12
π1 = (E,B,A,C,D)
D1(A,B,C), D2(B,C,D)
h1(D1) = B
h1(D2) = B
),()]()(Pr[ yxyx DDJSDhDh
• Broder et al. 2000 proposed Min-wise independent hashing (Min-
hash)
21st CIKM Conference Lahaina, Maui Hawaii
Estimation Algorithm TrackDJ
1230/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)
π1 = (E,B,A,C,D)
h1(D1) = B
h1(D2) = B
h1(D3) = E
h1(D4) = E
h1(D5) = A
D1
D2
D3 D4
D5
D1
D2
D3
D4 D5
D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), D5(A,C,D)
21st CIKM Conference Lahaina, Maui Hawaii
•Time complexity:
• Properties:
 Execution in linear time (depends on the data set size)
TrackDJ Overview
1330/10/12
)(nO
21st CIKM Conference Lahaina, Maui Hawaii
Experimental Evaluation of the Theoretical
Claims (Flickr Dataset)
1430/10/12
ε=5%, δ=95%
Data Set Size All Pairs SampleDJ TrackDJ
n RDJ Time(seconds) Error(%) Time(seconds) Error(%) Time(seconds)
1,000 0.00206 0.08 0.017 34 (0.57 min) 0 40 (0.66 min)
10,000 0.001992 8.82 0.028 40 (0.67 min) 0.013 410 (6.84 min)
100,000 0.001992 912 (15.21 min) 0.019 90 (1.50 min) 0.043 5,253 (1.46 h)
1,000,000 0.001993 97,215 (27 h) 0.08 223 (3.72 min) 0.041 51,730 (14.37 h)
Data Set Size All Pairs SampleDJ TrackDJ
n Time (seconds) RDJ Time (seconds) RDJ Time (seconds)
10,000,000 113 days (estimated) 0.001998 350 (5.84 min) 0.001997 790,016 (9.14 days)
20,000,000 450 days (estimated) 0.002203 246 (4.10 min) 0.002206 1,613,566 (16.68 days)
t t t
Dataset Size Dataset Size Dataset Size
21st CIKM Conference Lahaina, Maui Hawaii
Experimental evaluation of the
Theoretical Claims (Syntetic Dataset)
1530/10/12
.
All-Pair SampleDJ TrackDJ
n RDJ Time(hours) Error(%) Time(seconds) Error(%) Time(hours)
524,288 0.017
5.3
0.34 2 2.05
2.5
524,288 0.0087 0.26 10 1.96
524,288 0.00427 0.38 39 2.00
524,288 0.00217 0.02 156 1.95
524,288 0.00105 0.06 624 (10 min) 1.90
524,288 0.00052 0.13 2,502(42 min) 1.91
524,288 0.00026 0.04 10,089 (3h) 1.91
524,288 0.00013 0.39 40,635(11h) 2.31
log(t)
RDJ
21st CIKM Conference Lahaina, Maui Hawaii
Applications
1630/10/12
Flickr photo tags similarity over the time period 2005-2010
winter, snow, vacation, or house
graduation, wedding, beach
halloween, thanksgiving
christmas
21st CIKM Conference Lahaina, Maui Hawaii
SimilarDiverse
Applications: Diversity vs. #Clusters
1730/10/12
Size News Category RDJ
299,612 Corporate/Industrial 4.31
204,820 Makets 4.79
66,339 Economics 4.69
35,769 Government/Social 5.73
35,279 Sports 3.45
33,969 Domestic Politics 5.21
31,328 War, Civil War 5.81
Reuters RCV1 Categories
Size Group Title RDJ
139,344 Pictures of England 1.63
121,391 Dark Art 0.57
98,901 Aircraft Photos 1.99
89,606 Absolutely beautiful 0.51
76,265 Visual Arts!! 0.61
73,632 Lonely Planet:„Leaving„ 0.48
71,158 Lighthouse Lovers 4.56
Flickr Groups
21st CIKM Conference Lahaina, Maui Hawaii
Outline
• Motivation: Document Topic Diversity
• Diversity Metrics
• Proposed Efficient Algorithms: SampleDJ, TrackDJ
• Experiments
• Applications
• Future Work: Ideas&Directions
1830/10/1221st CIKM Conference Lahaina, Maui Hawaii
Conclusion & Future Work
• Average similarity of all object pairs can be computed in linear time
• Two novel algorithms with probabilistic guarantees and different properties
 SampleDJ: Fast for most datasets, does not depend on dataset size
TrackDJ: Solves the problem guaranteed in linear time
Future Work:
• Applying other similarity measures
• Studying visual features in multi-media collections
• Experiments with parallelization
1930/10/1221st CIKM Conference Lahaina, Maui Hawaii
Data sets and source code: http://www.l3s.de/~deng/
Fan Deng, Stefan Siersdorfer, Sergej Zerr
zerr@L3S.de
Thank you!
∩ UU
SampleDJ
TrackDJ
Jaccard similarity
Temporal diversity development in Flickr
http://en.wikipedia.org/wiki/File:Phanerozoic_Biodiversity.png
REFERENCES
[1] Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651 – 666, 2010.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. PODS ‟02, Madison, Wisconsin.
[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
[4] A. Z. Broder. Min-wise independent permutations: Theory and practice. ICALP ‟00, London, UK.
[5] A. Z. Broder. On the resemblance and containment of documents. SEQUENCES ‟97, Washington, USA.
[6] A. Z. Broder. Identifying and filtering near-duplicate documents. COM ‟00, London, UK, 2000.
[7] A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60:630–659, June 2000.
[8] M. Charikar. Similarity estimation techniques from rounding algorithms. STOC ‟02.
[9] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for monte carlo estimation. SIAM J. Comput., 29:1484–1496, March 2000.
[10] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. SCG ‟04, New York, USA.
[11] J. D. Fearon. Ethnic and cultural diversity by country*. Journal of Economic Growth, 8:195–222, 2003.
[12] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. WWW‟09, Madrid, Spain.
[13] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. STOC ‟98, Dallas, Texas, USA.
[14] C. C. Krebs. Ecological Methodology. HarperCollins, 1989.
[15] C. Lévêque and J.-C. Mounolou. Biodiversity. John Wiley & Sons, 2003.
[16] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
[17] M. Ley. The dblp computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/.
[18] S. Lieberson. Measuring population diversity. American Sociological Review, 34(6):850–862, 1969.
[19] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[20] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res., 2:397–418, March 2002.
[21] E. Minack, W. Siberski, and W. Nejdl. Incremental diversification for very large sets: a streaming-based approach. In SIGIR ’11, Beijing, China.
[22] Olken. Random sampling from databases. In Ph.D. Diss. (University of California at Berkeley), 1993.
[23] O. Papapetrou, W. Siberski, and N. Fuhr. Text clustering for peer-to-peer networks with probabilistic guarantees. LNCS, pages V.5993, 293–305. Springer Berlin /
Heidelberg, 2010.
[24] D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW ’10, Raleigh, USA.
[25] I. Rafols and M. Meyer. Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics, 82(2):263–287, 2010.
[26] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In CIKM ’06.
[27] E. H. Simpson. Measurement of diversity. Nature, 163, 1949.
[28] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007.
[29] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia. Efficient computation of diverse query results. In ICDE’08, Washington, DC, USA.
[30] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In WWW ’05, New York, USA.
Similarity Measures
• There exists a large number of possible measures:
 Cosine similarity, Okapi, Inverted distances, ect.
• Jaccard Similarity (Computationally efficient)
 Each object belongs to one of a discrete sets of categories
2221th CIKM Conference Lahaina, Maui Hawaii 30/10/12
||
||
),(
ji
ji
ji
OO
OO
OOJS∩ UU
Text 1 island maui second largest hawaiian
Text 2 tenerife largest island seven canary
Jaccard Similarity JS=2/6=0.33
Estimation Algorithm TrackDJ
2321th CIKM Conference Lahaina, Maui Hawaii 30/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)
D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D)
π1 = (E,B,A,C,D)
h1(D1) = B
h1(D2) = B
h1(D3) = E
h1(D4) = E
h1(D5) = A
h2(D1) = A
h2(D2) = C
h2(D3) = C
h2(D4) = B
h2(D5) = A
π2 = (A,C,B,D,E)
),()]()(Pr[ yxyx DDJSDhDh
h1(D1) = B h1(D2) = B h1(D3) = E h1(D4) = E h1(D5) = A
Estimation Algorithm TrackDJ
2421th CIKM Conference Lahaina, Maui Hawaii 30/10/12
• Broder et al. proposed Min-wise independent hashing (Min-hash)
π1 = (E,B,A,C,D)
h1(D1) = B
h1(D2) = B
h1(D3) = E
h1(D4) = E
h1(D5) = A
D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D)
),()]()(Pr[ yxyx DDJSDhDh
D1(A,B,C)
h1(D1) = B
…..
h2(D1) = A
h2(D2) = C
h2(D3) = C
h2(D4) = B
h2(D5) = A
π2 = (A,C,B,D,E)
Outline
• Motivation: Document Topic Diversity
• Diversity Metrics
• Proposed Efficient Algorithms: SampleDJ, TrackDJ
• Experiments
• Applications
• Future Work: Ideas&Directions
2521th CIKM Conference Lahaina, Maui Hawaii 30/10/12
Outline
• Motivation: Document Topic Diversity
• Diversity Metrics
• Proposed Efficient Algorithms: SampleDJ, TrackDJ
• Experiments
• Applications
• Future Work: Ideas&Directions
2621th CIKM Conference Lahaina, Maui Hawaii 30/10/12
Outline
• Motivation: Document Topic Diversity
• Diversity Metrics
• Proposed Efficient Algorithms: SampleDJ, TrackDJ
• Experiments
• Applications
• Future Work: Ideas&Directions
2721th CIKM Conference Lahaina, Maui Hawaii 30/10/12
Problem Statement “All-Pair” problem
• To measure the diversity of a dataset, similarity computation
between all possible pairs is required
 O(n2) complexity
 not feasible for large datasets
2821th CIKM Conference Lahaina, Maui Hawaii 30/10/12
A
B
C D
E
B
C
D E FF
Applications: Diversity vs. #Clusters
2921th CIKM Conference Lahaina, Maui Hawaii 30/10/12
Size News Category RDJ
299,612 Corporate/Industrial 4.31
204,820 Makets 4.79
66,339 Economics 4.69
35,769 Government/Social 5.73
35,279 Sports 3.45
33,969 Domestic Politics 5.21
31,328 War, Civil War 5.81
Reuters RCV1 Categories
Size Group Title RDJ
139,344 Pictures of England 1.63
121,391 Dark Art 0.57
98,901 Aircraft Photos 1.99
89,606 Absolutely beautiful 0.51
76,265 Visual Arts!! 0.61
73,632 Lonely Planet:„Leaving„ 0.48
71,158 Lighthouse Lovers 4.56
Flickr Groups
Size Educational Background RDJ
562,837 High School, Diploma, Ged 49,41
366,116 Some Colledge w.o. Degree 49,22
273,281 5th,6th,7th, or 8th Grade 51,63
213,941 Bachelors Degree 51,36
174,653 1st, 2nd, 3rd or 4th Grade 70.97
108,834 N/a Less Than 3Years Old 84.55
107,142 10th Grade 47.56
UCI US-Census Educ. Based Clusters
Similarity Measures
• There exists a large number of possible measures:
 Cosine similarity, Okapi, Inverted distances, ect.
• Jaccard Similarity has special properties we make use
in our algorithms
3021th CIKM Conference Lahaina, Maui Hawaii 30/10/12
||
||
),(
ji
ji
ji
OO
OO
OOJS∩ UU

Mais conteúdo relacionado

Semelhante a CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

TNC2012 Federated and scholarly identity - match made in heaven?
TNC2012 Federated and scholarly identity - match made in heaven?TNC2012 Federated and scholarly identity - match made in heaven?
TNC2012 Federated and scholarly identity - match made in heaven?
Gudmundur Thorisson
 
Thomas ecn 2012
Thomas ecn 2012Thomas ecn 2012
Thomas ecn 2012
ECNOfficer
 
RDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataRDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research Data
Gudmundur Thorisson
 

Semelhante a CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections (20)

10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
10-1-13 “Research Data Curation at UC San Diego: An Overview” Presentation Sl...
 
Ben Shneiderman: Thrill of Discovery
Ben Shneiderman: Thrill of DiscoveryBen Shneiderman: Thrill of Discovery
Ben Shneiderman: Thrill of Discovery
 
Cyberistructure
CyberistructureCyberistructure
Cyberistructure
 
The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?
The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?
The Quest for Digital Preservation: Will Part of Math History Be Gone Forever?
 
Metadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data RepositoriesMetadata as Linked Data for Research Data Repositories
Metadata as Linked Data for Research Data Repositories
 
International Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and EducationInternational Perspectives: Visualization in Science and Education
International Perspectives: Visualization in Science and Education
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
British Library Datasets Programme Feb 2011
British Library Datasets Programme Feb 2011British Library Datasets Programme Feb 2011
British Library Datasets Programme Feb 2011
 
TNC2012 Federated and scholarly identity - match made in heaven?
TNC2012 Federated and scholarly identity - match made in heaven?TNC2012 Federated and scholarly identity - match made in heaven?
TNC2012 Federated and scholarly identity - match made in heaven?
 
Insights from Knowledge Graphs
Insights from Knowledge GraphsInsights from Knowledge Graphs
Insights from Knowledge Graphs
 
Thomas ecn 2012
Thomas ecn 2012Thomas ecn 2012
Thomas ecn 2012
 
Open Data is not Enough (final version)
Open Data is not Enough (final version)Open Data is not Enough (final version)
Open Data is not Enough (final version)
 
DataCite - services and support for opening up research data
DataCite - services and support for opening up research dataDataCite - services and support for opening up research data
DataCite - services and support for opening up research data
 
RDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataRDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research Data
 
What is DataCite-screenshots
What is DataCite-screenshotsWhat is DataCite-screenshots
What is DataCite-screenshots
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Tackling variety in event based systems
Tackling variety in event based systemsTackling variety in event based systems
Tackling variety in event based systems
 
Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021Hala skafkeynote@conferencedata2021
Hala skafkeynote@conferencedata2021
 

Mais de CUbRIK Project

Mais de CUbRIK Project (20)

Matching Game Mechanics and Human Computation Tasks in Games with a Purpose
Matching Game Mechanics and Human Computation Tasks in Games with a PurposeMatching Game Mechanics and Human Computation Tasks in Games with a Purpose
Matching Game Mechanics and Human Computation Tasks in Games with a Purpose
 
Humanist machine interaction with histoGraph
Humanist machine interaction with histoGraphHumanist machine interaction with histoGraph
Humanist machine interaction with histoGraph
 
histoGraph presented to MMSP 2013
histoGraph presented to MMSP 2013histoGraph presented to MMSP 2013
histoGraph presented to MMSP 2013
 
histoGraph for historians
histoGraph for historianshistoGraph for historians
histoGraph for historians
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital Humanities
 
SMILA in CUbRIK
SMILA in CUbRIKSMILA in CUbRIK
SMILA in CUbRIK
 
CUbRIK research on social aspects
CUbRIK research on social aspectsCUbRIK research on social aspects
CUbRIK research on social aspects
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraph
 
The CUbRIK histoGraph Factsheet
The CUbRIK histoGraph FactsheetThe CUbRIK histoGraph Factsheet
The CUbRIK histoGraph Factsheet
 
CUbRIK Fashion Trend Analysis: a Business Intelligence Application
CUbRIK Fashion Trend Analysis: a Business Intelligence ApplicationCUbRIK Fashion Trend Analysis: a Business Intelligence Application
CUbRIK Fashion Trend Analysis: a Business Intelligence Application
 
CUbRIK Social Graph Visual Interface
CUbRIK Social Graph Visual InterfaceCUbRIK Social Graph Visual Interface
CUbRIK Social Graph Visual Interface
 
Mining Emotions in Short Films: User Comments or Crowdsourcing?
Mining Emotions in Short Films: User Comments or Crowdsourcing?Mining Emotions in Short Films: User Comments or Crowdsourcing?
Mining Emotions in Short Films: User Comments or Crowdsourcing?
 
CUbRIK and gaming experience@Qualinet
CUbRIK and gaming experience@QualinetCUbRIK and gaming experience@Qualinet
CUbRIK and gaming experience@Qualinet
 
CUbRIK: Open Box. Multimedia and Human Computation approach
CUbRIK: Open Box. Multimedia and Human Computation approachCUbRIK: Open Box. Multimedia and Human Computation approach
CUbRIK: Open Box. Multimedia and Human Computation approach
 
ICT 2013: Better Society: empowering Horizon 2020 with trustable social media
ICT 2013: Better Society: empowering Horizon 2020 with trustable social mediaICT 2013: Better Society: empowering Horizon 2020 with trustable social media
ICT 2013: Better Society: empowering Horizon 2020 with trustable social media
 
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
How Do We Deep-Link? Leveraging User-Contributed Time-Links for Non-Linear Vi...
 
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a PurposeCUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
CUbRIK Tutorial at ICWE 2013: part 2 - Introduction to Games with a Purpose
 
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human ComputationCUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
CUbRIK tutorial at ICWE 2013: part 1 Introduction to Human Computation
 
Semantic schema for geonames
Semantic schema for geonamesSemantic schema for geonames
Semantic schema for geonames
 
Exploiting User Generated Content for Mountain Peak Detection
Exploiting User Generated Content for Mountain Peak DetectionExploiting User Generated Content for Mountain Peak Detection
Exploiting User Generated Content for Mountain Peak Detection
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

CUbRIK Research at CIKM 2012: Efficient Jaccard-based Diversity Analysis of Large Document Collections

  • 1. 121st CIKM Conference Lahaina, Maui Hawaii 30/10/12 Fan Deng, Stefan Siersdorfer, Sergej Zerr 21st ACM Conference on Information and Knowledge Management, CIKM 2012, Lahiana, Maui Hawaii
  • 2. Diversity 221st CIKM Conference Lahaina, Maui Hawaii 30/10/12 Diversity - Healthmeasure of an ecosystem Biodiversity: is the degree of variation of life forms within a given species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of ecosystems (Wikipedia).
  • 3. Diversity Biodiversity: is the degree of variation of life forms within a given species, ecosystem, biome, or an entire planet. Biodiversity is a measure of the health of ecosystems (Wikipedia). 330/10/1221st CIKM Conference Lahaina, Maui Hawaii
  • 4. Diversity in Computer Science 430/10/12 Our focus: Topic diversity of the large text corpora Social Web Environment – Ecosystem Group dynamics “Hot topics”, controversial topics Diversity of opinions Topic ambiguity Temporal topic analysis 21st CIKM Conference Lahaina, Maui Hawaii Increasing amounts of data are published on the Internet on a daily basis, not least due to popular social web environments: YouTube, Flickr, blogosphere, … ect.
  • 5. Outline • Motivation: Document Topic Diversity • Diversity Metrics • Proposed Efficient Algorithms: SampleDJ, TrackDJ • Experiments • Applications • Future Work: Ideas&Directions 530/10/1221st CIKM Conference Lahaina, Maui Hawaii
  • 6. Diversity Metrics • Simpson‟s1 diversity index  Each object belongs to one of a discrete sets of categories • Stirling‟s2 index  Depends on distances between objects and their relative occurrences 6 [1] E. H. Simpson. Measurement of diversity. Nature, 163, 1949. 21st CIKM Conference Lahaina, Maui Hawaii 30/10/12 [2] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007. Z i iD 1 2 )()()( jijiij ij ppdD A B C D E B C D E F
  • 7. • Refined Jaccard Index – average Jaccard similarity between all possible object pairs • Note: lower RDJ value corresponds to higher diversity • Problem: “All-Pair Problem” • Solution: Estimation algorithms with probabilistic error bound guarantees Refined Jaccard Index 721st CIKM Conference Lahaina, Maui Hawaii 30/10/12 ji ji OOJS nn RDJ ),( )1( 2 nji1 ∩ UU Jaccard similarity
  • 8. • Input: Relative error ε, accuracy confidence δ • Output: Estimated RDJ value •Algorithms: SampleDJ, TrackDJ (claims and proofs in the paper) Estimation Algorithms 821st CIKM Conference Lahaina, Maui Hawaii 30/10/12 RDJ RDJRDJ || Pr
  • 9. Estimation Algorithm SampleDJ 921st CIKM Conference Lahaina, Maui Hawaii 30/10/12 ... . . . . . .. . . . .. .. . . . . . . . .. . . .. . . . . . .. . . . . .. . ...... Document Set Document sub sets: Step 1 ... . . . .. .. . . . .. .. . . . . . . . .. . . .. . . . . . .. . .. . . . .. . . . . . . .. .. .. .. . ... . . . . . .. .. . . . . . . . .. . . . .. .. . . . . . . .. . . .. ... . . . . . . .. .. . . . . .. .. . . . . . .. . . . .. . . . . . . . . .. . . .. ... . . .. . . .. .. . . . . .. . MedianMedian . . . .. . ......
  • 10. • Execution time: • Properties:  Execution time (number of trials) does not depend on the data set size, but only on RDJ value  For a dataset with a very high diversity value can run infinitely long time. SampleDJ Overview 1021st CIKM Conference Lahaina, Maui Hawaii 30/10/12 ) 1 ( 2 RDJ
  • 11. Estimation Algorithm TrackDJ 1130/10/12 π1 = (E,B,A,C,D) D1(A,B,C), D2(B,C,D) h1(D1) = B h1(D2) = B ),()]()(Pr[ yxyx DDJSDhDh • Broder et al. 2000 proposed Min-wise independent hashing (Min- hash) 21st CIKM Conference Lahaina, Maui Hawaii
  • 12. Estimation Algorithm TrackDJ 1230/10/12 • Broder et al. proposed Min-wise independent hashing (Min-hash) π1 = (E,B,A,C,D) h1(D1) = B h1(D2) = B h1(D3) = E h1(D4) = E h1(D5) = A D1 D2 D3 D4 D5 D1 D2 D3 D4 D5 D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), D5(A,C,D) 21st CIKM Conference Lahaina, Maui Hawaii
  • 13. •Time complexity: • Properties:  Execution in linear time (depends on the data set size) TrackDJ Overview 1330/10/12 )(nO 21st CIKM Conference Lahaina, Maui Hawaii
  • 14. Experimental Evaluation of the Theoretical Claims (Flickr Dataset) 1430/10/12 ε=5%, δ=95% Data Set Size All Pairs SampleDJ TrackDJ n RDJ Time(seconds) Error(%) Time(seconds) Error(%) Time(seconds) 1,000 0.00206 0.08 0.017 34 (0.57 min) 0 40 (0.66 min) 10,000 0.001992 8.82 0.028 40 (0.67 min) 0.013 410 (6.84 min) 100,000 0.001992 912 (15.21 min) 0.019 90 (1.50 min) 0.043 5,253 (1.46 h) 1,000,000 0.001993 97,215 (27 h) 0.08 223 (3.72 min) 0.041 51,730 (14.37 h) Data Set Size All Pairs SampleDJ TrackDJ n Time (seconds) RDJ Time (seconds) RDJ Time (seconds) 10,000,000 113 days (estimated) 0.001998 350 (5.84 min) 0.001997 790,016 (9.14 days) 20,000,000 450 days (estimated) 0.002203 246 (4.10 min) 0.002206 1,613,566 (16.68 days) t t t Dataset Size Dataset Size Dataset Size 21st CIKM Conference Lahaina, Maui Hawaii
  • 15. Experimental evaluation of the Theoretical Claims (Syntetic Dataset) 1530/10/12 . All-Pair SampleDJ TrackDJ n RDJ Time(hours) Error(%) Time(seconds) Error(%) Time(hours) 524,288 0.017 5.3 0.34 2 2.05 2.5 524,288 0.0087 0.26 10 1.96 524,288 0.00427 0.38 39 2.00 524,288 0.00217 0.02 156 1.95 524,288 0.00105 0.06 624 (10 min) 1.90 524,288 0.00052 0.13 2,502(42 min) 1.91 524,288 0.00026 0.04 10,089 (3h) 1.91 524,288 0.00013 0.39 40,635(11h) 2.31 log(t) RDJ 21st CIKM Conference Lahaina, Maui Hawaii
  • 16. Applications 1630/10/12 Flickr photo tags similarity over the time period 2005-2010 winter, snow, vacation, or house graduation, wedding, beach halloween, thanksgiving christmas 21st CIKM Conference Lahaina, Maui Hawaii SimilarDiverse
  • 17. Applications: Diversity vs. #Clusters 1730/10/12 Size News Category RDJ 299,612 Corporate/Industrial 4.31 204,820 Makets 4.79 66,339 Economics 4.69 35,769 Government/Social 5.73 35,279 Sports 3.45 33,969 Domestic Politics 5.21 31,328 War, Civil War 5.81 Reuters RCV1 Categories Size Group Title RDJ 139,344 Pictures of England 1.63 121,391 Dark Art 0.57 98,901 Aircraft Photos 1.99 89,606 Absolutely beautiful 0.51 76,265 Visual Arts!! 0.61 73,632 Lonely Planet:„Leaving„ 0.48 71,158 Lighthouse Lovers 4.56 Flickr Groups 21st CIKM Conference Lahaina, Maui Hawaii
  • 18. Outline • Motivation: Document Topic Diversity • Diversity Metrics • Proposed Efficient Algorithms: SampleDJ, TrackDJ • Experiments • Applications • Future Work: Ideas&Directions 1830/10/1221st CIKM Conference Lahaina, Maui Hawaii
  • 19. Conclusion & Future Work • Average similarity of all object pairs can be computed in linear time • Two novel algorithms with probabilistic guarantees and different properties  SampleDJ: Fast for most datasets, does not depend on dataset size TrackDJ: Solves the problem guaranteed in linear time Future Work: • Applying other similarity measures • Studying visual features in multi-media collections • Experiments with parallelization 1930/10/1221st CIKM Conference Lahaina, Maui Hawaii
  • 20. Data sets and source code: http://www.l3s.de/~deng/ Fan Deng, Stefan Siersdorfer, Sergej Zerr zerr@L3S.de Thank you! ∩ UU SampleDJ TrackDJ Jaccard similarity Temporal diversity development in Flickr http://en.wikipedia.org/wiki/File:Phanerozoic_Biodiversity.png
  • 21. REFERENCES [1] Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651 – 666, 2010. [2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. PODS ‟02, Madison, Wisconsin. [3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. [4] A. Z. Broder. Min-wise independent permutations: Theory and practice. ICALP ‟00, London, UK. [5] A. Z. Broder. On the resemblance and containment of documents. SEQUENCES ‟97, Washington, USA. [6] A. Z. Broder. Identifying and filtering near-duplicate documents. COM ‟00, London, UK, 2000. [7] A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60:630–659, June 2000. [8] M. Charikar. Similarity estimation techniques from rounding algorithms. STOC ‟02. [9] P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for monte carlo estimation. SIAM J. Comput., 29:1484–1496, March 2000. [10] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. SCG ‟04, New York, USA. [11] J. D. Fearon. Ethnic and cultural diversity by country*. Journal of Economic Growth, 8:195–222, 2003. [12] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. WWW‟09, Madrid, Spain. [13] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. STOC ‟98, Dallas, Texas, USA. [14] C. C. Krebs. Ecological Methodology. HarperCollins, 1989. [15] C. Lévêque and J.-C. Mounolou. Biodiversity. John Wiley & Sons, 2003. [16] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004. [17] M. Ley. The dblp computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/. [18] S. Lieberson. Measuring population diversity. American Sociological Review, 34(6):850–862, 1969. [19] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. [20] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res., 2:397–418, March 2002. [21] E. Minack, W. Siberski, and W. Nejdl. Incremental diversification for very large sets: a streaming-based approach. In SIGIR ’11, Beijing, China. [22] Olken. Random sampling from databases. In Ph.D. Diss. (University of California at Berkeley), 1993. [23] O. Papapetrou, W. Siberski, and N. Fuhr. Text clustering for peer-to-peer networks with probabilistic guarantees. LNCS, pages V.5993, 293–305. Springer Berlin / Heidelberg, 2010. [24] D. Rafiei, K. Bharat, and A. Shukla. Diversifying web search results. In WWW ’10, Raleigh, USA. [25] I. Rafols and M. Meyer. Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics, 82(2):263–287, 2010. [26] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman. Incremental hierarchical clustering of text documents. In CIKM ’06. [27] E. H. Simpson. Measurement of diversity. Nature, 163, 1949. [28] A. Stirling. A general framework for analysing diversity in science, technology and society. Journal of The Royal Society Interface, 4(15):707–719, 2007. [29] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia. Efficient computation of diverse query results. In ICDE’08, Washington, DC, USA. [30] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In WWW ’05, New York, USA.
  • 22. Similarity Measures • There exists a large number of possible measures:  Cosine similarity, Okapi, Inverted distances, ect. • Jaccard Similarity (Computationally efficient)  Each object belongs to one of a discrete sets of categories 2221th CIKM Conference Lahaina, Maui Hawaii 30/10/12 || || ),( ji ji ji OO OO OOJS∩ UU Text 1 island maui second largest hawaiian Text 2 tenerife largest island seven canary Jaccard Similarity JS=2/6=0.33
  • 23. Estimation Algorithm TrackDJ 2321th CIKM Conference Lahaina, Maui Hawaii 30/10/12 • Broder et al. proposed Min-wise independent hashing (Min-hash) D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D) π1 = (E,B,A,C,D) h1(D1) = B h1(D2) = B h1(D3) = E h1(D4) = E h1(D5) = A h2(D1) = A h2(D2) = C h2(D3) = C h2(D4) = B h2(D5) = A π2 = (A,C,B,D,E) ),()]()(Pr[ yxyx DDJSDhDh h1(D1) = B h1(D2) = B h1(D3) = E h1(D4) = E h1(D5) = A
  • 24. Estimation Algorithm TrackDJ 2421th CIKM Conference Lahaina, Maui Hawaii 30/10/12 • Broder et al. proposed Min-wise independent hashing (Min-hash) π1 = (E,B,A,C,D) h1(D1) = B h1(D2) = B h1(D3) = E h1(D4) = E h1(D5) = A D1(A,B,C), D2(B,C,D), D3(C,E), D4(E,B,D), O5(A,C,D) ),()]()(Pr[ yxyx DDJSDhDh D1(A,B,C) h1(D1) = B ….. h2(D1) = A h2(D2) = C h2(D3) = C h2(D4) = B h2(D5) = A π2 = (A,C,B,D,E)
  • 25. Outline • Motivation: Document Topic Diversity • Diversity Metrics • Proposed Efficient Algorithms: SampleDJ, TrackDJ • Experiments • Applications • Future Work: Ideas&Directions 2521th CIKM Conference Lahaina, Maui Hawaii 30/10/12
  • 26. Outline • Motivation: Document Topic Diversity • Diversity Metrics • Proposed Efficient Algorithms: SampleDJ, TrackDJ • Experiments • Applications • Future Work: Ideas&Directions 2621th CIKM Conference Lahaina, Maui Hawaii 30/10/12
  • 27. Outline • Motivation: Document Topic Diversity • Diversity Metrics • Proposed Efficient Algorithms: SampleDJ, TrackDJ • Experiments • Applications • Future Work: Ideas&Directions 2721th CIKM Conference Lahaina, Maui Hawaii 30/10/12
  • 28. Problem Statement “All-Pair” problem • To measure the diversity of a dataset, similarity computation between all possible pairs is required  O(n2) complexity  not feasible for large datasets 2821th CIKM Conference Lahaina, Maui Hawaii 30/10/12 A B C D E B C D E FF
  • 29. Applications: Diversity vs. #Clusters 2921th CIKM Conference Lahaina, Maui Hawaii 30/10/12 Size News Category RDJ 299,612 Corporate/Industrial 4.31 204,820 Makets 4.79 66,339 Economics 4.69 35,769 Government/Social 5.73 35,279 Sports 3.45 33,969 Domestic Politics 5.21 31,328 War, Civil War 5.81 Reuters RCV1 Categories Size Group Title RDJ 139,344 Pictures of England 1.63 121,391 Dark Art 0.57 98,901 Aircraft Photos 1.99 89,606 Absolutely beautiful 0.51 76,265 Visual Arts!! 0.61 73,632 Lonely Planet:„Leaving„ 0.48 71,158 Lighthouse Lovers 4.56 Flickr Groups Size Educational Background RDJ 562,837 High School, Diploma, Ged 49,41 366,116 Some Colledge w.o. Degree 49,22 273,281 5th,6th,7th, or 8th Grade 51,63 213,941 Bachelors Degree 51,36 174,653 1st, 2nd, 3rd or 4th Grade 70.97 108,834 N/a Less Than 3Years Old 84.55 107,142 10th Grade 47.56 UCI US-Census Educ. Based Clusters
  • 30. Similarity Measures • There exists a large number of possible measures:  Cosine similarity, Okapi, Inverted distances, ect. • Jaccard Similarity has special properties we make use in our algorithms 3021th CIKM Conference Lahaina, Maui Hawaii 30/10/12 || || ),( ji ji ji OO OO OOJS∩ UU

Notas do Editor

  1. Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.
  2. Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.
  3. Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.
  4. Genetic diversity serves as a way for populations to adapt to changing environments. With more variation, it is more likely that some individuals in a population will possess variations of alleles that are suited for the environment. Those individuals are more likely to survive to produce offspring bearing that allele. The population will continue for more generations because of the success of these individuals.
  5. Thanks :)
  6. Thanks :)