Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

•Transferir como PPTX, PDF•

0 gostou•380 visualizações

Presentation given by Mark M. Hall, Mark Stevenson and Paul D. Clough from the Information School /Department of Computer Science, University of Sheffield, UK 24-27 September 2012 TPDL 2012, Cyprus

Tecnologia

Evaluating the Use of Clustering
for Automatically Organising
Digital Library Collections
Mark M. Hall, Mark Stevenson,
Paul D. Clough

TPDL 2012, Cyprus, 24-27 September 2012

Opening Up Digital Cultural Heritage

http://www.flickr.com/photos/brokenthoughts/122096903/
Carl Collins
http://www.flickr.com/photos/carlcollins/199792939/

http://www.flickr.com/photos/usnationalarchives/4069633668/
TPDL 2012, Cyprus, 24-27 September 2012

Exploring Collections
• Exploring / Browsing as an alternative to
Search (where applicable)
• Requires some kind of structuring of the
data
• Manual structuring ideal
– Expensive to generate
– Integration of collections problematic
• Alternative: Automatic structuring via
clustering

TPDL 2012, Cyprus, 24-27 September 2012

Test Collection
• 28133 photographs provided
by the University of St
Andrews Library
– 85% pre 1940 Ottery St Mary
– 89% black and white Church

– Majority UK
– Title and description tend to be
short

TPDL 2012, Cyprus, 24-27 September 2012

Tested Clustering Strategies
• Latent Dirichlet Allocation (LDA)
– 300 & 900 topics
– With and without Pairwise Mutual Information
(PMI) filtering
• K-Means
– 900 clusters
– TFIDF vectors & LDA topic vectors
• OPTICS
– 900 clusters
– TFIDF vectors & LDA topic vectors

TPDL 2012, Cyprus, 23-27 September 2012

Processing Time
Model Wall-clock Time
LDA 300 00:21:48
LDA 900 00:42:42
LDA + PMI 300 05:05:13
LDA + PMI 900 17:26:08
K-Means TFIDF 09:37:40
K-Means LDA 03:49:04
Optics TFIDF 12:42:13
Optics LDA 05:12:49

TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Metrics
• Cluster cohesion
– Items in a cluster should be similar to each
other
– Items in a cluster should be different from
items in other clusters
• How to test this?
– “Intruder” test
– If you insert an intruder into a cluster, can
people find it

TPDL 2012, Cyprus, 24-27 September 2012

Intruder Test
1. Randomly select one topic
2. Randomly select four items from the topic
3. Randomly select a second topic – the
“intruder” topic
4. Randomly select one item from the
second topic – the “intruder” item
5. Scramble the five items and let the user
choose which one is the “intruder”

TPDL 2012, Cyprus, 24-27 September 2012

Cluster Cohesion – Cohesive

TPDL 2012, Cyprus, 24-27 September 2012

Cluster Cohesion – Not Cohesive

TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Metrics
• Cohesive
– “Intruder” is chosen significantly more
frequently than by chance
– Choice distribution is significantly different
from the uniform distribution
• Borderline cohesive
– Two out of five items make up > 95% of the
answers
– “Intruder” is one of those two

TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Bounds
• Upper bound
– Manual annotation
• 936 topics
• Lower bound
– 3 cohesive topics
– <5% likelihood of seeing that number of cohesive
topics by chance
• Control data
– 10 “really, totally, completely obvious” intruders
used to filter participants who randomly select
answers

TPDL 2012, Cyprus, 24-27 September 2012

Experiment
• Crowd-sourced using staff & students at
Sheffield University
– 700 participants
• 9 clustering strategies
– 30 units per strategy – total of 270 units
• Results
– 8840 ratings
– 21 – 30 ratings per unit (median 27 ratings)

TPDL 2012, Cyprus, 24-27 September 2012

Results
Model Cohesive Borderline Non-Cohesive
Upper Bound 27 0 3
Lower Bound 3 0 27
LDA 300 15 6 9
LDA 900 20 4 6
LDA + PMI 300 16 4 10
LDA + PMI 900 21 2 7
K-Means TFIDF 24 3 3
K-Means LDA 20 0 10
Optics TFIDF 14 2 14
Optics LDA 16 0 14

TPDL 2012, Cyprus, 24-27 September 2012

Conclusions
• K-means almost as good as the human
classification
• LDA is very fast and approximately two
thirds of the topics are acceptably
cohesive

• Future work:
– Make it hierarchical
– Create hybrid algorithms

TPDL 2012, Cyprus, 24-27 September 2012

Thank you for listening

Find out more about the project:

http://www.paths-project.eu

m.mhall@sheffield.ac.uk

The research leading to these results has received funding from the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project
partners involved in PATHS (see: http://www.paths-project.eu).

Mais conteúdo relacionado

Destaque

My E-mail appears as spam - troubleshooting path - part 11 of 17Eyal Doron

The autodiscover algorithm for locating the source of information part 05#36Eyal Doron

Word pressで情報を得るのに役立つwebサイトの紹介Akinori Tateyama

DFC2012 India: Health & Hygienedesignforchangechallenge

Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...Eyal Doron

Plivo OSDC FR 2012mricordeau

Think before you speakDesi Puspitasariku

Destaque (7)

My E-mail appears as spam - troubleshooting path - part 11 of 17

The autodiscover algorithm for locating the source of information part 05#36

Word pressで情報を得るのに役立つwebサイトの紹介

DFC2012 India: Health & Hygiene

Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...

Plivo OSDC FR 2012

Think before you speak

Semelhante a Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

3 Dec 2013 Integrated computational materials CDE themed competition presenta...Defence and Security Accelerator

Facing the data challenge: Developing data policy & servicesMarieke Guy

DM2E Data ModelSteffen Hennicke

Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...TERN Australia

Dr. alex bartzasinnovation_workshop2013

Kaggle's WISE 2014 challenge Eleftherios Spyromitros-Xioufis

UKRepNet presentation at Pure UK User Group Meeting DundeeeuroCRIS - Current Research Information Systems

Business case and cost modelling for an end-to-end RDM serviceJisc RDM

Search and Hyperlinking Overview @MediaEval2014Maria Eskevich

(11) INTERACTION Final event - Wrap-upInteraction-FP7

UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...EDINA, University of Edinburgh

IASSIST 2012 - DDI-RDF - Trouble with TriplesDr.-Ing. Thomas Hartmann

Icsm12.pptYann-Gaël Guéhéneuc

Improving the Performance of the DL-Learner SPARQL Component for Semantic We...Sebastian Hellmann

Open Access & sharing research data: a Dutch workshop for phd in economicsEsther Hoorn

Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...Lighton Phiri

Research Data Management at Imperial College LondonSarah Anna Stewart

Linked Data for Knowledge Discovery: IntroductionMathieu d'Aquin

DLF Fall Forum 2012, Tales from the CloudDuraSpace

Orcid implementations-140929-jonasgilbertjonas_gilbert

Semelhante a Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (20)

3 Dec 2013 Integrated computational materials CDE themed competition presenta...

Facing the data challenge: Developing data policy & services

DM2E Data Model

Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...

Dr. alex bartzas

Kaggle's WISE 2014 challenge

UKRepNet presentation at Pure UK User Group Meeting Dundee

Business case and cost modelling for an end-to-end RDM service

Search and Hyperlinking Overview @MediaEval2014

(11) INTERACTION Final event - Wrap-up

UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...

IASSIST 2012 - DDI-RDF - Trouble with Triples

Icsm12.ppt

Improving the Performance of the DL-Learner SPARQL Component for Semantic We...

Open Access & sharing research data: a Dutch workshop for phd in economics

Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...

Research Data Management at Imperial College London

Linked Data for Knowledge Discovery: Introduction

DLF Fall Forum 2012, Tales from the Cloud

Orcid implementations-140929-jonasgilbert

Mais de pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject

Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject

User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject

Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject

Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject

PATHS state of the art monitoring reportpathsproject

Recommendations for the automatic enrichment of digital library content using...pathsproject

Semantic Enrichment of Cultural Heritage content in PATHSpathsproject

Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject

PATHS @ LATECH 2013pathsproject

PATHS at the eChallenges conferencepathsproject

PATHS at the EAA conference 2013pathsproject

PATHS at the eCult dialogue day 2013pathsproject

Comparing taxonomies for organising collections of documents presentationpathsproject

SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject

A pilot on Semantic Textual Similaritypathsproject

Comparing taxonomies for organising collections of documentspathsproject

PATHS Final prototype interface design v1.0pathsproject

PATHS Evaluation of the 1st paths prototypepathsproject

Mais de pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...

PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...

Implementing Recommendations in the PATHS system, SUEDL 2013

User-Centred Design to Support Exploration and Path Creation in Cultural Her...

Generating Paths through Cultural Heritage Collections Latech2013 paper

Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...

PATHS state of the art monitoring report

Recommendations for the automatic enrichment of digital library content using...

Semantic Enrichment of Cultural Heritage content in PATHS

Generating Paths through Cultural Heritage Collections, LATECH 2013 paper

PATHS @ LATECH 2013

PATHS at the eChallenges conference

PATHS at the EAA conference 2013

PATHS at the eCult dialogue day 2013

Comparing taxonomies for organising collections of documents presentation

SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity

A pilot on Semantic Textual Similarity

Comparing taxonomies for organising collections of documents

PATHS Final prototype interface design v1.0

PATHS Evaluation of the 1st paths prototype

Último

Real Time Object Detection Using Open CVKhem

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Partners Life - Insurer Innovation Award 2024The Digital Insurer

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

1. Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012

2. Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/ Carl Collins http://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012

3. Exploring Collections • Exploring / Browsing as an alternative to Search (where applicable) • Requires some kind of structuring of the data • Manual structuring ideal – Expensive to generate – Integration of collections problematic • Alternative: Automatic structuring via clustering TPDL 2012, Cyprus, 24-27 September 2012

4. Test Collection • 28133 photographs provided by the University of St Andrews Library – 85% pre 1940 Ottery St Mary – 89% black and white Church – Majority UK – Title and description tend to be short TPDL 2012, Cyprus, 24-27 September 2012

5. Tested Clustering Strategies • Latent Dirichlet Allocation (LDA) – 300 & 900 topics – With and without Pairwise Mutual Information (PMI) filtering • K-Means – 900 clusters – TFIDF vectors & LDA topic vectors • OPTICS – 900 clusters – TFIDF vectors & LDA topic vectors TPDL 2012, Cyprus, 23-27 September 2012

6. Processing Time Model Wall-clock Time LDA 300 00:21:48 LDA 900 00:42:42 LDA + PMI 300 05:05:13 LDA + PMI 900 17:26:08 K-Means TFIDF 09:37:40 K-Means LDA 03:49:04 Optics TFIDF 12:42:13 Optics LDA 05:12:49 TPDL 2012, Cyprus, 24-27 September 2012

7. Evaluation Metrics • Cluster cohesion – Items in a cluster should be similar to each other – Items in a cluster should be different from items in other clusters • How to test this? – “Intruder” test – If you insert an intruder into a cluster, can people find it TPDL 2012, Cyprus, 24-27 September 2012

8. Intruder Test 1. Randomly select one topic 2. Randomly select four items from the topic 3. Randomly select a second topic – the “intruder” topic 4. Randomly select one item from the second topic – the “intruder” item 5. Scramble the five items and let the user choose which one is the “intruder” TPDL 2012, Cyprus, 24-27 September 2012

9. Cluster Cohesion – Cohesive TPDL 2012, Cyprus, 24-27 September 2012

10. Cluster Cohesion – Not Cohesive TPDL 2012, Cyprus, 24-27 September 2012

11. Evaluation Metrics • Cohesive – “Intruder” is chosen significantly more frequently than by chance – Choice distribution is significantly different from the uniform distribution • Borderline cohesive – Two out of five items make up > 95% of the answers – “Intruder” is one of those two TPDL 2012, Cyprus, 24-27 September 2012

12. Evaluation Bounds • Upper bound – Manual annotation • 936 topics • Lower bound – 3 cohesive topics – <5% likelihood of seeing that number of cohesive topics by chance • Control data – 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answers TPDL 2012, Cyprus, 24-27 September 2012

13. Experiment • Crowd-sourced using staff & students at Sheffield University – 700 participants • 9 clustering strategies – 30 units per strategy – total of 270 units • Results – 8840 ratings – 21 – 30 ratings per unit (median 27 ratings) TPDL 2012, Cyprus, 24-27 September 2012

14. Results Model Cohesive Borderline Non-Cohesive Upper Bound 27 0 3 Lower Bound 3 0 27 LDA 300 15 6 9 LDA 900 20 4 6 LDA + PMI 300 16 4 10 LDA + PMI 900 21 2 7 K-Means TFIDF 24 3 3 K-Means LDA 20 0 10 Optics TFIDF 14 2 14 Optics LDA 16 0 14 TPDL 2012, Cyprus, 24-27 September 2012

15. Conclusions • K-means almost as good as the human classification • LDA is very fast and approximately two thirds of the topics are acceptably cohesive • Future work: – Make it hierarchical – Create hybrid algorithms TPDL 2012, Cyprus, 24-27 September 2012

16. Thank you for listening Find out more about the project: http://www.paths-project.eu m.mhall@sheffield.ac.uk The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (7)

Semelhante a Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Semelhante a Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (20)

Mais de pathsproject

Mais de pathsproject (20)

Último

Último (20)

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections