SlideShare uma empresa Scribd logo
1 de 16
Evaluating the Use of Clustering
    for Automatically Organising
      Digital Library Collections
             Mark M. Hall, Mark Stevenson,
                    Paul D. Clough


TPDL 2012, Cyprus, 24-27 September 2012
Opening Up Digital Cultural Heritage




                                                                     http://www.flickr.com/photos/brokenthoughts/122096903/
Carl Collins
http://www.flickr.com/photos/carlcollins/199792939/




                                 http://www.flickr.com/photos/usnationalarchives/4069633668/
   TPDL 2012, Cyprus, 24-27 September 2012
Exploring Collections
• Exploring / Browsing as an alternative to
  Search (where applicable)
• Requires some kind of structuring of the
  data
• Manual structuring ideal
    – Expensive to generate
    – Integration of collections problematic
• Alternative: Automatic structuring via
  clustering

TPDL 2012, Cyprus, 24-27 September 2012
Test Collection
• 28133 photographs provided
  by the University of St
  Andrews Library
    – 85% pre 1940                             Ottery St Mary
    – 89% black and white                      Church

    – Majority UK
    – Title and description tend to be
      short


TPDL 2012, Cyprus, 24-27 September 2012
Tested Clustering Strategies
• Latent Dirichlet Allocation (LDA)
    – 300 & 900 topics
    – With and without Pairwise Mutual Information
      (PMI) filtering
• K-Means
    – 900 clusters
    – TFIDF vectors & LDA topic vectors
• OPTICS
    – 900 clusters
    – TFIDF vectors & LDA topic vectors

TPDL 2012, Cyprus, 23-27 September 2012
Processing Time
Model                                     Wall-clock Time
LDA 300                                   00:21:48
LDA 900                                   00:42:42
LDA + PMI 300                             05:05:13
LDA + PMI 900                             17:26:08
K-Means TFIDF                             09:37:40
K-Means LDA                               03:49:04
Optics TFIDF                              12:42:13
Optics LDA                                05:12:49



TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cluster cohesion
    – Items in a cluster should be similar to each
      other
    – Items in a cluster should be different from
      items in other clusters
• How to test this?
    – “Intruder” test
    – If you insert an intruder into a cluster, can
      people find it

TPDL 2012, Cyprus, 24-27 September 2012
Intruder Test
1. Randomly select one topic
2. Randomly select four items from the topic
3. Randomly select a second topic – the
   “intruder” topic
4. Randomly select one item from the
   second topic – the “intruder” item
5. Scramble the five items and let the user
   choose which one is the “intruder”

TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Cohesive




TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Not Cohesive




TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics
• Cohesive
    – “Intruder” is chosen significantly more
      frequently than by chance
    – Choice distribution is significantly different
      from the uniform distribution
• Borderline cohesive
    – Two out of five items make up > 95% of the
      answers
    – “Intruder” is one of those two

TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Bounds
• Upper bound
    – Manual annotation
         • 936 topics
• Lower bound
    – 3 cohesive topics
    – <5% likelihood of seeing that number of cohesive
      topics by chance
• Control data
    – 10 “really, totally, completely obvious” intruders
      used to filter participants who randomly select
      answers


TPDL 2012, Cyprus, 24-27 September 2012
Experiment
• Crowd-sourced using staff & students at
  Sheffield University
    – 700 participants
• 9 clustering strategies
    – 30 units per strategy – total of 270 units
• Results
    – 8840 ratings
    – 21 – 30 ratings per unit (median 27 ratings)


TPDL 2012, Cyprus, 24-27 September 2012
Results
Model                        Cohesive     Borderline   Non-Cohesive
Upper Bound                  27           0            3
Lower Bound                  3            0            27
LDA 300                      15           6            9
LDA 900                      20           4            6
LDA + PMI 300                16           4            10
LDA + PMI 900                21           2            7
K-Means TFIDF                24           3            3
K-Means LDA                  20           0            10
Optics TFIDF                 14           2            14
Optics LDA                   16           0            14

TPDL 2012, Cyprus, 24-27 September 2012
Conclusions
• K-means almost as good as the human
  classification
• LDA is very fast and approximately two
  thirds of the topics are acceptably
  cohesive

• Future work:
    – Make it hierarchical
    – Create hybrid algorithms

TPDL 2012, Cyprus, 24-27 September 2012
Thank you for listening



                                   Find out more about the project:

                              http://www.paths-project.eu


                                       m.mhall@sheffield.ac.uk



The research leading to these results has received funding from the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project
partners involved in PATHS (see: http://www.paths-project.eu).

Mais conteúdo relacionado

Destaque

My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17Eyal Doron
 
The autodiscover algorithm for locating the source of information part 05#36
The autodiscover algorithm for locating the source of information  part 05#36The autodiscover algorithm for locating the source of information  part 05#36
The autodiscover algorithm for locating the source of information part 05#36Eyal Doron
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Akinori Tateyama
 
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...Eyal Doron
 
Plivo OSDC FR 2012
Plivo OSDC FR 2012Plivo OSDC FR 2012
Plivo OSDC FR 2012mricordeau
 

Destaque (7)

My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17My E-mail appears as spam - troubleshooting path - part 11 of 17
My E-mail appears as spam - troubleshooting path - part 11 of 17
 
The autodiscover algorithm for locating the source of information part 05#36
The autodiscover algorithm for locating the source of information  part 05#36The autodiscover algorithm for locating the source of information  part 05#36
The autodiscover algorithm for locating the source of information part 05#36
 
Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介Word pressで情報を得るのに役立つwebサイトの紹介
Word pressで情報を得るのに役立つwebサイトの紹介
 
DFC2012 India: Health & Hygiene
DFC2012 India: Health & HygieneDFC2012 India: Health & Hygiene
DFC2012 India: Health & Hygiene
 
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introdu...
 
Plivo OSDC FR 2012
Plivo OSDC FR 2012Plivo OSDC FR 2012
Plivo OSDC FR 2012
 
Think before you speak
Think before you speakThink before you speak
Think before you speak
 

Semelhante a Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...Defence and Security Accelerator
 
Facing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & servicesFacing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & servicesMarieke Guy
 
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...TERN Australia
 
Business case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM serviceBusiness case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM serviceJisc RDM
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Maria Eskevich
 
(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-up(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-upInteraction-FP7
 
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...EDINA, University of Edinburgh
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesDr.-Ing. Thomas Hartmann
 
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...Sebastian Hellmann
 
Open Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsOpen Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsEsther Hoorn
 
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...Lighton Phiri
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College LondonSarah Anna Stewart
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionMathieu d'Aquin
 
DLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDuraSpace
 
Orcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbertOrcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbertjonas_gilbert
 

Semelhante a Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (20)

3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...3 Dec 2013 Integrated computational materials CDE themed competition presenta...
3 Dec 2013 Integrated computational materials CDE themed competition presenta...
 
Facing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & servicesFacing the data challenge: Developing data policy & services
Facing the data challenge: Developing data policy & services
 
DM2E Data Model
DM2E Data ModelDM2E Data Model
DM2E Data Model
 
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
Henry&Hobbs, 'Developing long-term agro-ecological trial datasets for C and N...
 
Dr. alex bartzas
Dr. alex bartzasDr. alex bartzas
Dr. alex bartzas
 
Kaggle's WISE 2014 challenge
Kaggle's WISE 2014 challenge Kaggle's WISE 2014 challenge
Kaggle's WISE 2014 challenge
 
UKRepNet presentation at Pure UK User Group Meeting Dundee
UKRepNet presentation at Pure UK User Group Meeting DundeeUKRepNet presentation at Pure UK User Group Meeting Dundee
UKRepNet presentation at Pure UK User Group Meeting Dundee
 
Business case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM serviceBusiness case and cost modelling for an end-to-end RDM service
Business case and cost modelling for an end-to-end RDM service
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
 
(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-up(11) INTERACTION Final event - Wrap-up
(11) INTERACTION Final event - Wrap-up
 
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
UK RepositoryNet+ Project: New Services for the Institutional Repository Netw...
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 
Icsm12.ppt
Icsm12.pptIcsm12.ppt
Icsm12.ppt
 
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...Improving the Performance of the  DL-Learner SPARQL Component for Semantic We...
Improving the Performance of the DL-Learner SPARQL Component for Semantic We...
 
Open Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economicsOpen Access & sharing research data: a Dutch workshop for phd in economics
Open Access & sharing research data: a Dutch workshop for phd in economics
 
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
Empirical Evaluation of ETD-ms Compliance for ETDs Harvested by the NDLTD Uni...
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College London
 
Linked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: IntroductionLinked Data for Knowledge Discovery: Introduction
Linked Data for Knowledge Discovery: Introduction
 
DLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the CloudDLF Fall Forum 2012, Tales from the Cloud
DLF Fall Forum 2012, Tales from the Cloud
 
Orcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbertOrcid implementations-140929-jonasgilbert
Orcid implementations-140929-jonasgilbert
 

Mais de pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring reportpathsproject
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...pathsproject
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013pathsproject
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conferencepathsproject
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013pathsproject
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013pathsproject
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationpathsproject
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similaritypathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentspathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0pathsproject
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypepathsproject
 

Mais de pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...
 
Semantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHSSemantic Enrichment of Cultural Heritage content in PATHS
Semantic Enrichment of Cultural Heritage content in PATHS
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual SimilaritySemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 

Último

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

  • 1. Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012
  • 2. Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/ Carl Collins http://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012
  • 3. Exploring Collections • Exploring / Browsing as an alternative to Search (where applicable) • Requires some kind of structuring of the data • Manual structuring ideal – Expensive to generate – Integration of collections problematic • Alternative: Automatic structuring via clustering TPDL 2012, Cyprus, 24-27 September 2012
  • 4. Test Collection • 28133 photographs provided by the University of St Andrews Library – 85% pre 1940 Ottery St Mary – 89% black and white Church – Majority UK – Title and description tend to be short TPDL 2012, Cyprus, 24-27 September 2012
  • 5. Tested Clustering Strategies • Latent Dirichlet Allocation (LDA) – 300 & 900 topics – With and without Pairwise Mutual Information (PMI) filtering • K-Means – 900 clusters – TFIDF vectors & LDA topic vectors • OPTICS – 900 clusters – TFIDF vectors & LDA topic vectors TPDL 2012, Cyprus, 23-27 September 2012
  • 6. Processing Time Model Wall-clock Time LDA 300 00:21:48 LDA 900 00:42:42 LDA + PMI 300 05:05:13 LDA + PMI 900 17:26:08 K-Means TFIDF 09:37:40 K-Means LDA 03:49:04 Optics TFIDF 12:42:13 Optics LDA 05:12:49 TPDL 2012, Cyprus, 24-27 September 2012
  • 7. Evaluation Metrics • Cluster cohesion – Items in a cluster should be similar to each other – Items in a cluster should be different from items in other clusters • How to test this? – “Intruder” test – If you insert an intruder into a cluster, can people find it TPDL 2012, Cyprus, 24-27 September 2012
  • 8. Intruder Test 1. Randomly select one topic 2. Randomly select four items from the topic 3. Randomly select a second topic – the “intruder” topic 4. Randomly select one item from the second topic – the “intruder” item 5. Scramble the five items and let the user choose which one is the “intruder” TPDL 2012, Cyprus, 24-27 September 2012
  • 9. Cluster Cohesion – Cohesive TPDL 2012, Cyprus, 24-27 September 2012
  • 10. Cluster Cohesion – Not Cohesive TPDL 2012, Cyprus, 24-27 September 2012
  • 11. Evaluation Metrics • Cohesive – “Intruder” is chosen significantly more frequently than by chance – Choice distribution is significantly different from the uniform distribution • Borderline cohesive – Two out of five items make up > 95% of the answers – “Intruder” is one of those two TPDL 2012, Cyprus, 24-27 September 2012
  • 12. Evaluation Bounds • Upper bound – Manual annotation • 936 topics • Lower bound – 3 cohesive topics – <5% likelihood of seeing that number of cohesive topics by chance • Control data – 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answers TPDL 2012, Cyprus, 24-27 September 2012
  • 13. Experiment • Crowd-sourced using staff & students at Sheffield University – 700 participants • 9 clustering strategies – 30 units per strategy – total of 270 units • Results – 8840 ratings – 21 – 30 ratings per unit (median 27 ratings) TPDL 2012, Cyprus, 24-27 September 2012
  • 14. Results Model Cohesive Borderline Non-Cohesive Upper Bound 27 0 3 Lower Bound 3 0 27 LDA 300 15 6 9 LDA 900 20 4 6 LDA + PMI 300 16 4 10 LDA + PMI 900 21 2 7 K-Means TFIDF 24 3 3 K-Means LDA 20 0 10 Optics TFIDF 14 2 14 Optics LDA 16 0 14 TPDL 2012, Cyprus, 24-27 September 2012
  • 15. Conclusions • K-means almost as good as the human classification • LDA is very fast and approximately two thirds of the topics are acceptably cohesive • Future work: – Make it hierarchical – Create hybrid algorithms TPDL 2012, Cyprus, 24-27 September 2012
  • 16. Thank you for listening Find out more about the project: http://www.paths-project.eu m.mhall@sheffield.ac.uk The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).