SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Using Wikipedia as a reference
    for extracting semantic
    information from a text

            Andrea Prato
                  &
          Marco Ronchetti
      Università di Trento, Italy
Explicit Semantic Analysis




                             Gabrilovich
                             Markovich
                             2007
Throw away:

Stopwords
Fragment pages (<100 words)
Suffixes (stemming)
- Leukemia
                                                - Severe combined
                                                immunodeficiency
    A sample (ESA)                              - Cancer
                                                -Non-Hodgkin lymphoma
The development of T-cell leukaemia             - AIDS
   following the otherwise successful           -ICD-10 Chapter II:
   treatment of three patients with X-linked
   severe combined immune deficiency (X-
                                                Neoplasms;
   SCID) in gene-therapy trials using           -Chapter III: Diseases of the
   haematopoietic stem cells has led to a re-   blood and blood-forming
   evaluation of this approach. Using a
   mouse model for gene therapy of X-
                                                organs, and certain
   SCID, we find that the corrective             disorders involving the
   therapeutic gene IL2RG itself can act as     immune mechanism
   a contributor to the genesis of T-cell
   lymphomas, with one-third of animals
                                                - Bone marrow transplant
   being affected. Gene-therapy trials for X-   - Immunosuppressive drug
   SCID, which have been based on the           - Acute lymphoblastic
   assumption that IL2RG is minimally
   oncogenic, may therefore pose some risk
                                                leukemia
   to patients.                                 - Multiple sclerosis.
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                                                 3-Ian_Thorpe.
                                                 4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't          2005-06,
   make an ideal place to come to practise
                                                 5-Swimming_machine,
   your favourite sport, although you'll get a
                                                 6-American_football_strategy,
   decent workout just walking around and
   up and down bridges! If you've got any        7-Contract_bridge_glossary,
   energy left for some extra exercise, try a    8-Olympic_Games,
   spot of swimming (although pools are          9-Pingu_episodes_series_6,
   rare) or even a jog. Venice is a bit of a     10-Venice.
   desert for swimmers. You can go in off        …
   the Lido (if you're game) or at one of        15 - Corruption_in_Ghana
   Venice's two public swimming pools            …
   (handily, they close in summer).              27 - Legislative_system_of_the
Lonely Planet Tourist Guide                      Peopleʼs_Republic_of_China.
Clustering
Wikipedia is hyperlinked
Swimming is clustered with Olympic Games
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                                                 3-Ian_Thorpe.
                                                 4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't          2005-06,
   make an ideal place to come to practise
                                                 5-Swimming_machine,
   your favourite sport, although you'll get a
                                                 6-American_football_strategy,
   decent workout just walking around and
   up and down bridges! If you've got any        7-Contract_bridge_glossary,
   energy left for some extra exercise, try a    8-Olympic_Games,
   spot of swimming (although pools are          9-Pingu_episodes_series_6,
   rare) or even a jog. Venice is a bit of a     10-Venice.
   desert for swimmers. You can go in off        …
   the Lido (if you're game) or at one of        15 - Corruption_in_Ghana
   Venice's two public swimming pools            …
   (handily, they close in summer).              27 - Legislative_system_of_the
Lonely Planet Tourist Guide                      Peopleʼs_Republic_of_China.
Throw away:

Large aggregators
   Category links
   Numbers
   Pages with more than (N=100) links
After clustering:

 only 3 clusters with cardinality larger than 1.
 The first cluster, with cardinality 21, was
  automatically named Swimming.
 The second and the third both have cardinality
  equal to 2, and they are named Training and
  Venice-bucentaur.
Which one is
                          machine -generated?
Validation: Turing test


                            Classification



   Text                     Classification



                            Classification
20 texts of length
Outcome   ranging between 60
          and 200 words. Texts
          were collected from
          various sources like
          newspaper articles,
          text books, random
          web pages, MSN
          Encarta.
Further improvements
Using only nouns

Using a POS Tagger to identify syntactic
 roles in document to be classified
Keep only names (throw away the rest)


No degradation in the results!
Define Multiwords

 Lexical multiword identification approach:
 The following generative pattern is considered
 ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun     Prep)?)
               (Adj∣Noun)∗)Noun

  +: One or more *: Zero or more ?: Zero or one ∣: Or


Validation: A candidate multiword is valid if there
          is a Wikipedia entry related to it.
Text with multiwords:

Keep all nouns
Keep all adjectives that are part of a
 multiword
Evaluation (human inspection of
results)
100 samples (50 technical, 50 generic)
Multiword improved significanty 7 (5 technical)
It improved marginally 13
It worsened marginally 6


Overall improvement: 10/% on technical text
Work in progress
Concept-mediated mapping
among documents
How similar are two docs?
                                   Jaccard Index



           Concept 1

           Concept 2   Concept 2
  Doc 1                                  Doc 3
           Concept 3   Concept 3

                       Concept 4
Syllabi comparison
Inter
links
Mapping documents in different
  languages
   Deploying Wikipedia Interlinks
                                         Jaccard Index



          Concept 1

          Concept 2          Concept 2
Doc 1                                             Doc 3
          Concept 3          Concept 3

                  INTERLINKS Concept 4

Mais conteúdo relacionado

Semelhante a Using Wikipedia as a reference for extracting semantic information

kurous case neural text.pdf
kurous case neural text.pdfkurous case neural text.pdf
kurous case neural text.pdfYawarAbbas73
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and CognitionAndrzej Wasowski
 
DNA memories
DNA memoriesDNA memories
DNA memoriesHoda msw
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible researchYannick Wurm
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...IBM India Smarter Computing
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...CSCJournals
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingStuti Nayak
 
SIGEVOlution Summer 2007
SIGEVOlution Summer 2007SIGEVOlution Summer 2007
SIGEVOlution Summer 2007Pier Luca Lanzi
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 

Semelhante a Using Wikipedia as a reference for extracting semantic information (20)

kurous case neural text.pdf
kurous case neural text.pdfkurous case neural text.pdf
kurous case neural text.pdf
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and Cognition
 
DNA memories
DNA memoriesDNA memories
DNA memories
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
Ismb2009
Ismb2009Ismb2009
Ismb2009
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic Modeling
 
SIGEVOlution Summer 2007
SIGEVOlution Summer 2007SIGEVOlution Summer 2007
SIGEVOlution Summer 2007
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Using Wikipedia as a reference for extracting semantic information

  • 1. Using Wikipedia as a reference for extracting semantic information from a text Andrea Prato & Marco Ronchetti Università di Trento, Italy
  • 2. Explicit Semantic Analysis Gabrilovich Markovich 2007
  • 3. Throw away: Stopwords Fragment pages (<100 words) Suffixes (stemming)
  • 4. - Leukemia - Severe combined immunodeficiency A sample (ESA) - Cancer -Non-Hodgkin lymphoma The development of T-cell leukaemia - AIDS following the otherwise successful -ICD-10 Chapter II: treatment of three patients with X-linked severe combined immune deficiency (X- Neoplasms; SCID) in gene-therapy trials using -Chapter III: Diseases of the haematopoietic stem cells has led to a re- blood and blood-forming evaluation of this approach. Using a mouse model for gene therapy of X- organs, and certain SCID, we find that the corrective disorders involving the therapeutic gene IL2RG itself can act as immune mechanism a contributor to the genesis of T-cell lymphomas, with one-third of animals - Bone marrow transplant being affected. Gene-therapy trials for X- - Immunosuppressive drug SCID, which have been based on the - Acute lymphoblastic assumption that IL2RG is minimally oncogenic, may therefore pose some risk leukemia to patients. - Multiple sclerosis.
  • 5. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  • 8. Swimming is clustered with Olympic Games
  • 9. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  • 10. Throw away: Large aggregators  Category links  Numbers  Pages with more than (N=100) links
  • 11. After clustering:  only 3 clusters with cardinality larger than 1.  The first cluster, with cardinality 21, was automatically named Swimming.  The second and the third both have cardinality equal to 2, and they are named Training and Venice-bucentaur.
  • 12. Which one is machine -generated? Validation: Turing test Classification Text Classification Classification
  • 13. 20 texts of length Outcome ranging between 60 and 200 words. Texts were collected from various sources like newspaper articles, text books, random web pages, MSN Encarta.
  • 15. Using only nouns Using a POS Tagger to identify syntactic roles in document to be classified Keep only names (throw away the rest) No degradation in the results!
  • 16. Define Multiwords  Lexical multiword identification approach:  The following generative pattern is considered ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun Prep)?) (Adj∣Noun)∗)Noun +: One or more *: Zero or more ?: Zero or one ∣: Or Validation: A candidate multiword is valid if there is a Wikipedia entry related to it.
  • 17. Text with multiwords: Keep all nouns Keep all adjectives that are part of a multiword
  • 18. Evaluation (human inspection of results) 100 samples (50 technical, 50 generic) Multiword improved significanty 7 (5 technical) It improved marginally 13 It worsened marginally 6 Overall improvement: 10/% on technical text
  • 20. Concept-mediated mapping among documents How similar are two docs? Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 Concept 4
  • 23. Mapping documents in different languages Deploying Wikipedia Interlinks Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 INTERLINKS Concept 4