SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
+




                                 OpenWN-PT: Brazilian
                                 Portuguese Wordnet for all
                                 Alexandre Rademaker, EMAp,FGV- Rio
                                 Valeria de Paiva, Rearden Commerce, CA
Global Wordnet Conference 2012   Gerard de Melo, Berkeley
                                 Rafael Hausler, EMAp/FGV
Matsue, Japan
+ Fundação Getulio Vargas (FGV)
                                            http://www.fgv.br


 “Fundação Getulio Vargas (FGV) is a Brazilian
   higher education and research institution
   founded in December 20, 1944. It offers regular
   courses of Economics, Business Administration,
   Law, Social Sciences and Applied Mathematics.
   Its original goal was to train people for the
   country's public- and private-sector
   management. […] It is considered by
   Foreign Policy magazine to be a top-5
   "policymaker think-tank" worldwide.”
+   CPDOC




EMAp


            We are starting a project (part of
            MIST), joint work of CPDOC and
            EMAp, where we want, in the
            long run, to use formal logical
            tools to reason about knowledge
            obtained from text in
            Portuguese. We want to improve
            the structure and search in the
            CPDOC databases and files.
+ CPDOC: Center of Brazilian Contemporary
  History (http://cpdoc.fgv.br)
  n    CPDOC is a major center for teaching and researching in the
        Social Sciences and Contemporary History located in Rio de
        Janeiro.

  n    CPDOC is the leading historical research institute in the
        country. It holds a major collection of personal archives, oral
        histories and audiovisual sources pertaining to Brazilian
        contemporary history.
        n    Personal Archives: About 200 archival funds, summing up to 1,8
              million documents, among text, images and videos.
        n    Oral History Program: A huge set of testimonies (in audio and
              video) consisting of more than 1.000 interviews, which
              correspond to up to 5 thousand hours of recordings.
        n    Brazilian Historical Biographic Dictionary (DHBB): in the current
              version, it comprehends 7.553 entries, of which 6.584 are of
              biographical nature and 969 related to institutions, events and
              concepts of interest for the Brazilian history after 1930.
+ EMAp: School of Applied Mathematics
  (http://emap.fgv.br)
  n    Created to develop expertise in Mathematics applied to science
        an technology and help advance FGV's own mission.

  n    Core team of highly creative and competent mathematicians
        experts in image and signal/sound processing. Not much in text
        processing.

  n    Huge demand for mathematical and computational tools to
        model the recent social changes in Brazil

  n    Active partnerships with other schools at FGV and other
        institutions like Light (power supplier company of RJ) , Petrobras
        etc.

  n    Undergraduate and graduate courses (Master)

  n    Some projects include: Mathematical Epidemiology, Facial
        Recognition, Modeling the Judiciary, Modeling Legal Conflicts
        and Natural Language Processing
+
    MIST Project: images
    Asla Sá
             Applied Mathematics School – NLP Research Overview


     Projeto MIST
    n    Original Problem
          • The original problem




           Legend: Esq./dir.: (1º plano) Flávio Marcílio (1º); Ernesto Geisel (2º); Paulo Torres (3º); Eloy José da Rocha (4º). (2º
          Legend:Adalberto Pereira dos Santos (1º). Foto: Agência Nacional (Estúdio/Agência).
               plano) Esq./dir.: (1o plano) Flávio Marcílio (1o); Ernesto Geisel (2o);

          Paulo Torres (3o); Eloy José da Rocha (4o). (2o plano) Adalberto Pereira
          dos Santos (1o). Foto: Agência Nacional (Estúdio/Agência).
+ Mathematics School – NLP Research Overview
Applied
  MIST Project: images
MIST Project Faces, developed by EMAp team
 Very Important
  • V
    Very I
         Important F
              t t Faces – d l
                          developed b EMA researchers
                                  d by EMAp      h
+
    MIST Project: audio files
    Moacyr Silva
    Applied Mathematics School – NLP Research Overview


MIST
Project
P j
Aligning
text and
sound
+ MIST Project: NLP and ontology engineering
  Alexandre Rademaker and Renato Rocha

  n    Conversion of the current authorized subject headings into a history
        thesaurus: people, processes, events, places etc. These will be
        afterward converted to domain ontologies and incorporated in the
        Semantic Portal.

  n    Unify access to the CPDOC Systems; Enhanced visibility to search
        engines with unification of concepts terminology;

  n    Integration with the Linked Open Data (LOD) via RDF triplification;

  n    Integration with the Learning Objects Databases and the FGV
        Digital Library;

  n    NLP to extract more relations and knowledge from texts (first DHBB)
+ OpenWordnet-PT?
  (aren’t all wordnets open?)
 There are some attempts: WordNet.PT and
 WordNet.PT global (Lisboa), MultiWordNet.PT
 and Brazilian WordNet by Bento Dias.




                                         We need a Portuguese
                                         Wordnet for our work, but
                                         none of the previous projects
                                         are openly available.
+ Inspiration: PARC’s Bridge Architecture
                XLE/LFG P                               Inference
    Text                 arsing KR M                     Engine
                                     apping
                       F-structure      Transfer
                                      semantics
                                                     AKR
     Sources
                                                                Assertions



    Question                                                    Query




                                Unified Lexicon
               LFG              Term rewriting        ECD
               XLE              KR mapping rules      Textual Inference
               MaxEnt models    Factives,Deverbals    logics



Basic idea: canonicalization of meanings
+ Simplifying the PARC’s Bridge Architecture
                    Parsing                              Inference
    Text                             KR Mapping           Engines
                       F-structure
                                      semantics
                                                    KR
     Sources
                                                               Assertions



    Question                                                   Query



                                 Term rewriting
                                 OpenWN-PT
                                 SUMO-PT
               Grammar           KR mapping rules    Textual Inference
               Stanford Parser                       logics



Idea: Simplify and reproduce components in PORTUGUESE
+ Language/KR (mis?)alignments:

 n  Language
   n  Generalizations come from the structure of the language
   n  Representations compositionally derived from sentence structure


 n  Knowledge    representation
   n  Generalizations come from the structure of the world
   n  Representations to support reasoning
   n  Maintain multiple interpretations

 n  Layered   bridge helps with the different constraints
 n  FIRST   STEP of simplified architecture:
  WORDNET for PORTUGUESE
+ OpenWN-PT: How?

 n    Leverage EuroWordNet, Global WordNet experiece

 n    Leverage YAGO, UWN experience…

 n    Recruited Gerard de Melo for project

 n    Gerard’s work: UWN/MENTA A large-scale multilingual
       lexical knowledge base built using statistical methods,
       transforming WordNet into a massively multilingual resource
       (over 1 million words and several million named entities in a
       single large multilingual taxonomy)

 n    Let us look at Portuguese-projection of UWN/Menta. This is an
       automated version of a Portuguese WordNet, publicly available.


                                 https://github.com/arademaker/wordnet-br
+ OpenWN-PT: is it done?
 n    Universal WordNet (UWN) experience: Towards a Universal Wordnet by
       Learning from Combined Evidence (de Melo, Weikum, (CIKM 2009) )

 n    A methodology for the automatic construction of a large-scale multilingual
       lexical database where words of many languages are hierarchically
       organized in terms of their meanings and their semantic relations to other
       words.

 n    Bootstrapped from WordNet, extends it with around 1.5 million meaning
       links for 800,000 words in over 200 languages, drawing on evidence
       extracted from a variety of resources including existing (monolingual)
       wordnets, (mostly bilingual) translation dictionaries, and parallel corpora.

       Graph-based scoring functions and statistical learning techniques are used
       to iteratively integrate this information and build an output graph.

 n    Experiments show high level of precision and coverage more than 86%.
       Approx 24K terms in Portuguese

 n    Is it good enough? Depends on application…
+ OpenWN-PT: How we started?
   The file was generated by combining the following data:

   Princeton WordNet 3.0 was used to obtain English glosses and
   English terms for synset IDs.

   The unreleased 2010-12 version UWN and MENTA provided
   candidate terms in Portuguese, candidate glosses in Portuguese
   (from Wikipedia), and candidate terms in Spanish.

   The EuroWordNet base concept list (5000_bc.xml) provides the base
   concept numbers. The original file was mapped from WordNet 2.0 to
   3.0 using the mappings from WN-Map. When multiple mappings for
   a WordNet 2.0 synset existed, all possible WordNet 3.0 synsets were
   kept. Hence, there may be multiple entries with the same base
   concept number.

http://nlp.lsi.upc.edu/web/index.php?option=com_content&task=view&id=21&Itemid=57




                                             https://github.com/arademaker/wordnet-br
+ OpenWN-PT: what does it look like?
 n    Typical good entry with minor manual improvements.

 n    Automatic produces candidate Portuguese words for each
       of some of WN3.0 synsets.

 n    Check suggested words and add Portuguese gloss and
       examples.
+ OpenWN-PT: what does it look like?

                      Not very useful




                    Good automatically suggestion
+ OpenWN-PT: lexical gaps
+ OpenWN-PT: revisions




We are not using
linguistic experts,
revision is always
necessary!
+ OpenWN-PT: first step guidelines

  n    Read the English gloss and the English words.

  n    Come up with Portuguese words that express the same meaning
        as the English gloss and have the part-of-speech indicated by the
        first letter of the WordNet synset identifer and write them into "PT-
        Words-Man”.

  n    Write a Portuguese gloss into the "PT-Gloss” field. If the gloss
        contains English example sentences, then only translate them if
        their translations sound natural in Portuguese and if the
        translation actually contains the Portuguese words added to the
        synset.
+
    Done? Not so simple...
    n    Checking is much easier than starting from scratch..

    n    But long and tedious work to check even the initial 5k synsets
          suggested by GWA let alone the 24k synsets already in UWN

    n    Necessary? YES! Lexical gaps of all sorts

    n    Evolving guidelines for translators/checkers

    n    Assumed we’d be done on 5K for this talk, but still working.

    n    Payoff expected: A huge body of work on data, hopefully
          reproducible in Portuguese
+ OpenWN-PT: next step

  n    Keep following the procedure described as the “expand
        approach” for the global wordnet grid.

  n    First translate the synsets in the Princeton WordNet to Portuguese,
        then take over the relations from Princeton and revise, adding the
        Portuguese terms that satisfy different relations. Then revise and
        revise and revise until we can guarantee the consistency of the
        taxonomy.

  n    Since we have a first target corpus, the Brazilian Historical
        Biographic Dictionary, we can also calculate word frequency to
        prioritize expansion of the OpenWN-PT.
+ Conclusions
 n    Took to heart GWA’s claim: need OPEN Portuguese WordNet, starting with 5k
       concepts suggested.

 n    Have automatically-constructed version obtained from Universal WordNet UWN/
       Menta

 n    We're not where we wanted to be, but things are progressing solidly. Many issues
       on working at a distance. We had hoped to have 5k synsets done by now. 812
       synsets is a good start, considering the Zipfian distribution of WF. Each synset has
       multiple words, and Francis and Kucera showed that with 1000 words, you can
       already understand 72% of written text.

 n    Of the 300 synsets that were double inspected/corrected by hand, Gerard
       methods really seem to be living up to expectations. The data is language, so it's
       messy, noisy and subject to interpretation, but mostly it seems good quality.

 n    Need to increase number of people doing it, need to create more checks. We
       want to experiment crowd sourcing, like http://tagger.thepcf.org.uk/, or game
       oriented, http://freerice.com/. To volunteers workers, maybe motivated by status
       upgrade like http://stackoverflow.com/. Try the Asian Wordnet Management
       System.
+

    Thanks!
+
    References
    Towards a Universal Wordnet by Learning from Combined
    Evidence  Gerard de Melo, Gerhard Weikum (2009)
    18th ACM Conference on Information and Knowledge Management
    (CIKM 2009), Hong Kong, China.
    Bridges from Language to Logic:  Concepts, Contexts and
    Ontologies Valeria de Paiva (2010)
    Logical and Semantic Frameworks with Applications, LSFA'10, Natal,
    Brazil, 2010.

    `A Basic Logic for Textual inference", AAAI Workshop on Inference for
    Textual Question Answering, 2005.
     ``Textual Inference Logic: Take Two", CONTEXT 2007.
     ``Precision-focused Textual Inference", Workshop on Textual
    Entailment and Paraphrasing, 2007.
    PARC's Bridge and Question Answering System Proceedings of
    Grammar Engineering Across Frameworks, 2007.

Mais conteúdo relacionado

Mais procurados

A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
Relationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPRelationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPR A Akerkar
 
Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsAdrian Paschke
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...
Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...
Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...Martin Thorsen Ranang
 
Knowledge Multimedia Processes in Technology Enhanced Learning
Knowledge Multimedia Processes in Technology Enhanced LearningKnowledge Multimedia Processes in Technology Enhanced Learning
Knowledge Multimedia Processes in Technology Enhanced LearningRalf Klamma
 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Petros Tsonis
 
Pal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespacesPal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespacesMustafa Jarrar
 
Statistics-based Approaches to Lexical Semantics
Statistics-based Approaches to Lexical SemanticsStatistics-based Approaches to Lexical Semantics
Statistics-based Approaches to Lexical SemanticsMartin Thorsen Ranang
 

Mais procurados (11)

A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
CGW 2010 - NLPN
CGW 2010 - NLPNCGW 2010 - NLPN
CGW 2010 - NLPN
 
Relationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPRelationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLP
 
Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and Systems
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...
Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...
Ph.D. Defense Presentation of "Open-Domain Word-Level Interpretation of Norwe...
 
Knowledge Multimedia Processes in Technology Enhanced Learning
Knowledge Multimedia Processes in Technology Enhanced LearningKnowledge Multimedia Processes in Technology Enhanced Learning
Knowledge Multimedia Processes in Technology Enhanced Learning
 
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
Challenge of Image Retrieval, Brighton, 2000 1 ANVIL: a System for the Retrie...
 
Pal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespacesPal gov.tutorial2.session1.xml basics and namespaces
Pal gov.tutorial2.session1.xml basics and namespaces
 
IPA Spring Days 2012
IPA Spring Days 2012IPA Spring Days 2012
IPA Spring Days 2012
 
Statistics-based Approaches to Lexical Semantics
Statistics-based Approaches to Lexical SemanticsStatistics-based Approaches to Lexical Semantics
Statistics-based Approaches to Lexical Semantics
 

Semelhante a OpenWN-PT: a Brazilian Wordnet for all

OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportAlexandre Rademaker
 
Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingValeria de Paiva
 
Lexical Resources for Portuguese
Lexical Resources  for PortugueseLexical Resources  for Portuguese
Lexical Resources for PortugueseValeria de Paiva
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseValeria de Paiva
 
Dimensions of Media Object Comprehensibility
Dimensions of Media Object ComprehensibilityDimensions of Media Object Comprehensibility
Dimensions of Media Object ComprehensibilityLawrie Hunter
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
 
Body-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseBody-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseJorge Baptista
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftSebastian Hellmann
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicValeria de Paiva
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTAlexandre Rademaker
 
Grammatical Framework for implementing multilingual frames and constructions
Grammatical Framework for implementing multilingual frames and constructionsGrammatical Framework for implementing multilingual frames and constructions
Grammatical Framework for implementing multilingual frames and constructionsNormunds Grūzītis
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
 
Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013
Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013
Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013Mariana Damova, Ph.D
 

Semelhante a OpenWN-PT: a Brazilian Wordnet for all (20)

Pargram2011
Pargram2011Pargram2011
Pargram2011
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese Understanding
 
Lexical Resources for Portuguese
Lexical Resources  for PortugueseLexical Resources  for Portuguese
Lexical Resources for Portuguese
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for Portuguese
 
Ontology development
Ontology developmentOntology development
Ontology development
 
Dimensions of Media Object Comprehensibility
Dimensions of Media Object ComprehensibilityDimensions of Media Object Comprehensibility
Dimensions of Media Object Comprehensibility
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
Body-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in PortugueseBody-Part Nouns and Whole-Part Relations in Portuguese
Body-Part Nouns and Whole-Part Relations in Portuguese
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural Logic
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PT
 
Grammatical Framework for implementing multilingual frames and constructions
Grammatical Framework for implementing multilingual frames and constructionsGrammatical Framework for implementing multilingual frames and constructions
Grammatical Framework for implementing multilingual frames and constructions
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013
Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013
Multilingual Access to Cultural Heritage Content on the Semantic Web - Acl2013
 
Anabela Barreiro - Alinhamentos
Anabela Barreiro - AlinhamentosAnabela Barreiro - Alinhamentos
Anabela Barreiro - Alinhamentos
 

Mais de Alexandre Rademaker

Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetAlexandre Rademaker
 
An overview of Portuguese WordNets
An overview of Portuguese WordNetsAn overview of Portuguese WordNets
An overview of Portuguese WordNetsAlexandre Rademaker
 
On the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal LogicOn the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal LogicAlexandre Rademaker
 
A linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archivesA linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archivesAlexandre Rademaker
 
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...Alexandre Rademaker
 
On the proof theory for Description Logics
On the proof theory for Description LogicsOn the proof theory for Description Logics
On the proof theory for Description LogicsAlexandre Rademaker
 
A database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF storesA database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF storesAlexandre Rademaker
 
Intuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal ReasoningIntuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal ReasoningAlexandre Rademaker
 
Is it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQIIs it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQIAlexandre Rademaker
 

Mais de Alexandre Rademaker (10)

Verifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNetVerifying Integrity Constraints of a RDF-based WordNet
Verifying Integrity Constraints of a RDF-based WordNet
 
An overview of Portuguese WordNets
An overview of Portuguese WordNetsAn overview of Portuguese WordNets
An overview of Portuguese WordNets
 
On the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal LogicOn the Computational Complexity of Intuitionistic Hybrid Modal Logic
On the Computational Complexity of Intuitionistic Hybrid Modal Logic
 
A linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archivesA linked open data architecture for contemporary historical archives
A linked open data architecture for contemporary historical archives
 
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
Processamento de Linguagem Natural em textos da História Comptemporânea do Br...
 
On the proof theory for Description Logics
On the proof theory for Description LogicsOn the proof theory for Description Logics
On the proof theory for Description Logics
 
A database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF storesA database approach to monitoring the quality of information in RDF stores
A database approach to monitoring the quality of information in RDF stores
 
Intuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal ReasoningIntuitionistic Description Logic for Legal Reasoning
Intuitionistic Description Logic for Legal Reasoning
 
First Order Logic
First Order LogicFirst Order Logic
First Order Logic
 
Is it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQIIs it important to explain a theorem? A case study in UML and ALCQI
Is it important to explain a theorem? A case study in UML and ALCQI
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

OpenWN-PT: a Brazilian Wordnet for all

  • 1. + OpenWN-PT: Brazilian Portuguese Wordnet for all Alexandre Rademaker, EMAp,FGV- Rio Valeria de Paiva, Rearden Commerce, CA Global Wordnet Conference 2012 Gerard de Melo, Berkeley Rafael Hausler, EMAp/FGV Matsue, Japan
  • 2. + Fundação Getulio Vargas (FGV) http://www.fgv.br “Fundação Getulio Vargas (FGV) is a Brazilian higher education and research institution founded in December 20, 1944. It offers regular courses of Economics, Business Administration, Law, Social Sciences and Applied Mathematics. Its original goal was to train people for the country's public- and private-sector management. […] It is considered by Foreign Policy magazine to be a top-5 "policymaker think-tank" worldwide.”
  • 3. + CPDOC EMAp We are starting a project (part of MIST), joint work of CPDOC and EMAp, where we want, in the long run, to use formal logical tools to reason about knowledge obtained from text in Portuguese. We want to improve the structure and search in the CPDOC databases and files.
  • 4. + CPDOC: Center of Brazilian Contemporary History (http://cpdoc.fgv.br) n  CPDOC is a major center for teaching and researching in the Social Sciences and Contemporary History located in Rio de Janeiro. n  CPDOC is the leading historical research institute in the country. It holds a major collection of personal archives, oral histories and audiovisual sources pertaining to Brazilian contemporary history. n  Personal Archives: About 200 archival funds, summing up to 1,8 million documents, among text, images and videos. n  Oral History Program: A huge set of testimonies (in audio and video) consisting of more than 1.000 interviews, which correspond to up to 5 thousand hours of recordings. n  Brazilian Historical Biographic Dictionary (DHBB): in the current version, it comprehends 7.553 entries, of which 6.584 are of biographical nature and 969 related to institutions, events and concepts of interest for the Brazilian history after 1930.
  • 5. + EMAp: School of Applied Mathematics (http://emap.fgv.br) n  Created to develop expertise in Mathematics applied to science an technology and help advance FGV's own mission. n  Core team of highly creative and competent mathematicians experts in image and signal/sound processing. Not much in text processing. n  Huge demand for mathematical and computational tools to model the recent social changes in Brazil n  Active partnerships with other schools at FGV and other institutions like Light (power supplier company of RJ) , Petrobras etc. n  Undergraduate and graduate courses (Master) n  Some projects include: Mathematical Epidemiology, Facial Recognition, Modeling the Judiciary, Modeling Legal Conflicts and Natural Language Processing
  • 6. + MIST Project: images Asla Sá Applied Mathematics School – NLP Research Overview Projeto MIST n  Original Problem • The original problem Legend: Esq./dir.: (1º plano) Flávio Marcílio (1º); Ernesto Geisel (2º); Paulo Torres (3º); Eloy José da Rocha (4º). (2º Legend:Adalberto Pereira dos Santos (1º). Foto: Agência Nacional (Estúdio/Agência). plano) Esq./dir.: (1o plano) Flávio Marcílio (1o); Ernesto Geisel (2o); Paulo Torres (3o); Eloy José da Rocha (4o). (2o plano) Adalberto Pereira dos Santos (1o). Foto: Agência Nacional (Estúdio/Agência).
  • 7. + Mathematics School – NLP Research Overview Applied MIST Project: images MIST Project Faces, developed by EMAp team Very Important • V Very I Important F t t Faces – d l developed b EMA researchers d by EMAp h
  • 8. + MIST Project: audio files Moacyr Silva Applied Mathematics School – NLP Research Overview MIST Project P j Aligning text and sound
  • 9. + MIST Project: NLP and ontology engineering Alexandre Rademaker and Renato Rocha n  Conversion of the current authorized subject headings into a history thesaurus: people, processes, events, places etc. These will be afterward converted to domain ontologies and incorporated in the Semantic Portal. n  Unify access to the CPDOC Systems; Enhanced visibility to search engines with unification of concepts terminology; n  Integration with the Linked Open Data (LOD) via RDF triplification; n  Integration with the Learning Objects Databases and the FGV Digital Library; n  NLP to extract more relations and knowledge from texts (first DHBB)
  • 10. + OpenWordnet-PT? (aren’t all wordnets open?) There are some attempts: WordNet.PT and WordNet.PT global (Lisboa), MultiWordNet.PT and Brazilian WordNet by Bento Dias. We need a Portuguese Wordnet for our work, but none of the previous projects are openly available.
  • 11. + Inspiration: PARC’s Bridge Architecture XLE/LFG P Inference Text arsing KR M Engine apping F-structure Transfer semantics AKR Sources Assertions Question Query Unified Lexicon LFG Term rewriting ECD XLE KR mapping rules Textual Inference MaxEnt models Factives,Deverbals logics Basic idea: canonicalization of meanings
  • 12. + Simplifying the PARC’s Bridge Architecture Parsing Inference Text KR Mapping Engines F-structure semantics KR Sources Assertions Question Query Term rewriting OpenWN-PT SUMO-PT Grammar KR mapping rules Textual Inference Stanford Parser logics Idea: Simplify and reproduce components in PORTUGUESE
  • 13. + Language/KR (mis?)alignments: n  Language n  Generalizations come from the structure of the language n  Representations compositionally derived from sentence structure n  Knowledge representation n  Generalizations come from the structure of the world n  Representations to support reasoning n  Maintain multiple interpretations n  Layered bridge helps with the different constraints n  FIRST STEP of simplified architecture: WORDNET for PORTUGUESE
  • 14. + OpenWN-PT: How? n  Leverage EuroWordNet, Global WordNet experiece n  Leverage YAGO, UWN experience… n  Recruited Gerard de Melo for project n  Gerard’s work: UWN/MENTA A large-scale multilingual lexical knowledge base built using statistical methods, transforming WordNet into a massively multilingual resource (over 1 million words and several million named entities in a single large multilingual taxonomy) n  Let us look at Portuguese-projection of UWN/Menta. This is an automated version of a Portuguese WordNet, publicly available. https://github.com/arademaker/wordnet-br
  • 15. + OpenWN-PT: is it done? n  Universal WordNet (UWN) experience: Towards a Universal Wordnet by Learning from Combined Evidence (de Melo, Weikum, (CIKM 2009) ) n  A methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. n  Bootstrapped from WordNet, extends it with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. n  Experiments show high level of precision and coverage more than 86%. Approx 24K terms in Portuguese n  Is it good enough? Depends on application…
  • 16. + OpenWN-PT: How we started? The file was generated by combining the following data: Princeton WordNet 3.0 was used to obtain English glosses and English terms for synset IDs. The unreleased 2010-12 version UWN and MENTA provided candidate terms in Portuguese, candidate glosses in Portuguese (from Wikipedia), and candidate terms in Spanish. The EuroWordNet base concept list (5000_bc.xml) provides the base concept numbers. The original file was mapped from WordNet 2.0 to 3.0 using the mappings from WN-Map. When multiple mappings for a WordNet 2.0 synset existed, all possible WordNet 3.0 synsets were kept. Hence, there may be multiple entries with the same base concept number. http://nlp.lsi.upc.edu/web/index.php?option=com_content&task=view&id=21&Itemid=57 https://github.com/arademaker/wordnet-br
  • 17. + OpenWN-PT: what does it look like? n  Typical good entry with minor manual improvements. n  Automatic produces candidate Portuguese words for each of some of WN3.0 synsets. n  Check suggested words and add Portuguese gloss and examples.
  • 18. + OpenWN-PT: what does it look like? Not very useful Good automatically suggestion
  • 20. + OpenWN-PT: revisions We are not using linguistic experts, revision is always necessary!
  • 21. + OpenWN-PT: first step guidelines n  Read the English gloss and the English words. n  Come up with Portuguese words that express the same meaning as the English gloss and have the part-of-speech indicated by the first letter of the WordNet synset identifer and write them into "PT- Words-Man”. n  Write a Portuguese gloss into the "PT-Gloss” field. If the gloss contains English example sentences, then only translate them if their translations sound natural in Portuguese and if the translation actually contains the Portuguese words added to the synset.
  • 22. + Done? Not so simple... n  Checking is much easier than starting from scratch.. n  But long and tedious work to check even the initial 5k synsets suggested by GWA let alone the 24k synsets already in UWN n  Necessary? YES! Lexical gaps of all sorts n  Evolving guidelines for translators/checkers n  Assumed we’d be done on 5K for this talk, but still working. n  Payoff expected: A huge body of work on data, hopefully reproducible in Portuguese
  • 23. + OpenWN-PT: next step n  Keep following the procedure described as the “expand approach” for the global wordnet grid. n  First translate the synsets in the Princeton WordNet to Portuguese, then take over the relations from Princeton and revise, adding the Portuguese terms that satisfy different relations. Then revise and revise and revise until we can guarantee the consistency of the taxonomy. n  Since we have a first target corpus, the Brazilian Historical Biographic Dictionary, we can also calculate word frequency to prioritize expansion of the OpenWN-PT.
  • 24. + Conclusions n  Took to heart GWA’s claim: need OPEN Portuguese WordNet, starting with 5k concepts suggested. n  Have automatically-constructed version obtained from Universal WordNet UWN/ Menta n  We're not where we wanted to be, but things are progressing solidly. Many issues on working at a distance. We had hoped to have 5k synsets done by now. 812 synsets is a good start, considering the Zipfian distribution of WF. Each synset has multiple words, and Francis and Kucera showed that with 1000 words, you can already understand 72% of written text. n  Of the 300 synsets that were double inspected/corrected by hand, Gerard methods really seem to be living up to expectations. The data is language, so it's messy, noisy and subject to interpretation, but mostly it seems good quality. n  Need to increase number of people doing it, need to create more checks. We want to experiment crowd sourcing, like http://tagger.thepcf.org.uk/, or game oriented, http://freerice.com/. To volunteers workers, maybe motivated by status upgrade like http://stackoverflow.com/. Try the Asian Wordnet Management System.
  • 25. + Thanks!
  • 26. + References Towards a Universal Wordnet by Learning from Combined Evidence  Gerard de Melo, Gerhard Weikum (2009) 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China. Bridges from Language to Logic:  Concepts, Contexts and Ontologies Valeria de Paiva (2010) Logical and Semantic Frameworks with Applications, LSFA'10, Natal, Brazil, 2010. `A Basic Logic for Textual inference", AAAI Workshop on Inference for Textual Question Answering, 2005. ``Textual Inference Logic: Take Two", CONTEXT 2007. ``Precision-focused Textual Inference", Workshop on Textual Entailment and Paraphrasing, 2007. PARC's Bridge and Question Answering System Proceedings of Grammar Engineering Across Frameworks, 2007.