SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
© 2017 Nuance Communications, Inc. All rights reserved.
Universal Dependencies
for Inference
Valeria de Paiva AI & NLU Lab, Sunnyvale
(joint work with A. Rademaker and F. Chalub, IBM Research, Brazil)
February, 2017
© 2017 Nuance Communications, Inc. All rights reserved. 2
Outline
•  The corpus SICK
•  Universal Dependencies
•  Processing SICK with
FreeLing & Parsey
•  Analysis
•  What’s Next?
© 2017 Nuance Communications, Inc. All rights reserved. 3
•  Stands for Sentences Involved in Compositional
Knowledge,
•  Result of 5-year European project COMPOSES(2011-2016)
•  SICK data set consists of about 10,000 English sentence
pairs, from captions of images
•  original sentences were normalized to remove unwanted
linguistic phenomena; sentences were expanded to obtain
up to three new suitable for evaluation; all the sentences
were paired to obtain the final data set.
•  Each sentence pair was annotated for (relatedness and)
entailment by means of crowdsourcing techniques.
•  entailment annotation led to 5595 neutral pairs, 1424
contradiction pairs, and 2821 entailment pairs
•  All available from http://clic.cimec.unitn.it/composes/
sick.html
The Corpus SICK
© 2017 Nuance Communications, Inc. All rights reserved. 4
–  Simple, common sense like sentences: People are walking
or One man in a big city is holding up a sign and begging for
money.
–  After de-duplication of sentences we have 6076 sentences,
with 512 unique verb lemmas, 269 unique adjectives, 148
unique adverbs and 1076 unique nouns.
–  Created from captions of pictures, SICK is about daily
activities and entities; ones you can see in pictures taken by
real people and which require common sense concepts:
people, cats, dogs, running, climbing, eating, etc.
The SICK corpus
© 2017 Nuance Communications, Inc. All rights reserved. 5
–  Full details, code and data included, can be found in our GitHub repository
https://github.com/own-pt/rte-sick
–  We run the sentences through FreeLing and use its tokenization to match the
UD dependencies, obtained using Parsey McFace.
–  From the CoNLL representation of each sentence we obtain its bag-of-
concepts using the Princeton WordNet (PWN) mappings provided by
FreeLing.
–  We use FreeLing’s Word Sense Disambiguation (WSD) -- based on Aguirre’s
UKB state-of-the-art Personalized PageRank algo -- to pick a favorite sense.
–  We use PWN to SUMO mappings to obtain logical concepts.
Processing the sentences
Parsey McFace, FreeLing
© 2017 Nuance Communications, Inc. All rights reserved. 6
Available from http://nlp.cs.upc.edu/freeling/
FreeLing
–  Established project, led by Lluís Padró, UPC (Universitat Politècnica de Catalunya),
since 2004. latest release FreeLing 4.0, May 2016
–  FreeLing is a C++ library providing language analysis functionalities:
–  morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense
Disambiguation, Semantic Role Labelling, etc.
–  for a variety of languages: English, Spanish, Portuguese, Italian, French, German,
Russian, Catalan, Galician, Croatian, Slovene, etc.
–  We have been using FreeLing since 2014
A multilingual, open source language analysis tool suite
© 2017 Nuance Communications, Inc. All rights reserved. 7
From https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-
most.html
SyntaxNet
–  Project, led by Slav Petrov, Google Research, first release May 2016
–  SyntaxNet, is an open-source neural network framework implemented in TensorFlow
that provides a foundation for Natural Language Understanding (NLU) systems.
–  Parsey McParseface is an English parser trained by SyntaxNet and used to analyze
English text, producing Stanford dependencies. Given a sentence as input, it tags
each word with a part-of-speech (POS) tag that describes the word's syntactic
function, and it determines the syntactic relationships between words in the
sentence, represented in the dependency parse tree.
–  SyntaxNet applies neural networks to the ambiguity problem using beam search in
the model.
A multilingual, open source language analysis tool suite, based on
NNs
© 2017 Nuance Communications, Inc. All rights reserved. 8
Globally Normalized Transition-Based Neural Networks is their paper
Parsey McParseface
–  Given some data from the Google supported Universal Dependencies project, you
can train a parsing model on your own machine, they say.
–  On a standard benchmark (the 20 year old Penn Treebank),
Parsey McParseface achieves 94% accuracy, beating their own previous state-of-
the-art results. On Web text over 90% parse accuracy.
(linguists trained for this task agree in 96-97% of the cases)
–  Coupled this with Manning’s famous 97% accuracy on pos-tagging…
–  But: Alice drove down the street in her car
The English face of SyntaxNet
© 2017 Nuance Communications, Inc. All rights reserved. 9
“our work is still cut out for us: we
would like to develop methods that
can learn world knowledge and
enable equal understanding of
natural language across all
languages and contexts”
Slav Petrov
Research Engineer, Google Research
© 2017 Nuance Communications, Inc. All rights reserved. 10
Processing of SICK
–  Want to do inference with SICK sentences
–  Want sentences into 3 buckets:
–  Prop, Prop&Prop, notProp
–  Not so easy to bucket them..
–  Need to find all negations, need to find all conjunctions
of propositions
–  Examples: A woman is rock climbing , pausing and
calculating the route
–  vs
–  A man and a woman are walking through the woods
6076 sentences, but only 1814 unique lemmas, yay!
© 2017 Nuance Communications, Inc. All rights reserved. 11
Processing of SICK
What the CoNLL representations look like? Similar to this
6076 sentences, but only 1814 unique lemmas, yay!
© 2017 Nuance Communications, Inc. All rights reserved. 12
Processing of SICK
–  SICK sentences were “expanded” from a core set of sentences in visual corpora, via
human constructed transformations such as adding passive or active voice, adding
negations, adding adjectival modifiers, etc.
–  Idea was to simplify the linguistic structure, and to create comparisons of different
linguistic phenomena.
–  They say: [caption corpora] contain sentences (as opposed to paragraphs) that
describe the same picture or video and are thus near paraphrases, are lean on
named entities and rich in generic terms.
–  They call this process “normalization”, they provide us with 11 normalization rules.
–  Examples next slide
How did they construct their corpus?
© 2017 Nuance Communications, Inc. All rights reserved. 13
Processing of SICK
–  SICK sentences were “expanded” from a core set of sentences in visual corpora, via
human constructed transformations such as adding passive or active voice, adding
negations, adding adjectival modifiers, etc.
–  Idea was to simplify the linguistic structure, and to create comparisons of different
linguistic phenomena.
–  They say: [caption corpora] contain sentences (as opposed to paragraphs) that
describe the same picture or video and are thus near paraphrases, are lean on
named entities and rich in generic terms.
–  They call this process “normalization”, they provide us with 11 normalization rules.
–  Examples next slide
How did they construct their corpus? (their LREC2014 paper)
© 2017 Nuance Communications, Inc. All rights reserved. 14
Processing of SICK
–  SICK sentences were “expanded” from a core set of sentences in visual corpora, via
human constructed transformations such as adding passive or active voice, adding
negations, adding adjectival modifiers, etc.
–  Idea was to simplify the linguistic structure, and to create comparisons of different
linguistic phenomena.
–  They say: [caption corpora] contain sentences (as opposed to paragraphs) that
describe the same picture or video and are thus near paraphrases, are lean on
named entities and rich in generic terms.
–  They call this process “normalization”, they provide us with 11 normalization rules.
–  Examples next slide
How did they construct their corpus? More “normalization” rules
© 2017 Nuance Communications, Inc. All rights reserved. 15
Processing of SICK
1.  There are still non-sentences like “A couple standing on the curb”, rule #10.
2.  There are 41 possessive dependencies (genitives like An animal is biting somebody 's hand) and 341 poss ones
(for possessive pronouns) as in A man is holding a mask in *his* raised hand, rule #1.
3. There are still a few NE’s in the corpus (e.g “Seadoo”, a kind of jet ski) and ATVs (all terrain vehicles) rule #2
4. There are a few non “present continuous sentences” like (The speaker likes parrots) rule #3
5. Rule #4 ``Replace complex verb constructions into simpler ones” is difficult to judge. But examples similar to the one given
(A man is attempting to surf down a hill made of sand è A man is surfing down a hill made of sand) happen in the corpus
several times. We have 4 of ``trying to catch” and 37 xcomp dependency relations all together.
6. Rule #5 (Simplify verb phrases with modals and auxiliaries) is hard to check as modals are not marked in the dependencies,
as far as I can see.
7. Rule #6 (Replace phrasal verbs with a synonym if verb and preposition are not adjacent) only helps a little. There are 324
particles, most of them corresponding to particle verbs that change meaning (e.g. A toddler is standing up ) gets one concept
for standing and one for up.
8. Rule #7(Remove multiword expressions), we have 1148 noun-noun dependency relations and many are mwes.
9. Rule #8 (Remove dates and numbers) worked fairly well, we only have one temporal modifier (a sunny day) but 748 number
dependencies and a need to tie this up to real numbers
10. Rule #9 (Turn subordinates into coordinates) also seems to have worked ok.
11. Rule #11(Remove indirect interrogative and parenthetical phrases). Well there are at least 4 real appositives, e.g Four
middle eastern children , three girls and one boy , are climbing on the grotto with a pink interior.
The construction rules didn’t work too well…
© 2017 Nuance Communications, Inc. All rights reserved. 16
Results of SICK
–  The curators of the corpus also made an effort to
reduce the amount of “encyclopedic knowledge” about
the world that is needed.
–  They say “To ensure the quality of the data set, all the
sentences were checked for grammatical or lexical
mistakes and disfluencies by a native English speaker.”
–  Reasonably well, we expect?
–  But the negations tell a different story. All (or almost all)
the expletives (539 of them) are negative, but negation
is not marked. This wouldn’t be a problem, if these
sentences made sense. Many do not.
How well did they do the job of simplifying the language?
© 2017 Nuance Communications, Inc. All rights reserved. 17
Processing of SICK
Meaning-preserving Transformations (from LREC2014 paper again)
© 2017 Nuance Communications, Inc. All rights reserved. 18
Processing of SICK
Meaning-altering Transformations
© 2017 Nuance Communications, Inc. All rights reserved. 19
Results of SICK
According to stats we have 436 neg dependency relations. They really seem
to correspond to negation using the adverb “not”.
Clearly we are missing many negations (most if not all the 539 expletives
are negative), but the determiners ‘no’ are not marked as negative. This we
can correct semi-automatically.
But there are many atrocious sentences that do not make sense at all:
A person is ignoring the motocross bike that is lying on its side and
there is no one is racing by; (maybe the third ``is” is a typo?) or There is
no man wearing clothes that are covered with paint or is sitting outside in a
busy area writing something.
How can we measure how many bad sentences there are? We decided to
investigate a reduced sub-corpus.
Negation in SICK-PARSEY
© 2017 Nuance Communications, Inc. All rights reserved. 20
The image cannot be displayed. Your computer may not have enough
memory to open the image, or the image may have been corrupted.
Restart your computer, and then open the file again. If the red x still
appears, you may have to delete the image and then insert it again.
3-4-5: a hand-checked sub-corpus
–  Only one conjunction:
Paper and scissors both cut
–  13 copulas (4 wrong of them are wrong:
The man is training, The man is rock climbing,
A woman is grating carrots, The woman is dicing garlic )
–  19 negations marked, but 26 neg expletives, 10 nobodies, 1
no person è should be 56 negations, needs work, we know.
–  Compounds? Have 20 nns, some are real compounds:
–  ping pong, golden retriever, sumo wrestlers, tiger cub, sumo
ringers, baby pandas, cartoon airplane
–  Some are processing mistakes, gerund for the noun, lots (14
in 390) hamster singing, lion walking, panda climbing, etc.
–  Also 9 cases of particle verbs (need to update PWN?)
390 sentences (385 nsubj+ 5 nsubjpass)
Problems pale in comparison to WSD
© 2017 Nuance Communications, Inc. All rights reserved. 21
–  Mapping has many issues
–  QAD (quick and dirty, Tetrault&Lee)
–  Rewriting of CoNLL is necessary for
most sentences, future work…
Princeton WN to SUMO mapping
–  SUMO is intended as a foundation
ontology for a variety of information
processing systems.
–  First released in December 2000, it
uses a version of the language
SUO-KIF (first-order logic with
bells&whistles) in a LISP-like syntax.
–  A mapping from WordNet synsets to
SUMO which makes it special.
–  Many domain ontologies to extend it
–  “the largest formal public ontology in
existence today.”
Why SUMO? Language is important
SUMO is open source, upper ontology
© 2017 Nuance Communications, Inc. All rights reserved. 22
Princeton WN-SUMO mapping
text = People are walking
1 People people NOUN NNS_ 3 nsubj _NNS|07942152-n GroupOfPeople=
2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+
3 walking walk VERB VBG_ 0 ROOT _ VBG|01904930-v|Walking=
Kinds of mapping:
1.  Equality: the synset corresponds exactly to the concept (as in Walking= above)
2.  Containment: Concept+, the synset is a subconcept of the SUMO concept
3.  Negation: Concept[ (the negation of the concept) silent= not communicating
4.  A-box: shouldn’t be happening here, but it does. ok for Asian, German?
Why SUMO?
SUMO’s mappings: the theory
© 2017 Nuance Communications, Inc. All rights reserved. 23
Examples
text = Men are sawing
1 Men man NOUN NNS_ 3 nsubj _ NNS|02472293-n|Hominid=
2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+
3  sawing saw VERB VBG_ 0 ROOT _ VBG|01559590-v|Cutting+
“saw” is a very specific verb and has only one synset in PWN:
01559590-v saw | (cut with a saw; "saw wood for the fireplace")
But SUMO has no exact match for this concept, so it maps to Cutting+, a kind of Cutting,
the one done with a “saw”.
Much worse is the situation with “Man” that occurs in 33 PWN synsets, the WSD alg
decided for Hominid=, but Man= (equivalent mapping) would be much better.
Containment
© 2017 Nuance Communications, Inc. All rights reserved. 24
Examples
text = Some people are silent
1 Some some DET DT _ 2 det _ DT|?|?
2 people people NOUN NNS_ 4 nsubj _NNS|07942152-
nGroupOfPeople=
3 are be VERB VBP _ 4 cop _ VBP|02604760-v|Entity+
4 silent silent ADJ JJ _ 0 ROOT _ JJ|00942163-a|
LinguisticCommunication[
“silent” is an adjective in 6 synsets, mapped to LinguisticCommunication[, which is to be
read as non-LinguisticCommunication. This seems the wrong WSD, but the correct one
would give us Communication[ (negated subsuming mapping).
Negation
© 2017 Nuance Communications, Inc. All rights reserved. 25
Examples
1 An a DET DT _ 3 det _ DT|?|?
2 American american ADJ JJ _ 3 amod_ JJ|02927303-a|LandArea@
3 footballer footballer NOUN NN _ 5 nsubj _ NN|10101634-n|
ProfessionalAthlete+
4 is be VERB VBZ _ 5 aux _ VBZ|02604760-v|Entity+
5 wearing wear VERB VBG _ 0 ROOT _ VBG|00075021-v|
RecreationOrExercise+
6 the the DET DT _ 10 det _ DT|?|?
7 red red ADJ JJ _ 10 amod_ JJ|00381097-a|Red=
8 and and CONJ CC _ 7 cc _ CC|?|?
9 white white ADJ JJ _ 7 conj _ JJ|00393105-a|White=
10 strip strip NOUN NN _ 5 dobj _ NN|09449510-n|SelfConnectedObject+
The mapping of ‘American’ to related to the continent America is a reasonable hook for future work
A-box
© 2017 Nuance Communications, Inc. All rights reserved. 26
Analysis
The numbers
Total SICK mappings:
38671 ? Most need to be discarded
29606 + most need to be made specific
16477 =
171 @
67 [
Need to deal with negated, a-box
concepts
Need more bodily process (sleeping..),
Need more baby-zoology (tiger, kitten
not the same)
But the killer is WSD
© 2017 Nuance Communications, Inc. All rights reserved. 27
Conclusion
QAD with FreeLing WSD does NOT work
Could repeat Petrov’s words.
Could make a more careful corpus, checking its
processing.
Could rethink WSD. Have done so and included
all the PWN senses, for the time being.
It’s less wrong!!
Considering how to make sure that ontologies
pay for their upkeeping, without throwing away
our Neural Networks gains.
© 2017 Nuance Communications, Inc. All rights reserved. 28
Future Work
Enhancing dependencies with friends
Want to use what we’ve learnt during the summer
with the presentations of Schuster, Lewis, Reddy
and Stanovsky.They all had rules that applied to
dependencies to make them more `semantic’. We
also have the AKR rules and the matcher rules to
use and improve.
At the moment, working on Stanovsky’s (and
Fauke’s rules, they work for EN and DE) with
Prateek Jain.
Also working on the Entailment pairs of SICK.
(note that simply concepts can give useful info2)
© 2017 Nuance Communications, Inc. All rights reserved. 29
CODA
We are the UD-Portuguese now!
There were two totally automatic
treebanks in Portuguese in UDs.
Zeman’s version started from
Linguateca’s Bosque and did
conversions. Google’s UD-BR used
web data and no manual input.
Started re-annotating Bosque in Oct
2016. We’re now the official UD-PT,
as we automatically converted and
manually checked (or are checking)
Bosque. Paper submitted.
© 2017 Nuance Communications, Inc. All rights reserved.
Thank you
© 2017 Nuance Communications, Inc. All rights reserved. 31

Mais conteúdo relacionado

Semelhante a Universal Dependencies For Inference

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Towards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web StackTowards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web StackAndre Freitas
 
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceBeyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceVijay Prakash Dwivedi
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Agile in Context: Cynefin Framework,Three-circle model &the future of Agile
Agile in Context: Cynefin Framework,Three-circle model &the future of AgileAgile in Context: Cynefin Framework,Three-circle model &the future of Agile
Agile in Context: Cynefin Framework,Three-circle model &the future of AgileDaniel Walsh
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013CA API Management
 
Optimizing Near-Synonym System
Optimizing Near-Synonym SystemOptimizing Near-Synonym System
Optimizing Near-Synonym SystemSiyuan Zhou
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet SensesSemantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Sensespathsproject
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data scienceNikhil Jaiswal
 
An introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAn introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAndre Freitas
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
 

Semelhante a Universal Dependencies For Inference (20)

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Machine Translation of Discontinuous Multiword Units
Machine Translation of Discontinuous Multiword UnitsMachine Translation of Discontinuous Multiword Units
Machine Translation of Discontinuous Multiword Units
 
Towards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web StackTowards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web Stack
 
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceBeyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
 
OBSC Framework
OBSC FrameworkOBSC Framework
OBSC Framework
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Agile in Context: Cynefin Framework,Three-circle model &the future of Agile
Agile in Context: Cynefin Framework,Three-circle model &the future of AgileAgile in Context: Cynefin Framework,Three-circle model &the future of Agile
Agile in Context: Cynefin Framework,Three-circle model &the future of Agile
 
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013
 
Optimizing Near-Synonym System
Optimizing Near-Synonym SystemOptimizing Near-Synonym System
Optimizing Near-Synonym System
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet SensesSemantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data science
 
An introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAn introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semantics
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 

Mais de Valeria de Paiva

Dialectica Categorical Constructions
Dialectica Categorical ConstructionsDialectica Categorical Constructions
Dialectica Categorical ConstructionsValeria de Paiva
 
Logic & Representation 2021
Logic & Representation 2021Logic & Representation 2021
Logic & Representation 2021Valeria de Paiva
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear LogicsValeria de Paiva
 
Dialectica Categories Revisited
Dialectica Categories RevisitedDialectica Categories Revisited
Dialectica Categories RevisitedValeria de Paiva
 
Networked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better ScienceNetworked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better ScienceValeria de Paiva
 
Going Without: a modality and its role
Going Without: a modality and its roleGoing Without: a modality and its role
Going Without: a modality and its roleValeria de Paiva
 
Problemas de Kolmogorov-Veloso
Problemas de Kolmogorov-VelosoProblemas de Kolmogorov-Veloso
Problemas de Kolmogorov-VelosoValeria de Paiva
 
Natural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and MachinesNatural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and MachinesValeria de Paiva
 
The importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in PortugueseThe importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in PortugueseValeria de Paiva
 
Negation in the Ecumenical System
Negation in the Ecumenical SystemNegation in the Ecumenical System
Negation in the Ecumenical SystemValeria de Paiva
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear LogicsValeria de Paiva
 
Semantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACTSemantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACTValeria de Paiva
 
Categorical Explicit Substitutions
Categorical Explicit SubstitutionsCategorical Explicit Substitutions
Categorical Explicit SubstitutionsValeria de Paiva
 
Logic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for DialogLogic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for DialogValeria de Paiva
 
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)Valeria de Paiva
 

Mais de Valeria de Paiva (20)

Dialectica Comonoids
Dialectica ComonoidsDialectica Comonoids
Dialectica Comonoids
 
Dialectica Categorical Constructions
Dialectica Categorical ConstructionsDialectica Categorical Constructions
Dialectica Categorical Constructions
 
Logic & Representation 2021
Logic & Representation 2021Logic & Representation 2021
Logic & Representation 2021
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear Logics
 
Dialectica Categories Revisited
Dialectica Categories RevisitedDialectica Categories Revisited
Dialectica Categories Revisited
 
PLN para Tod@s
PLN para Tod@sPLN para Tod@s
PLN para Tod@s
 
Networked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better ScienceNetworked Mathematics: NLP tools for Better Science
Networked Mathematics: NLP tools for Better Science
 
Going Without: a modality and its role
Going Without: a modality and its roleGoing Without: a modality and its role
Going Without: a modality and its role
 
Problemas de Kolmogorov-Veloso
Problemas de Kolmogorov-VelosoProblemas de Kolmogorov-Veloso
Problemas de Kolmogorov-Veloso
 
Natural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and MachinesNatural Language Inference: for Humans and Machines
Natural Language Inference: for Humans and Machines
 
Dialectica Petri Nets
Dialectica Petri NetsDialectica Petri Nets
Dialectica Petri Nets
 
The importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in PortugueseThe importance of Being Erneast: Open datasets in Portuguese
The importance of Being Erneast: Open datasets in Portuguese
 
Negation in the Ecumenical System
Negation in the Ecumenical SystemNegation in the Ecumenical System
Negation in the Ecumenical System
 
Constructive Modal and Linear Logics
Constructive Modal and Linear LogicsConstructive Modal and Linear Logics
Constructive Modal and Linear Logics
 
Semantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACTSemantics and Reasoning for NLP, AI and ACT
Semantics and Reasoning for NLP, AI and ACT
 
NLCS 2013 opening slides
NLCS 2013 opening slidesNLCS 2013 opening slides
NLCS 2013 opening slides
 
Dialectica Comonads
Dialectica ComonadsDialectica Comonads
Dialectica Comonads
 
Categorical Explicit Substitutions
Categorical Explicit SubstitutionsCategorical Explicit Substitutions
Categorical Explicit Substitutions
 
Logic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for DialogLogic and Probabilistic Methods for Dialog
Logic and Probabilistic Methods for Dialog
 
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
Intuitive Semantics for Full Intuitionistic Linear Logic (2014)
 

Último

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Último (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

Universal Dependencies For Inference

  • 1. © 2017 Nuance Communications, Inc. All rights reserved. Universal Dependencies for Inference Valeria de Paiva AI & NLU Lab, Sunnyvale (joint work with A. Rademaker and F. Chalub, IBM Research, Brazil) February, 2017
  • 2. © 2017 Nuance Communications, Inc. All rights reserved. 2 Outline •  The corpus SICK •  Universal Dependencies •  Processing SICK with FreeLing & Parsey •  Analysis •  What’s Next?
  • 3. © 2017 Nuance Communications, Inc. All rights reserved. 3 •  Stands for Sentences Involved in Compositional Knowledge, •  Result of 5-year European project COMPOSES(2011-2016) •  SICK data set consists of about 10,000 English sentence pairs, from captions of images •  original sentences were normalized to remove unwanted linguistic phenomena; sentences were expanded to obtain up to three new suitable for evaluation; all the sentences were paired to obtain the final data set. •  Each sentence pair was annotated for (relatedness and) entailment by means of crowdsourcing techniques. •  entailment annotation led to 5595 neutral pairs, 1424 contradiction pairs, and 2821 entailment pairs •  All available from http://clic.cimec.unitn.it/composes/ sick.html The Corpus SICK
  • 4. © 2017 Nuance Communications, Inc. All rights reserved. 4 –  Simple, common sense like sentences: People are walking or One man in a big city is holding up a sign and begging for money. –  After de-duplication of sentences we have 6076 sentences, with 512 unique verb lemmas, 269 unique adjectives, 148 unique adverbs and 1076 unique nouns. –  Created from captions of pictures, SICK is about daily activities and entities; ones you can see in pictures taken by real people and which require common sense concepts: people, cats, dogs, running, climbing, eating, etc. The SICK corpus
  • 5. © 2017 Nuance Communications, Inc. All rights reserved. 5 –  Full details, code and data included, can be found in our GitHub repository https://github.com/own-pt/rte-sick –  We run the sentences through FreeLing and use its tokenization to match the UD dependencies, obtained using Parsey McFace. –  From the CoNLL representation of each sentence we obtain its bag-of- concepts using the Princeton WordNet (PWN) mappings provided by FreeLing. –  We use FreeLing’s Word Sense Disambiguation (WSD) -- based on Aguirre’s UKB state-of-the-art Personalized PageRank algo -- to pick a favorite sense. –  We use PWN to SUMO mappings to obtain logical concepts. Processing the sentences Parsey McFace, FreeLing
  • 6. © 2017 Nuance Communications, Inc. All rights reserved. 6 Available from http://nlp.cs.upc.edu/freeling/ FreeLing –  Established project, led by Lluís Padró, UPC (Universitat Politècnica de Catalunya), since 2004. latest release FreeLing 4.0, May 2016 –  FreeLing is a C++ library providing language analysis functionalities: –  morphological analysis, named entity detection, PoS-tagging, parsing, Word Sense Disambiguation, Semantic Role Labelling, etc. –  for a variety of languages: English, Spanish, Portuguese, Italian, French, German, Russian, Catalan, Galician, Croatian, Slovene, etc. –  We have been using FreeLing since 2014 A multilingual, open source language analysis tool suite
  • 7. © 2017 Nuance Communications, Inc. All rights reserved. 7 From https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds- most.html SyntaxNet –  Project, led by Slav Petrov, Google Research, first release May 2016 –  SyntaxNet, is an open-source neural network framework implemented in TensorFlow that provides a foundation for Natural Language Understanding (NLU) systems. –  Parsey McParseface is an English parser trained by SyntaxNet and used to analyze English text, producing Stanford dependencies. Given a sentence as input, it tags each word with a part-of-speech (POS) tag that describes the word's syntactic function, and it determines the syntactic relationships between words in the sentence, represented in the dependency parse tree. –  SyntaxNet applies neural networks to the ambiguity problem using beam search in the model. A multilingual, open source language analysis tool suite, based on NNs
  • 8. © 2017 Nuance Communications, Inc. All rights reserved. 8 Globally Normalized Transition-Based Neural Networks is their paper Parsey McParseface –  Given some data from the Google supported Universal Dependencies project, you can train a parsing model on your own machine, they say. –  On a standard benchmark (the 20 year old Penn Treebank), Parsey McParseface achieves 94% accuracy, beating their own previous state-of- the-art results. On Web text over 90% parse accuracy. (linguists trained for this task agree in 96-97% of the cases) –  Coupled this with Manning’s famous 97% accuracy on pos-tagging… –  But: Alice drove down the street in her car The English face of SyntaxNet
  • 9. © 2017 Nuance Communications, Inc. All rights reserved. 9 “our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across all languages and contexts” Slav Petrov Research Engineer, Google Research
  • 10. © 2017 Nuance Communications, Inc. All rights reserved. 10 Processing of SICK –  Want to do inference with SICK sentences –  Want sentences into 3 buckets: –  Prop, Prop&Prop, notProp –  Not so easy to bucket them.. –  Need to find all negations, need to find all conjunctions of propositions –  Examples: A woman is rock climbing , pausing and calculating the route –  vs –  A man and a woman are walking through the woods 6076 sentences, but only 1814 unique lemmas, yay!
  • 11. © 2017 Nuance Communications, Inc. All rights reserved. 11 Processing of SICK What the CoNLL representations look like? Similar to this 6076 sentences, but only 1814 unique lemmas, yay!
  • 12. © 2017 Nuance Communications, Inc. All rights reserved. 12 Processing of SICK –  SICK sentences were “expanded” from a core set of sentences in visual corpora, via human constructed transformations such as adding passive or active voice, adding negations, adding adjectival modifiers, etc. –  Idea was to simplify the linguistic structure, and to create comparisons of different linguistic phenomena. –  They say: [caption corpora] contain sentences (as opposed to paragraphs) that describe the same picture or video and are thus near paraphrases, are lean on named entities and rich in generic terms. –  They call this process “normalization”, they provide us with 11 normalization rules. –  Examples next slide How did they construct their corpus?
  • 13. © 2017 Nuance Communications, Inc. All rights reserved. 13 Processing of SICK –  SICK sentences were “expanded” from a core set of sentences in visual corpora, via human constructed transformations such as adding passive or active voice, adding negations, adding adjectival modifiers, etc. –  Idea was to simplify the linguistic structure, and to create comparisons of different linguistic phenomena. –  They say: [caption corpora] contain sentences (as opposed to paragraphs) that describe the same picture or video and are thus near paraphrases, are lean on named entities and rich in generic terms. –  They call this process “normalization”, they provide us with 11 normalization rules. –  Examples next slide How did they construct their corpus? (their LREC2014 paper)
  • 14. © 2017 Nuance Communications, Inc. All rights reserved. 14 Processing of SICK –  SICK sentences were “expanded” from a core set of sentences in visual corpora, via human constructed transformations such as adding passive or active voice, adding negations, adding adjectival modifiers, etc. –  Idea was to simplify the linguistic structure, and to create comparisons of different linguistic phenomena. –  They say: [caption corpora] contain sentences (as opposed to paragraphs) that describe the same picture or video and are thus near paraphrases, are lean on named entities and rich in generic terms. –  They call this process “normalization”, they provide us with 11 normalization rules. –  Examples next slide How did they construct their corpus? More “normalization” rules
  • 15. © 2017 Nuance Communications, Inc. All rights reserved. 15 Processing of SICK 1.  There are still non-sentences like “A couple standing on the curb”, rule #10. 2.  There are 41 possessive dependencies (genitives like An animal is biting somebody 's hand) and 341 poss ones (for possessive pronouns) as in A man is holding a mask in *his* raised hand, rule #1. 3. There are still a few NE’s in the corpus (e.g “Seadoo”, a kind of jet ski) and ATVs (all terrain vehicles) rule #2 4. There are a few non “present continuous sentences” like (The speaker likes parrots) rule #3 5. Rule #4 ``Replace complex verb constructions into simpler ones” is difficult to judge. But examples similar to the one given (A man is attempting to surf down a hill made of sand è A man is surfing down a hill made of sand) happen in the corpus several times. We have 4 of ``trying to catch” and 37 xcomp dependency relations all together. 6. Rule #5 (Simplify verb phrases with modals and auxiliaries) is hard to check as modals are not marked in the dependencies, as far as I can see. 7. Rule #6 (Replace phrasal verbs with a synonym if verb and preposition are not adjacent) only helps a little. There are 324 particles, most of them corresponding to particle verbs that change meaning (e.g. A toddler is standing up ) gets one concept for standing and one for up. 8. Rule #7(Remove multiword expressions), we have 1148 noun-noun dependency relations and many are mwes. 9. Rule #8 (Remove dates and numbers) worked fairly well, we only have one temporal modifier (a sunny day) but 748 number dependencies and a need to tie this up to real numbers 10. Rule #9 (Turn subordinates into coordinates) also seems to have worked ok. 11. Rule #11(Remove indirect interrogative and parenthetical phrases). Well there are at least 4 real appositives, e.g Four middle eastern children , three girls and one boy , are climbing on the grotto with a pink interior. The construction rules didn’t work too well…
  • 16. © 2017 Nuance Communications, Inc. All rights reserved. 16 Results of SICK –  The curators of the corpus also made an effort to reduce the amount of “encyclopedic knowledge” about the world that is needed. –  They say “To ensure the quality of the data set, all the sentences were checked for grammatical or lexical mistakes and disfluencies by a native English speaker.” –  Reasonably well, we expect? –  But the negations tell a different story. All (or almost all) the expletives (539 of them) are negative, but negation is not marked. This wouldn’t be a problem, if these sentences made sense. Many do not. How well did they do the job of simplifying the language?
  • 17. © 2017 Nuance Communications, Inc. All rights reserved. 17 Processing of SICK Meaning-preserving Transformations (from LREC2014 paper again)
  • 18. © 2017 Nuance Communications, Inc. All rights reserved. 18 Processing of SICK Meaning-altering Transformations
  • 19. © 2017 Nuance Communications, Inc. All rights reserved. 19 Results of SICK According to stats we have 436 neg dependency relations. They really seem to correspond to negation using the adverb “not”. Clearly we are missing many negations (most if not all the 539 expletives are negative), but the determiners ‘no’ are not marked as negative. This we can correct semi-automatically. But there are many atrocious sentences that do not make sense at all: A person is ignoring the motocross bike that is lying on its side and there is no one is racing by; (maybe the third ``is” is a typo?) or There is no man wearing clothes that are covered with paint or is sitting outside in a busy area writing something. How can we measure how many bad sentences there are? We decided to investigate a reduced sub-corpus. Negation in SICK-PARSEY
  • 20. © 2017 Nuance Communications, Inc. All rights reserved. 20 The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. 3-4-5: a hand-checked sub-corpus –  Only one conjunction: Paper and scissors both cut –  13 copulas (4 wrong of them are wrong: The man is training, The man is rock climbing, A woman is grating carrots, The woman is dicing garlic ) –  19 negations marked, but 26 neg expletives, 10 nobodies, 1 no person è should be 56 negations, needs work, we know. –  Compounds? Have 20 nns, some are real compounds: –  ping pong, golden retriever, sumo wrestlers, tiger cub, sumo ringers, baby pandas, cartoon airplane –  Some are processing mistakes, gerund for the noun, lots (14 in 390) hamster singing, lion walking, panda climbing, etc. –  Also 9 cases of particle verbs (need to update PWN?) 390 sentences (385 nsubj+ 5 nsubjpass) Problems pale in comparison to WSD
  • 21. © 2017 Nuance Communications, Inc. All rights reserved. 21 –  Mapping has many issues –  QAD (quick and dirty, Tetrault&Lee) –  Rewriting of CoNLL is necessary for most sentences, future work… Princeton WN to SUMO mapping –  SUMO is intended as a foundation ontology for a variety of information processing systems. –  First released in December 2000, it uses a version of the language SUO-KIF (first-order logic with bells&whistles) in a LISP-like syntax. –  A mapping from WordNet synsets to SUMO which makes it special. –  Many domain ontologies to extend it –  “the largest formal public ontology in existence today.” Why SUMO? Language is important SUMO is open source, upper ontology
  • 22. © 2017 Nuance Communications, Inc. All rights reserved. 22 Princeton WN-SUMO mapping text = People are walking 1 People people NOUN NNS_ 3 nsubj _NNS|07942152-n GroupOfPeople= 2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+ 3 walking walk VERB VBG_ 0 ROOT _ VBG|01904930-v|Walking= Kinds of mapping: 1.  Equality: the synset corresponds exactly to the concept (as in Walking= above) 2.  Containment: Concept+, the synset is a subconcept of the SUMO concept 3.  Negation: Concept[ (the negation of the concept) silent= not communicating 4.  A-box: shouldn’t be happening here, but it does. ok for Asian, German? Why SUMO? SUMO’s mappings: the theory
  • 23. © 2017 Nuance Communications, Inc. All rights reserved. 23 Examples text = Men are sawing 1 Men man NOUN NNS_ 3 nsubj _ NNS|02472293-n|Hominid= 2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+ 3  sawing saw VERB VBG_ 0 ROOT _ VBG|01559590-v|Cutting+ “saw” is a very specific verb and has only one synset in PWN: 01559590-v saw | (cut with a saw; "saw wood for the fireplace") But SUMO has no exact match for this concept, so it maps to Cutting+, a kind of Cutting, the one done with a “saw”. Much worse is the situation with “Man” that occurs in 33 PWN synsets, the WSD alg decided for Hominid=, but Man= (equivalent mapping) would be much better. Containment
  • 24. © 2017 Nuance Communications, Inc. All rights reserved. 24 Examples text = Some people are silent 1 Some some DET DT _ 2 det _ DT|?|? 2 people people NOUN NNS_ 4 nsubj _NNS|07942152- nGroupOfPeople= 3 are be VERB VBP _ 4 cop _ VBP|02604760-v|Entity+ 4 silent silent ADJ JJ _ 0 ROOT _ JJ|00942163-a| LinguisticCommunication[ “silent” is an adjective in 6 synsets, mapped to LinguisticCommunication[, which is to be read as non-LinguisticCommunication. This seems the wrong WSD, but the correct one would give us Communication[ (negated subsuming mapping). Negation
  • 25. © 2017 Nuance Communications, Inc. All rights reserved. 25 Examples 1 An a DET DT _ 3 det _ DT|?|? 2 American american ADJ JJ _ 3 amod_ JJ|02927303-a|LandArea@ 3 footballer footballer NOUN NN _ 5 nsubj _ NN|10101634-n| ProfessionalAthlete+ 4 is be VERB VBZ _ 5 aux _ VBZ|02604760-v|Entity+ 5 wearing wear VERB VBG _ 0 ROOT _ VBG|00075021-v| RecreationOrExercise+ 6 the the DET DT _ 10 det _ DT|?|? 7 red red ADJ JJ _ 10 amod_ JJ|00381097-a|Red= 8 and and CONJ CC _ 7 cc _ CC|?|? 9 white white ADJ JJ _ 7 conj _ JJ|00393105-a|White= 10 strip strip NOUN NN _ 5 dobj _ NN|09449510-n|SelfConnectedObject+ The mapping of ‘American’ to related to the continent America is a reasonable hook for future work A-box
  • 26. © 2017 Nuance Communications, Inc. All rights reserved. 26 Analysis The numbers Total SICK mappings: 38671 ? Most need to be discarded 29606 + most need to be made specific 16477 = 171 @ 67 [ Need to deal with negated, a-box concepts Need more bodily process (sleeping..), Need more baby-zoology (tiger, kitten not the same) But the killer is WSD
  • 27. © 2017 Nuance Communications, Inc. All rights reserved. 27 Conclusion QAD with FreeLing WSD does NOT work Could repeat Petrov’s words. Could make a more careful corpus, checking its processing. Could rethink WSD. Have done so and included all the PWN senses, for the time being. It’s less wrong!! Considering how to make sure that ontologies pay for their upkeeping, without throwing away our Neural Networks gains.
  • 28. © 2017 Nuance Communications, Inc. All rights reserved. 28 Future Work Enhancing dependencies with friends Want to use what we’ve learnt during the summer with the presentations of Schuster, Lewis, Reddy and Stanovsky.They all had rules that applied to dependencies to make them more `semantic’. We also have the AKR rules and the matcher rules to use and improve. At the moment, working on Stanovsky’s (and Fauke’s rules, they work for EN and DE) with Prateek Jain. Also working on the Entailment pairs of SICK. (note that simply concepts can give useful info2)
  • 29. © 2017 Nuance Communications, Inc. All rights reserved. 29 CODA We are the UD-Portuguese now! There were two totally automatic treebanks in Portuguese in UDs. Zeman’s version started from Linguateca’s Bosque and did conversions. Google’s UD-BR used web data and no manual input. Started re-annotating Bosque in Oct 2016. We’re now the official UD-PT, as we automatically converted and manually checked (or are checking) Bosque. Paper submitted.
  • 30. © 2017 Nuance Communications, Inc. All rights reserved. Thank you
  • 31. © 2017 Nuance Communications, Inc. All rights reserved. 31