Anúncio
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a Towards Universal Semantic Understanding of Natural Languages(20)

Anúncio

Mais de Yunyao Li(20)

Anúncio

Towards Universal Semantic Understanding of Natural Languages

  1. Towards Universal Semantic Understanding of Natural Languages Yunyao Li (@yunyao_li) Senior Research Manager Scalable Knowledge Intelligence IBM Research - Almaden
  2. How many languages are there in the world? 2
  3. 3 7,102 known languages 23 most spoken language 4.1+ Billion people Source: https://www.iflscience.com/environment/worlds-most-spoken-languages-and-where-they-are-spoken/
  4. Conventional Approach towards Language Enablement 4 English Text English NLU English Applications German Text German NLU German Applications Chinese Text Chinese NLU Chinese Applications Separate NLU pipeline for each language Separate application for each language
  5. Universal Semantic Understanding of Natural Languages 5 English Text German Text Universal NLU Cross-lingual Applications Chinese Text Single NLU pipeline for different languages Develop once for different language
  6. The Challenges 6 Models – Built for one task at a time Training Data – High quality labeled data is required but hard to obtain Meaning Representation – Different meaning representation for different languages – Different mention representation for the same languages Auto-Generation + Expert/Crowd Curation Unified Meaning Representation High-quality parser + Programmable Abstraction + Human-machine co-creation Our Research 6
  7. John hastily ordered a dozen dandelions for Mary from Amazon’s Flower Shop. order.02 (request to be delivered) A0: Orderer A1: Thing ordered A2: Benefactive, ordered-for A3: Source A0: Orderer A1: Thing ordered A2: Benefactive, ordered-for A3: SourceAM-MNR: Manner WHO HOW DID WHAT WHERE Semantic Role Labeling (SRL) FOR WHOM Who did what to whom, when, where and how?
  8. Dirk broke the window with a hammer. Break.01A0 A1 A2 The window was broken by Dirk. The window broke. A1 Break.01 A0 A1 Break.01 Break.01 A0 – Breaker A1 – Thing broken A2 – Instrument A3 – Pieces Break.15 A0 – Journalist, exposer A1 – Story, thing exposed Syntax vs. Semantic Parsing What type of labels are valid across languages? • Lexical, morphological and syntactic labels differ greatly • Shallow semantic labels remain stable
  9. SRL Resources Other languages • Chinese Proposition Bank • Hindi Proposition Bank • German FrameNet • French? Spanish? Russian? Arabic? … English • FrameNet • PropBank 1. Limited coverage 2. Language-specific formalisms 订购 A0: buyer A1: commodity A2: seller order.02 A0: orderer A1: thing ordered A2: benefactive, ordered-for A3: source We want different languages to share the same semantic labels
  10. WhatsApp was bought by Facebook Facebook hat WhatsApp gekauft Facebook a achété WhatsApp buy.01 Facebook WhatsApp Buyer Thing bought Cross-lingual representationMultilingual input text Buy.01 A0A1 Buy.01A1A0 Buy.01A0 A1 Shared Frames Across Languages A0 A1
  11. The Challenges 11 Models – Built for one task at a time Training Data – High quality labeled data is required but hard to obtain Meaning Representation – Different meaning representation for different languages – Different mention representation for the same languages Auto-Generation + Expert Frame Curation + Crowdsourcing Unified Meaning Representation High-Quality parser + Programmable Abstraction + Human-Machine Co-creation Our Research 11
  12. Generate SRL resources for many other languages • Shared frame set • Minimal effort Il faut qu‘ il y ait des responsables Need.01A0 Je suis responsable pour le chaos Be.01A1 A2 AM-PRD Les services postaux ont achété des … Be.01 A2A1 Buy.01A0 Corpus of annotated text data Universal Proposition Banks Frame set Buy.01 A0 – Buyer A1 – Thing bought A2 – Seller A3 – Price paid A4 – Benefactive Pay.01 A0 – Payer A1 – Money A2 – Being payed A3 – Commodity
  13. Annotator training months Annotation Years Repeat for each language! Current Practices 13
  14. Example: TV subtitles Our Idea: Annotation projection with parallel corpora Das würde ich für einen Dollar kaufen German subtitles I would buy that for a dollar! English subtitles PRICEBUYER ITEM BUYERITEM Training data • Semantically annotated • Multilingual • Large amount I would buy that for a dollar PRICE projection Das würde ich für einen Dollar kaufen Auto-Generation of Universal Preposition Bank 14 Resource: https://www.youtube.com/watch?v=u5HOt0ZOcYk
  15. We need to hold people responsible Il faut qu‘ il y ait des responsables English sentence: Target sentence: Hold.01A0 A1 A3Need.01 Hold.01 Incorrect projection! There need to be those responsible A1 Main error sources: • Translation shift • Source-language SRL errors However: Projections Not Always Possible
  16. Filtered Projection & Bootstrapping Two-step process – Filters to detect translation shift, block projections (more precision at cost of recall) – Bootstrap learning to increase recall – Generated 7 universal proposition banks from 3 language groups • Version 1.0: https://github.com/System- T/UniversalPropositions/ • Version 2.0 coming soon [ACL’15] Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling.
  17. Multilingual Aliasing • Problem: Target language frame lexicon automatically generated from alignments – False frames – Redundant frames • Expert curation of frame mappings [COLING’16] Multilingual Aliasing for Auto-Generating Proposition Banks
  18. Low-resource Languages Apply approach to low-resource languages Bengali, Malayalam, Tamil – Fewer sources of parallel data – Almost no NLP: No syntactic parsing, lemmatization etc. Crowdsourcing for data curation [EMNLP’16] Towards Semi-Automatic Generation of Proposition Banks for Low- Resource Languages
  19. Annotation Tasks (all) Task Routerraw text Corpus predicted annotations Corpus curated annotations Corpus Easy tasks are curated by crowd Difficult tasks are curated by experts Crowd-in-the-Loop Curation [EMNLP’17] CROWD-IN-THE-LOOP: A Hybrid Approach for Annotating Semantic Roles
  20. Task Router Classifier
  21. ­9pp F1 improvement over SRL results Effectiveness of Crowd-in- the-Loop ¯66.4pp expert efforts ­10pp F1 improvement over SRL results ¯87.3pp expert efforts Latest results (in submission)
  22. The Challenges 22 Models – Built for one task at a time Training Data – High quality labeled data is required but hard to obtain Meaning Representation – Different meaning representation for different languages – Different mention representation for the same languages Auto-Generation + Expert Frame Curation + Crowdsourcing Unified Meaning Representation High-Quality parser + Programmable Abstraction + Human-Machine Co-creation Our Research 22
  23. What Makes SRL So Difficult? Heavy-tailed distribution of class labels – Common frames • say.01 (8243), have.01 (2040), sell.01 (1009) – Many uncommon frames • swindle.01, feed.01, hum.01, toast.01 – Almost half of all frames seen fewer than 3 times in training data Many low-frequency exceptions – Difficult to capture in model 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Distribution of frame labels
  24. Low-Frequency Exceptions Strong correlation of syntactic function of an argument to its role Example: passive subject The window was broken by Dirk SBJ PMOD VC NMOD A1 The silver was sold by the man. SBJ PMOD VC NMOD A1 Creditors were told to hold off. SBJ ORPD VC IM PRT TELL.01 A0: speaker (agent) A1: utterance (topic) A2: hearer (recipient)
  25. 86% of passive subjects are labeled A1 (over 4.000x in training data) Local Bias 87% of passive subjects of Tell.01 are labeled A2 (53x in training data) Most Classifiers – Bag-of-features – Learn weights for features to classes – Perform generalization Question: How do we explicitly capture low-frequency exceptions?
  26. Instance-based Learning kNN: k-Nearest Neighbors classification Find the k most similar instances in training data Derive class label from nearest neighbors A0 A1 A1 A2 A1 A1 A1 A1 A1 A0 A0 A1 A0 A2 A2A2 A2 A1 A2 ? 1 2 3 ndistance Creditors were told to hold off. SBJ ORPD VC IM PRT “creditor” passive subject of TELL.01 noun passive subject of TELL.01 COMPOSITE FEATURE DISTANCE 1 2 . . . . . . any passive subject of any agentive verb n ? Main idea: Back off to composite feature seen at least k times [COLING 2016] K-SRL: Instance-based Learning for Semantic Role Labeling
  27. Results In-domain Out-of-domain • Significantly outperform previous approaches – Especially on out-of-domain data • Small neighborhoods suffice (k=3) ­0.6pp F1 In-Domain ­2.3pp F1 Out-of-Domain Latest results (improvement over SoAT. in submission with DL + IL) [COLING 2016] K-SRL: Instance-based Learning for Semantic Role Labeling
  28. The Challenges 28 Models – Built for one task at a time Training Data – High quality labeled data is required but hard to obtain Meaning Representation – Different meaning representation for different languages – Different mention representation for the same languages Auto-Generation + Expert Frame Curation + Crowdsourcing Unified Meaning Representation High-Quality parser + Programmable Abstraction + Human-Machine Co-creation Our Research 28
  29. WhatsApp was bought by Facebook Facebook hat WhatsApp gekauft Facebook a achété WhatsApp buy.01 Facebook WhatsApp Buyer Thing bought Cross-lingual representation Multilingual input text Buy.01 A0A1 Buy.01A1A0 Buy.01A0 A1 Crosslingual Information Extraction Sentence Verb Buyer Thing bought 1 buy.01 Facebook WhatsApp 2 buy.01 Facebook WhatsApp 3 buy.01 Facebook WhatsApp Crosslingual extraction Task: Extract who bought what [NAACL’18] SystemT: Declarative Text Understanding for Enterprise [ACL’16] POLYGLOT: Multilingual Semantic Role Labeling with Unified Labels [COLING’16] Multilingual Information Extraction with PolyglotIE https://vimeo.com/180382223
  30. Transparent Linguistic Models for Contract Understanding 30 [NAACL-NLLP’19] Transparent Linguistic Models for Contract Understanding and Comparison https://www.ibm.com/cloud/compare-and-comply
  31. Transparent Model Design Purchaser will purchase the Assets by a cash payment. Element Obligation for Purchaser [NAACL-NLLP’19] Transparent Linguistic Models for Contract Understanding and Comparison https://www.ibm.com/cloud/compare-and-comply
  32. Transparent Model Design Purchaser will purchase the Assets by a cash payment. Element [Purchaser]A0 [will]TENSE-FUTURE purchase [the Assets]A1 [by a cash payment]ARGM-MNR Core NLP Understanding Core NLP Primitives & Operators Provided by SystemT [ACL '10, NAACL ‘18] Semantic NLP Primitives [NAACL-NLLP’19] Transparent Linguistic Models for Contract Understanding and Comparison https://www.ibm.com/cloud/compare-and-comply
  33. Transparent Model Design Purchaser will purchase the Assets by a cash payment. Element Legal Domain LLEs [Purchaser]ARG0 [will]TENSE-FUTURE purchase [the Assets]ARG1 [by a cash payment]ARGM-MNR LLE1: PREDICATE ∈ DICT Business-Transaction ∧ TENSE = Future ∧ POLARITY = Positive → NATURE = Obligation ∧ PARTY = ARG0 LLE2: …........ Domain Specific Concepts Business transact. verbs in future tense with positive polarity Core NLP Primitives & Operators Semantic NLP Primitives [NAACL-NLLP’19] Transparent Linguistic Models for Contract Understanding and Comparison https://www.ibm.com/cloud/compare-and-comply
  34. Transparent Model Design Purchaser will purchase the Assets by a cash payment. Element Model Output [Purchaser]ARG0 [will]TENSE-FUTURE purchase [the Assets]ARG1 [by a cash payment]ARGM-MNR Obligation for Purchaser Nature/Party: Domain Specific Concepts Core NLP Primitives & Operators LLE1: PREDICATE ∈ DICT Business-Transaction ∧ TENSE = Future ∧ POLARITY = Positive → NATURE = Obligation ∧ PARTY = ARG0 LLE2: …........ Legal Domain LLEsSemantic NLP Primitives [NAACL-NLLP’19] Transparent Linguistic Models for Contract Understanding and Comparison https://www.ibm.com/cloud/compare-and-comply
  35. Human & Machine Co-Creation Label ed Data Evaluati on Results Productio n Deep Learnin g Learned Rules (Explainable) Modify Rules Machine performs heavy lifting to abstract out patterns Humans verify/ transparent model Evaluation & Deployment Raises the abstraction level for domain experts to interact with
  36. Label being assigned Various ways of selecting/ranking ranking rules Center panel lists all rules HEIDL Demo Rule-specific performance metrics [ACL’19] HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop
  37. HEIDL Demo Examples available at the click of a button [ACL’19] HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop
  38. Center panel lists all rules HEIDL Demo Playground mode allows adding and dropping of predicates from a rule [ACL’19] HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop
  39. User Study: Human+Machine Co-Created Model Performance User study – 4 NLP Engineers with 1-2 year experience – 2 NLP experts with 10+ years experience Key Takeaways – Explanation of Learned Rules: Visualization tool is very effective – Reduction in human labor: Co-created model created within 1.5 person-hrs outperforms black-box sentence classifier – Reduced Requirement on Human Expertise: Co-created model is at par with Super-Expert’s created model Ua Ub Uc Ud 0.0 0.1 0.2 0.3 0.4 0.5 0.6 F-measure RuleNN+Human BiLSTM [ACL’19] HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop
  40. Conclusion Research prototype Early adaption (EN) Cross-lingual adaptation • Compliance: Watson Compare & Comply • Email: Watson Workspace • Healthcare: Watson Drug Discovery • Material Science: Advanced Material Discovery • … • 10+ languages • SoAT models • Paper: 10+ publications • Patent: 6 patent filed • Data: ibm.biz/LanguageData • Code: Chinese SOUNDEX https://pypi.org/project/chinesesoundex-1.0/ • ongoing
  41. Thank You! 41 To learn more: • Role of AI in Enterprise Application ( ibm.biz/RoleOfAI) Research Projects: • ibm.biz/ScalableKnowledgeIntelligence • ibm.biz/SystemT Data Sets: • ibm.biz/LanguageData Follow me: • LinkedIn: https://www.linkedin.com/in/yunyao-li/ • Twitter: @yunyao_li By now, you should be able to: – Identify challenges towards universal semantic understanding of natural languages – Understand current state-of-the-arts in addressing the challenges – Define general use cases for universal semantic understanding of natural languages
Anúncio