Anúncio
Anúncio

Mais conteúdo relacionado

Anúncio
Anúncio

Documenting and modeling inflectional paradigms in under-resourced languages

  1. Documenting and Modeling Inflectional Paradigms in Under-resourced Languages By Ekaterina Vylomova evylomova@gmail.com Cardamom Seminar Series Feb, 1 2022
  2. NLP: Universal or Euroversal?
  3. Multilinguality
  4. Approx. 7,000* languages in the world… * Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max Weinreich)
  5. Roman Jakobson on differences between Languages ``Languages differ essentially in what they must convey and not in what they may convey''
  6. Chinese (Isolating; strict word order) wǒmen xué le zhè xiē shēngcí. I.PL.AN learn .PAST this .PL new word. ``We learned these new words.''
  7. Russian (Synthetic;flexible word order) My vyučili eti novyje slova. We learn.PAST.PL this.ACC.PL new.ACC.PL word.ACC.PL ``We learned these new words.''
  8. Nannu-n-niuti-kkuminar-tu-rujussu-u-vuq. Polar.bear-catch-instrument.for.achieving-something.good.for-PART- big-be-3SG.INDIC ``It (a dog) is good for catching polar bears with.'' Speedtalk in “Gulf” by Robert Heinlein West Greenlandic (Polysynthetic; Fortescue (2017))
  9. Aban-yawoith-warrgah-marne-ganj-ginje-ng. 1/3PL-again-wrong-BEN-meat-cook-PP ``I cooked the wrong meat for them again'' Kunwinjku (Polysynthetic; Evans (2003))
  10. Approx. 7,000* languages in the world… … But majority of NLP technologies only focus on most documented languages such as English, French, German, Russian, Hindi, or Finnish * Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max Weinreich)
  11. Most NLP is Standard Average European … According to Martin Haspelmath (2001), “euroversals” share: ● definite and indefinite articles (e.g. English the vs. a) ● a periphrastic perfect formed with 'have' plus a passive participle (e.g. English I have said); ● the verb is inflected for person and number of the subject, but subject pronouns may not be dropped even when this would be unambiguous ● Some features that are common in European langs but also found elsewhere: ○ lack of distinction between inclusive and exclusive first-person plural pronouns ("we and you" vs. "we and not you") ○ lack of distinction between alienable and inalienable (e.g. body part) possession; SAE was introduced by Whorf, 1939
  12. Most NLP is Standard Average European Morphology: Position of Case Affixes (WALS 51A) SAE was introduced by Whorf, 1939 Suffixing morphology: Eurasian and Australian languages; Prefixation: Mesoamerican languages and African languages spoken below the Sahara
  13. Most NLP is Standard Average European Morphology: Inclusive/Exclusive Distinction in Independent Pronouns (WALS 39A) SAE was introduced by Whorf, 1939 No Incl.excl.: Eurasian Incl/excl.: Australian and South American languages and;
  14. Most NLP is Standard Average European Morphology: Possessive Classification (WALS 59A) SAE was introduced by Whorf, 1939 No Possessive: Eurasian Possessive: Australian, African, American;
  15. Most NLP is Standard Average European Morphology: The Past Tense (WALS 66A) SAE was introduced by Whorf, 1939 Past tense, 1: Eurasian No past tense: South-East Asia, African, American;
  16. UniMorph: Universal Morphosyntactic annotation
  17. Wiktionary: > 8000 language codes (incl. historic) https://en.wiktionary.org/wiki/Wiktionary:List_of_languages
  18. Wiktionary: language-specific беглец (“male escapee/runaway”) + pos=N, case=ACC, number=SG → беглеца Functionally similar feature names (cases, numbers, etc ) differ across languages! Highly depends on the descriptive tradition used to document/describe a language!
  19. UniMorph: Universal Morphology 1) 23 dimensions of meaning (TAM, case, number, animacy), 212 features 2) A-morphous morphology (Anderson, 1992) 3) Paradigms extracted from English Wiktionary (Kirov et al., 2016) https://unimorph.github.io/doc/unimorph-schema.pdf By John Sylak-Glassman, 2015
  20. UniMorph: Universal Morphology https://unimorph.github.io/doc/unimorph-schema.pdf A Russian noun declension: Lemma Form UM Tags/Features
  21. From Language-specific to Universal features Descriptive categories (specific to languages) vs. comparative concepts (Haspelmath, 2010)
  22. Gender Noun class Bantu Languages: approx. 23 noun classes Nakh-Daghestanian Languages: 2–8 classes ● Masculine ● Feminine ● Neuter Corbett, Greville G. 1991. Gender. Cambridge, UK: Cambridge University Press.
  23. Gender Noun class Bantu Languages: approx. 23 noun classes ● Masculine ● Feminine ● Neuter Corbett, Greville G. 1991. Gender. Cambridge, UK: Cambridge University Press.
  24. Case Systems: Core
  25. Case Systems: Non-Core Semblative (like smth, e.g. English “human-like”; Evenki) → COMPV Abessive (English “-less”, “without” ; Uralic) → PRIV Causal case (“because of this…”; Moksha) → LGSPEC Prepositional case (location, “about *”; Slavic) → ESS XXXX
  26. Case Systems: Local Uralic: Inessive (“in”) → IN+ESS Elative (“out of”) → IN+ABL Illative (“into”) → IN +ALL Allative (“onto”) → AT+ALL Essive → FRML Pl– Place; Dst – Distal; Mot – Motion; Asp – Aspect
  27. Case Systems: challenges in UniMorph Cases (and other features) combinations: Evenki: Accusative reflexive-genitive (ACC+PSSRS); Accusative Definite (ACC+DEF)/Indefinite (ACC+INDF) Case compounding: Uralic(e.g. Beserman Udmurt; from Maria Usacheva) Case stacking: Kayardild (up to 6 cases!): Also in Chukchi, Sumerian, Basque, and others
  28. Scripts and Writing Systems
  29. A language may use different scripts over time … and it’s often a political question! 1. E.g., USSR invented many writing systems for various indigenous peoples that didn’t have their own system 2. Still, some republics have ongoing debates on the system they should use (e.g., Latin- vs. Cyrillic-based)
  30. A language may use different scripts over timeLanguage Arabic Latin Cyrillic Other Bashkir < end of 1920s 1930s 1940– Kazakh < end of 1920s 1930s 1940– Tatar < end of 1920s 1930s;2000s 1940– Uzbek < end of 1920s 1930s; 1990– 1940– Uyghur < end of 1920s; but still used in China 1930–1940s 1946– Buryat 1930s End of 1930s– <1920s: Mongolian Chukchi 1930s 1937– Tenevil Udmurt 1930s end of 18th cent. — Evenki 1930s 1937– Mongolian (China)
  31. A language may use different scripts over timeLanguage Arabic-based Latin-based Cyrillic-based Other Bashkir < end of 1920s 1930s 1940– Kazakh < end of 1920s 1930s 1940– Tatar < end of 1920s 1930s;2000s 1940– Uzbek < end of 1920s 1930s; 1990– 1940– Uyghur < end of 1920s; but still used in China 1930–1940s 1946– Buryat 1930s End of 1930s– <1920s: Mongolian Chukchi 1930s 1937– Tenevil Udmurt 1930s end of 18th cent. — Evenki 1930s 1937– Mongolian (China) May consider all existing scripts and data but better to check with native speakers which one would be preferred !
  32. SIGMORPHON Shared Task on Morphological Reinflection
  33. SIGMORPHON Shared Task on Morphological (Re-)Inflection Lemma Tag Form RUN V;PAST ran RUN V;PRES;1;SG run RUN V;PRES;2;SG run RUN V;PRES;3;SG runs RUN V;PRES;PL run RUN V;PART running Inflection: RUN + V;PST → ran reinflection: running +V;PST → ran Cotterell et al., 2016–2018 McCarthy et al., 2019 Vylomova et al., 2020 Pimentel, Ryskina et al, 2021
  34. SIGMORPHON Shared Task on Morphological (Re-)Inflection Lemma Tag Form RUN V;PAST ran RUN V;PRES;1;SG run RUN V;PRES;2;SG run RUN V;PRES;3;SG runs RUN V;PRES;PL run RUN V;PART running Inflection: RUN + V;PST → ran reinflection: running +V;PST → ran Approx. 96% avg. accuracy on high-resource languages! Significantly less in under-resourced languages! Winning systems are neural seq2seq models See more details in my SIGTYP Talk
  35. Error Taxonomy (Gorman et al., 2019) ● Free variation error: more than one acceptable form exists ● Extraction errors: flaws in UniMorph’s parsing of Wiktionary ● Wiktionary errors: errors in Wiktionary/ ling. source data itself ● Silly errors: “bizarre” errors which defy any purely linguistic characterization (``*membled'' instead of ``mailed'' or enters a loop such as ``ynawemaylmyylmyylmyylmyylmyylmyym...'' instead of ``ysnewem'') ● Allomorphy errors: misapplication of existing allomorphic patterns ● Spelling errors: forms that do not follow language-specific orthographic conventions
  36. Error Taxonomy (Gorman et al., 2019) Majority of errors are due to allomorphy
  37. Allomorphy Errors ● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda) ● Ablaut in Dutch and German (*pront; *saufte) ● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre) ● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból) ● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa) ● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG ) ● Aspect in Russian (*будешь сорвать) ● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности (ACC.PL))
  38. Allomorphy Errors ● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda) ● Ablaut in Dutch and German (*pront; *saufte) ● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre) ● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból) ● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa) ● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG ) ● Aspect in Russian (*будешь сорвать) ● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности (ACC.PL)) Poor Performance on unseen lemmas
  39. Some writing systems are more challenging! Tibetan: ○ Nonce words and impossible combinations of component units (Di et al., 2019)
  40. A Case Study on Nen (PNG; Muradoglu et al., 2020) ● Spoken in the village of Bimadbn in the Western Province of PNG, by approx 400 people ● Verbs: prefixing, middle, and ambifixing ● Distributed Exponence (DE); “morphosyntactic feature values can only be determined after unification of multiple structural positions"
  41. A Case Study on Nen (PNG; Muradoglu et al., 2020) ● Low accuracy on small number of samples (<1000) ● Allomorphy: vowel harmony ● Looping: *ynawemaylmyylmyylmyylmy-ylmyylmyymaya mawemyymamya How well do they generalize? Syncretism Test: all the TAM categories exhibit syncretism across the second and third-person singular actor. Exception: The past perfective slot (where they take different forms) Not observing the past perfective forms, systems tend to predict the forms as syncretic (generalizing from observed slots), resulting in the misprediction of the actual forms (exceptions)
  42. Using machine learning to evaluate our data quality
  43. Experiment data 15 25 35 22 What are the most “challenging” languages? ● Syc: Syriac (train: 1217 rows/534 lemmas) ● Ckt: Chukchi (132 rows/118 lemmas) ● Itl: Itelmen (1246 rows/849 lemmas) ● Gup: Kunwinjku (214 rows/62 lemmas) ● Bra: Braj (1082 rows/825 lemmas) ● Ail: Eibela (918 rows/470 lemmas) ● Evn: Evenki (5216 rows/2919 lemmas) ● Sjo: Xibe (290 rows/206 lemmas) ST 2021
  44. Experiment data 15 25 35 22 Evenki : ● The dataset has been created from oral speech samples ● Little attempt at any standardization in the oral speech transcription ● The Evenki language is known to have rich dialectal variation ● The vowel harmony is complex: not all suffixes obey it, and it is also dialect-dependent Schema: ● Various past tense forms are all annotated as PST ● Several comitative suffixes all annotated as COM ● Some features are present in the word form but they receive no annotation at all Elena Klyachko ST 2021
  45. Experiment data 15 25 35 22 Chukchi (Luorawetlat): ● The dataset has been created from oral speech samples ● Very small and sparse dataset ● Polypersonal agreement ● very productive incorporation English analogue: I am bookreading/dishwashing/fruitcutting. ● Complex vowel harmony: Morphemes: "plus" with [a], [o], [e], [ə] and "minus" with [i], [e], [u], [ə] If a word with minus morphemes gets an affix with plus morphemes, all the minus ones turn into plus: [i] -> [e], [e] -> [a], [u]--> [o]. Нутэнут + N;DAT → нотагты Schema: ● No attempt to split genderlects «walrus»: male рыркы (rərkə) — female цыццы tsəttsə Elena Klyachko, Maria Ryskina ST 2021
  46. Experiment data 15 25 35 22 Itelmen: ● Very small and sparse dataset ● Verbs mark both subjects (with prefixes and suffixes) and objects (with suffixes); hard to capture combinations: Gold Tag Predicted čʼelisčiŋnen V;NO3S+AB3S;FIN;IND;PST čʼelnen enčʼiɬivnen V;NO3S+AB3S;FIN;IND;PRS;CAUS ənčʼiɬiznen nəntxlakninˀin V;NO3P+AB3S;FIN;FOC;IND;PST nəntxlakisxenˀin ● Complex phono- and morphotactics ktavolknan V.CVB;NFIN ktavolˀin ktekejknen V.CVB;NFIN ktekejˀin Karina Sheifer, Sofya Ganieva, Matvey Plugaryov ST 2021
  47. Experiment data 15 25 35 22 Kunwinjku ● Very small and sparse dataset ● Complex phonotactics: borlbme+V;2;PL;Non-PST → *ngurriborlbme (instead of ngurriborle) ● Falling into loops: *ngarrrrrrrrrrrrrrmbbbijj (should be karribelbmerrinj) or *ngadjarridarrkddrrdddrrmerri (should be karriyawoyhdjarrkbidyikarrmerrimeninj) William Lane, Steven Bird ST 2021
  48. Experiment data 15 25 35 22 Eibela ● Very small and sparse dataset (comes from inter-linear texts) ● Complex features, many of which didn’t have accurate UniMorph analogues ASS (associated place) → Approximative APPRX ASS.EV (associated event) → Intransitive INTR A.NOM (Agent nominalization) → Noun N CONTINUOUS/PROG →Progressive PROG COORD (Coordinator) → Comitative COM DEL.IMP: DEL (Delayed imperative → Imperative-Jussive IMP ; Future FUT DEO.DUR (deontic.durative) →Obligative OBLIG ; durative DUR D.S (different subject) →DS HYPO (hypothetical) → Irrealis IRR PERFECT/RESULTATIVE →Perfect PRF PROHIB → IMP;NEG PROGRESSIVE → PROG SIM/SIM, S.S (simultaneous) → Simultaneous Multiclausal Aspect SIMMA ST.IMP (Strong imperative) → Imperative-Jussive Mood ; Remote REMT TEL.EV (Telic) → Telic TEL UNCERT (Uncertain) → Potential POT ● Misprediction of vowel length: to:mulu+N;ERG -->tomulε (should be of to:mu:lε:) Grant Aiton ST 2021
  49. Which language data can be used for typological studies (e.g., compare morphological complexity)?
  50. Approx. amount of data for each ST2021 language
  51. Approx. amount of data for each ST2021 language Polysynthetic languages represented by a few samples A few samples from glosses; No full paradigms Mainly full paradigms Chukchi data contains: Only 3 out 9 gram. cases; 3 out of 6 tenses
  52. Approx. amount of data for each ST2021 language Polysynthetic languages represented by a few samples A few samples from glosses; No full paradigms Mainly full paradigms Chukchi data contains: Only 3 out 9 gram. cases; 3 out of 6 tenses Check how representative the dataset is before you use it for typological studies!
  53. Current challenges and open questions
  54. Paradigms in polysynthetic languages Currently we have very limited representation of paradigms in polysynthetic languages Not clear what a potential paradigm should contain
  55. Derivation – Inflection continuum No strict boundary between inflection and derivation Derivational morphology also represents systematicity and regularity (English agentive “-er”, absence “-less”, ability “-able”) No clear dimensions of derivational meanings
  56. Morphology and Syntax Should we include clitics, copulas, MWEs? If we do, the size of paradigm tables grows exponentially What might be an efficient way to store them?
  57. Thank you! Join and follow us: https://twitter.com/unimorph_/
Anúncio