How to Troubleshoot Apps for the Modern Connected Worker
Closing the Gap: Data Models for Documentary Linguistics
1. Closing the Gap:
Data Models for
Documentary Linguistics
Baden Hughes
Department of Computer Science and Software Engineering
The University of Melbourne
badenh@cs.mu.oz.au
2. Overview
Overall Context
The Electronic Data Format Challenge
Common Problems
Data Encoding Models
Lexicons, interlinear texts, paradigms, syntactic trees, annotation
standards, query languages
Linguistic Motivations vs Computational Interests
New Types of Data Exploration
Effects on Linguistic Analysis
New Tools
Conclusions
Latrobe Uni - Linguistics Seminar - 20050505 2
3. Overall Context
Large amounts of human language data
continues to be managed in electronic form
and analysed in fieldwork-driven linguistic
documentation
Increasing focus on acquisition-centric
methodologies which have vastly increased
the rate of growth of linguistic data
Reasonably static basic linguistic data
structures largely grounded in print domain
Latrobe Uni - Linguistics Seminar - 20050505 3
4. The Electronic Data Format
Challenge
The methods used for the digital encoding of
linguistic data are often disparate
Often at best reduced to native formats supported by
widely-used tools such as Shoebox
Conversion is typically complex and lossy
Sometimes this can’t be predicted in advance
Many utility manipulation functions required to move
data between analytical applications and outputs
These functions are largely external to analytical
environments, with some notable exceptions (eg regular
expression manipulation)
Latrobe Uni - Linguistics Seminar - 20050505 4
5. Common Problems
Despite diversity of language and analytical
approach, many documentary and descriptive
linguists face a common challenge: the
interoperability and longevity of electronic data
generated in fieldwork settings.
Repurposing data
Publishing data on the web
Publishing in papers
New analysis tools
New generation formats
Latrobe Uni - Linguistics Seminar - 20050505 5
6. The Emergence of Abstract
Language Data Encoding Models
Recently, a number formal data encoding models for
linguistic data types have emerged from projects
investigating quot;best practicequot; methods for preserving
linguistic data.
We will briefly consider models for
lexicons
interlinear texts
paradigms
syntactic trees
annotation standards
query languages
Latrobe Uni - Linguistics Seminar - 20050505 6
7. Data Models (1)
Lexicons
Bell & Bird (2001)
Interlinear Text
Bow, Hughes & Bird (2003)
Hughes, Bird & Bow (2003)
Linguistic Paradigms
Penton, Bow, Bird & Hughes (2004)
Penton & Bird (2004)
Latrobe Uni - Linguistics Seminar - 20050505 7
8. Data Models (2)
Syntactic Trees
Lai & Bird (2004)
Annotation Standards
Farrar, Lewis & Langendoen (2002)
Farrar & Langendoen (2003)
Query Languages
Bird, Chen, Davidson, Lee & Zheng (2005)
Cassidy & Bird (2000)
Taylor (2004)
Latrobe Uni - Linguistics Seminar - 20050505 8
9. Linguistic Motivations
Data models – so what ?
It is the combined utility of these models that makes
them attractive to documentary linguists
The challenge is to lower the barrier to use of these
technologies in fieldwork and analytical contexts
Linguistics (mostly) don’t care about the technology,
they just want to do linguistics!
Computer scientists are generally not interested in
linguistics …
Latrobe Uni - Linguistics Seminar - 20050505 9
10. Computational Interests
The development of such models may be inherently
interesting to computationally inclined researchers
Human language data encoding and annotation is
genuinely interesting in computer science terms;
unfortunately basic data modelling isn' t
Technologists have a bad habit of providing advice which
is intended well but lacks traction for non-technical
communities (eg “use XML”)
Many of the solutions are XML-based, but contain many
more components than just XML encoded data
Latrobe Uni - Linguistics Seminar - 20050505 10
11. New Types of Data Exploration (1)
Open implemented solutions for a range of
manipulations are available
Lexicons
Generation of different types of lexicons
Interlinear Text (see following examples …)
Generation of different types of interlinear text
Induction of morphosyntactic glossing from lexicons
Generation of lexicons from interlinear text
Enrichment of lexicons from interlinear text
Latrobe Uni - Linguistics Seminar - 20050505 11
14. New Types of Data Exploration (2)
Open implemented solutions for a range of
manipulations are available
Syntactic Trees
Induction of trees from interlinear text
Creation of interlinear text from syntactic tree drawing
Creation of lexicons from syntactic trees
Paradigms (see following examples …)
Generation of different types of paradigms
Induction of paradigms from interlinear text
Annotation of interlinear text from paradigms
Enrichment of lexicons from paradigms
Latrobe Uni - Linguistics Seminar - 20050505 14
17. Effects on Linguistic Analysis
Integrated encoding standards for linguistic
data affect the practice of linguistic analysis
Some analysis types are now easier
New possibilities emerge
New analytical challenges are discovered
Data linkage/integration is certainly one of the
improvements
Latrobe Uni - Linguistics Seminar - 20050505 17
18. New Tools
The next generation of tools which support these
data models natively are emerging eg FIELD, ELAN,
Toolbox (almost)
“Middleware” which allows the translation of legacy
formats to and from these models are reasonably
widely available
Analytical tools are increasingly being implemented
with web-grounded technologies and using web-
derived models
Open source/open data approaches are becoming
pervasive
Latrobe Uni - Linguistics Seminar - 20050505 18
19. Conclusion
Reducing the gap between computationally tractable
representations on which a high degree of
functionality can be built and simple underlying
formats driven by fieldwork-oriented tools
Reduces the intermediate data-munging steps which
require technical knowledge rather than linguistic
knowledge is advantageous to all parties
While we are not quite “there yet”, the light at the
end of the tunnel is definitely there
Growing community of philosophically aligned
computer scientists and linguists
Latrobe Uni - Linguistics Seminar - 20050505 19
20. References
Bell & Bird, 2001. A Preliminary Study of the Structure of Lexicon Entries. Proceedings of
the Workshop on Web-Based Language Documentation and Description.
Bow, Hughes & Bird 2003. Towards a General Model for Interlinear Text. Proceedings of
EMELD 2003.
Farrar, Lewis & Langendoen, 2002. A Common Ontology for Linguistic Concepts.
Proceedings of the Knowledge Technologies Conference.
Farrar & Langendoen, 2003. A linguistic ontology for the Semantic Web. GLOT
International 7(3)
Hughes, Bird & Bow, 2003. Encoding and Presenting Interlinear Text Using XML
Technologies. Proceedings of ALTW 2003.
Lai & Bird, 2004. Querying and Updating Treebanks: A Critical Survey and Requirements
Analysis. Proceedings of ALTW 2004.
Penton, Bow, Bird & Hughes, 2004. Towards a General Model for Linguistic Paradigms.
Proceedings of EMELD 2004.
Penton & Bird, 2004. Representing and Rendering Linguistic Paradigms. Proceedings of
ALTW 2004.
Bird, Chen, Davidson, Lee & Zheng, 2005. Extending XPath to Support Linguistic Queries.
Proceedings of PLANX 2005.
Cassidy & Bird, 2000. Querying databases of annotated speech. Proceedings of the
Eleventh Australasian Database Conference.
Taylor, 2004. XSLT as a Linguistic Query Language. BSc(Hons) Thesis, University of
Melbourne.
Latrobe Uni - Linguistics Seminar - 20050505 20