NOTE: this page is still in process. at present it contains a number of conceptual recommendations without suggestions of how these recommendations could be implemented. Please feel free to contact the author with comments, complaints, and suggestions: Alexis Palmer, apalmer@coli.uni-sb.deThis page discusses one particular manner in which linguistic data may be re-purposed: as training data for statistical machine learning approaches in computational linguistics and/or natural language processing. More specifically, we're talking about training data for supervised or semi-supervised methods -- methods that learn from labeled data.
Linguistic data as training material for machine learning
| Case study subdiscipline: | Computational linguistics, language documentation and description |
| Goals of this case study: | To highlight how decisions related to the annotation and storage of linguistic data (in this case, interlinear glossed texts from a language documentation project) can make the data more or less useful as training data for statistical machine learning methods |
1. What makes data good training data?
The key word here is
consistency. Roughly put, the machine model learns generalizations over observed data and uses those to predict analyses for previously-unseen data. In order for it generalize well, the collection of data must be as internally-consistent as possible in the way that it is coded/labeled.
A second important consideration is the
underlying data structure. For efficient machine processing, there must be some explicit indication of relations between the text and its annotations. These two points are illustrated below with examples of interlinear glossed text, a common way of representing language data.
2. Labeling consistencyTypographic consistency
It is important in any data collection and annotation effort that all annotators work from one agreed-upon set of labels. For the sake of the machine learner, it is also important to adhere to capitalization and punctuation conventions. For example,
'PST' and
'pst' may both be intended to indicate a past tense morpheme, but the machine will see them as two distinct labels. Of course, many such issues can be handled by processing the data post-annotation and pre-model training, but to do so efficiently requires text manipulation skills that those producing the original data may or may not have.
One way for projects to maintain labeling consistency is by use of an annotation interface which restricts the space of allowed labels.
Analytic consistency
Maintaining analytic consistency is a much more difficult task. In cases where the analysis is reasonably well-understood at the outset of annotation, agreed-upon conventions for analysis and annotation may be made available to annotators in the form of a detailed annotation manual. It is often the case, however, that analysis and annotation proceed in parallel. In documentation and description of less-studied (or previously unstudied) languages, this is in fact the normal situation.
Several bits of record-keeping can help to deal with changing analyses:
- tracking the source of each label (i.e. the specific annotator) as well as the time and date of annotation
- documenting changes in analysis and/or labeling conventions, indicating the nature and source of the change, how the change should be manifested in the annotation (in other words, what did the previous analysis look like? what does the new analysis look like?), the date and time at which the decision to change the analysis was made, and whether or not the change has been back-propagated to previously-labeled data
- using annotation tools and/or data formats which are able to maintain a historical record of changes in the data (along with the metadata associated with those changes)
We recognize that some of these desiderata are not easily attainable with currently-available systems for text glossing and interlinearization, particularly in the language documentation context. We thus add our voice to those calling for development of an open source, updated, general-purpose system for text interlinearization and glossing.
3. Data structuresFirst, we point to the pages of
WG1: Annotation Standards as well as the
Existing Resources page for many valuable resources pertaining to standardization of data structures for annotation. The resources presented on these pages include links to proposed standards and extensive bibliographic references related to this topic.
Interlinear glossed text (IGT)The particular concern in this case study is the use of interlinear glossed text (IGT) as training data for a machine learner. First, here's an example of IGT from the Mayan language Uspanteko
[AP:(Pixabaj needet ref]:al.).Full text:
Kita' tinch'ab'ej laj inyolj iin.| TEXT | kita' | tinch'ab'ej |
|
|
| laj | inyolj |
| iin |
| MORPHEME | kita' | t- | in- | ch'abe | -j | laj | in- | yolj | iin |
| GLOSS | NEG | INC- | E1S- | hablar | -SC | PREP | E1S- | idioma | yo |
Spanish translation:
No le hablo in mi idioma. English translation:
I don't speak to him in my language.Links between annotation tiers
The table above shows three tiers of annotation for this Uspanteko clause. The 'TEXT' tier contains each word of the clause (word boundaries are indicated by double-line cell borders). The 'MORPHEME' tier shows a segmentation of each word into its component morphemes, and the 'GLOSS' tier shows a morpheme-by-morpheme gloss of the clause, including both gloss labels for non-stem morphemes (e.g. NEG for
kita') and lemma translations for stem morphemes (e.g.
hablar for
ch'abe).
Two NLP tasks we might imagine learning from such data are
morphological segmentation (producing the 'MORPHEME' tier, given at least the 'TEXT' tier and perhaps the translation(s) as well) and
morpheme glossing (roughly, given the 'MORPHEME' tier, produce the 'GLOSS' tier). This is where the data structure used to represent the interlinear text becomes crucial!
Most often when we encounter IGT -- as, in fact, in the table above -- the links between annotation tiers are conveyed through visual aspects of the presentation. Here, for example, the association of morphemes with the words they belong to is communicated using double-line borders at word boundaries. Visually-oriented presentations of IGT do not generally provide the explicit encoding of these relationships that a machine learner needs to make sense of the data. In order to use IGT as training data, it must be presented to the machine learner in a format that
directly encodes links between elements from one annotation tier to those on another.
Structured representational formatsWhat is needed to address this concern is a format which preserves structured links between annotation tiers. XML formats are one way of preserving said links. At the same time, using XML follows current recommendations regarding longevity and portability of data
[AP:(for needexample, ref].Bird and Simons 2003, EMELD School of Best Practices). Several XML formats for IGT have been proposed, including
EOPAS [AP:EthnoER's needEOPAS ref(Schroeter and links], IGT-XMLThieberger [AP:2006), needIGT-XML ref(Palmer and links],Erk 2007), and an earlier model outlined in Bow, Hughes, and Bird 2003
[AP:(Bow et needal. ref]2003). Another approach is the use of Annotation Graphs [AP: need(e.g. refBird and links]Liberman 2001, Maeda et al. 2002).
References:
---Bird, Steven and Mark Liberman. 2001. 'A formal framework for linguistic annotation.' Speech Communication. 33(1-2): 23-60.---Bird, Steven and Gary Simons. 2003. 'Seven dimensions of portability for language documentation and description.' Language, 79(3): 557-582.---Bow, Catherine, Baden Hughes, and Steven Bird. 2003. 'Towards a general model of interlinear text.' In Proceedings of EMELD Workshop 2003: Digitizing and Annotating Texts and Field Recordings. LSA Institute: Lansing MI, USA.---Maeda, Kazuaki, Steven Bird, Xiaoyi Ma, and Haejoong Lee. 'Creating Annotation Tools with the Annotation Graph Toolkit'. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC). ---Palmer, Alexis and Katrin Erk. 2007. 'IGT-XML: An XML format for interlinearized glossed text.' In Proceedings of the Linguistic Annotation Workshop (LAW-07), ACL 2007.---Pixabaj, Telma Can (coordinator), Miguel Angel Vicente Méndez, María Vicente Méndez, and Oswaldo Ajcot Damián. Uspanteko text collection, in Text Collections in Four Mayan Languages, 2003-2007. OKMA (Oxlajuuj Keej Maya' Ajtz'iib'), Supported by Endangered Languages Documentation Programme (SOAS, University of London).---Schroeter, Ronald and Nicholas Thieberger. 2006. 'EOPAS: the EthnoER online representation of interlinear text.' In Sustainable Data from Digital Fieldwork (proceedings of conference held at the University of Sydney, 4-6 December 2006). Sydney University Press.