Machine reusability of data |

Version 9 - view current page

NOTE: this page is still in process. at present it contains a number of conceptual recommendations without suggestions of how these recommendations could be implemented. Please feel free to contact the author with comments, complaints, and suggestions: Alexis Palmer, apalmer@coli.uni-sb.de

This page discusses one particular manner in which linguistic data may be re-purposed: as training data for statistical machine learning approaches in computational linguistics and/or natural language processing. More specifically, we're talking about training data for supervised or semi-supervised methods -- methods that learn from labeled data.

Linguistic data as training material for machine learning

Case study subdiscipline: Computational linguistics, language documentation and description
Goals of this case study:To highlight how decisions related to the annotation and storage of linguistic data (in this case, interlinear glossed texts from a language documentation project) can make the data more or less useful as training data for statistical machine learning methods


1. What makes data good training data?

The key word here is consistency. Roughly put, the machine model learns generalizations over observed data and uses those to predict analyses for previously-unseen data. In order for it generalize well, the collection of data must be as internally-consistent as possible in the way that it is coded/labeled.

A second important consideration is the underlying data structure. For efficient machine processing, there must be some explicit indication of relations between the text and its annotations. These two points are illustrated below with examples of interlinear glossed text, a common way of representing language data.

2. Labeling consistency

Typographic consistency
It is important in any data collection and annotation effort that all annotators work from one agreed-upon set of labels. For the sake of the machine learner, it is also important to adhere to capitalization and punctuation conventions. For example, 'PST' and 'pst' may both be intended to indicate a past tense morpheme, but the machine will see them as two distinct labels. Of course, many such issues can be handled by processing the data post-annotation and pre-model training, but to do so efficiently requires text manipulation skills that those producing the original data may or may not have.

One way for projects to maintain labeling consistency is by use of an annotation interface which restricts the space of allowed labels.

Analytic consistency
Maintaining analytic consistency is a much more difficult task. In cases where the analysis is reasonably well-understood at the outset of annotation, agreed-upon conventions for analysis and annotation may be made available to annotators in the form of a detailed annotation manual. It is often the case, however, that analysis and annotation proceed in parallel. In documentation and description of less-studied (or previously unstudied) languages, this is in fact the normal situation.

Several bits of record-keeping can help to deal with changing analyses:
  • tracking the source of each label (i.e. the specific annotator) as well as the time and date of annotation
  • documenting changes in analysis and/or labeling conventions, indicating the nature and source of the change, how the change should be manifested in the annotation (in other words, what did the previous analysis look like? what does the new analysis look like?), the date and time at which the decision to change the analysis was made, and whether or not the change has been back-propagated to previously-labeled data
  • using annotation tools and/or data formats which are able to maintain a historical record of changes in the data (along with the metadata associated with those changes)
We recognize that some of these desiderata are not easily attainable with currently-available systems for text glossing and interlinearization, particularly in the language documentation context. We thus add our voice to those calling for development of an open source, updated, general-purpose system for text interlinearization and glossing.

3. Data structures

First, we point to the pages of WG1: Annotation Standards as well as the Existing Resources page for many valuable resources pertaining to standardization of data structures for annotation. The resources presented on these pages include links to proposed standards and extensive bibliographic references related to this topic.

Interlinear glossed text (IGT)
The particular concern in this case study is the use of interlinear glossed text (IGT) as training data for a machine learner. First, here's an example of IGT from the Mayan language Uspanteko:

Full text: Kita' tinch'ab'ej laj inyolj iin.

TEXTkita'tinch'ab'ej


lajinyolj
iin
MORPHEMEkita't-in-ch'abe-jlajin-yoljiin
GLOSSNEGINC-E1S-hablar-SCPREPE1S-idiomayo

Spanish translation: No le hablo in mi idioma.
English translation: I don't speak to him in my language.

Links between annotation tiers
The table above shows two tiers of annotation for the clause shown above the table. The 'MORPHEME' tier shows a segmentation of words into their component morphemes. The 'GLOSS' tier shows a morpheme-by-morpheme gloss of the clause, including both gloss labels for non-stem morphemes (e.g. NEG for kita') and lemma translations for stem morphemes (e.g. hablar for ch'abe).