Group 1: Annotation Standards |

Version 221 - view current page

Working group members: Mary Beckman (co-chair), Stuart Robinson (co-chair), Sarah Churng, Greville Corbett, Charles Fillmore, Richard Wright

Preamble

The Annotation Standards group was charged with identifying and documenting existing and needed standards for the annotation of linguistic data. Further, the group was asked to consider possible standards that may need to be developed in the future. Annotation standards support interoperability, aggregation of data, and (ideally) applications that help linguists address the research questions that they are interested in answering while creating consistently annotated data as a side-effect. Another consideration that the group took into account is that sometimes these goals may be in partial conflict with standards for the ethical treatment of human subjects.

This wiki page is the report ("white paper") from the group, who acknowledge the helpful comments of the other attendees at the Cyberling09 workshop, particularly Emily Bender, Nancy Ide, and Mark Liberman.

Table of contents

  1. What is annotation and what is it good for?
  2. What are annotation standards and what are they for?
  3. What does it take to be a good annotation standard?
  4. The state of the art (with some case studies)
  5. Existing annotation standards and resources
  6. References
1. What is annotation and what is it good for?

Annotation is the act of adding, to primary linguistic data, information representing analyses or models of aspects of the data. For example, if the primary linguistic data are an audio recording of a sequence of turns in a conversation between two speakers, then one type of annotation could be the marking of speaker-change points in the conversation within a layer of annotations related to the analysis of discourse structure. Another series of annotation layers could begin with an orthographic and/or a segmental transcription of the speech. Other annotation layers in this series might include a tokenization (segmentation) and glossing of the words or other similar units in the orthographic or segmental transcriptions of the recording. Other series of annotation layers on the morphosyntactic side of language could include a subsequent set of part-of-speech assignments to the words and/or a parsing of the syntactic structures of the sentences and other linguistic expressions in the recording. A parallel series of annotation layers on the phonetics/phonology side of language could include the tagging of linguistically significant events in the spectral patterns of the utterances (e.g., the release burst of each plosive and the transitions between different voice qualities), a parsing of prosodic structures that group segments into syllables and higher-order constituents, and the identification of salient points of coordination in the rhythms at different levels (e.g., marking of stressed or accented syllables).

1.1. What elements of an analysis can be annotated?

As the above example illustrates, elements of a linguistic analysis that can be annotated and for which annotation conventions can be codified separately are of at least three types: (1) tokenization/segmentation, (2) syntagmatic structure, (3) paradigmatic content of the events/tokens and structure. In the ontology of linguistic annotations, these aspects could be thought of as (1) the identification of instances or things, (2) the identification of relations among things, and (3) the identification of classes of things or relational functions. Bird and Liberman (2001) give an insightful discussion and a framework for formulating good ways of treating all three aspects. They propose to formalize this ontology in terms of the annotation graph -- a directed acyclic graph, in which each annotation token is (minimally) a triple consisting of two nodes that point to the positions in the string of labels on any annotation tier, and the label for the arc connecting these points, as in the two figures below.

Praat view of TIMIT snippet in Fig. 2a from Bird and Liberman (2001)

Figure 2a from Bird and Liberman (2001).

Figure 1. Spectrogram (and Praat TextGrid merge of original label files) for the first three words in utterance train/dr1/fjsp0/sa1 from the TIMIT corpus (top) with a screen shot of the original
phn and wrd label files (the first and third tiers of the Praat TextGrid) and the associated annotation graph snippet from Figure 2a in Bird and Liberman (2001).

http://lacito.vjf.cnrs.fr/archivage/tools/list_rsc.php?lg=Hayu

Figure 5 from Bird and Liberman (2001)

Figure 2. Sample view for the first utterance in a Hayu narrative from the LACTITO archive (top) with a screen shot of the snippet of the annotation file and associated annotation graph from Figure 5 in Bird and Liberman (2001).


1.2. Examples

We expand on these three different aspects by illustrating each in reference to the type of example described above, where the primary linguistic data are an audio recording, and also in reference to cases where the primary linguistic data are instead a written text, which may or may not have begun as an orthographic transcription of an audio recording.

(1) For audio data, tokenization considerations can include not just the need to decide on the number of things that are instantiated at any given level of annotation, but also the need to agree upon where token boundaries (e.g., edges of segments or words) should be placed relative to the disparate spectral cues to the often asychronous and/or smoothly changing postures of different articulatory systems. For text data, tokenization considerations similarly can include the need to introduce word boundaries (e.g., as spaces in text) for languages with writing systems that do not use space for word separation, or, for English, the need to separate punctuation that shows significant syntagmatic boundaries -- by adding surrounding spaces -- from punctuation that is a part of a name or word (e.g., "that ' s that ! " as opposed to "etc.", "Mrs." "U.S.A.").

(2) For audio data, syntagmatic structure can include agreed-upon conventions for identifying time points on different annotation tiers that should be synchronized because they are the same event (e.g., the time stamp for the beginning edge of the first segment in a word is also the time stamp for the beginning edge of the word, and should change if/when the segment boundary is moved). It can also include the principles for differentiating different types of coordination across annotation tiers (e.g., when the time of a linguistically meaningful fundamental frequency maximum is identified relative to the time of a stop release). For text data, syntagmatic structure can include the bracketing of sequences of text that function as single constituents, or the indexing of anaphors to their antecedents, or the indexing of discontinuous collocates, as between the first and last words in "wreak this type of havoc".

(3) For audio data, the development of annotation conventions inevitably includes work to agree on the set of contrasting types to distinguish at several levels. For example, in tagging consonant and vowel segments, should the labels include only "broad" phoneme classes, or should major allophones be distinguished, or even finer phonetic detail marked? Whose intonational analysis should be adopted in tagging utterance melody? For text data, similarly, annotation conventions can involve work to codify the set of paradigmatic contrasts at many levels, from conventions regarding the number of different types of filled pauses and how to spell them in an orthographic transcription of recorded speech, to conventions for identifying cells in morphological paradigms, which might need to be differentiated in significance across languages. For example, systems for morphosyntactic glossing intended for cross-language comparison might need to recognize that for one language 'singular' is understood as in paradigmatic contrast to 'plural' while in another language it might be in contrast with 'dual' and 'plural'. This could conceivably be achieved by adding a "legend" to a given annotation layer, for a given language, linking each special annotation category to a description of the relevant contrast set, e.g., in a grammar.

1.3. The purpose of annotation and its relationship to the data

As the last example should make clear, conventions for each aspect of an annotation scheme cannot be established without thinking carefully about what the data are and what the annotations are for. In general, annotations are dedicated to specific purposes: it is hard to imagine a corpus development project that seeks to account for every detectable phenomenon in a language sample. Sociophoneticans who are interested in the increased use of creaky voice by young female speakers of American English would want to have an annotation tier where beginning and endpoints of creaky voice are marked; phonologists who are interested in comparing the probability of finding a particular segment sequence within a word to the probability of finding that sequence across a word boundary obviously do not need such a tier. Researchers interested in preferred or habitual locutions on the part of individual speakers obviously need to have speaker-ID information in the data they study; those interested in finding examples of syntactic phenomena obviously do not need such information. These different needs for more or fewer levels of annotation can be described as different points on a scale of granularity. But the different granularities can also involve the same level of annotation, but different degrees of specificity in the paradigmatic set.

The following is an example of levels of granularity in the syntactic description of English rate expressions such forty dollars an hour, forty miles an hour, forty miles a gallon, forty times a day, forty dollars an ounce, and the like. In dealing with such phrases, one purpose might be that of providing a preliminary mark-up for a parser, inasmuch as this pattern of two adjacent NPs is not a part of the ordinary grammar of the language. For this purpose, it would be enough to block off such phrases and mark them as NP: as such they then can fit into PPs (moving at forty miles an hour) and VPs (earns forty dollars an hour, gets forty miles a gallon), etc. A quite different purpose one might have for annotating such expressions is providing a mark-up that is usable for language understanding efforts: in such cases the type of unit (linear extent, money amount, time, weight, etc.) in each of the two parts of these expressions should be indicated (with information from a lexicon), allowing the automatic assignment of the phrases to such categories as Fuel-efficiency, Price-per-unit, Frequency, Speed and the like.

An obvious and important use of annotation is that of providing a layer of representation that is available for further analysis; in a sense, this amounts to regarding one person's annotation as another person's primary data. For example, phonological or orthographic transcription allows morphosyntactic analysis of a sample of speech more directly than the acoustic trace.

This property of layered annotation raises the issue, strongly associated with the late John Sinclair, that mistakes in one layer of annotation compound mistakes in higher layers. This point can be seen in the fact that descriptions of English syntax tend to accept the tokenization implied in the standard orthography, so that, for example, whose and another are treated as single units. Proposals about the clitic vs. suffix analysis of the possessive marker in English would be argued differently if whose were who's, making who the hell's fault is that? seem not so anomalous; and descriptions of the pattern that allows a mere twenty dollars, an extra five pages, an additional twenty dollars could be seen as incorporating another five pages (i.e., as an other five pages).

An analogous situation arises when phonological descriptions accept the tokenization or the set of paradigmatic categories implicit in the conventional segmental transcription for a language, even when describing speech produced by child speakers who have not yet acquired the phonological system of the language.

These pitfalls suggest that the sociology of developing annotation conventions might be an object of study in its own right. They also suggest that linguists should think flexibly about the types of things that should be considered annotations, so that conventions can be developed for how to link these things to the primary data. For example, it might be appropriate to think of responses from naive judges, elicited over the web using Mechanical Turk or the like, as a kind of annotation, in which case, it could be useful to develop standards for eliciting these judgments and tools for linking the responses back to the corpus of primary data that provided the stimuli. It might also be appropriate to think of skilled formant "correction" as a kind of annotation, in which case, there could be standards for "correcting" formants and associating the formant traces with the corpus, as in the development of the tools for the Origins of New Zealand English Project.


2. What are annotation standards and what are they for?

Some annotation is created to serve a single researcher's needs. If the annotation practices developed by this researcher, for a certain class of phenomena, are consistent, so much the better for this solo researcher. An issue of annotation standards arises when there is a need or opportunity for other researchers to work with the same data, or when researchers become interested in the same kinds of phenomena in other data samples, or in other languages, and want to be able to make generalizations.

An annotation standard, then, is a set of conventions that is associated with a commitment to adhere to the conventions by a community of users. A standard can evolve gradually in a community of researchers who are working on similar problems in some language domain, so that assumptions about the analytic space converge in some way that promotes the natural emergence of infrastructure for developing, transmitting, and codifying a standard. A standard can also arise from adopting a tool that brings with it assumptions about the data being analyzed that can be met by adhering to the standard.

No matter the path of convergence, however, annotation standards cannot be defined without reference to a shared set of assumptions and an associated community of analysts. As a corollary, a set of annotation conventions cannot be evaluated (or standardized) without at least an initial reference to a community of analysts and users.

Within such a community, annotation adds value by spreading the workload of providing agreed-upon analyses to a larger set of shared primary data than can be analyzed by a researcher working alone. Looking outward from the core community, annotation provides expert analyses for others who might not otherwise have access to the primary data.

This understanding of the relationship between the analyses of the data that are to be encoded in the annotations (the "model") and the primary data themselves leads to the following characterization of what annotation standards are for and how they can be evaluated.

2.1. Within the original community of developers and users ...

  • It is critically important to ground any annotation schema in terms of the particular model of the phenomena being annotated, and to develop it in relationship to the question being asked and the shared assumptions of the community about the phenomena being observed and modeled. Within this initial community, the annotations evolve as a set of "common law" rules about what the observed phenomena are. These rules will specify how the data should be segmented into tokens, how the tokens will be labeled in terms of an agreed upon inventory of contrasting types and relationships, and how relationships among tokens will be parsed and labeled.
  • A set of annotation conventions (rules), therefore, can only be evaluated first in relationship to the initial user community and their questions. While there might be domain-specific evaluation criteria, defined relative to independent observational tools, one critically important evaluation criterion that is common to all domains is the reliability of the annotations. Is the annotation consistent within and between annotators? If an annotator observes the same data twice in different independent annotation sessions, are the analyses (the annotations) the same? Similarly, if two different annotators observe the same data independently, do they arrive at the same analyses (tokenization and labelings)?
  • To achieve consistency typically requires a long iterative process of "common law" development, during which two or more users annotate some set of data separately, then convene to discuss and adjudicate the disagreements, formulate new principles to cover the cases discussed, and then start a new round of independent annotation, comparison, discussion. The initial users will need to agree on the degree of consistency that is needed to accomplish their goals. An ancillary set of "laws" will need to be developed to reliably differentiate between disagreements that arise from intrinsic ambiguity and disagreements that arise because the conventions and annotation tools are not yet at the point of required coverage/stability/useability.
  • It can be useful in the process of developing, evaluating, and using an annotation standard to work on the different aspects separately. For example, if at some stage of development (or in some subcommunity of users), the tokenization is more reliable than the identification of relationships, the annotations might be adequate for some subset of the initial purposes, but not others. It is then important to develop conventions for tagging corpora or parts of corpora for relevant facts such as which version of the annotation scheme was used or the level of experience and/or training of the annotator(s).
2.2. When extending to a new community of users ...

  • Annotations developed within a particular community of initial developers and users might be extended to another community of users who are addressing different questions and may have different model assumptions. The goodness of a standard then becomes a product not just of the initial developers/users, but also of the flexibility/ingenuity of the later users of annotated data.
  • The needs of various communities are in some cases overlapping (both phoneticians and sociolinguists may seek standards for phonetic annotation) and in other cases conflicting (a fieldworker may want language-specific idiosyncratic part-of-speech categories in interlinear glossing whereas a typologist may want agreed-upon cross-linguistically motivated categories). To get a sense of the potential disparaties among different communities of users, we listed the first sets that came to mind:
    • sociolinguists
    • computational linguists and NLP practitioners
    • language acquisition specialists
    • psycholinguists and laboratory phonologists
    • specialists in speech and language disorder
    • fieldworkers and language typologists
    • stylometrists, disputed-author researchers, etc.
    • educators evaluating text complexity, comprehensibility
  • Even within a single later-adopting community, however, the questions and needs may differ in relationship to different types of primary data. Here we can differentiate at least among (1) full video recordings, (2) audio-only recordings, and (3) spoken utterances that were recorded only as text in the first level of "annotation" of the fieldworkers' written transcription. The initial tokenization/labelling of each of these primary data types may be an orthographic transcription, and in the case of type (3), the initial tokenization/labelling then becomes the only record. For some communities of users, the models and questions related to such transcriptions might differ dramatically from the models/questions that can be applied to data that are (4) originally written texts.
  • In adopting (and adapting) a set of annotation conventions to a new set of questions and applications, then, it is again useful to ask: What aspects of the annotations can we usefully tease apart and evaluate/adopt/develop separately?

3. What does it take to be a good annotation standard?


3.1 Best practices (themes)


The associated properties that define a good annotation standard can be grouped into a few overarching themes and associated questions about the annotation conventions:

  • Consistency/Reliability
    • What is the history of the annotation conventions? Did they evolve in careful, iterative rounds of (1) discussion of the goals of the annotation set, (2) independent annotation of a suitably diverse corpus of primary data by a large number of annotators, (3) calculation of inter-annotator agreement, and (4) discussion of points of agreement and disagreement and incremental revision?
    • Are there standards / mechanisms for continued calibration of consistency within and between annotators?
    • What are the published intra-annotator and inter-annotator consistency rates?
    • Are the conventions designed to allow transparent, easy, reliable "back-tracking" to the primary data, via time stamps or sequence position nodes?
  • Useability
    • Is there good (accessible and extensible) documentation?
    • Is there a suitably diverse and continuous community for teaching (and testing the ability of) new annotators / users?
    • Are there good tools for annotating and using the annotations, and good community mechanisms for building / extending / sharing tools?
    • Is there a reliable connection between the annotations and the primary data that allow the user to track back to the data to check a suitable subset of the annotations?
    • Is the design of the annotation schema such that annotations can be used as reliable tags back into the primary data, for easy queries using standard query tools?
  • Resilience
    • How does the standard deal with inter-annotater disagreement? Is information about disagreements preserved so that they can be analyzed in the course of developing the next version, to determine whether there are common cases of inherent ambiguity that need to be marked, or new cases that the conventions do not yet cover?
    • Are there principled mechanisms for marking degree of uncertainty about difficult or ambiguous cases? (See the CHAT manual for a thoughtful discussion of this question.)
    • Are there graceful ways of choosing to provide more or less specific degrees of analysis?
    • Are there good mechanisms for providing and getting the most out of partial annotations?
    • Are there robust ways of extending partial annotations to more of a corpus and of verifying and modifying the annotations of a corpus?
    • Relatedly, are there good mechanisms for keeping track of which parts of a corpus are in what state of annotation and verification / modification?
  • Accountability/Responsibility
    • Again, are there robust mechanisms for maintaining transparent links back to the primary data? and are these mechanisms ethical? Do they insure the explicitly or implicitly agreed-upon degree of confidentiality of the person or people who produced the primary data (or who produced some subset of the annotations by providing naive judgments)? The issue of confidentiality is especially vexing when the primary data are video recordings. (See sections 4.3 and 4.4.)
    • Do the standards encourage (or even allow) later "consumers" to credit the annotations in publication?
    • Are the annotators (or the annotator level) for different parts of a corpus or different aspects of the annotation identified in a way that allows later users to partition the annotations -- e.g., into annotations by native speakers versus non-native speakers?
  • Interoperability
    • Can the annotation be validated and used in different tools or computational models?
    • Is the logical structure of all three aspects of the annotation conventions transparent, and transparently related to the documented descriptions of the annotated phenomena?
    • Also, is it possible to translate to and from some other annotation conventions that have been used for this set of phenomena, in a way that makes it possible to share data across different analytic frameworks?
    • Are the formats for encoding the different aspects of the annotation conducive to using the annotations for purposes different from the originally intended ones?
    • Are the definitions of the annotation elements freely available and stored in an open format?
    • Are any requisite tools for annotating or using the annotations free open source?
  • Extensibility/Adaptability
    • Can the annotation schema be extended to annotating utterances in other styles from the utterance sets for which it was developed? Can the annotation conventions be used for utterances produced by other speaker types? Can they be extended to (or readily adapted for) annotating data from other dialects, other languages, ....?
    • Is there a solid and suitably diverse core of users (and "maintainers") to allow the standard to evolve and change in response to user feedback and/or to new needs?
    • Is there a sensible consensus or mechanism for deciding when to "publish" a new version?
    • Are there good standards and mechanisms for versioning? For example, is there a robust way to permanently associate meta-data about which version of the conventions was used in annotating (different parts of) any corpus? Are there tools for keeping track of who the taggers were at different levels / times, and are there tools (or at least a "crib") for how to "translate" across corpora and/or across levels of annotation as the standard evolves and expands?

3.2. Best practices ("tangibles")


The hallmarks of an emerging annotation standard therefore begin with these two important social characteristics:
  • community
    • There is a sustainably large and diverse community of core users/maintainers.
  • history
    • There is a history of effective dissemination of the conventions and recruitment of new core users.
Other more tangible accoutrements of annotations standards that have exemplified the best practices identified above include:
  • documentation
    • The conventions are adequately and fully documented, in a "reference manual" that can be consulted easily by experienced users.
  • training manual
    • Ideally, there is also a separate, well-tested training manual (or a standard syllabus for training courses) that leads new users through a graduated sequence of more and more difficult examples culled from data that were annotated in developing the documentation and/or the reliability metrics.
  • inter-annotator reliability metrics
    • There are published records of inter-annotator consistency tests. Ideally, these tests differentiate between disagreements that stem from intrinsic ambiguity and disagreements that have other, remediable sources such as inadequacies in coverage, deficiencies in the documentation, or the like.
  • computational tools
    • Members of the community have invested in developing computational tools that increase the reliability of the annotations.
  • conventions for metadata
    • The community has developed mechanisms for protecting confidentiality of the producers of the data, crediting of the provenance of the annotations, and so on.
  • conventions for responsible maintenance
    • A set of conventions, or a more elaborate institutional framework, has also emerged, for responsible maintenance of the conventions, for continued elaboration of the documentation, and for updating of the training manual (or re-accreditation of the training courses).

4. The state of the art (with some case studies)

In this section, we illustrate the considerations outlined in Section 3 by briefly reviewing the development and current state of annotation standards in four very broad areas. These reviews highlight two factors that promote or hinder the development of reliable and resilient standards.

The first is the degree to which the "semantics" of the target phenomena are naturally constrained. At one extreme is the case of phonological annotation of consonants and vowels of spoken languages. Here, the aerodynamics of the vocal-auditory channel tightly constrains the tokenization and types of relationships at the lowest level of the prosodic hierarchy in an extremely robust way. An example at the other extreme is the analysis of grammatical constructions, where it is difficult to even imagine what boundaries could be imposed by nature on what a language-specific morphosyntactic construction can mean.

The second factor is the age of the language type and/or of the systematic linguistic investigation of the phenomena across languages. Spoken languages may have existed as long as there have been modern homo sapiens, and the "annotation" of consonants and vowels goes back to the first alphabetic writing systems. By contrast, the systematic study of signed languages has a much shorter history.

4.1. Phonology of spoken languages

4.1.1. Annotation systems for vowels and consonants

As noted above, tokenization and other aspects of the analysis of categories at the lowest level of the prosodic hierarchy for spoken languages is tightly constrained by the (psycho)physics of the human articulatory and auditory systems. As a result, it has been relatively easy to develop conventions for annotating utterances of spoken languages at this level of this part of the grammar, using an alphabetic analysis, and the International Phonetic Alphabet is a premier example of a well-developed annotation standard. It is maintained and updated by the International Phonetic Association, which was established in 1886 and today is associated with the International Congress of Phonetic Sciences, a meeting held every four years which attracts several thousand attendees.

The International Phonetic Alphabet has a handbook, which documents the annotation conventions and specifies a well-codified format for presenting a catalog of the consonant and vowel inventories of a spoken language variety using the IPA. (There is a long-standing section of the Journal of the International Phonetic Association devoted to publishing such language-specific schema.) The most recent version of the handbook was published in 1999, after a conference convened in 1989 to review the coverage of the categories in the IPA consonant and vowel charts and the lists of other symbols for categories that do not fit neatly into the tokenization and paradigmatic features that are encoded in the consonant and vowel charts. In between the conference and the publication of the revised Handbook, the Journal of the International Phonetic Association published a report on deliberations of the conference as a whole (JIPA, 19: 67-80) as well as reports from working subgroups charged with deliberating on various more focused issues such as "Computer Coding of IPA symbols" (JIPA, 19: 81-82) and "the best means of transcription of disordered speech" (JIPA, 24:95-98) and correspondence from members of the association commenting on the proposed revisions and deliberations at the conference (e.g., JIPA, 20: 22-32).

While there is no official training manual, there is a long history of teaching the annotation conventions (and the phonological analyses on which they are based), which predates the founding of the International Phonetics Association. For example, Henry Sweet's A handbook of phonetics, published in 1984, includes vowel and consonant tables that are organized in terms of the same dimensions of analysis as the modern IPA chart -- i.e., openness, place, rounding for vowels and place, manner, laryngeal properties for consonants. The same basic approach is also taken in most subsequent textbooks, including Peter Ladefoged's well-known A course in phonetics, which is still used in training students in the annotation of vowels and consonants in many departments of phonetics, linguistics, logopedics, and speech & hearing science.

There is also a history of research on inter-annotator consistency rates for segmental transcription, and on the factors that affect transcription consistency. In general, it is easier to be consistent the closer the annotation is to a "broad" phonemic transcription. For example, Eisen (1993) reports complete agreement among three transcribers of only 50% for a "narrow" transcription, even when distinguishing among only ten "major class" categories such as "voiced plosive". When the same transcribers were asked instead to note only segments that deviated from an automatically inserted broad "dictionary" form transcription, consistency improved to 85%.

4.1.2. Where the analysis breaks down

The different levels of reliability for "broad" versus "narrow" transcription hint at some of the things that affect inter-annotator reliability. Reliability is highest when the primary data are clean recordings of careful fluent utterances produced by adult native speakers of a dialect for which there is a consensus phonemic analysis that can be the basis for the tokenization and consonant and vowel label set, and when the goal of the annotation is to produce a "broad" phonemic transcription as the basis for morphological analysis or the like. Reliability is lower when recordings are noisy, when the primary data are casual or dysfluent utterances, when the dialect of the speaker(s) is an understudied variety that differs from the dialect on which the IPA description is based, or when the goal is to produce a "narrow" transcription as the basis for sociophonetic analysis of variation in the speech community or the like. In the latter cases, reliability is improved if annotators are well trained in phonetics (not just in "classical phonemics") and have recourse to tools such as the interactive spectrographic display window in the Praat signal analysis tool. However, no amount of phonetic training will resolve the inherent unreliability of shoehorning "sub-phonemic" paradigmatic variation and "suprasegmental" parsing differences into a phonemic segmental model.

Alphabetic annotation of pre-school children's speech poses special challenges, then, because it assumes (paradigmatic and syntagmatic) phonological structures that may not be in place until the child is much older. See the discussion of this point in Pye, Wilcox, and Siren (1988), Hewlett & Waters (2004), Edwards and Beckman (2008), among many others. Pye and colleagues suggest, for example, that points of low inter-transcriber reliability can indicate places where the standard phonemic analysis of the target language is particularly inappropriate for the child's developing phonological system. Edwards and Beckman (2008) suggest that "narrow" transcription is best supplemented in this case by experiments eliciting perceptual responses from phonetically untrained native speaker/listeners. All of these researchers remind us that tokenization and paradigmatic differentiation at the level of vowels and consonants is a product of the interaction of the natural constraints from the aerodynamics with the exigencies of lexical contrast in dense neighborhoods. The phonemic analysis that is the basis for the IPA conventions is less compelling for a speaker whose lexicon is still too small to have very dense phonological neighborhoods.

Prosodic phenomena such as stress and syllable structure, and intonational phenomena such as the melodies that group syllables together into phrases and the like, pose a related challenge for phonological annotation. Because there is no comparably compelling natural basis for tokenization of melodic events, spoken languages are much more diverse in the ways in which utterances are structured above the leaf nodes of the prosodic hierarchy. The Working Group on Labeling of Suprasegmentals at the 1989 conference that led to the current IPA handbook recognized this by deciding to recommend no standard annotation conventions for intonation (Bruce, 1989). A basic principle of the ToBI annotation framework also says that "phonetic transcription" of prosody and intonation is impossible. See Beckman, Hirschberg, and Shattuck-Hufnagel (2005) for further explication of this point and the implications for the development of annotation conventions for prosody and intonation. See also Pitrelli et al. (1994) for the remarkably good inter-annotator reliability rates that nonetheless can be achieved when conventions are specific to a particular dialect.

4.2. Morphosyntax and semantics

SOMEONE insert transition here that harks back to the introductory paragraphs of section 4.

4.2.1. Leipzig Glossing Rules

The Leipzig Glossing Rules, which build on Lehmann (1983) are a de facto standard for morphosyntactic glossing. They represent a light-touch codification of previous best practice. They allow for analyses of varying levels of granularity, subject only to the requirement that the segmentation in the glossing line must match that in the source language line. There is a standard set of abbreviations, which could usefully be extended. Documentation consists of a short document, with examples, freely available on-line.

In their current version, the Leipzig Glossing Rules provide the means and the options for linguists of different persuasions to give adequate morphosyntactic glosses. A next step would be to suggest how users should characterize their own usage in a particular publication. At the most obvious level, all additional abbreviations should be specified. Then, we should note that the abbreviations are mainly for feature values (for ‘singular’, ‘feminine’ and so on); it would be good practice to specify which feature each is a value of. Normally this is obvious, occasionally it can cause confusion. Finally, since the rules can be used for different purposes, it is helpful if users clarify and spell out their assumptions. For most purposes, when glossing yesterday we bid three hundred pounds for that horse, we would gloss bid as past tense. The information is derived not from the form itself but from the time adverbial, since the form bid could be present or even imperative. For writing about tense, word order, argument structure, and so on, this solution is fine. If writing about syncretism, however, this theoretical ambiguity would matter, and would need to be indicated appropriately. More generally, the annotator frequently has to be selective in the level of detail included in the morphosyntactic glossing. This means that we cannot expect that different linguists would provide identical annotations. But we should aim for a greater level of consistency than we often find. The rules offer the alternatives, but good practice requires us to choose consciously, to specify the choices made and to apply them consistently.

In terms of tools, it would be useful to have a tool that would check annotations for internal consistency and for any unintentional departures from the conventions.

4.2.2. FrameNet Annotation Criteria

Certain annotation practices followed in FrameNet can be seen as arbitrary but motivated. One of the central notions of FrameNet is that of Valence: general descriptions of the combinatory possibilities of individual lexical heads (verbs, nouns, adjectives, and some prepositions), expressed in both syntactic and semantic-role terms. Valence descriptions are derived automatically from a body of annotated sentences, so it is obviously necessary to agree on how the annotations are structured. The need to pair syntactic arguments with semantic roles motivates our decision to include the "markers" of a phrase with the constituent. For example, if we wish to identify the speaker and the content of an announcement from the phrase the announcement by the governor of her decision to resign, those two elements are blocked off as [by the governor] and [of her decision to resign]: some projects would ignore the prepositions and select only the NPs in those expressions. Similarly, in the governor's announcement that she intended to resign the labels would be assigned to [the governor's] and [that she intended to resign], including both the genitive suffix and the that-clause. And similarly, then, in [the governor] announced [to the world] [that she intended to resign]. A differently motivated project might leave out the structure-markers in order to represent the "content" more faithfully; FrameNet includes these in order to match semantic and syntactic segmentation, allowing users to recognize the structural elements.

It will be seen that the annotations themselves do not distinguish markers that are determined by the grammar (the possessive ending in the governor's announcement), from those that are determined by the meaning of the PP as a whole (the preposition in under the table), or by the governing lexical head (the preposition on in we can depend on Harry. Such information is recoverable from the grammar and the lexicon, but it is not part of the annotation.

The standardization of this FrameNet practice has various consequences. All annotators on the Berkeley project agree to use it in their work, and various FrameNet or FrameNet-like projects in other languages have agreed to follow the same, or analogous conventions. These are Spanish FrameNet, Japanese FrameNet and the SALSA project in Germany. Furthermore, Professor Hiroaki Sato of Senshu University in Tokyo manages a browser of FrameNet data ("FrameSQL" http://sato.fm.senshu-u.ac.jp/fn2_13/notes/index.html) and he is developing a way of pairing valence patterns across the various languages that have FrameNet databases; the comparisons work best if all users treat function markers in the same way.

4.2.3. The Penn Treebank

The Penn Treebank is a collection of English texts that have been grammatically annotated for part of speech, grammatical function (predicate argument relations), and constituency (Marcus et al., 1993; Marcus et al. 1994; Taylor et al., 2001). Although the Penn Treebank contains material taken from four sources (the Wall Street Journal, the Brown Corpus, the Switchboard corpus, and ATIS), its collection of annotated Wall Street Journal newspaper articles is the most heavily utilized, to such an extent that many researchers think of it as the Wall Street Journal corpus. It is available under a commerical license from and distributed by the Language Data Consortium (http://www.cis.upenn.edu/~treebank/). The corpus is distributed as a collection of texts with accompanying stand-off annotation. The Wall Street Journal section of the corpus, for example, consists of 2499 articles (totalling approximately a million words) published in the Wall Street Journal during a three year period in the eighties. For each article, there are a number of plain text files that contain various types of annotation (one for part of speech, another for bracketting) as well as a "master" annotation file that contains all of the annotation merged together.

The Penn Treebank is an important point of reference in grammatical annotation given its success. Not only has it become an important resource in computational linguistics (much like WordNet or CELEX), it has also inspired a large number of similar projects for other languages--e.g., Chinese (Xue et al., 2002), Czech (Hajicova, 1998), Spanish (Navarro et al., 2003), and German (Brants et al., 2002), to name only a few.

The Penn Treebank Project has a number of strengths that help explain its popularity. Chief among these, of course, is that by providing a non-trivial amount of annotated newspaper text it managed to scratch an itch felt by the community of researchers interested in computational linguistics, natural language processing, and related fields. But in addition the Penn Treebank Project provided good documentation and designed the corpus in such a way that it could be easily used: consisting only of plain text files, providing good documentation, versioning the corpus, etc.

Despite its success, the Penn Treebank has a number of weaknesses. One of these is the absence of a standard toolkit for its manipulation. This is a shortcoming that has affected its development and impedes its adoption by those lacking the resources that develop their own toolkits for its manipulation. (The creation of various open source toolkits has ameliorated this problem to some extent, but it is a comparatively recent development compared to the age of the Penn Treebank Project.) Another is the model used to describe grammatical functions, which adheres to an old-fashioned Government and Binding analysis that posits, among other things, traces for movement. Although the annotation is couched in a multistratal theory of grammar, this has not hindered its use in monostratal theories of grammar, such as LFG (Frank, 2000).

In fact, it is unrealistic to expect a great deal of standardization in annotation for grammatical information given the highly contentious nature of grammatical theory itself. The difficulties inherent in this problem can be seen in attempts to develop treebanks in languages with more flexible word order and discontinuous constituency, such as German. Although it is possible to treat annotate German using a grammatical theory that posits traces, it leads to inelegance and German treebanks have as a result departed from its grammatical model. (The NeGra and TiGer annotation schemes use graphs with crossing edges rather than simple context-free trees.)

4.2.4. The difficulty of developing annotation systems for grammatical constructions

There are well-known tree-banks that offer syntactic parses of all of the sentences in a sample, meeting certain levels of adequacy, such as the Penn Treebank. There are various levels of part-of-speech tagging for large corpora, such as the British National Corpus, that are for the most part successful. But a complete record of the special grammatical constructions in a text does not seem feasible.

For research purposes it should be possible to tag (say) all comparative sentences in a text, identifying the scales and the phrases that directly or indirectly indicate the entities being compared. It should be in principle possible to identify all idioms or tight collocations in a text, however long this might take. It should be possible to notice constructions with certain peculiarities for the sake of assembling examples for further study, such as, for English, the pattern that has a degree-modified adjective followed by an indefinite NP marked by the preposition of. (Do you need {[this big] [of a box]}?) But the expressions that represent individual constructions are frequently tightly intertwined, and the effort to work out the nature of such integration on a large scale is not likely to be possible. A sentence like He's in no bigger of a hurry than you are exhibits a comparative structure (bigger ... than you are), a collocational idiom (in ... a hurry, one of the few uses of hurry as a noun), the puzzling structure with the of-phrase (bigger of a hurry), a special minimizing use of the word no with a compared adjective (consider the difference between he's not smarter than your mother and he's no smarter than your mother), and the particular form of the than-clause (than you are vs. than you, than expected, than ever, etc.). Representing the working of all of these constructions and their articulation is not to be expected.

Research that collects and explores examples of grammatical constructions, idioms, collocations, and multiword expressions in general, and illustrates their properties one at a time, has got to be an essential task for linguistics and computational linguistics, both for grammar writing and as a way of producing learning corpora for machine-learning techniques to improve syntactic parsers. But since many of the most important constructions cannot easily be associated with individual words in a sentence, or with specific nodes in a parse tree, there is little likelihood of acquiring large-scale accurate annotatons of grammatical constructions, beyond familiar parsing and chunking of nonproblematic sentences, any time soon.
The problem is further compounded by the fact that cross-framework agreement on syntactic phenomena in general is not easy to achieve: dependency-based and constituency-based treatments are not always interconvertible; theoreticians who seek to minimize redundancy in their analyses would not see the same number of construction types in a given text as the grammarian who wishes to work with structures of finer granularity.

The proposal in this section favors rich analysis of small texts, together with extensive sampling of given constructional phenomena one at a time, or in small families of constructions. Such a combined approach should eventually lead to understanding the importance of non-core constructions and multiword expressions, classifying their variety, estimating how many of them there are, determining their relevance in profiling different genres, estimating their "density" in different kinds of texts, exploring the manner in which they are learned, and evaluating their contribution to measures of language complexity.

A sample of constructional annotations prepared within the Framenet project can be seen on http://www.icsi.berkeley.edu/~hsato/cxn00/21colorTag/index.html.


4.3. Annotation of Gesture

The two factors relevant for annotation developments, as discussed above in the beginning of Section 4, are especially salient for the progression of standards in gesture annotation. We discuss gesture here as inclusive of gestures in sign languages as well as discourse-related gestures of the spontaneous type which accompanies natural spoken language narratives.

First, in contrast to the tightly constrained audio-articulatory modality of spoken language systems, which are articulated using the vocal tract, gesture systems involve the visual-gestural modality. Traditionally, gestures are understood to be articulated using the hands in movements. In the case of sign languages, however, the category of gestures has recently been expanded to include certain movements involving the head, face, and shoulders. (See Neidle et al. 2000 for some discussion of nonmanual gestures in ASL, and Boyes-Braem, 2001 for the descriptions of 'mouth gestures' in multiple European sign languaes.) Moreover, the visual space of gestures in sign languages exists as a complex continuum which involves the signed phonemic structures in the lowest level of the prosodic hierarchy, larger signed morphemes which engages in spatially bound agreement relationships with other signs, and `nonmanual' gestures of higher-order prosodic structures. So while the conventions for annotating gestures in the traditional sense may be developed with relatively straightforwardness using video analysis, developing annotation standards for gestures which handle the complexity of these relationships must also rely on multiple and dynamic layers of annotation of the types indicated in Section 1.

The Language Archiving Techonology tool, ELAN, is a professional tool for the complex annotation of video and audio sources. "Tiers" are implemented for simultaneously displaying and annotating parallel levels of analysis. These can be nested for dependencies between, say, an independent parent annotation of morpheme-by-morpheme transcription, and referring tiers for the varying gestural articulators (e.g. hand vs. mouth). A full manual for ELAN is available online.

Second, the systematic investigation gestures as a linguistic phenomena is a relatively new pursuit. This is true for both the annotation of gestures in spoken language narrative studies (such as with the McNeill Lab project) as well as the annotation of sign languages (see Neidle et al.'s Sign Stream Project, and the Berkeley Transcription System manual). The next section presents some considerations for developing unified conventions in the annotation of sign languages.


4.4. A unified annotation standard for signed languages

Finally, we address the need for a unified annotation of sign languages, as identified in this workshop group and codified within the desiderata for qualities of annotation standards in general. First, however, some prior discussion concerning the dissemination of tools and standards among the the communities of practice (sign language linguists) is necessary. For although several tools currently exist for the scientific annotation of video data (see above for gesture annotation using Anvil, ELAN, and Cross-Modal Analysis of Signal and Sense, for example), and although the target users are close-grained a community, widespread standard for sign language transcription and annotation is lacking.

4.4.1. What sign language annotation is, and what it is not

We begin by clarifying the purpose of a unified standard of sign languages. We emphasize that we do not aim to advocate a writing system of signs, nor do we intend for annotation to replace the primary linguistic video data with a derived set of data. Rather, annotation of sign languages should compliment the data as a way of tagging and searching the data. And the goals of a unified sign language annotation standard are to provide a shared platform of convention for collaborating across the various linguistic domains.

4.4.2. What a unified standard provides

For all linguists, annotation is paramount, and standards promote convergence. For sign language linguists, the annotation of primary video data poses several challenges. Few standards exist, for example, when it comes to annotating sign languages for fundamental linguistic phenomena such as pronominalization or indexicalization within the interlinear gloss. A more complex issue is the matter of transcribing certain non-manual features that are coarticulated with the manual signs--as functional labels (neg), abbreviations of the action (head shake), or even further break-down of the correlates involved? On the practical side, high-quality video data can require large loads of memory, and utilizing tools for analyzing video requires higher processor speeds and memory load. These obstacles for data sharing lead to hurdles for standardization of data annotation. Still, through further practice and dissemination, advances in sign language annotation provide the potential for increasingly consistent annotations and conventions. The ideal situation (projected solution) is one where sign language linguists, whether collaborating in an international workshop setting or via remote communications, have access to one mutually accessible standard that is extensible for all sign languages, interoperable across varying domains and models of interest, granular across levels of linguistic analysis, and practical for continuous usability.

5. Existing annotation standards and resources

This section lists the various annotation conventions and other resources for developing and discussing annotation standards that were suggested by participants of Cyberling09.

5.1. Phonetics and phonology

  • Phone segment tagging symbols, including both:
    • IPA and its various ASCII-fications, such as SAMPA, WorldBet
    • and language-specific phoneme-segment encodings such as ArpaBet for American English and the CSJ encoding for Japanese
  • The various ToBI conventions and similar conventions in other frameworks such as the ToDI conventions
  • The PhonBank conventions and tools
5.2. Morphosyntax

  • Leipzig Glossing rules
  • ISO Morphosyntactic Annotation Format (MAF)
  • ISO Lexical Markup Framework (LMF): Homepage (with Publications and Tools)
  • Typecraft: a labeling system which, for any verb construction of a given language, provides a template for that construction type displaying its argument structure, in a fashion as transparent as possible. The template is constructed from a universally established inventory of labeling primitives.
    http://www.typecraft.org/tc2wiki/Verbconstructions_cross-linguistically_-_Introduction
  • tags for short-unit word (SUW) and long-unit word (LUW) in the CSJ
5.3. Syntax and semantics

5.4. Pragmatics and discourse structure

  • CHAT conventions for segmenting turns and identifying the participant and the setting
  • DAMSL
  • and other schemes documented and discussed at the 1998 DRI meeting such as:
    • intentional structure annotation (see Nakatani, Grosz, Ahn, and Hirschberg, 1995)
5.5. Gesture

5.6. Other resources

  • The EMU Speech Database System (see Cassidy and Harrington, 2001): sourceforge page
  • ISO TC37 SC4 - Language Resource Management : Homepage
6. References
  • Beckman, Mary E., Julia Hirschberg, and Stefanie Shattuck-Hufnagel. 2005. The original ToBI system and the evolution of the ToBI framework. In Sun-Ah Jun, ed. Prosodic Typology: The Phonology of Intonation and Phrasing, pp. 9-54. Oxford University Press.
  • Bird, Steven, and Jonathan Harrington (2001). Editorial: Speech annotation and corpus tools. Speech Communication, 33(1,2): 1-4.
  • Bird, Steven, and Mark Liberman (1999). Annotation graphs as a framework for multidimensional linguistic data analysis, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, ACL, Madrid, Spain. http://acl.ldc.upenn.edu/W/W99/W99-0301.pdf
  • Bird, Steven, and Mark Liberman (2001). A formal framework for linguistic annotation. Speech Communication, 33(1,2): 23-60.
  • Bird, Steven, and Gary Simons (2003). Extending Dublin Core Metadata to support the description and discovery of language resources. Computing and the Humanities, 37, 375-388.
  • Bird, Steven, and Gary Simons (2003). Seven dimensions of portability for language documentation and description. Language, 79(3): 557–82.
  • Bow, Catherine, Baden Hughes, and Steven Bird (2003). Towards a general model of interlinear text. In Proceedings of the EMELD Conference 2003: Digitizing and annotating texts and field recordings. http://linguistlist.org/emeld/workshop/2003/proceedings03.html.
  • Brants, Thorsten, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. 2002. The TIGER Treebank. In E. Hinrichs and K. Simov (eds.), Proceedings of the First Workshop on Treebanks and Linguistic Theories, pp. 24–41, Sozopol, Bulgaria.
  • Bruce, Gösta (1989). Report from the IPA working group on suprasegmental categories. Working Papers, Lund University, Department of Linguistics, 35: 15-40.
  • Brugman, H., P. Wittenburg, S. C. Levinson, and S. Kita. Multimodal annotations in gesture and sign language studies. In M. Rodriguez González & C. Paz Suárez Araujo, eds., Third international conference on language resources and evaluation(pp. 176-182). http://www.mpi.nl/institute/research-groups/language-and-cognition-group/publications
  • Cassidy, Steve, and Jonathan Harrington (2001). Multi-level annotation in the Emu speech database management system. Speech Communication, 33 (1,2): 61-77.
  • Comrie, Bernard, Martin Haspelmath and Balthasar Bickel. (2004, revised 2008). The Leipzig Glossing Rules. Available at:
    http://www.eva.mpg.de/lingua/resources/glossing-rules.php
  • Edwards, J., & Beckman, M. E. (2008). Methodological questions in studying phonological acquisition. Clinical Linguistics and Phonetics, 22(12): 939-958.
  • Eisen, B. (1993). Reliability of speech segmentation and labelling at different levels of transcription. Proceedings of the 3rd European Conference on Speech Communication and Technology, Vol. 1, pp. 673-676.
  • Frank, A. 2000. Automatic F-Structure Annotation of Treebank Trees. In M. Butt and T. H. King (eds.), The Fifth International Conference on Lexical-Functional
    Grammar
    , The University of California at Berkeley, July 19-20 2000, CSLI Publications, Stanford, CA.
  • Gabbard, Ryan, Seth Kulick, and Marcus Mitchell. 2006. Fully parsing the Penn Treebank. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, New York, NY, 184–191. Morristown, NJ: Association for Computational Linguistics.
  • Hajicova, E. 1998. Prague Dependency Treebank: From Analytic to Tectogrammatical Annotation. In Proc. TSD’98.
  • Krotov, Alexander, Mark Hepple, Robert J. Gaizauskas, and Yorick Wilks. 1998. Compacting the Penn Treebank grammar. Proceedings of COLING/ACL98: Joint Meeting of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montréal, Canada, 699–703. Morristown, NJ: Association for Computational Linguistics.
  • Hewlett, Nigel and Waters, Daphne (2004). Gradient change in the acquisition of phonology. Clinical Linguistics & Phonetics,18 (6): 523-533.
  • Ide, N. and Suderman, K. (2007). GrAF: A Graph-based Format for Linguistic Annotations. Proceedings of the Linguistic Annotation Workshop, held in conjunction with ACL 2007, Prague, June 28-29, 1-8.
  • Ide, N., Romary, L. (2007). Towards International Standards for Language Resources. In Dybkjaer, L., Hemsen, H., Minker, W. (Eds.), Evaluation of Text and Speech Systems, Springer, 263-84.
  • Ide, N., Romary, L.. (2006). Representing Linguistic Corpora and Their Annotations. Proceedings of the Fifth Language Resources and Evaluation Conference (LREC), Genoa, Italy.
  • Ide, N., Romary, L. (2004). International standard for a linguistic annotation framework. Journal of Natural Language Engineering, 10:3-4, 211-225.
  • Ide, N., Romary, L. (2004). A Registry of Standard Data Categories for Linguistic Annotation. Proceedings of the Fourth Language Resources and Evaluation Conference (LREC), Lisbon, 135-39.
  • Ide, N., Bonhomme, P., Romary, L. (2000). XCES: An XML-based Standard for Linguistic Corpora. Proceedings of the Second Language Resources and Evaluation Conference (LREC), Athens, Greece, 825-30.
  • Jacobson, Michel, Boyd Michailovsky, John B. Lowe (2001). Linguistic documents synchronizing sound and text. Speech Communication, 33 (1,2): 79-96.
  • Kipp, Michael (2001). Anvil - A Generic Annotation Tool for Multimodal Dialogue Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech), pp. 1367-1370, Aalborg, September 2001.
  • Lausberg, Hedda, and Han Sloetjes (2009). Coding gestural behavior with the NEUROGES–ELAN system. Behavior Research Methods, 41 (3), 841-849.
  • Lehmann, Christian (1983). Directions for interlinear morphemic translations. Folia Linguistica 16.193-224.
  • Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large annotated corpus of English: The Penn Treebank. Computational Linguistics 19.313–30.
  • Marcus, Mitchell, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The Penn Treebank: annotating predicate argument structure. Proceedings of the workshop on Human Language Technology, Princeton, NJ, 110–5. Morristown, NJ: Association for Computational Linguistics.

  • McKelvie, David, Amy Isard, Andreas Mengel, Morten Baun Møller, Michael Grosse, and Marion Klein (2001). The MATE workbench -- An annotation tool for XML coded speech corpora. Speech Communication, 33 (1,2): 97-112.
  • Nakatani, Christine H., Barbara J. Grosz, David D. Ahn ,and Julia Hirschberg (1995). Instructions for annotating discourse. TR: 21-95, Harvard University, Cambridge, MA.
  • Navarro, Borja, Montserrat Civit, M. Antonia Martí, R. Marcos, B. Fernández. 2003. Syntactic, Semantic and Pragmatic Annotation in Cast3LB. Shallow Processing of Large Corpora (SProLaC), a Workshop on Corpus Linguistics, Lancaster, UK.Neidle, Carol, Stan Sclaroff, and Vassilis Athitsos (2001). SignStream: A tool for linguistic and computer vision research on visual-gestural language data. Behavior Research Methods, 33, 311-320.
  • Pitrelli, John F., Mary E. Beckman, and Julia Hirschberg (1994). Evaluation of prosodic transcription
    labelling reliability in the ToBI framework. Proceedings of the 1994 International Conference on Spoken Language Processing, Vol. 1, pp. 123-126.
  • Pye, C., K. A. Wilcox, and K. A. Siren (1988). Refining transcriptions: the significance of transcriber ‘errors’
    Journal of Child Language, 15, 17–37.
  • Quek, Francis, Dan McNeill, Robert Bryll, and Mary Harper (2002) Gesture Spatialization in Natural Discourse Segmentation. Proceedings of the Seventh International Conference on Spoken Language Processing, Vol. 1, Denver CO, pp.189-192.
  • Stirling, Lesley, Janet Fletcher, Ilana Mushin, and Roger Wales (2001). Representational issues in annotation: Using the Australian map task corpus to relate prosody and discourse structure. Speech Communication, 33 (1,2): 113-134.
  • Syrdal, Ann K., Julia Hirschberg, Julie McGory, and Mary Beckman (2001). Speech Communication, 33 (1,2): 135-151.
  • Taylor, A., Marcus M. and Santorini B. 2001. The Penn TreeBank: an Overview. In Abeillé A. (ed.), Building and Using Syntactically Annotated Corpora, Kluwer.
  • Taylor, Paul, Alan W. Black, and Richard Caley (2001). Heterogeneous relation graphs as a formalism for representing linguistic information. Speech Communication, 33 (1,2): 153-174.
  • Telljohann, Heike, Erhard Hinrichs, and Sandra Kübler. 2004. The TüBa-D/Z treebank: Annotating German with a context-free backbone. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 2004.
  • Telljohann, Heike, Erhard Hinrichs, Sandra Kuebler, and Heike Zinsmeister. 2006. Stylebook for the Tuebingen Treebank of Written German (TueBa-D/Z). Technischer Bericht, Seminar fuer Sprachwissenschaft, Universitaet Tuebingen, Tuebingen. Revidierte Fassung.
  • Trippel, Thorsten, Michael Maxwell, Greville Corbett, Cambell Prince, Christopher Manning, Stephen Grimes and Steve Moran (2008). Lexicon Schemas and Related Data Models: when Standards Meet Users. Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC'08). text of paper and slides
  • Xue, Nianwen, Fu-Dong Chiou, and Martha Palmer. 2002. Building a Large-Scale Annotated Chinese Corpus. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan.