A call for tools.Working group 3 was asked to discuss "tools for linguistics", with the possible secondary questions of identifying the properties of a "killer application" for linguistic research.
Our field (linguistics) has many tools already. We have annotation engines, analysis engines, and data-annotation standards. Sometimes, these tools are mutually compatible, but in general our tools exist for a sub-community within the field, and the tools work just well enough for the current research in that sub-community.
Currently open issues in technical tools for linguistics research seem to include:
- How to collaborate (though Group 7 seems to be covering this)
- Data provenance and proper citation of data-collections (Group 4 seems to be addressing some of this)
- Data control (controlling access and controlling changes)
Our group's discussion made clear that there is at least one area of technical where linguistics -- as a field -- lacks a widely-used tool: in active collaboration on data collection, curation, and sharing.
A dreamLet's imagine what an active online collaboration tool might look like. We provide here a brief narrative suggesting what might be available for a linguistics community with such a tool:
Sissala studies (an imaginary tale)
August 1-20: Researcher S. makes recordings in field
September 3: 150 conversations uploaded to the shared research platform, tagged as well as S.'s notes allow. S. remains in Burkina Faso, but (because he logged in) his proprietorship of that data is clearly tagged. He sets the access permission on this data to be shared with certain other members of the community.
Sep 20: School begins in North America. An instructor's upper-level class in "Phonetics of Pitch" selects 5 of these conversations for annotation for F0 as part of the class projects. Some conversations are transcribed by more than one student. Each transcription is itself tagged with the identity of the student and his instructor, who herself is the one granting access to the specific conversations. In correcting the students' work, both the student's and the instructor's transcriptions and improvements are logged and made available back to S.
Four undergrad students studying IPA at another American university transcribe 15 sentences each, their teacher transcribes 30 and double-checks the students' transcriptions as well.
S., from Burkina Faso, is able to see these transcriptions and sends back some commentary, noting a phonological distinction that the IPA students may not have captured.
October 1: Parts-of-speech are attached by grad students in the Netherlands, who are doing typological studies of West African languages' word-order phenomena.
October 20: A syntactician in Australia takes note of new work on Sissala.
In less than an academic quarter, new data and rich linguistic annotations of that data are shared across four continents, among students, researchers, and faculty. The students are using the same tools as the faculty and the field researchers: classwork is a good preparation for fieldwork. The students' own practicum is made useful both to other students and to the original researcher, who may (or may not) have time to perform this transcription, but may be able to find the time to offer feedback as well.
In this (perhaps utopian) vision, research and education communicate using the same tools, and researchers across different sub-disciplines and different levels of expertise may also use these tools.
Working data sharingWhat might such an active collaboration look like, using 2009 technology? Web programming and widely-supported data-distribution protocols make storing and distributing large amounts of data in a digitally-accessible way fairly easy. Creating, storing and managing data on server computers sounds complex (and is indeed) but these challenges are largely addressed by other communities with parallel needs, both commercial and non-commercial.
We envision an environment in which data-collection is (to draw an analogy) like constructing a post (or series of posts) for a weblog. The data (or post) may be stored on the server (which we semi-jocularly dub "cyberling.org" for this discussion) even before distribution (or "publication"); it may go through multiple drafts,
and it is possible (as with blog posts, on some platforms) that only some readers may *ever* be able to read it; attribution and edit history is tracked, and distribution to other locations, authors, and/or readers is straightforward.
Contemporary web programming has ready-made solutions for many of these challenges. (This is not to say that the current solutions are
ideal, only that they currently work well-enough.) Complete, well-tested strategies exist for backup (redundancy), storage, confirming and tracking identity, attribution, access control, and revision control.
Using these online environments also allows multiple researchers (possibly in different locations!) to extend and curate data collaboratively. These possibilities only richen as disciplines interact: for example, data collected for fieldwork in phonology and morphology may inform further research on sociolinguistics, syntax, or
typology for researchers far away, if the disciplines can share datasets using an online collaboration. Collaboration in this manner can offer benefits to reputation (through citation of one's field notes and through visibility of your own work) and new opportunities for researchers to share work in more repeatable, formally-visible
ways.
Motivating existing researchersWhile online data sharing strategies offer many additional exciting possibilities for collaboration, the "bootstrapping" problem lurks: existing researchers rarely have a strong interest in changing their current work patterns to join an as-yet-nonexistent collaboration network --- the benefits of collaboration will not appear unless many
researchers are already involved; the first one on the dance floor will look goofy until the party really gets started.
Of course, additional objections may arise: there's not enough time or money; we've not done it before, why start now?, and the perennial concern: this is not our area of expertise. It is a truism of both open-source and for-profit software that no project will succeed unless it "scratches someone's itch" --- that it
meets existing unmet needs of its users. In open-source software, the "itch" is often the desire of a capable programmer to improve her own tools; in for-profit software, the itch is the prospect that someone will pay for the product or the labor.
In this paper, we concern ourselves with discovering the "itch": as researchers, as collectors and curators of linguistic data, and as a community, we need individual (and collective) reasons to move towards this sort of sharing --- reasons to find the funding and the time to promote the kind of adjustment to our research methods that would allow this collaboration.
As researchers interested in sharing data, we want to encourage our users to join the collaboration, even before the benefits of the network effects appear. New users --- the "first ones on the dance floor" --- need good reasons to join
before the network effects.
New users --- especially with the community in its infancy --- need good
individual reasons to join. Thus, the question we address here
is "why should
I join a linguistics corpus-management service built on web tech?
Case studies: Immediate BenefitsWe believe that most linguistics researchers will benefit quite readily from this kind of tool built on web technology. In this section, we provide a sketch of several linguistics researchers who are fairly diverse in their subfields, interests, and experience, and we point out the benefits that each would incur from adapting his or her work to use these sorts of tools.
existing work: recorded conversations, annotated for a very narrow range of phonetic and sociological features. Current research tools: Praat, DAT tapes, and R for analysis.
Sharing data into the platforms suggested here gives her:
- an offsite backup
- shared access to the data with her research-assistants and annotators
- control over who has access to these data
- revision history (including opportunities to review and reconcile conflicting annotations)
existing work: documenting a language's lexicon in the field. Current research tools: Toolbox, and a paper notebook.
Sharing data into the platforms suggested here gives him:
- backup
- shared access to the under-development lexicon with his advisor (who is at home teaching)
- having uploaded the lexicon in one format, downloading in a different one is easy
- browsing the lexicon through the web
- revision history
Teaching work: uses Praat. Students are required to learn enough computer tools to turn in IPA assignments in digital form. A perennial frustration: some submission formats corrupt the students' transcriptions.
By sharing some data to transcribe with the class, this instructor can ask his students to transcribe directly into the online tools described here. The students get:
- practice with the tools that they will use as future professionals
- work with real field data
- attribution for their own work (their transcriptions are logged as their own)
The instructor gets:
- extra annotation passes over the same data
- clear indications of the students' transcription record (through the revision control and identity management)
- standardized transcription responses
Currently writing a syntax squib. Her research is currently looking at hundreds of Hungarian sentences, in search of left-displacement phenomena. The squib itself has only four (representative) examples.
If this student puts her examples into this sharing environment, she gets:
- backup of her notes
- organized documentation of the evidence behind that squib's point for later inclusion in her dissertation
- reproducibility -- other researchers who might challenge the representativeness of her four examples can find other examples
Currently running a Thai word-segmentation competition. Dataset involves thousands of sentences of Thai text, word-segmented by native Thai speakers. By using the platform provided, he can get:
- data backup
- easy indexing of multiple annotations of the same segments
- easy revision tracking, as disagreements are resolved
Additional Benefits
A successful platform would probably provide access to several simple-to-use applications (such as tools to visualize the uploaded data). For example, visual and tabular reports of (word-)frequencies in your data, cross-tabularization, lemmatizers; visualization of syntactic annotation; simple annotation tools to add additional layers of annotation to a data set; and so on. Over time, community members can develop and share additional tools. Existing tools (e.g. ANVIL, Praat, Linguistic Search Engine; TigerSearch) could be integrated into the platform, combined with intuitive user interfaces, thereby providing additional motivation to linguists to join the community and upload their data.
In the long run, the platform we envision can also facilitate the development of annotation standards. Annotation standards that have been developed for one task can be become objects that can be shared with other users. Just like any other type of data, annotation schemes can have tags for authorship, editorship, and revision history. This way, not only primary linguistic data, but also secondary data (part-of-speech sets, syntactic annotation schemes, etc.) can be shared and improved by community members.
What to build first?The system required to get all of the possible benefits of every one of these use cases will require substantial work. In this section, we propose a somewhat more limited scope as an initial goal.
We suggest that phonetic and phonological transcription are a particularly well-suited task for this sort of distributed data-sharing. Technology and standards are well-defined for sharing audio and transcriptions (essentially, text).
The issues of sharing audio and transcriptions among field researchers, their assistants, professional and student transcribers, and arbitrators are not simple, because they involve tracking meta-information, access control, and revision control. Software developers, for their own needs, have developed good tools for dealing with most of these issues.
Limiting the scope of this tool initially to audio files and multiple text transcriptions of those files would make working out these challenges somewhat simpler --- and thus more easily, in the future, extended to other forms of annotation (e.g., coding features that depend on other annotations, such as syntax above transcription).
Building the infrastructure for simple collaboration on transcription, however, would be a useful contribution to
- Fieldworkers
- Linguistics education - students and faculty
- Sharing of transcriptions - between fieldworkers and their research colleagues
- Collaboration among remote researchers (even when none of them are fieldworkers)
ConclusionToo often, discussions about "tools" are really discussions not about tools, but about how to make our tools interoperate, how to make sure that doing work on tools or data gets proper attribution (or the related question: how it gets funding), or one of several other concerns that are -- at root -- questions outside of the tools
themselves.
We have tools. The technology for sharing data, for managing access, revision history, redundancy, and privacy already exists, and is in use on the Internet every day. As the linguistics research community, we are not using the tools that exist for these tasks -- and some work must be done in order to use them to our (and their!) best ability. Nevertheless, the tools exist, and the next generation of linguists will thank us for already having them in place.