Version User Scope of changes
Jul 19 2009, 5:21 PM EDT (current) robert_forkel 4 words added, 55 words deleted
Jul 18 2009, 4:31 PM EDT robert_forkel

Changes

Key:  Additions   Deletions
The Tools group is charged with identifying and documenting existing and needed tools which will be the face of the cyberinfrastructure for ordinary working linguists. These tools include both those used by data creators (e.g., linguists annotating data that they later share) and data consumers (e.g., linguists using the annotated data of others to create new kinds of data).

Conference Results: Design Doc for a Linguistics-oriented Data Publishing and Sharing Platform
While there are many tools and methods for both gathering and analyzing data that are particular to the various subdisciplines, sharing and publishing of data seems to be the more pressing issue. The following describes the platform for such a platfrom.High level requirements:data backupdata conversionremote accesspublishing (legal/copyright/etc)graded access
Notes:

Existing Tools

  • TypeCraft Collaborative text annotation
  • WALS The World Atlas of Language Structures Online
  • ODIN Online Database of INterlinear glossed text
  • TextGrid "TextGrid aims to create a community grid for the collaborative editing, annotation, analysis and publication of specialist texts. It thus forms a cornerstone in the emerging e-Humanities."
  • Natural Language Toolkit (NLTK) "Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux."
  • eHumanities Desktop (project is in alpha development stage, no description available yet)
  • Roma: TEI validation tool "These pages will help you design your own TEI validator, as a DTD, RELAXNG or W3C Schema."
  • Chorus is a version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed. Chorus is a Palaso Project.
  • e-Linguistics: building a cyberinfrastructure for linguistics (including a Python toolkit for data migration; documentation is still being posted)
  • Consistent Document Engineering Toolkit
  • Thai language specific tools

General purpose tools in use by linguists:
  • R Project for statistical computing and the linguistics packages in EMU.
  • Praat doing phonetics by computer
  • Python
  • ANVIL video annotation research tool

Needed Tools
  • A FOS aligner tool (or aligner development tool) at a grain finer than the intervals marked off fairly automatically in LDC's transcriber tool
  • ??

What's a killer application?

Google Maps may be a good example for a killer app:
  • it's killer in the way it brought mapping data to everyone.
  • it actually killed, e.g. gml - at least gml's hope for mass adoption.
  • it didn't piggyback on a standard, but set one: kml - and it turned out, creating xml files isn't that much of a problem, if you want it badly enough.
But
  • can there be someting like micro-killer-apps?
  • can there be something like scientific killer apps? doesn't "scientific" mean "too small to be killer"?
Following the Google Maps example a killer app would help pull data out of the drawers. This might happen in two ways:
  1. Make publishing data easier or
  2. provide big enough incentives to submit to tedious publishing.

What could killer apps for linguistics look like?

  • search engines? or the semantic web (see this blog post for an idea of what this could mean)?
  • data visualization?
  • can "archiving" or "longterm preservation" be a killer app? (Does not sound like it - does it.)sss
  • is reproducible research enough of an incentive to publish data?
may the killer app be something social/political - like a new model for scientific recognition on the web? and if so, what can we do to bring it about? Foster skills?

Killer applications are applications that are used lots and lots

Therefore a good question might be: Who are the linguists interested in finding and/or producing reusable data?

* computational linguists
(yes)
* corpus linguists
(yes)
* typologists
* encyclopedic works like the SIL guides are already interesting and useful, and there is growing interest in sharing data and linguistic ontology
* descriptive linguists
(as above)
* theoretical linguists
* much theoretical work does not currently use reuseable (or computationally-accessible) data, with some exceptions [e.g. the LFG and HPSG communities].

But data for the computational linguist is probably not quite the same as data for the typologist (or the theoretical linguist).

Likewise a killer app for a computational linguist is probably something very different from an application that a
descriptive linguist, engaged into field work, would care to call a useful tool. Theoretical linguists might be interested in searching for data along yet another set of dimensions. Finally the generation of reusable resources, if considered important at all, must pay off academically to attract more than the occasional linguist. Perhaps we can conclude from this that we rather need a cluster of tools than this one application - together they might be a killer. :)

So following the definition above ("killer apps are apps that are used a lot"), we can probably assume that future killer apps will be on the web.

There seem to be two concerns here:
  1. what does 'data' look like for each field, and can we share specifications?
  2. what does an 'application' look like for linguists of various stripes?
These concerns need not pit "computational" linguists against other types of linguist.

What does 'data' look like for each field, and can we share specifications?

As any science linguistics is based on data, yet the form this data takes and the role it plays crucially depends on the way we perceive of language and the particular approach chosen in investigating its nature. Does that then mean that there is no such thing as " the empirical base of our field". Not necessarily; it only means that this base must consist of a multitude of different types of linguistic data. If so, free access to and reusability of this data might be a commodity that is found useful by most of us.
Let's assume we could agree on that point, what exactly does that mean for future linguistic tools? All seems to come back to the same point, namely that we are chasing a ghost by looking for that one killer app; instead what we most likely need are several different tools, able to cater to the multitude of needs that define the linguistic field as a whole.

Desirable Characteristics of Apps

  • No dead ends for data: While some apps (e.g. filemaker) may be "killer" in how they help organizing data, they also make reusing the data hard.