STORAGE
-- Standard Software interface to contribute to repository systems.SWORDS (based on the Atom Publishing Protocol) lets users and other systems to contribute content to different repository systems. The linguistics community should look into building off of SWORDS for linguistics specific needs.
RETRIEVAL
SEARCH
-- RESTful ways for repository systems to declare how they can be queried.The Atom Syndication Format can be very useful for expressing results of queries. Atom entries can be good containers to specify where individual records can be retrieved (and these individual records can be expressed in discipline specific XML formats). However, it will be good if an Atom feed could describe in a machine-readable way how its source collection can be queried (what are the query parameters available, what values can these query parameters take, etc.). Also, sometimes it may be more useful to obtain the summarized results of a query more than the individual records. I've been thinking about this with my work on
Open Context. One of the things I find useful as a service is to expose faceted metadata in machine readable formats (I'm using Atom). Exposing metadata like this provides a simple way for the system to define how it can be queried, as well as offering useful summary information about the contents of the collection.
RETRIEVALBut, in SEARCHthe case of Open Context, no widely adopted standard exists to support these kinds of capabilities.
ACCESS/REUSE-- standards for using and developing basic natural language processing (NLP) tools (POS taggers? spell checkers? ... ??), or for creating training data from existing linguistically-annotated data - This may be getting a bit 'out there', but I'm thinking about the extreme prevelance of statistical machine learning methods in computational linguistics and the value of annotated linguistic data for training tools using such methods. Are there ways of making this data more easily adapted to be training data for developing NLP tools?