Version User Scope of changes
Jul 19 2009, 12:45 PM EDT (current) ekansa 133 words added
Jul 19 2009, 12:12 PM EDT ekansa 158 words added, 2 words deleted

Changes

Key:  Additions   Deletions
STORAGE
-- Standard Software interface to contribute to repository systems.
SWORDS (based on the Atom Publishing Protocol) lets users and other systems to contribute content to different repository systems. The linguistics community should look into building off of SWORDS for linguistics specific needs.

RETRIEVAL

SEARCH
-- RESTful ways lower barriers to entry, but standards for REST-style architectures need more development
Standards for Web-services are most developed for (WS/SOAP) style web services. These types of web-services see extensive application in enterprise, government, and some cyberinfrastructure environments (typically where there is a high-degree of centralization). The standards for SOAP web services are highly developed. However, these are more difficult to scale in distributed, multi-organizational, and multidisciplinary environments. REST-style approaches would be better for cross-disciplinary cyberinfrastructure. However, REST style approaches suffer from the lack of more highly developed standards (as described below).

-- RESTful ways for repository systems to declare how they can be queried.

The Atom Syndication Format can be very useful for expressing results of queries. Atom entries can be good containers to specify where individual records can be retrieved (and these individual records can be expressed in discipline specific XML formats). However, it will be good if an Atom feed could describe in a machine-readable way how its source collection can be queried (what are the query parameters available, what values can these query parameters take, etc.). To a limited degree, the Open Search standard accomplishes this, but this standard is mainly geared toward simple full-text searches (like Google). It would be better to use a better designed more general standard than Open Search.

Also, sometimes it may be more useful to obtain the summarized results of a query more than the individual records. I've been thinking about this with my work on Open Context. One of the things I find useful as a service is to expose faceted metadata in machine readable formats (I'm using Atom). Exposing metadata like this provides a simple way for the system to define how it can be queried, as well as offering useful summary information about the contents of the collection. But, in the case of Open Context, no widely adopted standard exists to support these kinds of capabilities.



ACCESS/REUSE


-- standards for using and developing basic natural language processing (NLP) tools (POS taggers? spell checkers? ... ??), or for creating training data from existing linguistically-annotated data
  • This may be getting a bit 'out there', but I'm thinking about the extreme prevelance of statistical machine learning methods in computational linguistics and the value of annotated linguistic data for training tools using such methods. Are there ways of making this data more easily adapted to be training data for developing NLP tools?