See also pages for
subfield-specific best practicesSee also information re:
participating in the development of ISO standards
Existing annotation standards and resources
This section lists the various annotation conventions and other resources for developing and discussing annotation standards that were suggested by participants of Cyberling09. It was copied from
the working group 1 page on 23 July 2009, and ultimately (after the workshop whitepapers are published) this should be the permanent home for information on existing standards.
6.1. Phonetics and phonology
- Phone segment tagging symbols, including both:
- and language-specific phoneme-segment encodings such as ArpaBet for American English and the CSJ encoding for Japanese
- the various ToBI conventions
6.2. Morphosyntax
- Typecraft: a labeling system which, for any verb construction of a given language, provides a template for that construction type displaying its argument structure, in a fashion as transparent as possible. The template is constructed from a universally established inventory of labeling primitives.
http://www.typecraft.org/tc2wiki/Verbconstructions_cross-linguistically_-_Introduction
- tags for short-unit word (SUW) and long-unit word (LUW) in the CSJ
6.3. Syntax and semantics
6.4. Pragmatics and discourse structure
- CHAT conventions for segmenting turns and identifying the participant and the setting
6.5. Gesture6.6. Other resources- ISO TC37 SC4 - Language Resource Management : Homepage
Existing standards for storage, retrieval, and search
This section was copied from the working group 2 page on 23 July 2009, and this information should ultimately live here.
STORAGE
-- text encoding standards
- Unicode
- Text Encoding Initiative (TEI) P5 Standard "These guidelines make recommendations about suitable ways of representing those features of textual resources which need to be identified explicitly in order to facilitate processing by computer programs. In particular, they specify a set of markers (or tags) which may be inserted in the electronic representation of the text, in order to mark the text structure and other features of interest."
- ...
(domain-specific) terminological standards
---- storage and retrieval standards
- repository systems (e.g. www.escidoc.org and www.fedora-commons.org might be relevant in relation to long-term archiving
- DELAMAN is 'as international umbrella body for archives and other initiatives with the goal of documenting and archiving endangered languages and cultures worldwide. Our aim is to stimulate interaction about practical matters that result from the experiences of fieldworkers and archivists, and to act as an information clearinghouse.'
- The Rosetta Project is another archive actively promoting a set of best practices for storage of language data.
RETRIEVAL
-- reference/identification standards (i.e. metadata)
-- Citation Standards
- COINS (a simple standard for embedding Dublin Core citation metadata in a web page)
SEARCH
- Open Search. A very simple standard for sharing search results, usually by expressing such search results in the Atom Syndication Format. Although this is not the best standard (in terms of design, extensibility), it is relatively easy to adopt. Open Search also has proposed geographic extensions to describe how to query a collection based on geographic parameters.
ACCESS/REUSE
-- Cultural Heritage Global Schema and Ontologies
- CIDOC (an ontology mainly applied by European museums and other heritage organizations that is nicely abstracted and very generalized, but is complex and has some difficulties in application)
- OCHRE/ArchaeoML (a somewhat more simple global schema / ontology for cultural heritage applications, including archaeology, epigraphy and philology. It is highly abstract so that projects and collections retain native descriptive terminologies but some degree of interoperability and shared services are facilitated.
-- Copyright and Intellectual Property
- Creative Commons provides a series of standard copyright licenses and associated metadata to explicitly give certain permissions and conditions for use/reuse of copyrighted content. These are useful to define how content can be used. However, these are complicated to apply with scientific data, since US copyright law makes a distinction between "facts" (ideas, concepts, objective data) and "expressions". Since many scientific datasets contain factual measurements and observation, they may not be protected by copyright. To make matters more complicated, the determination of what's a fact and what's an expression is ambiguous and a blurred distinction. This legal ambiguity and complexity makes it harder to use and reuse scientific data. Therefore, Creative Common's scientific arm, "Science Commons", recommends that scientists do not use Creative Commons copyright licenses for scientific data. Instead, Science Commons recommends that scientific application explicitly dedicate data to the public domain using the "CC-Zero" declaration. CC-Zero removes legal ambiguity around data, removes all restrictions for reuse, and in theory, maximizes the scientific value of data.
-- APIs/standards for interfaces with other resources (e.g. corpora, lexica/lexical resources, treebanks?, ...)
- WordNet "This document presents a standard conversion of Princeton WordNet to RDF/OWL. It describes how it was converted and gives examples of how it may be queried for use in Semantic Web applications."
Existing tools, web services, and other technologies
This section was copied in part from the working group 3 page on 23 July 2009.
TOOLS
- TypeCraft Collaborative text annotation
- WALS The World Atlas of Language Structures Online
- ODIN Online Database of INterlinear glossed text
- TextGrid "TextGrid aims to create a community grid for the collaborative editing, annotation, analysis and publication of specialist texts. It thus forms a cornerstone in the emerging e-Humanities."
- Natural Language Toolkit (NLTK) "Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux."
- eHumanities Desktop (project is in alpha development stage, no description available yet)
- Roma: TEI validation tool "These pages will help you design your own TEI validator, as a DTD, RELAXNG or W3C Schema."
- Chorus is a version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed. Chorus is a Palaso Project.
- e-Linguistics: building a cyberinfrastructure for linguistics (including a Python toolkit for data migration; documentation is still being posted)
- Consistent Document Engineering Toolkit
- Thai language specific tools
General purpose tools in use by linguists:
- R Project for statistical computing and the linguistics packages in EMU.
- Praat doing phonetics by computer
- Python
- ANVIL video annotation research tool
WEB SERVICES
-- Representational state transfer (REST) - A design pattern used as a standard in which clients and servers are able to communicate over the internet.
- Communication is limited to the four verbs of the HTTP Protocol: GET, POST, PUT, DELETE. These four commands are used in manipulating any resource on the internet that has a URI. The limitation of only four commands provides simplicity in the semantics of communication between the client and server.
- Resource: Any entity; anything that can be identified with a URI (e.g. phone number, car, person, idea).
- GET: Retrieve one or many resources from the server.
- POST: Update a resource on the server.
- PUT: Add a new resource to the server.
- DELETE: Destroy a resource from the server.
- Note that GET, POST, PUT and DELETE are passed in the header as methods of an HTTP request. GET is the default method when submitting a URL request using a web browser.
- MultiTree Examples:
- GET http://multitree.linguistlist.org/codes/pol
- Semantics: “Get the code resource ‘pol’ from the multitree.linguistlist.org server.”
- Returns an HTML formatted page of all data pertaining to the ‘pol’ code resource in the MultiTree repository
- GET http://multitree.linguistlist.org/codes/pol/trees.json
- Semantics: “Get all tree resources that contain the code resource ‘pol’ and return in JSON format.
- Returns a javascript object that lists all tree resources which contain the code ‘pol’.
- JSON: Javascript object notation used for marking up data. Similar to XML but far more succinct since its purpose is for communicating between machines with disregard for human readability.
- It is possible to implement a RESTful web service from the ground up, though several popular frameworks help facilitate its construction:
- REST can be compared to other standards of communication over the internet: RPC and SOAP. The critique of these standards, though which have been used more widely than REST, is that they add unnecessary complexity when designing both communication interfaces between clients and servers.
- See also: Web Standards
OTHER TECHNOLOGIES
-- here