Interlinearisation case studyThis is a featured page

As an example of real-world application of standards and an attempt to set standards in interlinearisation, we can look at TypeCraft (henceforth TC).

From the TypeCraft project home page:
TypeCraft a multi-lingual online database of linguistically-annotated natural language text, embedded in a collaboration and information tool. This set-up allows users (projects as well as individuals) to create their own domains, to invite others, as well as share their data with the public. The kernel of TypeCraft is morphological word-level annotation in a relational database setting, wrapped into a communication system, not unlike popular online community sites. TypeCraft allows you to import raw text for annotation and export annotated data to MS Word, OpenOffice.org, LaTeX or XML for further use.

One of the primary goals of TC is to provide a standards-based tool for interlinear glossing and data sharing. Some of the standards (existing ones or those developed specifically for TC) in respect to WG2's expertise are:
  1. Encoding
    • TC uses Unicode everywhere from internal storage, through user interaction to external export formats.
  2. Storage
    • The internal storage is a relational database but the external one is XML, with an openly defined DTD (migration to XSD schema planned). This allows us to adjust the internal database in any way we see fit to make the tool better but at the same time keep the external XML format static and allow collaboration with other tools/systems. If we need to extend the XML format, we will strive to preserve existing structures and only add what is new. This ensures backwards-compatibility.
  3. Retrieval
    • TC is an online tool and every annotated phrase or collection of phrases can be accessed with a single URL. TC is also wiki-based and annotated phrases can be embedded in wiki articles. This makes the data accessible to everyone reading a TCwiki article (think wiki-based papers on linguistics!) and since it's a wiki, they can be commented on the talk page.
  4. Metadata
    • TC keeps track of the language of annotated resources by using the ISO 639-3 language codes. The ISO standard is not perfect (language names not always accurate and sometimes the official name for a particular language code is unacceptable to some speakers of the language) but nevertheless ISO language codes are definitely a good start and much better than having users supply language names themselves.
    • A fixed set of glossing and part-of-speech tags. This prevents typos when entering tag names (the system will refuse to save anything with misspelled tags) and enforces consistent usage (e.g. singular should be called only SG and not both S and SG). Although in some cases we allow several tags for one and the same grammatical feature to reflect parallel standards.
    • Related glossing tags are grouped together into larger virtual tags (e.g. DAT and ABL would be the tags for the dative and the ablative case and they both belong to CASE).
    • Rudimentary support for text-level metadata with planned support for OLAC.
  5. Access
    • Annotated interlinear glosses can be exported to XML as mentioned above. The power of XML is often underestimated. XML is reusable--a good XML representation can be transformed into many other formats. TC uses the same XML format and different XSL-transformations (EXtensible Stylesheet Language) to produce the same data in other formats: HTML (visually appealing to humans when rendered by a browser, can easily be imported into MS Word/OpenOffice.org as part of a paper), LaTeX (to include in a paper). XSLs are powerful enough to combine exported data from TC with data from other systems (e.g. a lexicon tool) to for example produce a lexicon in PDF (via LaTeX), where every entry is illustrated by an example annotated in TC.
    • TC is community based, centred around a wiki website and allows for data sharing and collaborative work (groups of people can work on the same texts/phrases).
    • Sharing and usage rights:
      • Every annotated phrase can be either private or published. Published phrases are accessible (read-only) to all users of the system.
      • Users of the system can belong to different groups and phrases can be shared with a particular group. All users in that group will have both read and write access to the shared phrases.
      • Future plans: manage usage rights by using access control lists (ACLs). ACLs are a powerful way to restrict or grant rights with fine granularity.
  6. Search
    • Searching within TC is consistent because of enforced standardisation of metadata. E.g. when a user searches for class markers in Bantu languages (or any other similar morphemes in other languages), he or she does not have to guess what the annotator might have used but can just look through the list of glossing tags and see those that correspond to class markers. The same is valid for languages as all languages are stored as references to ISO codes so it is impossible to annotate for the same language and refer to it with different names.
    • TC makes every bit of the annotation searchable. The search is powerful, information from different levels can be combined together and searched for. E.g. all phrases where a NOUN is annotated for both CASE and ANIMACY. This makes it easy to access data on particular linguistic phenomena.



alexispalmer
alexispalmer
Latest page update: made by alexispalmer , Aug 28 2009, 10:16 AM EDT (about this update About This Update alexispalmer Rename - alexispalmer

No content added or deleted.

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.