Reasons for Using UnicodeThis is a featured page

Why use Unicode?


Linguistic data should be created and stored using standards. For written text, the character encoding standard is Unicode (http://unicode.org) /ISO 10646.

Linguists who create text using a non-Unicode font run the risk of jeopardizing their data, for it won't be easily found by widely used search processes, nor will it be saved in a stable, standardized format that will guarantee longevity.

PDFs, Word documents, webpages, and other documents that are put online but are created using a non-Unicode font will cause problems when doing searching. Using images for missing letters or symbols also present a problem, as in the following example.

The snippets below come from the article “A Preliminary Study of Jaw Movement in Arrernte Consonant Production” by Marija Tabaina (Journal of the International Phonetic Association 2009, 39: 33-51). There are two versions available: an HTML version and a PDF.

The HTML version uses images for certain IPA symbols, including one for the voiceless retroflex stop “ʈ”.

ReasonsforUsingUnicode - Cyberling Wiki

Unfortunately, it is not possible to search on the images in the HTML document itself, or by doing a “Google” search across the Internet on such an image, so one won't be able to locate f “ʈ” in the HTML document.

In the PDF version of the same text, a non-Unicode font has been used for the “ʈ”. At first glance, the “ʈ” appears fine in the PDF:

ReasonsforUsingUnicode - Cyberling Wiki

However, the font used has put the glyph for “ʈ“(Unicode LATIN SMALL LETTER T WITH RETROFLEX HOOK, U+0288) in the spot that is properly allocated in Unicode for the DAGGER (†, U+2020). As a result, it is not possible to search in this document for the “ʈ” (Unicode U+0288), because the “ʈ” has been overlaid on the dagger character. (In a similar way, many old Greek fonts would put the lowercase alpha on top of Latin "a" in the font, lowercase beta on top of "b", etc. To search for alpha, one had to search on the Latin letter "a". With Unicode, alpha has its own number [codepoint], which is different from the Latin lowercase "a", so now it is possible to search for alpha separately from Latin "a".)

A non-Unicode font will present other problems for the user: If one copies and pastes the letter “ʈ” from the above PDF into a Unicode-compliant word-processing document, it appears as a dagger.

The above example demonstrates that using a non-Unicode font can prevent search engines from finding documents and text properly.

Another important factor is that old documents created with non-Unicode fonts will be hard to read in the future, since they won’t be based on an international standard. If an elderly linguist were to die and leave his data on his computer (which he had keyed in with his own non-standard font), it may take considerable time and effort to convert the data into a standardized format that is usable by others, with the possibility that the data could be lost forever.

Tools (Fonts/Keyboards/etc.):

The listing below is not comprehensive, but is only intended to provide a few reliable webpages that provide tools and other useful information on Unicode-enabled products.

Fonts

Most core fonts that come with recent operating systems are Unicode-based, but they may not include all the special characters required by linguists. (Note: A new font that will be released with Windows 7, Ebrima, will include improved support for many African languages that use the Latin script. It also include the Vai and N'Ko scripts.)

"Character pickers"

These enable users to select the letters or symbols they wish, and cut and paste them into documents

Keyboards and Inputting Methods:


Other Script Standards and Useful Resources:

ISO Script Codes (ISO 15924): http://www.unicode.org/iso15924/

Recommendations on the Development of New Orthographies: http://www.unicode.org/notes/tn19/
This is a set of guidelines for linguists who are devising orthographies, so the orthography can be accessible to users on computers

Unicode for Language Documentation
An informative page from E-MELD with sections on (a) adding characters to Unicode, (b) precomposed forms (letters made up of a base character + one or more diacritic, and why these are not in Unicode), and (c) IPA and Unicode.

How to Get Involved in ISO Standards Development
This document also includes a short section on how to get involved in the development of the Unicode Standard


No user avatar
DeborahAnderson
Latest page update: made by DeborahAnderson , Aug 29 2009, 3:05 PM EDT (about this update About This Update DeborahAnderson Edited by DeborahAnderson

45 words added
1 word deleted

view changes

- complete history)
Keyword tags: Unicode
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.