<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="http://cyberling.elanguage.net/xsl/rss2html.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://cyberling.elanguage.net/scripts/wpcss/wiki/cyberling/skin/spots/rss" type="text/css" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Cyberling Wiki - Recently Updated Pages</title><link>http://cyberling.elanguage.net/pageSearch/updated</link><description>Recently Updated Pages on http://cyberling.elanguage.net</description><language>en-us</language><webMaster>info@wetpaint.com</webMaster><pubDate>Fri, 09 Oct 2009 13:57:06 CDT</pubDate><lastBuildDate>Fri, 09 Oct 2009 13:57:06 CDT</lastBuildDate><generator>wetpaint.com</generator><ttl>60</ttl><image><title>Cyberling Wiki</title><url>http://image.wetpaint.com/image/3/LvaayjtMtf4eurpclDNXnw8275</url><link>http://cyberling.elanguage.net</link><description>linguistics, cyberinfrastructure</description></image><item><title>Group 2: Standards for Storage, Retrieval, and Search of Data</title><link>http://cyberling.elanguage.net/page/Group+2%3A+Standards+for+Storage%2C+Retrieval%2C+and+Search+of+Data</link><author>AliciaBW</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Group+2%3A+Standards+for+Storage%2C+Retrieval%2C+and+Search+of+Data</guid><pubDate>Fri, 09 Oct 2009 13:57:06 CDT</pubDate><description>&lt;i&gt;As part of the Cyberling2009 workshop at Berkeley, this working group was charged with identifying and documenting existing and needed standards for the digital storage, retrieval, and search of linguistic data. Another concern was the potential for reuse of language data by parties other than the original creators of the data. We present the results of our working sessions as a set of wiki pages, as outlined below.&lt;br&gt;&lt;/i&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;Members ---&lt;br&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;Debbie Anderson, Eric Kansa, Pavel Mihaylov, Johanna Nichols, Alexis Palmer, Alicia Wassink&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;br&gt;Process ---&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;br&gt;While standards of some kinds &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;for storage, retrieval, and search of linguistic data &lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;do exist in linguistics, many subfields of linguistics talk more about &amp;quot;best practices&amp;quot; and &amp;quot;common practices&amp;quot; than they do about &amp;quot;standards&amp;quot;. We discussed the ways these terms are used on our &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Big+ideas&quot; target=&quot;_self&quot;&gt;Big Ideas&lt;/a&gt; page. We were not tasked with discussion of the related and important issue of annotation standards. For discussion of this issue, please see &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+1%3A+Annotation+Standards&quot; target=&quot;_self&quot;&gt;Group 1: Annotation Standards&lt;/a&gt;.&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;blockquote&gt;&lt;b&gt;Summary of Big Ideas regarding data sharing (storage, retrieval and search)&lt;/b&gt;:&lt;br&gt;How do we define &amp;#39;&lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Big+ideas&quot; target=&quot;_self&quot;&gt;other standards&lt;/a&gt;&amp;#39;?&lt;br&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Big+ideas&quot; target=&quot;_self&quot;&gt;Standards vs. Best Practices&lt;/a&gt;&lt;br&gt;How do we encourage &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Big+ideas&quot; target=&quot;_self&quot;&gt;adoption of standards&lt;/a&gt; in linguistics?&lt;br&gt;&lt;blockquote&gt;Data sharing: the publication model (for more on this, see the white paper from &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+4+White+Paper&quot; target=&quot;_self&quot;&gt;Group 4&lt;/a&gt;)&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Subfield-specific+practices&quot; target=&quot;_self&quot;&gt;Standards are great, now how do I use them?&lt;/a&gt;&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Getting+Involved+in+ISO+Standards+Development&quot; target=&quot;_self&quot;&gt;How can I participate in the creation of ISO standards?&lt;/a&gt;&lt;br&gt;&lt;br&gt;&lt;/blockquote&gt;&lt;b&gt;Issues addressed within the WG2 wiki pages:&lt;br&gt;&lt;/b&gt;&lt;ul&gt;&lt;li&gt;Unicode character encoding standards for increasing stable display, readability, and sharing of data&lt;/li&gt;&lt;li&gt;Relational database storage&lt;/li&gt;&lt;li&gt;Wiki-based sharing of research&lt;br&gt;&lt;/li&gt;&lt;li&gt;Metadata tags for increased transparency and usability of data&lt;/li&gt;&lt;li&gt;Version control&lt;/li&gt;&lt;li&gt;Web standards for sharing datasets&lt;/li&gt;&lt;li&gt;Machine reusability of data (under construction)&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;Results ---&lt;br&gt;&lt;/font&gt;&lt;ol&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;A set of examples/&lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Case+Studies&quot; target=&quot;_self&quot;&gt;case studies&lt;/a&gt; demonstrating &lt;i&gt;applications &lt;/i&gt;of standards for storage, retrieval, and search and their utility for linguistic research. &lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;A set of &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Subfield-specific+practices&quot; target=&quot;_self&quot;&gt;subfield-specific seed lists&lt;/a&gt; of common practices, requirements, conventions, etc. The purpose of creating these lists is twofold. First, the lists should be helpful to individual linguists working in the subfield in question. Second, they can serve as reference material to linguists from &lt;i&gt;other&lt;/i&gt; areas who might wish to annotate beyond their individual research concern. &lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;A seed list of &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Existing+Standards&quot; target=&quot;_self&quot;&gt;existing standards&lt;/a&gt; for storage, retrieval, and search of linguistic data.&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;A handful of recommendations regarding not-yet-existent but &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Needed+Standards&quot; target=&quot;_self&quot;&gt;needed standards&lt;/a&gt; for linguistics cyberinfrastructure.&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Additional &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Resources&quot; target=&quot;_self&quot;&gt;resources&lt;/a&gt;: relevant links, papers, etc.&lt;/font&gt;&lt;br&gt;&lt;/li&gt;&lt;/ol&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;Notes from &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Notes&quot; target=&quot;_self&quot;&gt;working sessions&lt;/a&gt;&lt;/font&gt;&lt;br&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Member Bios</title><link>http://cyberling.elanguage.net/page/Member+Bios</link><author>TerryLangendoen</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Member+Bios</guid><comments>update to my biosketch</comments><pubDate>Sat, 26 Sep 2009 17:33:13 CDT</pubDate><description>&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Working Group 1: Annotation Standards&lt;br&gt;&lt;/font&gt;&lt;/b&gt;  &lt;h3&gt;  Mary Beckman (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: mebeckman   &lt;br&gt;website: http://ling.osu.edu/~mbeckman&lt;br&gt;email: mbeckman@ling.osu.edu&lt;br&gt;I am a linguist who has worked (among many other things) on developing a framework for building annotation conventions for prosodic categories in specific language varieties (see http://ling.osu.edu/~tobi) and on developing a database of child and adult productions of word forms that target lingual obstruents (see http://ling.osu.edu/~edwards). Some of my current funded work is described at http://www.ling.ohio-state.edu/~edwards/socialDynamics/NSFhighlight.html and I am interested (among other things) in developing the infrastructure for facilitating this kind of collaborative approach that incorporates corpus-based work, experimental work, and computational modeling. &lt;br&gt;&lt;br&gt;&lt;h3&gt;  Stuart Robinson (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: stuartrobinson   &lt;br&gt;website: https://www.zapata.org/stuart&lt;br&gt;email: stuart@zapata.org&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;Sarah Churng&lt;/font&gt;&lt;br&gt;wiki username: ashragurnch&lt;br&gt;website:&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://students.washington.edu/ashra/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt; http://students.washington.edu/ashra/&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.net/page/ashra%40u.washington.edu&quot; target=&quot;_self&quot;&gt;ashra@u.washington.edu&lt;/a&gt;&lt;br&gt;I am a graduate student at the University of Washington. My main goal at the Cyberling workshop is to foster discussion for how cyberlinguistic infrastructure may facilitate sign language transcription and annotation, as part of the Annotation Standards working group, through best practice efforts, dissemination through the web, etc. I hope to compound these discussions with current research in deaf literacy, in which the first language acquisition of sign languages is hypothesized to bootstrap spoken language literacy.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Greville Corbett&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: http://www.surrey.ac.uk/LIS/SMG/gcorbett.htm&lt;br&gt;email: &lt;br&gt;I work at the University of Surrey, where I lead the Surrey Morphology Group. I have worked particularly on the typology of features, as in Gender (1991), Number (2000) and Agreement (2006), which makes me very positive towards the Leipzig Glossing Rules, while aware of what still needs to be done there. The SMG has produced several typological databases, freely available over the web (http://www.surrey.ac.uk/LIS/SMG/web_resources.htm ), so I have a particular interest in how we ensure that the research and effort involved in setting up such resources is respected and acknowledged. I collaborated with Marina Chumakina, Dunstan Brown and Harley Quilliam on the Archi Dictionary (http://www.smg.surrey.ac.uk/archi/linguists/index.aspx) , with all the issues of scripts, sound files, picture files, web access and so on.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Chuck Fillmore&lt;br&gt;&lt;/h3&gt;wiki username: cjfillmore   &lt;br&gt;website: &lt;br&gt;email: fillmore@icsi.berkeley.edu&lt;br&gt;A long-retired member of the Berkeley linguistics department. Associated since retirement with the &amp;quot;FrameNet&amp;quot; project, producing a kind of valency dictionary whose categories are based on knowledge of the &lt;b&gt;semantic frames&lt;/b&gt; that underlie individual words&amp;#39; meanings (http://framenet.icsi.berkeley.edu). The analyses are based on annotations of sentences extracted from a large corpus of written English. Currently also trying to use the same methods to build a registry of &amp;quot;&lt;b&gt;minority grammatical constructions&lt;/b&gt;&amp;quot; (the ones that ordinary parsers can&amp;#39;t handle, or that familiar compositional principles can&amp;#39;t interpret). I&amp;#39;ve been trying with some student colleagues to develop an abbreviated way to annotate phrases in respect to how they instantiate individual constructions. Each construct annotation has to be keyed to a full description of the construction, in the way that each lexical annotation has to be keyed to a full description of the relevant frame. Annotation of all the words or all the constructions in individual sentences requires layered stand-off annotation sets.&lt;br&gt;I lacked the patience to complete the complicated interview that would have allowed me to insert a photo - I could have tried to find out what my zodiac sign is, but I don&amp;#39;t have a good photo anyway: I&amp;#39;m tallish, red-haired, slow-moving, walk with a cane.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Richard Wright&lt;br&gt;&lt;/h3&gt;wiki username: rawright@u.washington.edu   &lt;br&gt;website: http://depts.washington.edu/phonlab/people/wright.html&lt;br&gt;email: rawright@u.washington.edu&lt;br&gt;I am an associate professor of linguistics and the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://depts.washington.edu/phonlab/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Phonetics Lab&lt;/a&gt; director at the University of Washington. I specialize in phonetic research and field lingiustics primarily in the acoustic domain. My interest in the Cyberling workshop stems from my experience developing and working with corpora of spoken language (in part with &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://faculty.washington.edu/wassink/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Alicia Wassink&lt;/a&gt;), in working on voice-based human machine interfaces (&amp;quot;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ssli.ee.washington.edu/vj/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;The Vocal Joystick&lt;/a&gt;&amp;quot; ), and in working with &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://staff.washington.edu/stiv/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Steve Moran&lt;/a&gt;) to develop a typological database of phonological inventories tied to ontological models of phonological feature theories that expands on previous work such as UPSID and the Stanford Phonology Archive &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://phoible.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;(P&lt;/a&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://phoible.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;HOIBLE&lt;/a&gt;). I am particularly interested in methods for representing the sounds of languages in ways that are machine readable and standardized across languages and across applications.&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Working Group 2: Other Standards&lt;br&gt;&lt;/font&gt;&lt;/b&gt;  &lt;h3&gt;  &lt;/h3&gt;  &lt;h3&gt;  Johanna Nichols (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: Johanna.Nichols   &lt;br&gt;website: http://linguistics.berkeley.edu/~ingush/, http://socrates.berkeley.edu/~jbn/&lt;br&gt;email: johanna@berkeley.edu&lt;br&gt;&lt;br&gt;I am Professor emeritus in Slavic linguistics at UC Berkeley. I co-founded and co-direct (with Balthasar Bickel; see below) the Autotyp databases and research project (http://uni-leipzig.de/~autotyp/), which will probably have its complete genealogical classification on-line by the time the Cyberling workshop begins. I work on questions of phylogeny, detecting and demonstrating linguistic relatedness, deep linguistic prehistory, and typology, and I have a documentation project creating very large corpora of spoken Ingush and Chechen (East Caucasian). All of these projects require combining data from different fields (linguistics, archaeology, ethnography, human genetics). I am concerned with seeing standards and tools developed that will not require every documentary linguist to start from scratch electronically, and that will serve linguists and language users rather than vice versa. I have developed all-lower-ascii practical writing systems for Ingush and Chechen and a similar system for my East Caucasian etymological database and am distressed at what I perceive as increasing pressure to put corpora, syntactic examples, dictionary headwords, etc. into unreadable phonetic transcription just because of font availability. For my Ingush corpus I&amp;#39;m working out ways of interlinearizing and lemmatizing in languages where lemmas, inflectional categories, etc. are properties of clauses rather than words. I&amp;#39;m also concerned with the lack of instructions, documentation, labels, etc., etc. in languages readable by speakers of minority languages in Russia.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Alexis Palmer, Saarland University (CoLi) and the University of Texas at Austin (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/alexispalmer&quot; target=&quot;_self&quot;&gt;alexispalmer&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://comp.ling.utexas.edu/apalmer&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;http://comp.ling.utexas.edu/apalmer&quot;&gt;http://comp.ling.utexas.edu/apalmer&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.net/page/apalmer%40coli.uni-sb.de&quot; target=&quot;_self&quot; title=&quot;apalmer@coli.uni-sb.de&quot;&gt;apalmer@coli.uni-sb.de&lt;/a&gt;&lt;br&gt;In May 2009 I began a new position as a postdoc in the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mmci.uni-saarland.de/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;M2CI Cluster of Excellence&lt;/a&gt; at Saarland University in Saarbr&amp;uuml;cken, Germany. There I am working with Caroline Sporleder in a research group on &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.coli.uni-saarland.de/projects/comodis/publications.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Computational Modelling of Discourse and Semantics&lt;/a&gt;. I&amp;#39;m also just finishing my PhD in &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://comp.ling.utexas.edu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;computational linguistics&lt;/a&gt; at the University of Texas at Austin, under the supervision of Jason Baldridge and Katrin Erk. My thesis research, which is connected with the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://comp.ling.utexas.edu/earl/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;EARL project&lt;/a&gt;, has to do with integrating automatic labeling and human annotation for more efficient production of interlinear glossed text (IGT). In general, I&amp;#39;m interested in the potential for applying techniques and methodologies from computational linguistics (CL) for research in other linguistic subfields, and particularly for documentation of endangered languages. Bridging the space between CL and the rest of linguistics raises a host of issues -- data formatting and availability, code reusability and availability, modularity and generalizibility of computational models, etc. -- related to the aims and core ideas of Cyberling 2009.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Debbie Anderson&lt;br&gt;&lt;/h3&gt;wiki username: DeborahAnderson   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistics.berkeley.edu/sei/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://linguistics.berkeley.edu/sei/&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:dwanders@sonic.net&quot; target=&quot;_self&quot;&gt;dwanders@sonic.net&lt;/a&gt;&lt;br&gt;I run a project at UC Berkeley, the Script Encoding Initiative, that assists groups (and individuals) in getting eligible scripts and characters into the Unicode Standard/ISO 10646, the international character code standard. I am particularly keen on being sure linguists and members of the user communities (especially minority language speakers) get a voice in the development of standards. I am the UC Berkeley representative to the Unicode Consortium, and a member of the US delegation to ISO/IEC JTC 1 SC2 Working Group 2 on coded character sets.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Eric Kansa&lt;br&gt;&lt;/h3&gt;wiki username: ekansa   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://isd.ischool.berkely.edu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://isd.ischool.berkely.edu&lt;/a&gt;&lt;br&gt;email: ekansa@ischool.berkeley.edu&lt;br&gt;Eric C. Kansa is Executive Director of the Information and Service Design Program and is an Adjunct Professor at the UC Berkley School of Information (I School). His primary role is to develop service design projects that bring I School students and faculty to work in collaboration with partner organizations. His research interests include efforts to enhance the accessibility and usability of research data collected in the field sciences, as well as, the impact of ubiquitous information accessibility in the consumer experience of services. Before coming to UC Berkeley, Eric was cofounder and former Executive Director, a nonprofit organization, the Alexandria Archive Institute. There he led development of&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.opencontext.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Open Context&lt;/a&gt;, an online system for publishing primary research data collected in the field sciences. This follows a position on the faculty of Harvard University, where he served as Lecturer and Undergraduate Tutor for the Department of Anthropology. He graduated from the University of California, San Diego with a BA in Cultural Anthropology. Eric was awarded a doctorate in Anthropology at Harvard University in 2001. Eric is currently Convener of the Society for American Archaeology&amp;#39;s Digital Data Interest Group.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Pavel Mihaylov&lt;br&gt;&lt;/h3&gt;wiki username: pavel.mihaylov   &lt;br&gt;email: bin,at,bash,dot,info&lt;br&gt;I am a computational linguist working for &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ontotext.com/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Ontotext&lt;/a&gt;, a mixed industry/research company based in Sofia, Bulgaria. My main occupation is web mining/information extraction and finite-state morphologies. Together with Dorothee Beermann, I work on TypeCraft (see Dorothee&amp;#39;s bio). Other than the computational bit, I have a general interest in linguistics and languages.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Alicia Wassink&lt;br&gt;&lt;/h3&gt;wiki username: AliciaBW   &lt;br&gt;website: http://faculty.washington.edu/wassink/&lt;br&gt;email: wassink@u.washington.edu&lt;br&gt;I work in acoustic phonetics, sociolinguistics and creole linguistics. In the Cyberling workshop, I&amp;#39;m wearing my sociophonetician hat. I am currently co-authoring a chapter on best practices in sociophonetics pertaining to the instrumental analysis of vowels for addressing research questions of interest to sociolinguists. This chapter is about setting standards vis-a-vis best practices in data analysis, and isn&amp;#39;t so much about data storage, retrieval or archiving, which are all topics I&amp;#39;m interested in discussing as part of this working group. As director of the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://depts.washington.edu/sociolab/index.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Sociolinguistics Laboratory&lt;/a&gt; at the University of Washington, Dept. of Linguistics, I&amp;#39;ve run metadata tutorials to train my students to use established protocols for associating metadata with their audiofiles that will increase data accessibility, searchability, acoustic analysis and ease of querying.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Working Group 3: Tools&lt;br&gt;&lt;/font&gt;&lt;/b&gt;  &lt;h3&gt;  Bill Byrne (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/bill_byrne&quot; target=&quot;_self&quot;&gt;bill_byrne&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.linkedin.com/in/billb&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;About me&quot;&gt;About me&lt;/a&gt;&lt;br&gt;email: billb@google.com&lt;br&gt;I design speech user interfaces at Google (e.g. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.youtube.com/watch?v=y3z7Tw1K17A&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;iPhone app&quot;&gt;iPhone app&lt;/a&gt;) and have been in this field for the last ten years. Having completed a linguistics PhD in 1998, I continue to follow theoretical and applied work but I have always been troubled by the lack of data available to researchers. I see Cyberling as an interesting and encouraging activity for the field. I would love to help develop more ways for linguists in all subdisciplines to easily gain access to very large sets of data as well as share their own data with the rest of the world.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Robert Forkel (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/robert_forkel&quot; target=&quot;_self&quot;&gt;robert_forkel&lt;/a&gt;   &lt;br&gt;website: &lt;a href=&quot;http://cyberling.elanguage.nethttps://dev.livingreviews.org/projects/epubtk/wiki/people/robert&quot; target=&quot;_self&quot;&gt;https://dev.livingreviews.org/projects/epubtk/wiki/people/robert&lt;/a&gt;&lt;br&gt;email: forkel@mpdl.mpg.de &lt;br&gt;I am a mathematician-turned-software developer working at the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://mpdl.mpg.de/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Max Planck Digital Library&lt;/a&gt;. My interest for web infrastructure for linguistics started while working on &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://wals.info/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;WALS Online&lt;/a&gt;. Having a couple more projects coming in, I&amp;#39;m interested in ways to publish linguistic data as part of the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linkeddata.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Linked Open Data Cloud&lt;/a&gt;. The data I&amp;#39;m concerned with mainly is word lists, interlinear glosses, etc.&lt;br&gt;&lt;br&gt;What I hope to take away from cyberling is a clearer idea about the lowest level of quality/granularity which would make sharing data still fruitful - I&amp;#39;m looking for low-hanging fruit.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Dorothee Beermann&lt;br&gt;&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/DorotheeBeermann&quot; target=&quot;_self&quot; title=&quot;DorotheeBeermann&quot;&gt;DorotheeBeermann&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hf.ntnu.no/hf/isk/Ansatte/dorothee.beermann/personInfo.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;http://www.hf.ntnu.dorothee.beermann&quot;&gt;http://www.hf.ntnu.dorothee.beermann&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:dorothee.beermann@hf.ntnu.no&quot; target=&quot;_self&quot; title=&quot;dorothee.beermann@hf.ntnu.no&quot;&gt;dorothee.beermann@hf.ntnu.no&lt;/a&gt;&lt;br&gt;I am associate professor at the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway.&lt;br&gt;My interest in Cyberlinguistics results from working across linguistic fields, from grammar engineering on the one hand to African linguistics on the other. Perhaps in particular when working across frameworks one would like to know what defines linguistics as a whole. Can we for example find a common answer to the question: &amp;#39;What defines linguistic methodology?&amp;#39;? &lt;br&gt;I am in particular interested in the role that interlinear glosses (IGs) play in linguistic research. Not so happy with the role IGs play at present, I would like to help facilitate a development where they become an independent linguistic resource - accessible to all of us.&lt;br&gt;Together with Pavel Mihaylov I have created a linguistic tool that helps to generate, store and retrieve IGs in a setting that allows sharing them with a group of colleagues or to publish them online. (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.typecraft.org/tc2wiki/Main_Page&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;TypeCraft&quot;&gt;TypeCraft&lt;/a&gt;). &lt;br&gt;&lt;br&gt;&lt;h3&gt;  Arienne Dwyer&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: &lt;br&gt;email: &lt;br&gt;[insert bio here]&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Florian Jaeger, University of Rochester&lt;br&gt;&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/FlorianJaeger&quot; target=&quot;_self&quot;&gt;FlorianJaeger&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hlp.rochester.edu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.hlp.rochester.edu&lt;/a&gt; [nevermind the security warning; trust me]&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:fjaeger@bcs.rochester.edu&quot; target=&quot;_self&quot;&gt;fjaeger@bcs.rochester.edu&lt;/a&gt;&lt;br&gt;I received my M.A. in Linguistics and Computer Science (HU &amp;amp; TU Berlin, with a visit to UC Berkeley) and my PhD in Linguistics with a designation in cognitive science (Stanford University with a visit to MIT). I have been at Rochester (Brain and Cognitive Sciences and Computer Science) since 01/2007, where I working on efficient language production, maintenance of probabilistic linguistic representations, and other such stuff. My involvement in Cyberling 2009 is related to my interests in replicability and extensibility of scientific work. This implies development of tools and annotation standards that make the data sets developed by one researchers useful to others. I am also interested in cheap `technology&amp;#39; with possible high impact factor, such as taking laptops to the field to run psycholinguistic studies, or the use of online platforms like Mechanical Turk to elicit large amounts of data from many languages at low cost. This often results in unbalanced, highly clustered data (similar to corpus data) for which modern statistical methods are required (in which I am also interested ... conveniently).&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Jeremy G. Kahn&lt;br&gt;&lt;/h3&gt;wiki username: JeremyKahn   &lt;br&gt;website: &lt;br&gt;email: jgk@washington.edu&lt;br&gt;I am a Ph.D. student in Linguistics at the University of Washington. I work as a Research Assistant to Mari Ostendorf in the Signal, Speech and Language Interpretation laboratory within UW Electrical Engineering, and I am currently a Visiting Fellow at the SRI Speech Technology and Research laboratory in Menlo Park. My primary research area is in using syntactic information to support a speech-recognition and machine-translation pipeline. I am interested in data annotation both on principle and as a matter of necessity; heavily statistical research software systems have tremendous issues with data compatibility, portability, and reconciliation.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Virach Sornlertlamvanich&lt;br&gt;&lt;/h3&gt;wiki username: virach   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tcllab.org/virach&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.tcllab.org/virach&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:virach@tcllab.org&quot; target=&quot;_self&quot;&gt;virach@tcllab.org&lt;/a&gt;&lt;br&gt;I am the Assistant Executive Director of National Electronics and Computer Technology Center (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.nectec.or.th/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;NECTEC&lt;/a&gt;), and the Co-director of Thai Computational Linguistics Laboratory (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tcllab.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TCL&lt;/a&gt;). My research interests lie in the computational linguistics that covers morphological, syntactic and semantic representation and analysis. My previous works have been provided in publications and implementations such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://lexitron.nectec.or.th/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;LEXiTRON&lt;/a&gt; (English-Thai dictionary), &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://asianwordnet.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Asian WordNet&lt;/a&gt;, &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tcllab.org/kui/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Kui&lt;/a&gt; (Knowledge Unifying Initiator: online collaboration editing tool), word segmentation (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.cmu.edu/~paisarn/software.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;SWATH&lt;/a&gt;), &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hlt.nectec.or.th/orchid&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ORCHID&lt;/a&gt; (Thai POS tagged corpus), &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://suparsit.com/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ParSit&lt;/a&gt; (English-Thai online machine translation), &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://sansarn.com/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Sansarn&lt;/a&gt; (Thai search engine portal), etc. Currently, I am working as a chair of Asian Language Resource group of AFNLP (Asian Federation of NLP), director member of AAMT (Asia-Pacific Association for Machine Translation). I am also conducting a series of school of &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tcllab.org/add&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ADD&lt;/a&gt; (Asian Applied NLP for linguistics diversity and language resource Development) for NLP networking and collaboration in language resource development especially for Asian countries.&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Working Group 4: Data Reliability &amp;amp; Provenance&lt;br&gt;&lt;/font&gt;&lt;/b&gt;  &lt;h3&gt;  Peter Austin (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: pkaustin   &lt;br&gt;website: http://www.hrelp.org/aboutus/staff/index.php?cd=pa&lt;br&gt;email: pa2@soas.ac.uk&lt;br&gt;&lt;br&gt;I am Marit Rausing Chair in Field Linguistics at the School of Oriental and African Studies in London and Director of the Endangered Languages Academic Programme. My research interests lie in the theory and practice of language documentation and description, endangered languages, morphosyntactic typology, Lexical Functional Grammar, and languages of eastern Indonesia and Aboriginal Australia. At SOAS I teach a course on &lt;i&gt;Technology and Language Documentation&lt;/i&gt; that covers data modeling, workflow, metadata, archiving, ethics and protocols, and software tools, and have participated in training workshops that cover these topics.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Martin Haspelmath, Max Planck Institute for Evolutionary Anthropology&lt;br&gt;&lt;/h3&gt;wiki username: haspelmath   &lt;br&gt;website: http://www.eva.mpg.de/lingua/staff/haspelmath/home.php&lt;br&gt;email: haspelmath@eva.mpg.de&lt;br&gt;&lt;br&gt;I am a typologist interested in linking structural data from as many different languages as possible (typological databases). As of 2008, the World Atlas of Language Structures (http://wals.info) has been online, and I&amp;#39;m interested in getting more such datasets out. I hope that linguists will soon publish their dictionaries and corpora online, and that it will become normal to see such online resources as regular (peer-reviewed) publications, because without the incentive of regular publication I fear linguists will not share their materials.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Kurt Bollacker&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: &lt;br&gt;email: &lt;br&gt;[insert bio here]&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Tracy Holloway King&lt;br&gt;&lt;/h3&gt;wiki username: tracyhollowayking   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www-csli.stanford.edu/~thking/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www-csli.stanford.edu/~thking/&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:tracyhollowayking@gmail.com&quot; target=&quot;_self&quot;&gt;tracyhollowayking@gmail.com&lt;/a&gt;&lt;br&gt;I am on the LSA TAC committee (with a number of people on this list). I currently manage the natural language engineering groups at Powerset, a semantic search company acquired in 2008 by Microsoft. I have been a co-organizer for the first three grammar engineering across frameworks (GEAF) workshops. I am particularly interested in making sure that resources, platforms, and theories created can be used cross-linguistically.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Koenraad de Smedt&lt;br&gt;&lt;/h3&gt;wiki username: Koenraad   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ling.uib.no/desmedt/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://ling.uib.no/desmedt/&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:desmedt@uib.no&quot; target=&quot;_self&quot;&gt;desmedt@uib.no&lt;/a&gt;&lt;br&gt;I am professor of Computational Linguistics at the University of Bergen, Norway and head of the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ling.uib.no/lamore&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Research Group on Language Models and Resources (LaMoRe)&lt;/a&gt;. Lately I have been working on parsebanking. I am currently the national contact person for CLARIN in Norway and a member of the Science Opportunities Panel of the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.forskningsradet.no/servlet/Satellite?c=Page&amp;cid=1226485583597&amp;pagename=evita%2FHovedsidemal&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;eVITA Programme&lt;/a&gt; Committee (Research Council of Norway). I am also coordinator of the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://clara.uib.no/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;CLARA&lt;/a&gt;, a Marie Curie ITN which will start up in the fall of 2009 (see &lt;a href=&quot;http://cyberling.elanguage.net/page/Organizations+and+Initiatives&quot; target=&quot;_self&quot;&gt;Organizations and Initiatives&lt;/a&gt;). In 2007 I organized a &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://tlt07.uib.no/ulaindex.php?page=program&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Workshop on Unified Linguistic Annotation&lt;/a&gt;. &lt;br&gt;&lt;br&gt;&lt;h3&gt;  Paul Trilsbeek&lt;br&gt;&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/paul.trilsbeek&quot; target=&quot;_self&quot;&gt;paul.trilsbeek&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mpi.nl/people/trilsbeek-paul&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.mpi.nl/people/trilsbeek-paul&lt;/a&gt;&lt;br&gt;email: Paul.Trilsbeek@mpi.nl&lt;br&gt;I am an archive manager for the language archive at the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mpi.nl/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Max Planck Institute for Psycholinguistics&lt;/a&gt;, part of which is the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mpi.nl/DOBES&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;DOBES&lt;/a&gt; archive of endangered languages. At MPI we also develop an array of &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lat-mpi.eu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;linguistic tools and a framework for digital archiving of language resources&lt;/a&gt;. In the European &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.clarin.eu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;CLARIN&lt;/a&gt; project, which aims at creating a common infrastructure for language resources and technology, the MPI plays an important role in the technical work package (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.clarin.eu/wp2&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;WP2&lt;/a&gt;).&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Working Group 5: Models from Other Fields&lt;br&gt;&lt;/font&gt;&lt;/b&gt;  &lt;h3&gt;  Scott Farrar (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: sofarrar   &lt;br&gt;website: http://faculty.washington.edu/farrar/&lt;br&gt;email: farrar@u.washington.edu&lt;br&gt;I am an Assistant Professor of Linguistics at the University of Washington, teaching computational linguistics in the Professional Master&amp;#39;s in Computational Linguistics Program at the University of Washington. My primary interest is in computational linguistics with a focus on &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://uakari.ling.washington.edu/e-linguistics&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;e-linguistics&lt;/a&gt;, or how to apply computational techniques in traditional linguistics research. I received my PhD in Linguistics from the University of Arizona in 2003. Before joining the CLMA Program, I worked at the University of Bremen in Germany and in Cameroon on a fieldwork assignment researching endangered Beboid languages. I am currently funded by the National Science Foundation on grant BCS-0720670 entitled, &amp;quot;Implementing the GOLD Community of Practice: Laying the Foundations for a Linguistics Cyberinfrastructure.&amp;quot; &lt;br&gt;&lt;br&gt;&lt;h3&gt;  Terry Langendoen (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: TerryLangendoen   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistics.arizona.edu/~langendoen&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://linguistics.arizona.edu/~langendoen&lt;/a&gt; &lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:langendt@email.arizona.edu&quot; target=&quot;_self&quot;&gt;langendt@email.arizona.edu&lt;/a&gt;&lt;br&gt;I am Professor Emeritus of Linguistics at the University of Arizona, having retired from academia in 2005. From 2006 to 2008, I was a Program Director in Linguistics at the National Science Foundation, and for the past year have been working part-time as an Expert in the Robust Intelligence Program in the Division of Information and Intelligent Systems at NSF. I thank Nancy Ide for getting me interested in the problem of annotation of electronic linguistic data by inviting me to became part of the NEH-supported Text Encoding Initiative (TEI) in 1987. In that project, I worked with Gary Simons to develop recommendations for the encoding of linguistic structure in SGML, the predecessor to XML, including a general-purpose annotation format for feature structures. In 2001, I began work on the NSF-supported E-MELD project, and together with Scott Farrar and Will Lewis developed the initial specifications of the General Ontology for Linguistic Description (GOLD), with the idea of enabling linguists to annotate their data without being committed to specific markup syntax, as in TEI, and to enable data annotated in different ways to be computationally interoperable. At the January 2009 LSA Annual Meeting, Emily Bender and I organized a special session on Computational Linguistics in Support of Linguistic Theory; in our presentation we touched on several of the themes of this workshop. I discussed Cyberling 2009 briefly at the end of my contribution &amp;quot;Opportunities at NSF&amp;quot; in the special section entitled &amp;quot;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.psychologicalscience.org/observer/getArticle.cfm?id=2544&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Keeing Science Moving in Tight Times&amp;quot;&lt;/a&gt; in the September 2009 issue of the Association for Psychological Science &lt;i&gt;Observer&lt;/i&gt;. &lt;br&gt;&lt;br&gt;&lt;h3&gt;  Balthasar Bickel&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: &lt;br&gt;email:&lt;br&gt;[insert bio here]&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Steve Moran&lt;br&gt;&lt;/h3&gt;wiki username: Steve_Moran   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://staff.washington.edu/stiv/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://staff.washington.edu/stiv/&lt;/a&gt;&lt;br&gt;email: stiv@u.washington.edu&lt;br&gt;I am a PhD student in the Linguistics Department at the University of Washington. My research interests including language documentation and developing cyberinfrastructure for interoperability of linguistic data. As a field linguist I am active in Prof. Jeffrey Heath&amp;#39;s Dogon languages project that is documenting the Dogon languages of Mali, and creating an online comparative lexicographic (and multimedia) website (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://dogonlanguages.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://dogonlanguages.org/&lt;/a&gt;). With Richard Wright, I am also developing a typological database of phonological inventories that is tied to ontological models of phonological feature theories. We are making these resources available at our project website, PHOIBLE (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://phoible.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://phoible.org/&lt;/a&gt;). Previously I worked for the Linguist List, specifically on the E-MELD project to foster the consensus of &amp;quot;best practice&amp;quot; standards for the digital archiving of endangered languages data.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Cornelius Puschmann, University of D&amp;uuml;sseldorf&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/coffee001&quot; target=&quot;_self&quot;&gt;coffee001&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ynada.com/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://ynada.com/&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:cornelius.puschmann@uni-duesseldorf.de&quot; target=&quot;_self&quot;&gt;cornelius.puschmann@uni-duesseldorf.de&lt;/a&gt;&lt;br&gt;I am a postdoc at the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.phil-fak.uni-duesseldorf.de/anglistik3/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Department of English Language and Linguistics&lt;/a&gt; at the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.uni-duesseldorf.de/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;University of D&amp;uuml;sseldorf&lt;/a&gt;, Germany. My involvement in Cyberling 2009 stems from my role as the technical coordinator of &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://elanguage.net/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;eLanguage&lt;/a&gt;, the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://lsadc.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;LSA&lt;/a&gt;&amp;#39;s Open Access publishing platform. I am also a strong proponent for &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.berlin6.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Open Access&lt;/a&gt; and Open Data in linguistics and in other disciplines.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Dwight van Tuyl, Eastern Michigan University&lt;br&gt;&lt;/h3&gt;wiki username: dvantuyl   &lt;br&gt;website: http://linguistlist.org/people/dwight.html&lt;br&gt;email: dwight@linguistlist.org&lt;br&gt;I&amp;#39;m a programmer at the LINGUIST List at Eastern Michigan University. We&amp;#39;ve recently finished the GOLD Community website at http://linguistics-ontology.org which attempts to build a community around the General Ontology of Linguistic Description currently being developed by Scott Farrar of the University of Washington. At the LINGUIST List, we plan on using GOLD in our latest project, LEGO, for annotating lexical data with GOLD concept URI&amp;#39;s. I&amp;#39;m hoping to come back from this workshop with an understanding of what tools andinterpolatablestandards could be used for projects like LEGO in order to provide a low barrier of entry for participating in a cyberinfrastructure for linguists.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Working Group 6: Funding Models&lt;br&gt;&lt;/font&gt;&lt;/b&gt;  &lt;h3&gt;  Mark Liberman (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: MarkYLiberman   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ling.upenn.edu/~myl&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://ling.upenn.edu/~myl&lt;/a&gt;&lt;br&gt;email: MarkYLiberman@gmail.com&lt;br&gt;Professor (Linguistics, Computer and Information Sciences) at University of Pennsylvania; Director, Linguistic Data Consortium.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  David Lightfoot (co-chair)&lt;br&gt;&lt;/h3&gt;wiki username: DavidLightfoot   &lt;br&gt;website: &lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:lightd@georgetown.edu&quot; target=&quot;_self&quot;&gt;lightd@georgetown.edu&lt;/a&gt;&lt;br&gt;David Lightfoot writes mainly on syntactic theory, language acquisition and historical change, which he views as intimately related. He argues that internal language change is contingent and fluky, takes place in a sequence of bursts, and is best viewed as the cumulative effect of changes in individual grammars, where a grammar is a &amp;quot;language organ&amp;quot; represented in a person&amp;#39;s mind/brain and embodying his/her language faculty. That, in turn, entails a non-standard view of language acquisition as &amp;quot;cue-based.&amp;quot; He has published eleven books, most recently The Development of Language (Blackwell, 1999), Syntactic Effects of Morphological Change (ed.) (Oxford UP, 2002), The Language Organ (with S.R. Anderson) (Cambridge UP, 2002), and How New Languages Emerge (Cambridge UP, 2006). He is also the author of more than 100 articles, book chapters and reviews. He is general editor for the Generative Syntax series published by Blackwell, and serves on the linguistics editorial board at Cambridge University Press. In 2004, he was elected a fellow of the American Association for the Advancement of Science, and in 2006, as a fellow of the Linguistic Society of America.&lt;br&gt;&lt;br&gt;Dr. Lightfoot has held regular professorial appointments at several universities including McGill University, where he taught many undergraduates who went on to become major figures in linguistics and psychology including Mark Baltin, Alan Prince, Michael Rochemont, Alison Gopnik, Elan Dresher, Norbert Hornstein, Amy Weinberg, Ren&amp;eacute;e Baillargeon and Elizabeth Cowper; the University of Utrecht in the Netherlands; and the University of Maryland, where he established and chaired for 12 years, a new department of linguistics with a unique focus--viewing linguistics as the study of the human language organ. He was also the associate director of the neuroscience and cognitive sciences program there. In 2001, he moved to Georgetown University as dean of the graduate school. In addition, he has held short-term appointments at universities in Austria, Brazil, Canada, Germany, Switzerland and the United Kingdom. In June 2005, he became assistant director of the National Science Foundation, heading the Directorate for Social, Behavioral and Economic Sciences.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Anthony Aristar&lt;br&gt;&lt;/h3&gt;wiki username: aristar   &lt;br&gt;website: http://linguistlist.org/aristar&lt;br&gt;email: aristar@linguistlist.org&lt;br&gt;I started life as a historical linguist and typologist, but was drawn, during the early Internet age (it was called Arpanet then!) to its potential as a medium for academic exchange. I founded the LINGUIST List in 1990 as a step in this direction, and the list grew so much that I and my co-Moderator Helen Aristar-Dry realized that we needed to rethink what we were and where LINGUIST could go. We started started applying for grants to build infrastructure for the discipline, and in 2004 Helen and I became co-Directors of the Institude for Language Information and Technology, housed at Eastern Michigan University. Our focus for the last few years has continued to be linguistic infrastructure, but we also now deal extensively with standards for linguistics on the Internet (e.g. our EMELD project) and work on digitizing endangered languages data. We have the following ongoing projects, funded by either NSF or NEH: MultiTree (http://multitree.linguistlist.org/) which is collecting all known hypotheses on language relationships, LLMAP (http://llmap.org) which is aimed at making GIS a fundamental part of linguistics, GOLDComm (http://linguistics-ontology.org/), which is aimed at expanding the GOLD ontology for linguistic description, LEGO (http://linguistlist.org/projects/lego.cfm) which has as its goal the development of several &amp;quot;building blocks&amp;quot; for lexical data interoperability within linguistics, and RELISH, a collaborative project with the Max Planck Institute for Psycholinguistics and The Johann Wolfgang Goethe-Universit&amp;auml;t Frankfurt, aimed a unifying two digital collections of endangered languages with special attention given to harmonizing the European and American standards for language documentation and lexicon building.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Collin Baker&lt;br&gt;&lt;/h3&gt;wiki username: collinfb   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://framenet.icsi.berkeley.edu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://framenet.icsi.berkeley.edu&lt;/a&gt;&lt;br&gt;email: collinb@icsi.berkeley.edu&lt;br&gt;&lt;br&gt;I am a linguist, working as manager of the FrameNet Project, founded and directed by Prof. Charles Fillmore, which is part of the AI group at the International Computer Science Institute in Berkeley. For the last decade, we have been building a rich lexical semantic database for English, based on frame semantic principles and grounded on manually annotated corpus examples of usage. We are currently participating in a joint annotation project for the American National Corpus (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://americannationalcorpus.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://americannationalcorpus.org&lt;/a&gt;), collaborating with colleagues building FrameNets for Spanish, German, Chinese, Japanese, etc., planning an alignment of FrameNet with WordNet, and exploring crowdsourcing as a means of gathering annotation data. I am interested in the problem of funding resource building, particularly long-term efforts. &lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Helen Dry&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: &lt;br&gt;email: &lt;br&gt;[insert bio here]&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Laura Welcher&lt;br&gt;&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/lbwelch&quot; target=&quot;_self&quot;&gt;lbwelch&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.rosettaproject.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.rosettaproject.org&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:laura@longnow.org&quot; target=&quot;_self&quot;&gt;laura@longnow.org&lt;/a&gt;&lt;br&gt;&lt;br&gt;I direct The Rosetta Project at &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.longnow.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;The Long Now Foundation&lt;/a&gt; in San Francisco, and am one of the co-organizers of Cyberling 2009. My interest in cyberlinguistics originally developed out of my experience in linguistic fieldwork, using specialized tools like Shoebox/Toolbox, as well as trying to make general tools like Filemaker Pro work for lexicography. Both of these tasks quickly gave me the sense that better tools are needed for what linguists do! Besides language documentation, my work at The Rosetta Project has underscored the need for standards upon which to build tools. The Rosetta Project maintains an archival collection for all of the world&amp;#39;s languages in multiple media formats. How does one search such an archive? How do experts interact with the content? How do users without any knowledge of language names, ISO codes, and language relationships find out information about the nearly 7,000 languages on the planet? These are some of the problems any project that claims to be &amp;quot;All Languages&amp;quot; must deal with. Our new archival structure is distributed and publicly interactive -- all languages and language relationships are available as open content in our &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.freebase.com/view/base/rosetta/views/langoid&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Rosetta Base in Freebase&lt;/a&gt;, all of the archived materials are in our &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.archive.org/details/rosettaproject&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Rosetta Collection in the Internet Archive&lt;/a&gt;, and we are currently building a user-editable wiki interface on top of this (currently in alpha mode, so please ask me if you&amp;#39;d like a demo -- we also need a good name for it...Rosetta Panglossia?). A companion project to the digital archive is &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://rosettaproject.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;The Rosetta Disk&lt;/a&gt; -- a microscopic version of the collection, built out of materials that can last for millennia -- this is one of the showpiece artifacts to get people engaged in long-term thinking, along with the Foundation&amp;#39;s &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://longnow.org/projects/clock/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;10,000 Year Clock of the Long Now&lt;/a&gt;.&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Working Group 7: Collaboration Structure&lt;/font&gt;&lt;br&gt;&lt;/b&gt;  &lt;h3&gt;  Brian MacWhinney (chair)&lt;br&gt;&lt;/h3&gt;wiki username: macw   &lt;br&gt;website: talkbank.org&lt;br&gt;email: macw@cmu.edu&lt;br&gt;Brian MacWhinney, Professor of Psychology, Computational Linguistics, and Modern Languages at Carnegie Mellon University, has developed a model of first and second language acquisition and processing called the Competition Model. He has also developed the CHILDES Project (childes.psy.cmu.edu) for the computational study of child language transcript data and the TalkBank (talkbank.org) system for the study of conversational interactions.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Emily M. Bender, University of Washington&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/EmilyMBender&quot; target=&quot;_self&quot;&gt;EmilyMBender&lt;/a&gt;   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://faculty.washington.edu/ebender/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://faculty.washington.edu/ebender/&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:ebender@u.washington.edu&quot; target=&quot;_self&quot;&gt;ebender@u.washington.edu&lt;/a&gt;&lt;br&gt;I am one of the co-organizers of Cyberling 2009. My interest in cyberinfrastructure for linguistics stems from my work on grammar engineering for linguistic hypothesis testing. I see this as one example of computational methods in support of linguistic analysis: using computers to systematically work with larger data sets and manage greater complexity than we could do without computational aids. I am also very interested in the issue of culture change within the field of linguistics, i.e., how to create a culture in which data sharing and the validation of hypotheses against large datasets are expected and rewarded.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Nicoletta Calzolari&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: &lt;br&gt;email: &lt;br&gt;[insert bio here]&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Nancy Ide&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: &lt;br&gt;email: &lt;br&gt;[insert bio here]&lt;br&gt;&lt;br&gt;&lt;h3&gt;  David Robinson&lt;br&gt;&lt;/h3&gt;wiki username: drobinsonlsa   &lt;br&gt;website: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lsadc.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.lsadc.org&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:drobinson@lsadc.org&quot; target=&quot;_self&quot;&gt;drobinson@lsadc.org&lt;/a&gt;&lt;br&gt;I am the Director of Membership and Meetings for the Linguistic Society of America. I will be attending Cyberling 2009 in order to assess how the LSA can best put the findings of this workshop at the disposal of LSA members as well as the profession at large, and what role the LSA can play in the development of a cyberinfrastructure for the profession. &lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;Other Collaborators&lt;br&gt;&lt;/font&gt;&lt;/b&gt;  &lt;h3&gt;  Jeff Good&lt;br&gt;&lt;/h3&gt;wiki username:&lt;a href=&quot;http://cyberling.elanguage.net/account/jcgood&quot; target=&quot;_self&quot;&gt;jcgood&lt;/a&gt;   &lt;br&gt;website:&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://buffalo.edu/~jcgood/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://buffalo.edu/~jcgood/&lt;/a&gt;&lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:jcgood@buffalo.edu&quot; target=&quot;_self&quot;&gt;jcgood@buffalo.edu&lt;/a&gt;&lt;br&gt;I am one of the co-organizers of the Cyberling workshop. (Unfortunately, however, I will not be able to attend most of it.) I am interested in how cyberlinguistic infrastructure can facilitate work in language description, typology, and comparative and historical linguistics.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Dan McCloy, University of Washington&lt;/h3&gt;wiki username: &lt;a href=&quot;http://cyberling.elanguage.net/account/danmccloy&quot; target=&quot;_self&quot;&gt;danmccloy&lt;/a&gt;   &lt;br&gt;email: &lt;a href=&quot;http://cyberling.elanguage.netmailto:drmccloy@u.washington.edu&quot; target=&quot;_self&quot;&gt;drmccloy@u.washington.edu&lt;/a&gt;&lt;br&gt;I am a graduate student in Linguistics at the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://depts.washington.edu/lingweb/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;University of Washington&lt;/a&gt;. My research is primarily in formal semantics. As one of the organizers of Cyberling 2009, I am the primary point-of-contact for most inquiries about workshop logistics.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Tandy Warnow&lt;br&gt;&lt;/h3&gt;wiki username:   &lt;br&gt;website: &lt;br&gt;email: &lt;br&gt;[insert bio here]&lt;br&gt;&lt;div&gt;  &lt;/div&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Towards a collaboration platform for linguistics</title><link>http://cyberling.elanguage.net/page/Towards+a+collaboration+platform+for+linguistics</link><author>EmilyMBender</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Towards+a+collaboration+platform+for+linguistics</guid><comments>formatting edits</comments><pubDate>Sun, 13 Sep 2009 23:51:30 CDT</pubDate><description>&lt;br&gt;&lt;b&gt;A call for tools.&lt;/b&gt;&lt;br&gt;&lt;br&gt;Working group 3 was asked to discuss &amp;quot;tools for linguistics&amp;quot;, with the possible secondary questions of identifying the properties of a &amp;quot;killer application&amp;quot; for linguistic research.&lt;br&gt;&lt;br&gt;Our field (linguistics) has many tools already. We have annotation engines, analysis engines, and data-annotation standards. Sometimes, these tools are mutually compatible, but in general our tools exist for a sub-community within the field, and the tools work just well enough for the current research in that sub-community.&lt;br&gt;&lt;br&gt;Currently open issues in technical tools for linguistics research seem to include:&lt;br&gt;&lt;ul&gt;&lt;li&gt;How to collaborate (though Group 7 seems to be covering this)&lt;/li&gt;&lt;li&gt;Data provenance and proper citation of data-collections (Group 4 seems to be addressing some of this)&lt;/li&gt;&lt;li&gt;Data control (controlling access and controlling changes)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;Our group&amp;#39;s discussion made clear that there is at least one area of technical where linguistics -- as a field -- lacks a widely-used tool: in active collaboration on data collection, curation, and sharing.&lt;br&gt;&lt;br&gt;&lt;b&gt;A dream&lt;/b&gt;&lt;br&gt;&lt;br&gt;Let&amp;#39;s imagine what an active online collaboration tool might look like. We provide here a brief narrative suggesting what might be available for a linguistics community with such a tool:&lt;br&gt;&lt;br&gt;&lt;blockquote&gt;&lt;i&gt;Sissala studies (an imaginary tale)&lt;/i&gt;&lt;br&gt;&lt;br&gt;&lt;u&gt;August 1-20&lt;/u&gt;: Researcher S. makes recordings in field&lt;br&gt;&lt;br&gt;&lt;u&gt;September 3&lt;/u&gt;: 150 conversations uploaded to the shared research platform, tagged as well as S.&amp;#39;s notes allow. S. remains in Burkina Faso, but (because he logged in) his proprietorship of that data is clearly tagged. He sets the access permission on this data to be shared with certain other members of the community.&lt;br&gt;&lt;br&gt;&lt;u&gt;Sep 20&lt;/u&gt;: School begins in North America. An instructor&amp;#39;s upper-level class in &amp;quot;Phonetics of Pitch&amp;quot; selects 5 of these conversations for annotation for F0 as part of the class projects. Some conversations are transcribed by more than one student. Each transcription is itself tagged with the identity of the student and his instructor, who herself is the one granting access to the specific conversations. In correcting the students&amp;#39; work, both the student&amp;#39;s and the instructor&amp;#39;s transcriptions and improvements are logged and made available back to S.&lt;br&gt;&lt;br&gt;Four undergrad students studying IPA at another American university transcribe 15 sentences each, their teacher transcribes 30 and double-checks the students&amp;#39; transcriptions as well.&lt;br&gt;&lt;br&gt;S., from Burkina Faso, is able to see these transcriptions and sends back some commentary, noting a phonological distinction that the IPA students may not have captured.&lt;br&gt;&lt;br&gt;&lt;u&gt;October 1&lt;/u&gt;: Parts-of-speech are attached by grad students in the Netherlands, who are doing typological studies of West African languages&amp;#39; word-order phenomena.&lt;br&gt;&lt;br&gt;&lt;u&gt;October 20&lt;/u&gt;: A syntactician in Australia takes note of new work on Sissala.&lt;br&gt;&lt;/blockquote&gt;&lt;br&gt;In less than an academic quarter, new data and rich linguistic annotations of that data are shared across four continents, among students, researchers, and faculty. The students are using the same tools as the faculty and the field researchers: classwork is a good preparation for fieldwork. The students&amp;#39; own practicum is made useful both to other students and to the original researcher, who may (or may not) have time to perform this transcription, but may be able to find the time to offer feedback as well.&lt;br&gt;&lt;br&gt;In this (perhaps utopian) vision, research and education communicate using the same tools, and researchers across different sub-disciplines and different levels of expertise may also use these tools.&lt;br&gt;&lt;br&gt;&lt;b&gt;Working data sharing&lt;/b&gt;&lt;br&gt;&lt;br&gt;What might such an active collaboration look like, using 2009 technology? Web programming and widely-supported data-distribution protocols make storing and distributing large amounts of data in a digitally-accessible way fairly easy. Creating, storing and managing data on server computers sounds complex (and is indeed) but these challenges are largely addressed by other communities with parallel needs, both commercial and non-commercial.&lt;br&gt;&lt;br&gt;We envision an environment in which data-collection is (to draw an analogy) like constructing a post (or series of posts) for a weblog. The data (or post) may be stored on the server (which we semi-jocularly dub &amp;quot;cyberling.org&amp;quot; for this discussion) even before distribution (or &amp;quot;publication&amp;quot;); it may go through multiple drafts, and it is possible (as with blog posts, on some platforms) that only some readers may *ever* be able to read it; attribution and edit history is tracked, and distribution to other locations, authors, and/or readers is straightforward.&lt;br&gt;&lt;br&gt;Contemporary web programming has ready-made solutions for many of these challenges. (This is not to say that the current solutions are &lt;i&gt;ideal&lt;/i&gt;, only that they currently work well-enough.) Complete, well-tested strategies exist for backup (redundancy), storage, confirming and tracking identity, attribution, access control, and revision control.&lt;br&gt;&lt;br&gt;Using these online environments also allows multiple researchers (possibly in different locations!) to extend and curate data collaboratively. These possibilities only richen as disciplines interact: for example, data collected for fieldwork in phonology and morphology may inform further research on sociolinguistics, syntax, or typology for researchers far away, if the disciplines can share datasets using an online collaboration. Collaboration in this manner can offer benefits to reputation (through citation of one&amp;#39;s field notes and through visibility of one&amp;#39;s own work) and new opportunities for researchers to share work in more repeatable, formally-visible ways.&lt;br&gt;&lt;br&gt;&lt;b&gt;Motivating existing researchers&lt;/b&gt;&lt;br&gt;&lt;br&gt;While online data sharing strategies offer many additional exciting possibilities for collaboration, the &amp;quot;bootstrapping&amp;quot; problem lurks: existing researchers rarely have a strong interest in changing their current work patterns to join an as-yet-nonexistent collaboration network --- the benefits of collaboration will not appear unless many researchers are already involved; the first one on the dance floor will look goofy until the party really gets started.&lt;br&gt;&lt;br&gt;Of course, additional objections may arise: there&amp;#39;s not enough time or money; we&amp;#39;ve not done it before, why start now?, and the perennial concern: this is not our area of expertise. It is a truism of both open-source and for-profit software that no project will succeed unless it &amp;quot;scratches someone&amp;#39;s itch&amp;quot; --- that it meets existing unmet needs of its users. In open-source software, the &amp;quot;itch&amp;quot; is often the desire of a capable programmer to improve her own tools; in for-profit software, the itch is the prospect that someone will pay for the product or the labor.&lt;br&gt;&lt;br&gt;In this paper, we concern ourselves with discovering the &amp;quot;itch&amp;quot;: as researchers, as collectors and curators of linguistic data, and as a community, we need individual (and collective) reasons to move towards this sort of sharing --- reasons to find the funding and the time to promote the kind of adjustment to our research methods that would allow this collaboration.&lt;br&gt;&lt;br&gt;As researchers interested in sharing data, we want to encourage our users to join the collaboration, even before the benefits of the network effects appear. New users --- the &amp;quot;first ones on the dance floor&amp;quot; --- need good reasons to join &lt;i&gt;before&lt;/i&gt; the network effects.&lt;br&gt;&lt;br&gt;New users --- especially with the community in its infancy --- need good &lt;i&gt;individual&lt;/i&gt; reasons to join. Thus, the question we address here is &amp;quot;why should &lt;i&gt;I&lt;/i&gt; join a linguistics corpus-management service built on web tech?&amp;quot;&lt;br&gt;&lt;br&gt;&lt;b&gt;Case studies: Immediate Benefits&lt;/b&gt;&lt;br&gt;&lt;br&gt;We believe that most linguistics researchers will benefit quite readily from this kind of tool built on web technology. In this section, we provide a sketch of several linguistics researchers who are fairly diverse in their subfields, interests, and experience, and we point out the benefits that each would incur from adapting his or her work to use these sorts of tools.&lt;br&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;Sociophonetician&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;existing work: recorded conversations, annotated for a very narrow range of phonetic and sociological features. Current research tools: Praat, DAT tapes, and R for analysis.&lt;br&gt;Sharing data into the platforms suggested here gives her:&lt;br&gt;&lt;ul&gt;&lt;li&gt;an offsite backup&lt;/li&gt;&lt;li&gt;shared access to the data with her research-assistants and annotators&lt;/li&gt;&lt;li&gt;control over who has access to these data&lt;/li&gt;&lt;li&gt;revision history (including opportunities to review and reconcile conflicting annotations)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;Field lexicographer&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;existing work: documenting a language&amp;#39;s lexicon in the field. Current research tools: Toolbox, and a paper notebook.&lt;br&gt;Sharing data into the platforms suggested here gives him:&lt;br&gt;&lt;ul&gt;&lt;li&gt;backup&lt;/li&gt;&lt;li&gt;shared access to the under-development lexicon with his advisor (who is at home teaching)&lt;/li&gt;&lt;li&gt;having uploaded the lexicon in one format, downloading in a different one is easy&lt;/li&gt;&lt;li&gt;browsing the lexicon through the web&lt;/li&gt;&lt;li&gt;revision history&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;phonetics instructor&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;Teaching work: uses Praat. Students are required to learn enough computer tools to turn in IPA assignments in digital form. A perennial frustration: some submission formats corrupt the students&amp;#39; transcriptions.&lt;br&gt;By sharing some data to transcribe with the class, this instructor can ask his students to transcribe directly into the online tools described here. The students get:&lt;br&gt;&lt;ul&gt;&lt;li&gt;practice with the tools that they will use as future professionals&lt;/li&gt;&lt;li&gt;work with real field data&lt;/li&gt;&lt;li&gt;attribution for their own work (their transcriptions are logged as their own)&lt;/li&gt;&lt;/ul&gt;The instructor gets:&lt;br&gt;&lt;ul&gt;&lt;li&gt;extra annotation passes over the same data&lt;/li&gt;&lt;li&gt;clear indications of the students&amp;#39; transcription record (through the revision control and identity management)&lt;/li&gt;&lt;li&gt;standardized transcription responses&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;the doctoral student&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;Currently writing a syntax squib. Her research is currently looking at hundreds of Hungarian sentences, in search of left-displacement phenomena. The squib itself has only four (representative) examples.&lt;br&gt;If this student puts her examples into this sharing environment, she gets:&lt;br&gt;&lt;ul&gt;&lt;li&gt;backup of her notes&lt;/li&gt;&lt;li&gt;organized documentation of the evidence behind that squib&amp;#39;s point for later inclusion in her dissertation&lt;/li&gt;&lt;li&gt;reproducibility -- other researchers who might challenge the representativeness of her four examples can find other examples&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;the NLP evaluation guru&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;Currently running a Thai word-segmentation competition. Dataset involves thousands of sentences of Thai text, word-segmented by native Thai speakers. By using the platform provided, he can get:&lt;br&gt;&lt;ul&gt;&lt;li&gt;data backup&lt;/li&gt;&lt;li&gt;easy indexing of multiple annotations of the same segments&lt;/li&gt;&lt;li&gt;easy revision tracking, as disagreements are resolved &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;&lt;b&gt;Additional Benefits&lt;br&gt;&lt;br&gt;&lt;/b&gt;A successful platform would probably provide access to several simple-to-use applications (such as tools to visualize the uploaded data). For example, visual and tabular reports of (word-)frequencies in your data, cross-tabularization, lemmatizers; visualization of syntactic annotation; simple annotation tools to add additional layers of annotation to a data set; and so on. Over time, community members can develop and share additional tools. Existing tools (e.g. ANVIL, Praat, Linguistic Search Engine; TigerSearch) could be integrated into the platform, combined with intuitive user interfaces, thereby providing additional motivation to linguists to join the community and upload their data. &lt;br&gt;&lt;b&gt;&lt;br&gt;&lt;/b&gt;In the long run, the platform we envision can also facilitate the development of annotation standards. Annotation standards that have been developed for one task can be become objects that can be shared with other users. Just like any other type of data, annotation schemes can have tags for authorship, editorship, and revision history. This way, not only primary linguistic data, but also secondary data (part-of-speech sets, syntactic annotation schemes, etc.) can be shared and improved by community members. &lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;br&gt;What to build first?&lt;/b&gt;&lt;br&gt;&lt;br&gt;The system required to get all of the possible benefits of every one of these use cases will require substantial work. In this section, we propose a somewhat more limited scope as an initial goal.&lt;br&gt;&lt;br&gt;We suggest that phonetic and phonological transcription are a particularly well-suited task for this sort of distributed data-sharing. Technology and standards are well-defined for sharing audio and transcriptions (essentially, text). &lt;br&gt;&lt;br&gt;The issues of sharing audio and transcriptions among field researchers, their assistants, professional and student transcribers, and arbitrators are not simple, because they involve tracking meta-information, access control, and revision control. Software developers, for their own needs, have developed good tools for dealing with most of these issues. &lt;br&gt;&lt;br&gt;Limiting the scope of this tool initially to audio files and multiple text transcriptions of those files would make working out these challenges somewhat simpler --- and thus more easily, in the future, extended to other forms of annotation (e.g., coding features that depend on other annotations, such as syntax above transcription).&lt;br&gt;&lt;br&gt;Building the infrastructure for simple collaboration on transcription, however, would be a useful contribution to &lt;br&gt;&lt;ul&gt;&lt;li&gt;Fieldworkers&lt;/li&gt;&lt;li&gt;Linguistics education - students and faculty&lt;/li&gt;&lt;li&gt;Sharing of transcriptions - between fieldworkers and their research colleagues&lt;/li&gt;&lt;li&gt;Collaboration among remote researchers (even when none of them are fieldworkers)&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br&gt;&lt;br&gt;Too often, discussions about &amp;quot;tools&amp;quot; are really discussions not about tools, but about how to make our tools interoperate, how to make sure that doing work on tools or data gets proper attribution (or the related question: how it gets funding), or one of several other concerns that are -- at root -- questions outside of the tools themselves.&lt;br&gt;&lt;br&gt;We have tools. The technology for sharing data, for managing access, revision history, redundancy, and privacy already exists, and is in use on the Internet every day. As the linguistics research community, we are not using the tools that exist for these tasks -- and some work must be done in order to use them to our (and their!) best ability. Nevertheless, the tools exist, and the next generation of linguists will thank us for already having them in place.&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Group 1: Annotation Standards</title><link>http://cyberling.elanguage.net/page/Group+1%3A+Annotation+Standards</link><author>mebeckman</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Group+1%3A+Annotation+Standards</guid><pubDate>Fri, 11 Sep 2009 12:26:35 CDT</pubDate><description>&lt;font color=&quot;#808080&quot;&gt;&lt;b&gt;&lt;font size=&quot;3&quot;&gt;Working group members:&lt;/font&gt;&lt;/b&gt; &lt;/font&gt;Mary Beckman (co-chair), Stuart Robinson (co-chair), Sarah Churng, Greville Corbett, Charles Fillmore, Richard Wright&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;Preamble&lt;/font&gt;&lt;br&gt;&lt;/b&gt;&lt;br&gt;The Annotation Standards group was charged with identifying and documenting existing and needed standards for the annotation of linguistic data. Further, the group was asked to consider possible standards that may need to be developed in the future. Annotation standards support interoperability, aggregation of data, and (ideally) applications that help linguists address the research questions that they are interested in answering while creating consistently annotated data as a side-effect. Another consideration that the group took into account is that sometimes these goals may be in partial conflict with standards for the ethical treatment of human subjects. &lt;br&gt;&lt;br&gt;This wiki page is the report (&amp;quot;white paper&amp;quot;) from the group, who acknowledge the helpful comments of the other attendees at the Cyberling09 workshop, particularly Emily Bender, Nancy Ide, and Mark Liberman.&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;&lt;b&gt;Table of contents&lt;br&gt;&lt;/b&gt;&lt;/font&gt;&lt;br&gt;&lt;ol&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot;&gt;What is annotation and what is it good for?&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;font size=&quot;3&quot;&gt;What are annotation standards and what are they for?&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;font size=&quot;3&quot;&gt;What does it take to be a good annotation standard?&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot; size=&quot;3&quot;&gt;The state of the art (with some case studies)&lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot; size=&quot;4&quot;&gt;&lt;font size=&quot;3&quot;&gt;Existing annotation standards and resources&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot; size=&quot;3&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;References&lt;/font&gt;&lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ol&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;b&gt;&lt;font size=&quot;5&quot;&gt;1. What is annotation and what is it good for?&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;font color=&quot;#000000&quot; size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;Annotation is the act of adding, to primary linguistic data, information representing analyses or models of aspects of the data. For example, if the primary linguistic data are an audio recording of a sequence of turns in a conversation between two speakers, then one type of annotation could be the marking of speaker-change points in the conversation within a layer of annotations related to the analysis of discourse structure. Another series of annotation layers could begin with an orthographic and/or a segmental transcription of the speech. Other annotation layers in this series might include a tokenization (segmentation) and glossing of the words or other similar units in the orthographic or segmental transcriptions of the recording. Other series of annotation layers on the morphosyntactic side of language could include a subsequent set of part-of-speech assignments to the words and/or a parsing of the syntactic structures of the sentences and other linguistic expressions in the recording. A parallel series of annotation layers on the phonetics/phonology side of language could include the tagging of linguistically significant events in the spectral patterns of the utterances (e.g., the release burst of each plosive and the transitions between different voice qualities), a parsing of prosodic structures that group segments into syllables and higher-order constituents, and the identification of salient points of coordination in the rhythms at different levels (e.g., marking of stressed or accented syllables).&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;1.1. What elements of an analysis can be annotated?&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;font color=&quot;#333333&quot;&gt;As the above example illustrates, elements of a linguistic analysis that can be annotated and for which annotation conventions can be codified separately are of at least three types: (1) tokenization/segmentation, (2) syntagmatic structure, (3) paradigmatic content of the events/tokens and structure. In the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Ontology_%28information_science%29&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ontology&lt;/a&gt; of linguistic annotations, these aspects could be thought of as (1) the identification of instances or things, (2) the identification of relations among things, and (3) the identification of classes of things or relational functions. Bird and Liberman (2001) give an insightful discussion and a framework for formulating good ways of treating all three aspects. They propose to formalize this ontology in terms of the &lt;b&gt;&lt;i&gt;annotation graph&lt;/i&gt;&lt;/b&gt; -- a directed acyclic graph, in which each annotation token is (minimally) a triple consisting of two nodes that point to the positions in the string of labels on any annotation tier, and the label for the arc connecting these points, as in the two figures below.&lt;br&gt;&lt;br&gt; &lt;br&gt;&lt;br&gt; &lt;br&gt;&lt;br&gt;&lt;b&gt;Figure 1.&lt;/b&gt; Spectrogram (and Praat TextGrid merge of original label files) for the first three words in utterance train/dr1/fjsp0/sa1 from the TIMIT corpus (top) with a screen shot of the original &lt;/font&gt;&lt;font color=&quot;#333333&quot;&gt;phn and &lt;/font&gt;&lt;font color=&quot;#333333&quot;&gt;wrd label files (the first and third tiers of the Praat TextGrid) and the associated annotation graph snippet from Figure 2a in Bird and Liberman (2001).&lt;br&gt;&lt;br&gt; &lt;br&gt;&lt;br&gt; &lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#ff0000&quot;&gt; &lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;b&gt;Figure 2.&lt;/b&gt; Sample view for the first utterance in &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://lacito.vjf.cnrs.fr/archivage/tools/list_rsc.php?lg=Hayu&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;a Hayu narrative&lt;/a&gt; from the LACTITO archive (top) with a screen shot of the&lt;/font&gt;&lt;font color=&quot;#333333&quot;&gt; &lt;/font&gt;&lt;font color=&quot;#333333&quot;&gt;snippet of the annotation file and associated &lt;/font&gt;&lt;font color=&quot;#333333&quot;&gt;annotation graph from Figure 5 in Bird and Liberman (2001).&lt;br&gt; &lt;br&gt; &lt;/font&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;1.2. Examples&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;We expand on these three different aspects by illustrating each in reference to the type of example described above, where the primary linguistic data are an audio recording, and also in reference to cases where the primary linguistic data are instead a written text, which may or may not have begun as an orthographic transcription of an audio recording. &lt;br&gt;&lt;/font&gt;&lt;br&gt;(1) For audio data, tokenization considerations can include not just the need to decide on the number of things that are instantiated at any given level of annotation, but also the need to agree upon where token boundaries (e.g., edges of segments or words) should be placed relative to the disparate spectral cues to the often asychronous and/or smoothly changing postures of different articulatory systems. For text data, tokenization considerations similarly can include the need to introduce word boundaries (e.g., as spaces in text) for languages with writing systems that do not use space for word separation, or, for English, the need to separate punctuation that shows significant syntagmatic boundaries -- by adding surrounding spaces -- from punctuation that is a part of a name or word (e.g., &amp;quot;that &amp;#39; s that ! &amp;quot; as opposed to &amp;quot;etc.&amp;quot;, &amp;quot;Mrs.&amp;quot; &amp;quot;U.S.A.&amp;quot;). &lt;br&gt;&lt;br&gt;(2) For audio data, syntagmatic structure can include agreed-upon conventions for identifying time points on different annotation tiers that should be synchronized because they are the same event (e.g., the time stamp for the beginning edge of the first segment in a word is also the time stamp for the beginning edge of the word, and should change if/when the segment boundary is moved). It can also include the principles for differentiating different types of coordination across annotation tiers (e.g., when the time of a linguistically meaningful fundamental frequency maximum is identified relative to the time of a stop release). For text data, syntagmatic structure can include the bracketing of sequences of text that function as single constituents, or the indexing of anaphors to their antecedents, or the indexing of discontinuous collocates, as between the first and last words in &amp;quot;&lt;u&gt;wreak&lt;/u&gt; this type of &lt;u&gt;havoc&lt;/u&gt;&amp;quot;. &lt;br&gt;&lt;br&gt;(3) For audio data, the development of annotation conventions inevitably includes work to agree on the set of contrasting types to distinguish at several levels. For example, in tagging consonant and vowel segments, should the labels include only &amp;quot;broad&amp;quot; phoneme classes, or should major allophones be distinguished, or even finer phonetic detail marked? Whose intonational analysis should be adopted in tagging utterance melody? For text data, similarly, annotation conventions can involve work to codify the set of paradigmatic contrasts at many levels, from conventions regarding the number of different types of filled pauses and how to spell them in an orthographic transcription of recorded speech, to conventions for identifying cells in morphological paradigms, which might need to be differentiated in significance across languages. For example, systems for morphosyntactic glossing intended for cross-language comparison might need to recognize that for one language &amp;#39;singular&amp;#39; is understood as in paradigmatic contrast to &amp;#39;plural&amp;#39; while in another language it might be in contrast with &amp;#39;dual&amp;#39; and &amp;#39;plural&amp;#39;. This could conceivably be achieved by adding a &amp;quot;legend&amp;quot; to a given annotation layer, for a given language, linking each special annotation category to a description of the relevant contrast set, e.g., in a grammar.&lt;br&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;1.3. The purpose of annotation and its relationship to the data&lt;/font&gt;&lt;br&gt;&lt;/b&gt;&lt;br&gt;&lt;/font&gt;As the last example should make clear, conventions for each aspect of an annotation scheme cannot be established without thinking carefully about what the data are and what the annotations are for. In general, annotations are dedicated to specific purposes: it is hard to imagine a corpus development project that seeks to account for every detectable phenomenon in a language sample. Sociophoneticans who are interested in the increased use of creaky voice by young female speakers of American English would want to have an annotation tier where beginning and endpoints of creaky voice are marked; phonologists who are interested in comparing the probability of finding a particular segment sequence within a word to the probability of finding that sequence across a word boundary obviously do not need such a tier. Researchers interested in preferred or habitual locutions on the part of individual speakers obviously need to have speaker-ID information in the data they study; those interested in finding examples of syntactic phenomena obviously do not need such information. These different needs for more or fewer levels of annotation can be described as different points on a scale of granularity. But the different granularities can also involve the same level of annotation, but different degrees of specificity in the paradigmatic set. &lt;br&gt;&lt;br&gt;The following is an example of levels of granularity in the syntactic description of English rate expressions such &lt;i&gt;forty dollars an hour&lt;/i&gt;, &lt;i&gt;forty miles an hour&lt;/i&gt;, &lt;i&gt;forty miles a gallon&lt;/i&gt;, &lt;i&gt;forty times a day&lt;/i&gt;, &lt;i&gt;forty dollars an ounce&lt;/i&gt;, and the like. In dealing with such phrases, one purpose might be that of providing a preliminary mark-up for a parser, inasmuch as this pattern of two adjacent NPs is not a part of the ordinary grammar of the language. For this purpose, it would be enough to block off such phrases and mark them as NP: as such they then can fit into PPs (&lt;i&gt;moving at forty miles an hour&lt;/i&gt;) and VPs (&lt;i&gt;earns forty dollars an hour&lt;/i&gt;, &lt;i&gt;gets forty miles a gallon&lt;/i&gt;), etc. A quite different purpose one might have for annotating such expressions is providing a mark-up that is usable for language understanding efforts: in such cases the type of unit (linear extent, money amount, time, weight, etc.) in each of the two parts of these expressions should be indicated (with information from a lexicon), allowing the automatic assignment of the phrases to such categories as Fuel-efficiency, Price-per-unit, Frequency, Speed and the like.&lt;br&gt;&lt;br&gt;An obvious and important use of annotation is that of providing a layer of representation that is available for further analysis; in a sense, this amounts to regarding one person&amp;#39;s annotation as another person&amp;#39;s primary data. For example, phonological or orthographic transcription allows morphosyntactic analysis of a sample of speech more directly than the acoustic trace. &lt;br&gt;&lt;br&gt;This property of layered annotation raises the issue, strongly associated with the late John Sinclair, that mistakes in one layer of annotation compound mistakes in higher layers. This point can be seen in the fact that descriptions of English syntax tend to accept the tokenization implied in the standard orthography, so that, for example, &lt;i&gt;whose&lt;/i&gt; and &lt;i&gt;another&lt;/i&gt; are treated as single units. Proposals about the clitic vs. suffix analysis of the possessive marker in English would be argued differently if &lt;i&gt;whose&lt;/i&gt; were &lt;i&gt;who&amp;#39;s&lt;/i&gt;, making &lt;i&gt;who the hell&amp;#39;s fault is that?&lt;/i&gt; seem not so anomalous; and descriptions of the pattern that allows &lt;i&gt;a mere twenty dollars&lt;/i&gt;, &lt;i&gt;an extra five pages&lt;/i&gt;, &lt;i&gt;an additional twenty dollars&lt;/i&gt; could be seen as incorporating &lt;i&gt;another five pages&lt;/i&gt; (i.e., as &lt;i&gt;an other five pages&lt;/i&gt;).&lt;br&gt;&lt;br&gt;An analogous situation arises when phonological descriptions accept the tokenization or the set of paradigmatic categories implicit in the conventional segmental transcription for a language, even when describing speech produced by child speakers who have not yet acquired the phonological system of the language. &lt;br&gt;&lt;br&gt;These pitfalls suggest that the sociology of developing annotation conventions might be an object of study in its own right. They also suggest that linguists should think flexibly about the types of things that should be considered annotations, so that conventions can be developed for how to link these things to the primary data. For example, it might be appropriate to think of responses from naive judges, elicited over the web using Mechanical Turk or the like, as a kind of annotation, in which case, it could be useful to develop standards for eliciting these judgments and tools for linking the responses back to the corpus of primary data that provided the stimuli. It might also be appropriate to think of skilled formant &amp;quot;correction&amp;quot; as a kind of annotation, in which case, there could be standards for &amp;quot;correcting&amp;quot; formants and associating the formant traces with the corpus, as in the development of the tools for the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ling.canterbury.ac.nz/onze/cc.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Origins of New Zealand English Project&lt;/a&gt;. &lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;2. What are annotation standards and what are they for?&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#333333&quot;&gt;Some annotation is created to serve a single researcher&amp;#39;s needs. If the annotation practices developed by this researcher, for a certain class of phenomena, are consistent, so much the better for this solo researcher. An issue of annotation &lt;u&gt;standards&lt;/u&gt; arises when there is a need or opportunity for other researchers to work with the same data, or when researchers become interested in the same kinds of phenomena in other data samples, or in other languages, and want to be able to make generalizations.&lt;/font&gt;&lt;br&gt;&lt;br&gt;An annotation standard, then, is a set of conventions that is associated with a commitment to adhere to the conventions by a community of users. A standard can evolve gradually in a community of researchers who are working on similar problems in some language domain, so that assumptions about the analytic space converge in some way that promotes the natural emergence of infrastructure for developing, transmitting, and codifying a standard. A standard can also arise from adopting a tool that brings with it assumptions about the data being analyzed that can be met by adhering to the standard.&lt;br&gt;&lt;br&gt;No matter the path of convergence, however, annotation standards cannot be defined without reference to a shared set of assumptions and an associated community of analysts. As a corollary, a set of annotation conventions cannot be evaluated (or standardized) without at least an initial reference to a community of analysts and users. &lt;br&gt;&lt;br&gt;Within such a community, annotation adds value by spreading the workload of providing agreed-upon analyses to a larger set of shared primary data than can be analyzed by a researcher working alone. Looking outward from the core community, annotation provides expert analyses for others who might not otherwise have access to the primary data.&lt;br&gt; &lt;br&gt;This understanding of the relationship between the analyses of the data that are to be encoded in the annotations (the &amp;quot;model&amp;quot;) and the primary data themselves leads to the following characterization of what annotation standards are for and how they can be evaluated.&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;2.1. Within the original community of developers and users ...&lt;/font&gt;&lt;br&gt;&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;It is critically important to ground any annotation schema in terms of the particular model of the phenomena being annotated, and to develop it in relationship to the question being asked and the shared assumptions of the community about the phenomena being observed and modeled. Within this initial community, the annotations evolve as a set of &amp;quot;common law&amp;quot; rules about what the observed phenomena are. These rules will specify how the data should be segmented into tokens, how the tokens will be labeled in terms of an agreed upon inventory of contrasting types and relationships, and how relationships among tokens will be parsed and labeled. &lt;/li&gt;&lt;li&gt;A set of annotation conventions (rules), therefore, can only be evaluated first in relationship to the initial user community and their questions. While there might be domain-specific evaluation criteria, defined relative to independent observational tools, one critically important evaluation criterion that is common to all domains is the reliability of the annotations. Is the annotation consistent within and between annotators? If an annotator observes the same data twice in different independent annotation sessions, are the analyses (the annotations) the same? Similarly, if two different annotators observe the same data independently, do they arrive at the same analyses (tokenization and labelings)? &lt;br&gt;&lt;/li&gt;&lt;li&gt;To achieve consistency typically requires a long iterative process of &amp;quot;common law&amp;quot; development, during which two or more users annotate some set of data separately, then convene to discuss and adjudicate the disagreements, formulate new principles to cover the cases discussed, and then start a new round of independent annotation, comparison, discussion. The initial users will need to agree on the degree of consistency that is needed to accomplish their goals. An ancillary set of &amp;quot;laws&amp;quot; will need to be developed to reliably differentiate between disagreements that arise from intrinsic ambiguity and disagreements that arise because the conventions and annotation tools are not yet at the point of required coverage/stability/useability.&lt;/li&gt;&lt;li&gt;It can be useful in the process of developing, evaluating, and using an annotation standard to work on the different aspects separately. For example, if at some stage of development (or in some subcommunity of users), the tokenization is more reliable than the identification of relationships, the annotations might be adequate for some subset of the initial purposes, but not others. It is then important to develop conventions for tagging corpora or parts of corpora for relevant facts such as which version of the annotation scheme was used or the level of experience and/or training of the annotator(s).&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;2.2. When extending to a new community of users ...&lt;/font&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;Annotations developed within a particular community of initial developers and users might be extended to another community of users who are addressing different questions and may have different model assumptions. The goodness of a standard then becomes a product not just of the initial developers/users, but also of the flexibility/ingenuity of the later users of annotated data. &lt;br&gt;&lt;/li&gt;&lt;li&gt;The needs of various communities are in some cases overlapping (both phoneticians and sociolinguists may seek standards for phonetic annotation) and in other cases conflicting (a fieldworker may want language-specific idiosyncratic part-of-speech categories in interlinear glossing whereas a typologist may want agreed-upon cross-linguistically motivated categories). To get a sense of the potential disparaties among different communities of users, we listed the first sets that came to mind:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;sociolinguists&lt;/li&gt;&lt;li&gt;computational linguists and NLP practitioners&lt;br&gt;&lt;/li&gt;&lt;li&gt;language acquisition specialists&lt;br&gt;&lt;/li&gt;&lt;li&gt; psycholinguists and laboratory phonologists&lt;/li&gt;&lt;li&gt;specialists in speech and language disorder&lt;/li&gt;&lt;li&gt;fieldworkers and language typologists&lt;/li&gt;&lt;li&gt;stylometrists, disputed-author researchers, etc.&lt;/li&gt;&lt;li&gt;educators evaluating text complexity, comprehensibility&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Even within a single later-adopting community, however, the questions and needs may differ in relationship to different types of primary data. Here we can differentiate at least among (1) full video recordings, (2) audio-only recordings, and (3) spoken utterances that were recorded only as text in the first level of &amp;quot;annotation&amp;quot; of the fieldworkers&amp;#39; written transcription. The initial tokenization/labelling of each of these primary data types may be an orthographic transcription, and in the case of type (3), the initial tokenization/labelling then becomes the only record. For some communities of users, the models and questions related to such transcriptions might differ dramatically from the models/questions that can be applied to data that are (4) originally written texts. &lt;br&gt;&lt;/li&gt;&lt;li&gt;In adopting (and adapting) a set of annotation conventions to a new set of questions and applications, then, it is again useful to ask: What aspects of the annotations can we usefully tease apart and evaluate/adopt/develop separately?&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;5&quot;&gt;3. What does it take to be a good annotation standard?&lt;/font&gt;&lt;/b&gt;&lt;/font&gt;&lt;b&gt;&lt;font size=&quot;5&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;h2&gt;&lt;font size=&quot;4&quot;&gt;&lt;b&gt;3.1 Best practices (themes)&lt;br&gt;&lt;/b&gt;&lt;/font&gt;&lt;/h2&gt;&lt;br&gt;The associated properties that define a good annotation standard can be grouped into a few overarching themes and associated questions about the annotation conventions:&lt;br&gt;&lt;br&gt;   &lt;ul&gt;&lt;li&gt;Consistency/Reliability&lt;/li&gt;&lt;ul&gt;&lt;li&gt;What is the history of the annotation conventions? Did they evolve in careful, iterative rounds of (1) discussion of the goals of the annotation set, (2) independent annotation of a suitably diverse corpus of primary data by a large number of annotators, (3) calculation of inter-annotator agreement, and (4) discussion of points of agreement and disagreement and incremental revision?&lt;/li&gt;&lt;li&gt;Are there standards / mechanisms for continued calibration of consistency within and between annotators? &lt;br&gt;&lt;/li&gt;&lt;li&gt;What are the published intra-annotator and inter-annotator consistency rates?&lt;/li&gt;&lt;li&gt;Are the conventions designed to allow transparent, easy, reliable &amp;quot;back-tracking&amp;quot; to the primary data, via time stamps or via sequence position nodes within an annotation stream that has a reasonably fine-grained tokenization?&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Useability&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Is there good (accessible and extensible) documentation? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Is there a suitably diverse and continuous community for teaching (and testing the ability of) new annotators / users?&lt;br&gt;&lt;/li&gt;&lt;li&gt;Are there good tools for annotating and using the annotations, and good community mechanisms for building / extending / sharing tools?&lt;/li&gt;&lt;li&gt;Is there a reliable connection between the annotations and the primary data that allow the user to track back to the data to check a suitable subset of the annotations?&lt;/li&gt;&lt;li&gt;Is the design of the annotation schema such that annotations can be used as reliable tags back into the primary data, for easy queries using standard query tools? &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Resilience&lt;br&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;How does the standard deal with inter-annotater disagreement? Is information about disagreements preserved so that they can be analyzed in the course of developing the next version, to determine whether there are common cases of inherent ambiguity that need to be marked, or new cases that the conventions do not yet cover?&lt;/li&gt;&lt;li&gt;Are there principled mechanisms for marking degree of uncertainty about difficult or ambiguous cases? (See the CHAT manual for a thoughtful discussion of this question.)&lt;/li&gt;&lt;li&gt;Are there graceful ways of choosing to provide more or less specific degrees of analysis? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Are there good mechanisms for providing and getting the most out of partial annotations? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Are there robust ways of extending partial annotations to more of a corpus and of verifying and modifying the annotations of a corpus?&lt;/li&gt;&lt;li&gt;Relatedly, are there good mechanisms for keeping track of which parts of a corpus are in what state of annotation and verification / modification? &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Accountability/Responsibility&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Again, are there robust mechanisms for maintaining transparent links back to the primary data? and are these mechanisms ethical? Do they insure the explicitly or implicitly agreed-upon degree of confidentiality of the person or people who produced the primary data (or who produced some subset of the annotations by providing naive judgments)? The issue of confidentiality is especially vexing when the primary data are video recordings. (See sections 4.3 and 4.4.) &lt;br&gt;&lt;/li&gt;&lt;li&gt;Do the standards encourage (or even allow) later &amp;quot;consumers&amp;quot; to credit the annotations in publication? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Are the annotators (or the annotator level) for different parts of a corpus or different aspects of the annotation identified in a way that allows later users to partition the annotations -- e.g., into annotations by native speakers versus non-native speakers? &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt; &lt;ul&gt;&lt;li&gt;Interoperability&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Can the annotation be validated and used in different tools or computational models? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Is the logical structure of all three aspects of the annotation conventions transparent, and transparently related to the documented descriptions of the annotated phenomena?&lt;/li&gt;&lt;li&gt;Also, is it possible to translate to and from some other annotation conventions that have been used for this set of phenomena, in a way that makes it possible to share data across different analytic frameworks? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Are the formats for encoding the different aspects of the annotation conducive to using the annotations for purposes different from the originally intended ones?&lt;/li&gt;&lt;li&gt;Are the definitions of the annotation elements freely available and stored in an open format? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Are any requisite tools for annotating or using the annotations free open source?&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Extensibility/Adaptability&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Can the annotation schema be extended to annotating utterances in other styles from the utterance sets for which it was developed? Can the annotation conventions be used for utterances produced by other speaker types? Can they be extended to (or readily adapted for) annotating data from other dialects, other languages, ....?&lt;br&gt;&lt;/li&gt;&lt;li&gt;Is there a solid and suitably diverse core of users (and &amp;quot;maintainers&amp;quot;) to allow the standard to evolve and change in response to user feedback and/or to new needs? &lt;br&gt;&lt;/li&gt;&lt;li&gt;Is there a sensible consensus or mechanism for deciding when to &amp;quot;publish&amp;quot; a new version?&lt;/li&gt;&lt;li&gt;Are there good standards and mechanisms for versioning? For example, is there a robust way to permanently associate meta-data about which version of the conventions was used in annotating (different parts of) any corpus? Are there tools for keeping track of who the taggers were at different levels / times, and are there tools (or at least a &amp;quot;crib&amp;quot;) for how to &amp;quot;translate&amp;quot; across corpora and/or across levels of annotation as the standard evolves and expands?&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br&gt;&lt;h2&gt;&lt;font size=&quot;4&quot;&gt;&lt;b&gt;3.2. Best practices (&amp;quot;tangibles&amp;quot;)&lt;br&gt;&lt;/b&gt;&lt;/font&gt;&lt;/h2&gt;&lt;br&gt;The hallmarks of an emerging annotation standard therefore begin with these two important social characteristics:&lt;br&gt;&lt;ul&gt;&lt;li&gt;community&lt;/li&gt;&lt;ul&gt;&lt;li&gt;There is a sustainably large and diverse community of core users/maintainers.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;history&lt;/li&gt;&lt;ul&gt;&lt;li&gt;There is a history of effective dissemination of the conventions and recruitment of new core users.&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;Other more tangible accoutrements of annotations standards that have exemplified the best practices identified above include:&lt;br&gt;&lt;ul&gt;&lt;li&gt;documentation&lt;/li&gt;&lt;ul&gt;&lt;li&gt;The conventions are adequately and fully documented, in a &amp;quot;reference manual&amp;quot; that can be consulted easily by experienced users. &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;training manual&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Ideally, there is also a separate, well-tested training manual (or a standard syllabus for training courses) that leads new users through a graduated sequence of more and more difficult examples culled from data that were annotated in developing the documentation and/or the reliability metrics.&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;inter-annotator reliability metrics&lt;/li&gt;&lt;ul&gt;&lt;li&gt;There are published records of inter-annotator consistency tests. Ideally, these tests differentiate between disagreements that stem from intrinsic ambiguity and disagreements that have other, remediable sources such as inadequacies in coverage, deficiencies in the documentation, or the like. &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;computational tools&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Members of the community have invested in developing computational tools that increase the reliability of the annotations.&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;conventions for metadata&lt;/li&gt;&lt;ul&gt;&lt;li&gt;The community has developed mechanisms for protecting confidentiality of the producers of the data, crediting of the provenance of the annotations, and so on. &lt;/li&gt;&lt;/ul&gt;&lt;li&gt;conventions for responsible maintenance&lt;/li&gt;&lt;ul&gt;&lt;li&gt;A set of conventions, or a more elaborate institutional framework, has also emerged, for responsible maintenance of the conventions, for continued elaboration of the documentation, and for updating of the training manual (or re-accreditation of the training courses).&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br&gt;&lt;b&gt;&lt;font size=&quot;5&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;4. The state of the art (with some case studies)&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;In this section, we illustrate the considerations outlined in Section 3 by briefly reviewing the development and current state of annotation standards in four very broad areas. These reviews highlight two factors that promote or hinder the development of reliable and resilient standards. &lt;br&gt;&lt;br&gt;The first is the degree to which the &amp;quot;semantics&amp;quot; of the target phenomena are naturally constrained. At one extreme is the case of phonological annotation of consonants and vowels of spoken languages. Here, the aerodynamics of the vocal-auditory channel tightly constrains the tokenization and types of relationships at the lowest level of the prosodic hierarchy in an extremely robust way. An example at the other extreme is the analysis of grammatical constructions, where it is difficult to even imagine what boundaries could be imposed by nature on what a language-specific morphosyntactic construction can mean. &lt;br&gt;&lt;br&gt;The second factor is the age of the language type and/or of the systematic linguistic investigation of the phenomena across languages. Spoken languages may have existed as long as there have been modern homo sapiens, and the &amp;quot;annotation&amp;quot; of consonants and vowels goes back to the first alphabetic writing systems. By contrast, the systematic study of signed languages has a much shorter history. &lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;4.1. Phonology of spoken languages&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font size=&quot;3&quot;&gt;4.1.1. Annotation systems for vowels and consonants&lt;/font&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;As noted above, tokenization and other aspects of the analysis of categories at the lowest level of the prosodic hierarchy for spoken languages is tightly constrained by the (psycho)physics of the human articulatory and auditory systems. As a result, it has been relatively easy to develop conventions for annotating utterances of spoken languages at this level of this part of the grammar, using an alphabetic analysis, and the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.langsci.ucl.ac.uk/ipa/ipachart.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;b&gt;International Phonetic Alphabet&lt;/b&gt;&lt;/a&gt; is a premier example of a well-developed annotation standard. It is maintained and updated by the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.langsci.ucl.ac.uk/ipa/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;b&gt;International Phonetic Association&lt;/b&gt;&lt;/a&gt;, which was established in 1886 and today is associated with the International Congress of Phonetic Sciences, a meeting held every four years which attracts several thousand attendees. &lt;br&gt;&lt;br&gt;The International Phonetic Alphabet has a &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.langsci.ucl.ac.uk/ipa/handbook.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;handbook&lt;/a&gt;, which documents the annotation conventions and specifies a well-codified format for presenting a catalog of the consonant and vowel inventories of a spoken language variety using the IPA. (There is a long-standing section of the &lt;i&gt;Journal of the International Phonetic Association&lt;/i&gt; devoted to publishing such language-specific schema.) The most recent version of the handbook was published in 1999, after a conference convened in 1989 to review the coverage of the categories in the IPA consonant and vowel charts and the lists of other symbols for categories that do not fit neatly into the tokenization and paradigmatic features that are encoded in the consonant and vowel charts. In between the conference and the publication of the revised Handbook, the &lt;i&gt;Journal of the International Phonetic Association&lt;/i&gt; published a report on deliberations of the conference as a whole (&lt;i&gt;JIPA&lt;/i&gt;, 19: 67-80) as well as reports from working subgroups charged with deliberating on various more focused issues such as &amp;quot;Computer Coding of IPA symbols&amp;quot; (&lt;i&gt;JIPA&lt;/i&gt;, 19: 81-82) and &amp;quot;the best means of transcription of disordered speech&amp;quot; (&lt;i&gt;JIPA&lt;/i&gt;, 24:95-98) and correspondence from members of the association commenting on the proposed revisions and deliberations at the conference (e.g., &lt;i&gt;JIPA&lt;/i&gt;, 20: 22-32).&lt;br&gt;&lt;br&gt;While there is no official training manual, there is a long history of teaching the annotation conventions (and the phonological analyses on which they are based), which predates the founding of the International Phonetic Association. For example, Henry Sweet&amp;#39;s &lt;i&gt;A handbook of phonetics&lt;/i&gt;, published in 1877, includes vowel and consonant tables that are organized in terms of the same dimensions of analysis as the modern IPA chart -- i.e., openness, place, rounding for vowels and place, manner, laryngeal properties for consonants. The same basic approach is also taken in most subsequent textbooks, including Peter Ladefoged&amp;#39;s well-known &lt;i&gt;A course in phonetics&lt;/i&gt;, which is still used in training students in the annotation of vowels and consonants in many departments of phonetics, linguistics, logopedics, and speech &amp;amp; hearing science. &lt;br&gt;&lt;br&gt;There is also a history of research on inter-annotator consistency rates for segmental transcription, and on the factors that affect transcription consistency. In general, it is easier to be consistent the closer the annotation is to a &amp;quot;broad&amp;quot; phonemic transcription. For example, Eisen (1993) reports complete agreement among three transcribers of only 50% for a &amp;quot;narrow&amp;quot; transcription, even when distinguishing among only ten &amp;quot;major class&amp;quot; categories such as &amp;quot;voiced plosive&amp;quot;. When the same transcribers were asked instead to note only segments that deviated from an automatically inserted broad &amp;quot;dictionary&amp;quot; form transcription, consistency improved to 85%. &lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font size=&quot;3&quot;&gt;4.1.2. Where the analysis breaks down&lt;/font&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;The different levels of reliability for &amp;quot;broad&amp;quot; versus &amp;quot;narrow&amp;quot; transcription hint at some of the things that affect inter-annotator reliability. Reliability is highest when the primary data are clean recordings of careful fluent utterances produced by adult native speakers of a dialect for which there is a consensus phonemic analysis that can be the basis for the tokenization and consonant and vowel label set, and when the goal of the annotation is to produce a &amp;quot;broad&amp;quot; phonemic transcription as the basis for morphological analysis or the like. Reliability is lower when recordings are noisy, when the primary data are casual or dysfluent utterances, when the dialect of the speaker(s) is an understudied variety that differs from the dialect on which the IPA description is based, or when the goal is to produce a &amp;quot;narrow&amp;quot; transcription as the basis for sociophonetic analysis of variation in the speech community or the like. In the latter cases, reliability is improved if annotators are well trained in phonetics (not just in &amp;quot;classical phonemics&amp;quot;) and have recourse to tools such as the interactive spectrographic display window in the Praat signal analysis tool. However, no amount of phonetic training will resolve the inherent unreliability of shoehorning &amp;quot;sub-phonemic&amp;quot; paradigmatic variation and &amp;quot;suprasegmental&amp;quot; parsing differences into a phonemic segmental model. &lt;br&gt;&lt;br&gt;Alphabetic annotation of pre-school children&amp;#39;s speech poses special challenges, then, because it assumes (paradigmatic and syntagmatic) phonological structures that may not be in place until the child is much older. Very cohesive research groups, such as the Stanford Child Phonology project that collected and annotated a cross-lingusitic longitudinal corpus between 1967 and 1992, can achieve published inter-annotator consistency rates as high as 90-95% agreement. However, this agreement is typically achieved by regular (at least weekly) meetings among the primary transcribers, during which inconsistent tags are &amp;quot;corrected&amp;quot; to a consensus category (or to an &amp;quot;expert&amp;quot; tie-breaker category when consensus is impossible). Pye, Wilcox, and Siren (1988) suggest that this practice hides the true nature of the difficulty, since points of low inter-transcriber reliability can indicate places where the standard phonemic analysis of the target language is particularly inappropriate for the child&amp;#39;s developing phonological system. Hewlett and Waters (2004) make a similar point, suggesting that the problem of obscuring &amp;quot;sub-phonemic&amp;quot; variability is compounded in most large-scale cross-sectional norming studies, in which fairly &amp;quot;broad&amp;quot; transcription is used in on-the-fly observations without a permanent audio recording, in order to be able to collect data from a large number of children. Edwards and Beckman (2008) suggest that even in cross-sectional studies where &amp;quot;narrow&amp;quot; transcription of recordings is done, transcription should be supplemented by experiments eliciting (potentially continuous) perceptual responses from phonetically untrained native speaker/listeners. All of these researchers remind us that tokenization and paradigmatic differentiation at the level of vowels and consonants is a product of the interaction of the natural constraints from the aerodynamics with the exigencies of lexical contrast in dense neighborhoods. The phonemic analysis that is the basis for the IPA conventions is less compelling for a speaker whose lexicon is still too small to have very dense phonological neighborhoods. &lt;br&gt;&lt;br&gt;Prosodic phenomena such as stress and syllable structure, and intonational phenomena such as the melodies that group syllables together into phrases and the like, pose a related challenge for phonological annotation. Because there is no comparably compelling natural basis for tokenization of melodic events, spoken languages are much more diverse in the ways in which utterances are structured above the leaf nodes of the prosodic hierarchy. The Working Group on Labeling of Suprasegmentals at the 1989 conference that led to the current IPA handbook recognized this by deciding to recommend no standard annotation conventions for intonation (Bruce, 1989). A basic principle of the ToBI annotation framework also says that &amp;quot;phonetic transcription&amp;quot; of prosody and intonation is impossible. See Beckman, Hirschberg, and Shattuck-Hufnagel (2005) for further explication of this point and the implications for the development of annotation conventions for prosody and intonation. See also Pitrelli et al. (1994) for the remarkably good inter-annotator reliability rates that nonetheless can be achieved when conventions are specific to a particular dialect. &lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;b&gt;4.2. Morphosyntax and semantics&lt;br&gt;&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font size=&quot;3&quot;&gt;In the preceding section we described how phonological and phonetic annotation is easier for broad class distinctions at the leaf nodes of the prosodic hierarchy, where tokenization is naturally constrained by the psychophysics of speech production. There may be an analogous difference in the degree of difficulty of annotation for morphosyntactic structures and semantics of spoken languages. Specifically, it again seems easier to agree on tokenization of elements that are closer to the bottom of the constituent hierarchy, where the psychophysics of the vocal-auditory channel interact with more general cognitive considerations of attention and memory to promote a &amp;quot;temporal&amp;quot; or sequential (as opposed to a &amp;quot;spatial&amp;quot; or simultaneous) decomposition of the form/meaning mapping. We illustrate by describing standards for annotating aspects of &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font size=&quot;3&quot;&gt;morphology, syntax, and semantics&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font size=&quot;3&quot;&gt; that have emerged in three different communities, before touching on some of the challenges in these areas of the grammar.&lt;br&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;4.2.1. The Penn Treebank&lt;br&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;The Penn Treebank is a collection of English texts that have been grammatically annotated for part of speech, grammatical function (predicate argument relations), and constituency (Marcus et al., 1993; Marcus et al. 1994; Taylor et al., 2001). For each sentence there is an accompanying syntactic analysis (a tree), hence the term treebank (a bank of syntactic trees). Although the Penn Treebank contains material taken from multiple sources (the Wall Street Journal, the Brown Corpus, the Switchboard corpus, and ATIS), its collection of annotated Wall Street Journal newspaper articles is so well known that many researchers think of it the Penn Treebank as the Wall Street Journal corpus. It is available under a commerical license from and distributed by the Language Data Consortium (http://www.cis.upenn.edu/~treebank/). &lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;The corpus is distributed as a collection of texts with accompanying stand-off annotation. The Wall Street Journal section of the corpus, for example, consists of 2499 articles (totalling approximately a million words) published in the Wall Street Journal during a three year period. For each article, there are a number of plain text files that contain various types of annotation as well as a &amp;quot;master&amp;quot; annotation file that integrates all of the annotation.&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;&lt;br&gt;The Penn Treebank is an important point of reference in grammatical annotation given its success. Not only has it become an important resource in computational linguistics (like WordNet or CELEX), as any search of the literature will reveal, but it has also inspired a large number of similar projects for other languages--e.g., Chinese (Xue et al., 2002), Czech (Hajicova, 1998), Spanish (Navarro et al., 2003), and German (Brants et al., 2002).&lt;br&gt;&lt;br&gt;The Penn Treebank Project has a number of strengths that help explain its popularity. Chief among these, of course, is that by providing a non-trivial amount of annotated newspaper text it managed to scratch an itch felt by the community of researchers interested in computational linguistics, natural language processing, and related fields. But in addition the Penn Treebank Project provided good documentation and designed the corpus in such a way that it could be easily used: consisting only of plain text files (easliy processed), providing good documentation, versioning the corpus, etc.&lt;br&gt;&lt;br&gt;Despite its success, the Penn Treebank has its weaknesses. One of these is the absence of a standard toolkit or user application for its viewing and/or manipulation. This is unfortunate since such a toolkit could have improved the pace of development and helped improve annotation quality (by eliminating errors that could be detected through automated validation). The under-availability of tools also impedes its adoption by those lacking the resources to develop their own tool. (The creation of various open source toolkits has ameliorated this problem to some extent, but it is a comparatively recent development compared to the age of the Penn Treebank Project.) Another potential weakness is the grammatical model used to describe grammatical functions, which adheres to an old-fashioned Government and Binding analysis that posits, among other things, traces for movement. However, even though the annotation is couched in a multistratal theory of grammar, this has not hindered its use in monostratal theories of grammar, such as LFG (Frank, 2000).&lt;br&gt;&lt;br&gt;In fact, it is unrealistic to expect a great deal of standardization of the content of annotation for grammatical information given the highly contentious nature of grammatical theory itself. The difficulties inherent in this problem can be seen in attempts to develop treebanks in languages with more flexible word order and discontinuous constituency, such as German. Although it is possible to treat annotate German using a grammatical theory that posits traces, it leads to inelegance and German treebanks have as a result departed from its grammatical model. (The NeGra and TiGer annotation schemes use graphs with crossing edges rather than simple context-free trees.)&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;b&gt;&lt;font size=&quot;3&quot;&gt;4.2.2. Leipzig Glossing Rules&lt;/font&gt;&lt;br&gt;&lt;/b&gt;&lt;/font&gt;&lt;br&gt;The &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.eva.mpg.de/lingua/resources/glossing-rules.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;b&gt;Leipzig Glossing Rules&lt;/b&gt;&lt;/a&gt;, which build on Lehmann (1983) are a de facto standard for morphosyntactic glossing. They represent a light-touch codification of previous best practice. They allow for analyses of varying levels of granularity, subject only to the requirement that the segmentation in the glossing line must match that in the source language line. There is a standard set of abbreviations, which could usefully be extended. Documentation consists of a short document, with examples, freely available on-line. &lt;br&gt; &lt;br&gt;In their current version, the Leipzig Glossing Rules provide the means and the options for linguists of different persuasions to give adequate morphosyntactic glosses. A next step would be to suggest how users should characterize their own usage in a particular publication. At the most obvious level, all additional abbreviations should be specified. Then, we should note that the abbreviations are mainly for feature values (for &amp;lsquo;singular&amp;rsquo;, &amp;lsquo;feminine&amp;rsquo; and so on); it would be good practice to specify which feature each is a value of. Normally this is obvious, occasionally it can cause confusion. Finally, since the rules can be used for different purposes, it is helpful if users clarify and spell out their assumptions. For most purposes, when glossing &lt;i&gt;yesterday we bid three hundred pounds for that horse&lt;/i&gt;, we would gloss &lt;i&gt;bid&lt;/i&gt; as past tense. The information is derived not from the form itself but from the time adverbial, since the form &lt;i&gt;bid&lt;/i&gt; could be present or even imperative. For writing about tense, word order, argument structure, and so on, this solution is fine. If writing about syncretism, however, this theoretical ambiguity would matter, and would need to be indicated appropriately. More generally, the annotator frequently has to be selective in the level of detail included in the morphosyntactic glossing. This means that we cannot expect that different linguists would provide identical annotations. But we should aim for a greater level of consistency than we often find. The rules offer the alternatives, but good practice requires us to choose consciously, to specify the choices made and to apply them consistently. &lt;br&gt; &lt;br&gt;In terms of tools, it would be useful to have a tool that would check annotations for internal consistency and for any unintentional departures from the conventions. &lt;br&gt; &lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;b&gt;&lt;font size=&quot;3&quot;&gt;4.2.3. FrameNet Annotation Criteria&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;The Berkeley FrameNet&lt;/font&gt;&lt;/font&gt; project is using annotation of written corpora to build an on-line lexical resource for English, based on frame semantics and supported by corpus evidence.&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;Certain annotation practices followed in FrameNet can be seen as arbitrary but motivated. One of the central notions of FrameNet is that of Valence: general descriptions of the combinatory possibilities of individual lexical heads (verbs, nouns, adjectives, and some prepositions), expressed in both syntactic and semantic-role terms. Valence descriptions are derived automatically from a body of annotated sentences, so it is obviously necessary to agree on how the annotations are structured. The need to pair syntactic arguments with semantic roles motivates our decision to include the &amp;quot;markers&amp;quot; of a phrase with the constituent. For example, if we wish to identify the speaker and the content of an announcement from the phrase &lt;i&gt;the &lt;b&gt;announcement&lt;/b&gt; by the governor of her decision to resign&lt;/i&gt;, those two elements are blocked off as [&lt;i&gt;by the governor&lt;/i&gt;] and [&lt;i&gt;of her decision to resign&lt;/i&gt;]: some projects would ignore the prepositions and select only the NPs in those expressions. Similarly, in &lt;i&gt;the governor&amp;#39;s &lt;b&gt;announcement&lt;/b&gt; that she intended to resign&lt;/i&gt; the labels would be assigned to [&lt;i&gt;the governor&amp;#39;s&lt;/i&gt;] and [&lt;i&gt;that she intended to resign&lt;/i&gt;], including both the genitive suffix and the that-clause. And similarly, then, in [&lt;i&gt;the governor&lt;/i&gt;] &lt;b&gt;&lt;i&gt;announced&lt;/i&gt;&lt;/b&gt; [&lt;i&gt;to the world&lt;/i&gt;] [&lt;i&gt;that she intended to resign&lt;/i&gt;]. A differently motivated project might leave out the structure-markers in order to represent the &amp;quot;content&amp;quot; more faithfully; FrameNet includes these in order to match semantic and syntactic segmentation, allowing users to recognize the structural elements.&lt;br&gt;&lt;br&gt;It will be seen that the annotations themselves do not distinguish markers that are determined by the grammar (the possessive ending in &lt;i&gt;the governor&lt;b&gt;&amp;#39;s&lt;/b&gt; announcement&lt;/i&gt;), from those that are determined by the meaning of the PP as a whole (the preposition in &lt;i&gt;under the table&lt;/i&gt;), or by the governing lexical head (the preposition &lt;i&gt;on&lt;/i&gt; in &lt;i&gt;we can depend on Harry&lt;/i&gt;. Such information is recoverable from the grammar and the lexicon, but it is not part of the annotation.&lt;br&gt;&lt;br&gt;The standardization of this FrameNet practice has various consequences. All annotators on the Berkeley project agree to use it in their work, and various FrameNet or FrameNet-like projects in other languages have agreed to follow the same, or analogous conventions. These are Spanish FrameNet, Japanese FrameNet and the SALSA project in Germany. Furthermore, Professor Hiroaki Sato of Senshu University in Tokyo manages a browser of FrameNet data (&amp;quot;FrameSQL&amp;quot; http://sato.fm.senshu-u.ac.jp/fn2_13/notes/index.html) and he is developing a way of pairing valence patterns across the various languages that have FrameNet databases; the comparisons work best if all users treat function markers in the same way.&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;4.2.4.&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt; The difficulty of developing annotation systems for grammatical constructions&lt;/font&gt;&lt;/b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;&lt;br&gt;&lt;br&gt;There are well-known tree-banks that offer syntactic parses of all of the sentences in a sample, meeting certain levels of adequacy, such as the Penn Treebank. There are various levels of part-of-speech tagging for large corpora, such as the British National Corpus, that are for the most part successful. But a complete record of the special grammatical constructions in a text does not seem feasible. &lt;br&gt;&lt;br&gt;For research purposes it should be possible to tag (say) all comparative sentences in a text, identifying the scales and the phrases that directly or indirectly indicate the entities being compared. It should be in principle possible to identify all idioms or tight collocations in a text, however long this might take. It should be possible to notice constructions with certain peculiarities for the sake of assembling examples for further study, such as, for English, the pattern that has a degree-modified adjective followed by an indefinite NP marked by the preposition &lt;i&gt;of&lt;/i&gt;. (&lt;i&gt;Do you need {[this big] [of a box]}?&lt;/i&gt;) But the expressions that represent individual constructions are frequently tightly intertwined, and the effort to work out the nature of such integration on a large scale is not likely to be possible. A sentence like &lt;i&gt;He&amp;#39;s in no bigger of a hurry than you are&lt;/i&gt; exhibits a comparative structure (&lt;i&gt;bigger ... than you are&lt;/i&gt;), a collocational idiom (&lt;i&gt;in ... a hurry&lt;/i&gt;, one of the few uses of &lt;i&gt;hurry&lt;/i&gt; as a noun), the puzzling structure with the &lt;i&gt;of&lt;/i&gt;-phrase (&lt;i&gt;bigger of a hurry&lt;/i&gt;), a special minimizing use of the word &lt;i&gt;no&lt;/i&gt; with a compared adjective (consider the difference between &lt;i&gt;he&amp;#39;s &lt;u&gt;not&lt;/u&gt; smarter than your mother&lt;/i&gt; and &lt;i&gt;he&amp;#39;s &lt;u&gt;no&lt;/u&gt; smarter than your mother&lt;/i&gt;), and the particular form of the &lt;i&gt;than&lt;/i&gt;-clause (&lt;i&gt;than you are&lt;/i&gt; vs. &lt;i&gt;than you&lt;/i&gt;, &lt;i&gt;than expected&lt;/i&gt;, &lt;i&gt;than ever&lt;/i&gt;, etc.). Representing the working of all of these constructions and their articulation is not to be expected.&lt;br&gt;&lt;br&gt;Research that collects and explores examples of grammatical constructions, idioms, collocations, and multiword expressions in general, and illustrates their properties one at a time, has got to be an essential task for linguistics and computational linguistics, both for grammar writing and as a way of producing learning corpora for machine-learning techniques to improve syntactic parsers. But since many of the most important constructions cannot easily be associated with individual words in a sentence, or with specific nodes in a parse tree, there is little likelihood of acquiring large-scale accurate annotatons of grammatical constructions, beyond familiar parsing and chunking of nonproblematic sentences, any time soon. &lt;/font&gt;&lt;/font&gt;The problem is further compounded by the fact that cross-framework agreement on syntactic phenomena in general is not easy to achieve: dependency-based and constituency-based treatments are not always interconvertible; theoreticians who seek to minimize redundancy in their analyses would not see the same number of construction types in a given text as the grammarian who wishes to work with structures of finer granularity. &lt;br&gt;&lt;br&gt;The proposal in this section favors rich analysis of small texts, together with extensive sampling of given constructional phenomena one at a time, or in small families of constructions. Such a combined approach should eventually lead to understanding the importance of non-core constructions and multiword expressions, classifying their variety, estimating how many of them there are, determining their relevance in profiling different genres, estimating their &amp;quot;density&amp;quot; in different kinds of texts, exploring the manner in which they are learned, and evaluating their contribution to measures of language complexity.&lt;br&gt;&lt;br&gt;A sample of constructional annotations prepared within the Framenet project can be seen on http://www.icsi.berkeley.edu/~hsato/cxn00/21colorTag/index.html.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;b&gt;4.3. Annotation of gesture&lt;/b&gt;&lt;/font&gt;&lt;br&gt;&lt;br&gt;The two factors relevant for annotation developments, as discussed above in the beginning paragraphs of Section 4, are especially salient for the progression of standards in gesture annotation. We discuss gesture here as inclusive of gestures in signed languages as well as discourse-related gestures of the spontaneous type which accompanies natural spoken language narratives. &lt;br&gt;&lt;br&gt;First, in contrast to the tightly constrained audio-articulatory modality of spoken language systems, which are articulated using the vocal tract, gesture systems involve the visual-gestural modality. Traditionally, gestures are understood to be articulated using the hands in movements. In the case of sign languages, however, the category of gestures has recently been expanded to include certain movements involving the head, face, and shoulders. (See Neidle et al. 2000 for some discussion of nonmanual gestures in ASL, and Boyes-Braem, 2001 for the descriptions of &amp;quot;mouth gestures&amp;quot; in multiple European sign languaes.) Moreover, the visual space of gestures in sign languages exists as a complex continuum which involves the signed phonemic structures in the lowest level of the prosodic hierarchy, larger signed morphemes which engage in spatially bound agreement relationships with other signs, and &amp;quot;nonmanual&amp;quot; gestures of higher-order prosodic structures. So while the conventions for annotating gestures in the traditional sense may be developed with relatively straightforwardness using video analysis, developing annotation standards for gestures which handle the complexity of these relationships must also rely on multiple and dynamic layers of annotation of the types illustrated in the figures and examples in Section 1.&lt;br&gt;&lt;br&gt;The Language Archiving Techonology tool, &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lat-mpi.eu/tools/elan/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ELAN&lt;/a&gt;, is a professional tool for the complex annotation of video and audio sources. &amp;quot;Tiers&amp;quot; are implemented for simultaneously displaying and annotating parallel levels of analysis. These can be nested for dependencies between, say, an independent parent annotation of morpheme-by-morpheme transcription, and referring tiers for the varying gestural articulators (e.g. hand vs. mouth). A full &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lat-mpi.eu/tools/elan/manual/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;manual&lt;/a&gt; for ELAN is available online.&lt;br&gt;&lt;br&gt;Second, the systematic investigation gestures as a linguistic phenomena is a relatively new pursuit. This is true for both the annotation of gestures in spoken language narrative studies (such as with the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://mcneilllab.uchicago.edu/topics/annotation.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;McNeill Lab&lt;/a&gt; project) as well as the annotation of sign languages (see Neidle et al.&amp;#39;s &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.bu.edu/asllrp/SignStream/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Sign Stream Project&lt;/a&gt;, and the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://childes.psy.cmu.edu/manuals/bts.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Berkeley Transcription System manual&lt;/a&gt;). The next section presents some considerations for developing unified conventions in the annotation of sign languages.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;4.4. A unified annotation standard for signed languages&lt;/font&gt;&lt;font size=&quot;4&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;Finally, we address the need for a unified annotation of sign languages, as identified in this workshop group and codified within the desiderata for qualities of annotation standards in general. First, however, some prior discussion concerning the dissemination of tools and standards among the the communities of practice (sign language linguists) is necessary. For although several tools currently exist for the scientific annotation of video data (see above for gesture annotation using Anvil, ELAN, and Cross-Modal Analysis of Signal and Sense, for example), and while the target users are a close-grained community, widespread standard for sign language transcription and annotation is lacking.&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;b&gt;4.4.1. What sign language annotation is, and what it is not&lt;/b&gt;&lt;br&gt;&lt;/font&gt;&lt;br&gt;We begin by clarifying the purpose of a unified standard of sign languages. We emphasize that we do not aim to advocate a writing system of signs, nor do we intend for annotation to replace the primary linguistic video data with a derived set of data. Rather, annotation of sign languages should complement the primary data record as a way of tagging and searching the data. And the goals of a unified sign language annotation standard are to provide a shared platform of convention for collaborating across the various linguistic domains. &lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;b&gt;4.4.2. &lt;/b&gt;&lt;/font&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;What a unified standard provides&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;For all linguists, annotation is paramount, and standards promote convergence. For sign language linguists, the annotation of primary video data poses several challenges. Few standards exist, for example, when it comes to annotating sign languages for fundamental linguistic phenomena such as pronominalization or indexicalization within the interlinear gloss. A more complex issue is the matter of transcribing certain non-manual features that are coarticulated with the manual signs--as functional labels (neg), abbreviations of the action (head shake), or even further break-down of the correlates involved. On the practical side, high-quality video data can require large loads of memory, and utilizing tools for analyzing video requires higher processor speeds and memory load. &lt;br&gt;&lt;br&gt;These issues stand to hinder the standardization of data annotation. Through further practice and dissemination, however, advances in sign language annotation standards provide the potential for consistency, conversation, and conventionalized practices among a growing community. The ideal situation (projected solution) is one where sign language linguists, whether collaborating in an international workshop setting or via remote communications, have access to one mutually accessible standard that is extensible for all sign languages, interoperable across varying domains and models of interest, granular across levels of linguistic analysis, and practical for continuous usability.&lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;5. &lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;Existing annotation standards and resources&lt;/font&gt;&lt;br&gt;&lt;/b&gt;&lt;br&gt;This section lists the various annotation conventions and other resources for developing and discussing annotation standards that were suggested by participants of Cyberling09. &lt;br&gt;&lt;br&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;5.1. Phonetics and phonology&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;Phone segment tagging symbols, including both:&lt;br&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;IPA and its various ASCII-fications, such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.phon.ucl.ac.uk/home/sampa/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;SAMPA&quot;&gt;SAMPA&lt;/a&gt;, &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/WorldBet&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;WorldBet&quot;&gt;WorldBet&lt;/a&gt;&lt;/li&gt;&lt;li&gt;and language-specific phoneme-segment encodings such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Arpabet&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;ArpaBet&quot;&gt;ArpaBet&lt;/a&gt; for American English and the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.kokken.go.jp/katsudo/seika/corpus/public/labeling.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;CSJ encoding&quot;&gt;CSJ encoding&lt;/a&gt; for Japanese&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;The various &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ling.ohio-state.edu/%7Etobi/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ToBI&lt;/a&gt; conventions and similar conventions in other frameworks such as the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://todi.let.kun.nl/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ToDI&lt;/a&gt; conventions&lt;br&gt;&lt;/li&gt;&lt;li&gt;The &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://childes.psy.cmu.edu/phon/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;PhonBank&lt;/a&gt; conventions and tools&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;5.2. Morphosyntax &lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.eva.mpg.de/lingua/resources/glossing-rules.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Leipzig Glossing rules&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC_37-4_N225_CD_MAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Morphosyntactic Annotation Format (MAF)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;ISO Lexical Markup Framework (LMF): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lexicalmarkupframework.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Homepage (with Publications and Tools)&quot;&gt;Homepage (with Publications and Tools)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Typecraft:  a &lt;i&gt;labeling system&lt;/i&gt; which, for any verb construction of a given language, provides a template for that construction type displaying its argument structure, in a fashion as transparent as possible. The template is constructed from a universally established inventory of labeling primitives.&lt;br&gt;http://www.typecraft.org/tc2wiki/Verbconstructions_cross-linguistically_-_Introduction&lt;br&gt;&lt;/li&gt;&lt;li&gt;tags for short-unit word (SUW) and long-unit word (LUW) in the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.kokken.go.jp/katsudo/seika/corpus/public/5.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;CSJ&quot;&gt;CSJ&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;b&gt;&lt;font size=&quot;4&quot;&gt;5.3. Syntax and semantics&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://framenet.icsi.berkeley.edu/index.php?option=com_wrapper&amp;Itemid=126&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;FrameNet&quot;&gt;FrameNet annotation manual&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;The tokenization guidelines, part-of-speech tags, and bracketing conventions for the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cis.upenn.edu/%7Etreebank/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Penn TreeBank project&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC37_SC4_N285_MetaModelSynAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Syntactic Annotation Format (SynAF)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/new_doc/iso_tc37_sc4_n269_ver10_wg2_24617-1_semaf-time_utf8.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Semantic Annotation Format - Time and Events (SemAF-TIME)&lt;/a&gt; (formerly &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.timeml.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TimeML&lt;/a&gt;)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;5.4. Pragmatics and discourse structure&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://childes.psy.cmu.edu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;CHAT&quot;&gt;CHAT&lt;/a&gt; conventions for segmenting turns and identifying the participant and the setting&lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.rochester.edu/research/cisd/resources/damsl/RevisedManual/RevisedManual.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;DAMSL&quot;&gt;DAMSL&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;and other schemes documented and discussed at the 1998 &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.umd.edu/users/traum/DSD/schemes.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;DRI meeting&quot;&gt;DRI meeting&lt;/a&gt; such as:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;intentional structure annotation (see Nakatani, Grosz, Ahn, and Hirschberg, 1995)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;5.5. &lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;Gesture&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;David MacNeil&amp;#39;s&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://mcneilllab.uchicago.edu/topics/annotation.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Gesture annotation&quot;&gt; Gesture annotation&lt;/a&gt; &lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://childes.psy.cmu.edu/manuals/bts.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;BTS sign transcription system&lt;/a&gt;&lt;/li&gt;&lt;li&gt;NEUROGES-ELAN system: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.berlingesturecenter.de/seminare/neurogeselan/neurogeselan.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;workshop series homepage&quot;&gt;workshop series homepage&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;Michael Kipp&amp;#39;s &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.dfki.de/%7Ekipp/anvil/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Anvil&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Carol Neidle&amp;#39;s &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.bu.edu/asllrp/SignStream/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;SignStream&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Vislab&amp;#39;s (Francis Quek&amp;#39;s) Cross-Modal Analysis of Signal and Sense for multi-modal human discourse &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;5.6. Other resources&lt;/font&gt;&lt;font size=&quot;4&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;The EMU Speech Database System (see Cassidy and Harrington, 2001): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://emu.sourceforge.net/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;sourceforge page&quot;&gt;sourceforge page&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;ISO TC37 SC4 - Language Resource Management : &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.tc37.sc4.org&quot; target=&quot;_self&quot;&gt;Homepage&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.isocat.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Data Category Registry for linguistic concepts&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;    &lt;b&gt;&lt;i&gt;&lt;u&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistics-ontology.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;GOLD (General Ontology for Linguistic Description)&lt;/a&gt;&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;various (lineages of) POS tagging systems such as:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;the ones assumed in the taggers linked into &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www-nlp.stanford.edu/links/statnlp.html#Taggers&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;the Stanford NLP resources page&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;text transcription symbols promoted on the LDC Corpus Cookbook page for &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://projects.ldc.upenn.edu/Corpus_Cookbook/transcription/symbols.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;transcribing text&quot;&gt;transcription/symbols&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ilc.cnr.it/EAGLES/isle/ISLE_Home_Page.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;International Standards for Language Engineering (ISLE)&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;ISO Linguistic Annotation Framework (LAF) &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Ffaculty%2Fide%2Fpapers%2FLAF-LREC06.pdf&quot; target=&quot;_self&quot;&gt;Overview&lt;/a&gt; and &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/iso_tc37_sc4_N463_rev00_wg1_wd_LAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Draft Standard (under revision)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;XML Serialization for LAF: &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Ffaculty%2Fide%2Fpapers%2FLAW.pdf&quot; target=&quot;_self&quot;&gt;ISO Graph Annotation Format (GrAF)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;SoundIndex and related search and view tools developed in the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://lacito.vjf.cnrs.fr/archivage/description.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;LACITO&quot;&gt;LACITO&lt;/a&gt; Linguistic Data Archiving Project (see Jakobson, Michailovsky, and Lowe, 2001).&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.xces.org&quot; target=&quot;_self&quot;&gt;XML Corpus Encoding Standard (XCES) &lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Fsigann&quot; target=&quot;_self&quot;&gt;SIGAnn : ACL Special Interest Group on Annotations &lt;/a&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Fsigann&quot; target=&quot;_self&quot;&gt;&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;NXT System: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://groups.inf.ed.ac.uk/switchboard/links.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Switchboard in NXT&quot;&gt;Switchboard in NXT&lt;/a&gt; &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.amiproject.org/showcase/standards-and-toolkits/nite-xml-toolkit-for-annotations&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;NITE XML Toolkit&quot;&gt;NITE XML Toolkit&lt;/a&gt; &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://videolectures.net/mlmi04ch_carletta_iab/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Video lecture by Jean Carletta: the NITE XML Toolkit Meets the ICSI Meeting Corpus&quot;&gt;Video lecture by Jean Carletta: the NITE XML Toolkit Meets the ICSI Meeting Corpus&lt;/a&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;tool and framework for building &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;E-MELD (Electronic Metastructure for Endangered Language Data): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://emeld.org/index.cfm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Homepage&quot;&gt;Homepage&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;All of the annotation systems listed on the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ldc.upenn.edu/annotation/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;COCOSDA technical topic domain Corpus Annotation Tools page&quot;&gt;COCOSDA Corpus Annotation Tools page&lt;/a&gt;&lt;/li&gt;&lt;li&gt;The &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.scs.leeds.ac.uk/amalgam/amalgam/amalghome.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Automatic Mapping Among Lexico-Grammatical Annotation Models project&lt;/a&gt; resources&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;6. &lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;References&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;Beckman, Mary E., Julia Hirschberg, and Stefanie Shattuck-Hufnagel. 2005. The original ToBI system and the evolution of the ToBI framework. In Sun-Ah Jun, ed. &lt;i&gt;Prosodic Typology: The Phonology of Intonation and Phrasing&lt;/i&gt;, pp. 9-54. Oxford University Press.&lt;/li&gt;&lt;li&gt;Bird, Steven, and Jonathan Harrington (2001). Editorial: Speech annotation and corpus tools. &lt;i&gt;Speech Communication&lt;/i&gt;, 33(1,2): 1-4.&lt;/li&gt;&lt;li&gt;Bird, Steven, and Mark Liberman (1999). Annotation graphs as a framework for multidimensional linguistic data analysis, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, ACL, Madrid, Spain. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://acl.ldc.upenn.edu/W/W99/W99-0301.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;http://acl.ldc.upenn.edu/W/W99/W99-0301.pdf&quot;&gt;http://acl.ldc.upenn.edu/W/W99/W99-0301.pdf&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;Bird, Steven, and Mark Liberman (2001). A formal framework for linguistic annotation. &lt;i&gt;Speech Communication&lt;/i&gt;, 33(1,2): 23-60.&lt;/li&gt;&lt;li&gt;Bird, Steven, and Gary Simons (2003). Extending Dublin Core Metadata to support the description and discovery of language resources. &lt;i&gt;Computing and the Humanities&lt;/i&gt;, 37, 375-388.&lt;/li&gt;&lt;li&gt;    &lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt; Bird, Steven, and Gary Simons (2003). Seven dimensions of portability for language documentation and description. &lt;i&gt;Language&lt;/i&gt;, 79(3): 557&amp;ndash;82.&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Bow, Catherine, Baden Hughes, and Steven Bird (2003). Towards a general model of interlinear text. In &lt;i&gt;Proceedings of the EMELD Conference 2003: Digitizing and annotating texts and field recordings&lt;/i&gt;. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistlist.org/emeld/workshop/2003/proceedings03.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://linguistlist.org/emeld/workshop/2003/proceedings03.html&lt;/a&gt;.  &lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;Brants, Thorsten, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. 2002. The TIGER Treebank. In E. Hinrichs and K. Simov (eds.), &lt;i&gt;Proceedings of the First Workshop on Treebanks and Linguistic Theories&lt;/i&gt;, pp. 24&amp;ndash;41, Sozopol, Bulgaria.&lt;/li&gt;&lt;li&gt;Bruce, G&amp;ouml;sta (1989). Report from the IPA working group on suprasegmental categories. &lt;i&gt;Working Papers, Lund University, Department of Linguistics&lt;/i&gt;, 35: 15-40.&lt;br&gt;&lt;/li&gt;&lt;li&gt;Brugman, H., P. Wittenburg, S. C. Levinson, and S. Kita. Multimodal annotations in gesture and sign language studies. In M. Rodriguez Gonz&amp;aacute;lez &amp;amp; C. Paz Su&amp;aacute;rez Araujo, eds., &lt;i&gt;Third international conference on language resources and evaluation&lt;/i&gt;(pp. 176-182). &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mpi.nl/institute/research-groups/language-and-cognition-group/publications&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;http://www.mpi.nl/institute/research-groups/language-and-cognition-group/publications&quot;&gt;http://www.mpi.nl/institute/research-groups/language-and-cognition-group/publications&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Cassidy, Steve, and Jonathan Harrington (2001). Multi-level annotation in the Emu speech database management system. &lt;i&gt;Speech Communication&lt;/i&gt;, 33 (1,2): 61-77.&lt;/li&gt;&lt;li&gt;                  Comrie, Bernard, Martin Haspelmath and Balthasar Bickel. (2004, revised 2008). The Leipzig Glossing Rules. Available at: &lt;br&gt;http://www.eva.mpg.de/lingua/resources/glossing-rules.php   &lt;/li&gt;&lt;li&gt;Edwards, J., &amp;amp; Beckman, M. E. (2008).  Methodological questions in studying phonological acquisition.   &lt;i&gt;Clinical Linguistics and Phonetics&lt;/i&gt;, 22(12): 939-958.&lt;/li&gt;&lt;li&gt;Eisen, B. (1993). Reliability of speech segmentation and labelling at different levels of transcription. &lt;i&gt;Proceedings of the 3rd European Conference on Speech Communication and Technology&lt;/i&gt;, Vol. 1, pp. 673-676.&lt;/li&gt;&lt;li&gt;Frank, A. 2000. Automatic F-Structure Annotation of Treebank Trees. In M. Butt and T. H. King (eds.), &lt;i&gt;The Fifth International Conference on Lexical-Functional &lt;br&gt;Grammar&lt;/i&gt;, The University of California at Berkeley, July 19-20 2000, CSLI Publications, Stanford, CA.&lt;br&gt;&lt;/li&gt;&lt;li&gt;Gabbard, Ryan, Seth Kulick, and Marcus Mitchell. 2006. Fully parsing the Penn Treebank. Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, New York, NY, 184&amp;ndash;191. Morristown, NJ: Association for Computational Linguistics. &lt;/li&gt;&lt;li&gt;Hajicova, E. 1998. Prague Dependency Treebank: From Analytic to Tectogrammatical Annotation. In Proc. TSD&amp;rsquo;98.&lt;br&gt;&lt;/li&gt;&lt;li&gt;Krotov, Alexander, Mark Hepple, Robert J. Gaizauskas, and Yorick Wilks. 1998. Compacting the Penn Treebank grammar. Proceedings of COLING/ACL98: Joint Meeting of the 36th Annual Meeting of the ACL and the 17th International Conference on Computational Linguistics, Montr&amp;eacute;al, Canada, 699&amp;ndash;703. Morristown, NJ: Association for Computational Linguistics. &lt;br&gt;&lt;/li&gt;&lt;li&gt;Hewlett, Nigel and Waters, Daphne (2004). Gradient change in the acquisition of phonology. &lt;i&gt;Clinical Linguistics &amp;amp; Phonetics&lt;/i&gt;,18 (6): 523-533.&lt;br&gt;&lt;/li&gt;&lt;li&gt;Ide, N. and Suderman, K. (2007). &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.vassar.edu/faculty/ide/papers/LAW.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;GrAF: A Graph-based Format for Linguistic Annotations.&lt;/a&gt; &lt;i&gt;Proceedings of the Linguistic Annotation Workshop&lt;/i&gt;, held in conjunction with ACL 2007, Prague, June 28-29, 1-8.&lt;/li&gt;&lt;li&gt;Ide, N., Romary, L. (2007). &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.vassar.edu/%7Eide/papers/Elsnet.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Towards International Standards for Language Resources.&lt;/a&gt; In Dybkjaer, L., Hemsen, H., Minker, W. (Eds.), &lt;i&gt;Evaluation of Text and Speech Systems&lt;/i&gt;, Springer, 263-84.&lt;/li&gt;&lt;li&gt;Ide, N., Romary, L.. (2006). &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.vassar.edu/%7Eide/papers/LAF-LREC06.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Representing Linguistic Corpora and Their Annotations.&lt;/a&gt; &lt;i&gt;Proceedings of the Fifth Language Resources and Evaluation Conference&lt;/i&gt; (LREC), Genoa, Italy.&lt;/li&gt;&lt;li&gt;Ide, N., Romary, L. (2004). &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.vassar.edu/%7Eide/papers/JNLE-rev.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;International standard for a linguistic annotation framework.&lt;/a&gt; &lt;i&gt;Journal of Natural Language Engineering&lt;/i&gt;, 10:3-4, 211-225.&lt;/li&gt;&lt;li&gt;Ide, N., Romary, L. (2004). &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.vassar.edu/%7Eide/papers/LREC2004-DCR.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;A Registry of Standard Data Categories for Linguistic Annotation&lt;/a&gt;. &lt;i&gt;Proceedings of the Fourth Language Resources and Evaluation Conference&lt;/i&gt; (LREC), Lisbon, 135-39.&lt;/li&gt;&lt;li&gt;Ide, N., Bonhomme, P., Romary, L. (2000). XCES: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.vassar.edu/%7Eide/papers/xces-lrec00.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;An XML-based Standard for Linguistic Corpora&lt;/a&gt;. &lt;i&gt;Proceedings of the Second Language Resources and Evaluation Conference&lt;/i&gt; (LREC), Athens, Greece, 825-30.&lt;/li&gt;&lt;li&gt;Jacobson, Michel, Boyd Michailovsky, John B. Lowe (2001). Linguistic documents synchronizing sound and text. &lt;i&gt;Speech Communication&lt;/i&gt;, 33 (1,2): 79-96.&lt;br&gt;&lt;/li&gt;&lt;li&gt;Kipp, Michael (2001). &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.dfki.de/%7Ekipp/public_archive/kipp2001-eurospeech.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Anvil - A Generic Annotation Tool for Multimodal Dialogue&lt;/a&gt; &lt;i&gt;Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech)&lt;/i&gt;, pp. 1367-1370, Aalborg, September 2001. &lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Lausberg, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Hedda, and Han Sloetjes (2009). Coding gestural behavior with the NEUROGES&amp;ndash;ELAN system. &lt;i&gt;Behavior Research Methods&lt;/i&gt;, 41 (3), 841-849.&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;Lehmann, Christian (1983). Directions for interlinear morphemic  translations. &lt;i&gt;Folia Linguistica&lt;/i&gt; 16.193-224. &lt;/li&gt;&lt;li&gt;Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large annotated corpus of English: The Penn Treebank. Computational Linguistics 19.313&amp;ndash;30. &lt;br&gt;&lt;/li&gt;&lt;li&gt;Marcus, Mitchell, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The Penn Treebank: annotating predicate argument structure. &lt;i&gt;Proceedings of the workshop on Human Language Technology&lt;/i&gt;, Princeton, NJ, 110&amp;ndash;5. Morristown, NJ: Association for Computational Linguistics. &lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;McKelvie, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;David, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt; Amy Isard, Andreas Mengel, Morten Baun M&amp;amp;oslash;ller, Michael Grosse, and Marion Klein &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;(2001). The MATE workbench -- An annotation tool for XML coded speech corpora. &lt;i&gt;Speech Communication&lt;/i&gt;, 33 (1,2): 97-112.&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Nakatani, Christine H., Barbara J. Grosz, David D. Ahn ,and Julia Hirschberg (1995). &lt;i&gt;Instructions for annotating discourse&lt;/i&gt;. TR: 21-95, Harvard University, Cambridge, MA.&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;Navarro, Borja, Montserrat Civit, M. Antonia Mart&amp;iacute;, R. Marcos, B. Fern&amp;aacute;ndez. 2003. Syntactic, Semantic and Pragmatic Annotation in Cast3LB. &lt;i&gt;Shallow Processing of Large Corpora (SProLaC), a Workshop on Corpus Linguistics&lt;/i&gt;, Lancaster, UK.Neidle, Carol, Stan Sclaroff, and Vassilis Athitsos (2001). SignStream: A tool for linguistic and computer vision research on visual-gestural language data. &lt;i&gt;Behavior Research Methods,&lt;/i&gt; 33, 311-320.&lt;/li&gt;&lt;li&gt;Pitrelli, John F., Mary E. Beckman, and Julia Hirschberg (1994). Evaluation of prosodic transcription&lt;br&gt;labelling reliability in the ToBI framework. &lt;i&gt;Proceedings of the 1994 International Conference on Spoken Language Processing&lt;/i&gt;, Vol. 1, pp. 123-126.&lt;br&gt;&lt;/li&gt;&lt;li&gt;Pye, C., K. A. Wilcox, and K. A. Siren (1988). Refining transcriptions: the significance of transcriber &amp;lsquo;errors&amp;rsquo; &lt;br&gt;&lt;i&gt;Journal of Child Language&lt;/i&gt;, 15, 17&amp;ndash;37. &lt;br&gt;&lt;/li&gt;&lt;li&gt;Quek, Francis, Dan McNeill, Robert Bryll, and Mary Harper (2002) &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.terpconnect.umd.edu/%7Emharper/papers/space-icslp.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Gesture Spatialization in Natural Discourse Segmentation&lt;/a&gt;. &lt;i&gt;Proceedings of the Seventh International Conference on Spoken Language Processing&lt;/i&gt;, Vol. 1, Denver CO, pp.189-192. &lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Stirling, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Lesley, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Janet Fletcher, Ilana Mushin, and Roger Wales (2001). Representational issues in annotation: Using the Australian map task corpus to relate prosody and discourse structure. &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;i&gt;Speech Communication&lt;/i&gt;, 33 (1,2): 113-134.&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;Syrdal, Ann K., Julia Hirschberg,  Julie McGory, and Mary Beckman (2001). &lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;i&gt;Speech Communication&lt;/i&gt;, 33 (1,2): &lt;/font&gt;&lt;/font&gt;135-151. &lt;/li&gt;&lt;li&gt;Taylor, A., Marcus M. and Santorini B. 2001. The Penn TreeBank: an Overview. In Abeill&amp;eacute; A. (ed.), &lt;i&gt;Building and Using Syntactically Annotated Corpora&lt;/i&gt;, Kluwer.&lt;/li&gt;&lt;li&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Taylor, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Paul, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Alan W. Black, and Richard Caley (2001). Heterogeneous relation graphs as a formalism for representing linguistic information. &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;i&gt;Speech Communication&lt;/i&gt;, 33 (1,2): 153-174.&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;Telljohann, Heike, Erhard Hinrichs, and Sandra K&amp;uuml;bler. 2004. The T&amp;uuml;Ba-D/Z treebank: Annotating German with a context-free backbone. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 2004.&lt;/li&gt;&lt;li&gt;Telljohann, Heike, Erhard Hinrichs, Sandra Kuebler, and Heike Zinsmeister. 2006. &lt;i&gt;Stylebook for the Tuebingen Treebank of Written German (TueBa-D/Z)&lt;/i&gt;. Technischer Bericht, Seminar fuer Sprachwissenschaft, Universitaet Tuebingen, Tuebingen. Revidierte Fassung. &lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Trippel, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Thorsten, &lt;/font&gt;&lt;/font&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;&lt;font face=&quot;Arial Unicode MS,Arial,Helvetica&quot;&gt;Michael Maxwell, Greville Corbett, Cambell Prince, Christopher Manning, Stephen Grimes and Steve Moran (2008). Lexicon Schemas and Related Data Models: when Standards Meet Users. &lt;i&gt;Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC&amp;#39;08)&lt;/i&gt;. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lrec-conf.org/proceedings/lrec2008/summaries/812.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;text of paper and slides&quot;&gt;text of paper and slides&lt;/a&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;Xue, Nianwen, Fu-Dong Chiou, and Martha Palmer. 2002. Building a Large-Scale Annotated Chinese Corpus. In &lt;i&gt;Proceedings of the 19th International Conference on Computational Linguistics&lt;/i&gt; (COLING 2002), Taipei, Taiwan.&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Group 7: Collaboration Structure</title><link>http://cyberling.elanguage.net/page/Group+7%3A+Collaboration+Structure</link><author>EmilyMBender</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Group+7%3A+Collaboration+Structure</guid><comments>Filled in coordination models, minor formatting edits</comments><pubDate>Thu, 10 Sep 2009 19:11:42 CDT</pubDate><description>.&lt;h3&gt;The collaboration structure group is charged with considering methods for enhancing collaboration and communication on three levels. Level 1 involves forming communication pathways that can link individual researchers to the overall agenda of developing a shared cyberinfrastructure. Level 2 involves the collaborations that are needed between system developers and tool developers to assure maximum interoperability and open access between data formats and programs. Level 3 involves support for collaborations between linguists and researchers in other sciences, grounded on the use of a shared, open access cyberinfrastructure. For each of these levels, we need to design lightweight methods for ensuring ongoing collaboration and coordination.&lt;/h3&gt;&lt;br&gt;The specific agenda items for this group include:&lt;br&gt;1. Data level interoperability: roundtrips between formats, transductions of formats, funding for the process of developing compatibility&lt;br&gt;2. Tool level interoperability and methods for collaboration in tool development.&lt;br&gt;3. Issues arising from a commitment to open access.&lt;br&gt;4. How to maximize data-sharing: the role of NSF, NIH, and LSA in terms of promoting greater commitment to data-sharing.&lt;br&gt;5. Characterization of linguistic digital data types and methods for linking to non-digital data.&lt;br&gt;6. An agenda for developing linkages to other sciences - the bigger picture of Extended Linguistics.&lt;br&gt;7. Lightweight administration within a framework of complex organizations: NIH, NSF, LSA, CLARIN, etc.&lt;br&gt;&lt;br&gt;Following are further analyses of these seven agenda items:&lt;br&gt;&lt;br&gt;1. Data level interoperability. &lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Level 1 involves compatibility in &lt;i&gt;annotation format&lt;/i&gt;, such as a formats provided by frameworks such as Annotation Graphs (AG) or the Linguistic Annotation Framework&amp;#39;s Graph Annotation Format (GrAF). &lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Level 2 involves compatibility in terms of data categories (content), wherein categories are the same conceptually and can be mapped to one another. This can be facilitated by ontological resources like GOLD as well as the ISOcat&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Level 3 compatibility involves use of a common set of notational conventions to express a fully declared range of content categories.&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;2. Tool level interoperability and methods for collaboration in tool development.   &lt;br&gt;&lt;ul&gt;  &lt;li&gt;  LRT, TalkBank, E-Meld standards are largely similar. &lt;br&gt;&lt;/li&gt;&lt;li&gt;  Media standardization issues for streaming serving and programs. YouTube, Google, Mozilla, and others are developing standards and systems that we could adopt.&lt;br&gt;&lt;/li&gt;&lt;li&gt;  Roundtrips between tool formats: CHAT, Anvil, EAF, AG, EXMaRLDA, Wavesurfer, TEI, SALT. Some of these tools are already interoperable, but these pathways need to be made clearer to users.   &lt;/li&gt;&lt;li&gt;  AG Tools approach. This approach allows programmers to develop new tools from the AG Toolkit. However, this is only linked at Level 1. Can there be a similar approach at Levels 2 and 3?&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;3. Issues arising from a commitment to open access. What data must be kept away from the public and what data can be made freely available? How can linguists work together to increase access to larger amounts of linguistically important data?   &lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Sharing and IRB principles at talkbank.org/share.   &lt;/li&gt;&lt;li&gt;  Legacy data vs. forward-looking protocols (E-Meld, AphasiaBank, as examples)  &lt;/li&gt;&lt;li&gt;  Community, population constraints&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;4. Methods for promoting a higher level of data contribution and individual researcher &amp;quot;buy in&amp;quot;.   &lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Inducements: publication, easy tool linkage   &lt;/li&gt;&lt;li&gt;  Community: role of LSA   &lt;/li&gt;&lt;li&gt;  Obligations and standards: role of NIH, NSF, DARPA, IE   &lt;/li&gt;&lt;li&gt;  Leading role of the European Community in setting standards for data-sharing for grant recipients&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;5. Characterization of linguistic digital data types and methods for linking to non-digital data.   &lt;br&gt;Here, it is important to distinguish the emphasis on corpora and linked media from the many other types of digital data that are of interest to linguists. In the area of Linguistic Exploration, the fundamental objects may be word lists, sentence lists, or dictionaries. In Linguistic Anthropology digitized records of objects are important. This extends eventually over to Archaeology and even information on human genetics etc. In the Learning Sciences, there is an emphasis on linking classroom video to individual student portfolios that may include letters, tests, art work and so on. For digital libraries, it is important to make clear where the hard copies actually reside. For many of these objects, identification can be made through the assigment of digital object identifiers (DOIs). However, this is a largely unexplored territory for most linguists.&lt;br&gt;&lt;br&gt;6. An agenda for developing linkages to other sciences - the bigger picture of Extended Linguistics. Here, the &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Ftalkbank.org%2Fshare%2Freport.doc&quot; target=&quot;_self&quot;&gt;MacWhinney-Groves NSF&lt;/a&gt; report should be particularly helpful.&lt;br&gt;&lt;br&gt;7. Lightweight administration within a framework of complex organizations: NIH, NSF, LSA, CLARIN, etc. There is a perception that some work on the development of shared cyberinfrastrucure has been top-heavy on committee work and reports without producing a significant amount of shared interoperable resources. Is there a way to build organizational structures that produce open-access products? Who should determine patterns of collaboration or should these patterns &amp;quot;emerge&amp;quot; through specific less-organized exchanges. But then how these interactions be guided toward cooperation and interoperability? Perhaps an emphasis on standards for collaboration might be possible.&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;Recommended Readings&lt;br&gt;&lt;/font&gt;  &lt;ul&gt;  &lt;li&gt;  &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.vassar.edu/sigann/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;SIGAnn website&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Ftalkbank.org&quot; target=&quot;_self&quot;&gt;TalkBank website&lt;/a&gt; (see ground rules for sharing, software, browsable database) &lt;br&gt;&lt;/li&gt;&lt;li&gt;  &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Ftalkbank.org%2Fshare%2Freport.doc&quot; target=&quot;_self&quot;&gt;NSF SBE Cyberinfrastructure Report&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;  SILT Proposal (attachment)   &lt;/li&gt;&lt;li&gt;  &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.flarenet.eu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;FLaReNet website&lt;/a&gt;&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  ISO committee for Language Resource Management &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;webpage&lt;/a&gt;&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Linguistic Annotation Framework and Graph Annotation Format descriptions (attachments)&lt;/li&gt;&lt;/ul&gt;  &lt;h2&gt;  Additional Links&lt;/h2&gt;  &lt;ul&gt;  &lt;li&gt;  ISO TC37 SC4 - Language Resource Management : &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.tc37.sc4.org&quot; target=&quot;_self&quot;&gt;Homepage&lt;/a&gt;   &lt;ul&gt;  &lt;li&gt;  ISO Linguistic Annotation Framework (LAF) &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Ffaculty%2Fide%2Fpapers%2FLAF-LREC06.pdf&quot; target=&quot;_self&quot;&gt;Overview&lt;/a&gt; and &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/iso_tc37_sc4_N463_rev00_wg1_wd_LAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Draft Standard (under revision)&lt;/a&gt;; XML Serialization for LAF : &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Ffaculty%2Fide%2Fpapers%2FLAW.pdf&quot; target=&quot;_self&quot;&gt;ISO Graph Annotation Format (GrAF)&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;blockquote&gt;  &lt;ul&gt;  &lt;li&gt;  &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC_37-4_N225_CD_MAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Morphosyntactic Annotation Format (MAF)&lt;/a&gt;   &lt;/li&gt;&lt;li&gt;  &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC37_SC4_N285_MetaModelSynAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Syntactic Annotation Format (SynAF)&lt;/a&gt;   &lt;/li&gt;&lt;li&gt;  &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/new_doc/iso_tc37_sc4_n269_ver10_wg2_24617-1_semaf-time_utf8.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Semantic Annotation Format - Time and Events (SemAF-TIME)&lt;/a&gt; (formerly &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.timeml.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TimeML&lt;/a&gt;)   &lt;/li&gt;&lt;li&gt;  ISO Lexical Markup Framework (LMF): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lexicalmarkupframework.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Homepage (with Publications and Tools)&lt;/a&gt;   &lt;/li&gt;&lt;li&gt;  &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.isocat.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Data Category Registry for linguistic concepts (ISOcat)&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;  &lt;ul&gt;  &lt;li&gt;  GOLD &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistics-ontology.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;(General Ontology of Linguistic Description)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Getting+Involved+in+ISO+Standards+Development&quot; target=&quot;_self&quot;&gt;How to Get Involved in ISO Standards Development&lt;/a&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;br&gt; &lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;White paper draft/notes  &lt;h3&gt;  Introduction&lt;/h3&gt;&lt;br&gt;In considering collaboration, we distinguish first between joint research efforts and large-scale coordination. The former may cross many boundaries: of disciplines, institutions, and nations. It is also by and large already successful, and not something that we thought this group needed to focus directly on (though of course a cyberinfrastructure for linguistics would, as an important contribution, provide further support for this kind of collaboration). Rather, we defined our task as the fostering of large-scale coordination such as is required to get the field of linguistics (and sister disciplines in the language sciences) to organize around the creation of a cyberinfrastructure and its population with data. We saw three main facets to the large-scale coordination of effort: cooperation, community building, and communication. The following figure summarizes the relationships we see among these concepts.&lt;br&gt;&lt;br&gt; &lt;br&gt;&lt;h3&gt;  Large-scale coordination of effort&lt;/h3&gt;&lt;br&gt;Under the heading of large-scale coordination of effort, we include such things as the reuse of data, especially across (otherwise) unrelated research groups; mechanisms for publishing/sharing data; evaluation/quality control of resources (broadly defined to include data sets, tools, standards, etc); establishment of and agreement on standards; and coherence of principles, goals, and architectures across groups involved in the creation of a cyberinfrastructure. In all of these concerns, we place a high priority on avoiding duplication of effort. In order to foster large-scale coordination of effort, we see three main classes of tools: formal modes of cooperation, explicit work towards community building, and vehicles of communication. &lt;br&gt;&lt;br&gt;&lt;h3&gt;  Coordination Models&lt;br&gt;&lt;/h3&gt;&lt;br&gt;We identified a few examples of models of coordination:&lt;br&gt;&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  ISO: exploit existing infrastructure for coordination&lt;/li&gt;&lt;li&gt;TalkBank: coordinate among researcher groups such as AphasiaBank, CHILDES, PhonBank, ClassBank, etc.&lt;br&gt;&lt;/li&gt;&lt;li&gt;  TEI: build infrastructure from the ground up   &lt;/li&gt;&lt;li&gt;  CLARIN: exploit an EC framework (ESFRI) financing the feasibility part of establishing European infrastructures, for CLARIN in the Humanities, to be then continued with national funding from the various EU countries  &lt;/li&gt;&lt;li&gt;  FLaReNet/SILT: parallel international funded efforts to establish international networks, with bottom up coordination by the projects&amp;#39; coordinators&lt;/li&gt;&lt;/ul&gt;Clearly, the various coordination models are linked to different funding models, and can (to differing degrees) be deliberately fostered by funding agencies.   &lt;br&gt;&lt;br&gt;&lt;h3&gt;The Role of the LSA&lt;/h3&gt;  &lt;font color=&quot;#000000&quot; face=&quot;Arial&quot;&gt;&lt;br&gt;An informal working group has been created within the LSA, consisting &lt;/font&gt;&lt;font color=&quot;#000000&quot; face=&quot;Arial&quot;&gt;of LSA staff, leadership and technology consultants.  The working group &lt;/font&gt;&lt;font color=&quot;#000000&quot; face=&quot;Arial&quot;&gt;will make detailed recommendations to the LSA&amp;#39;s Executive Committee &lt;/font&gt;  &lt;font color=&quot;#000000&quot; face=&quot;Arial&quot;&gt;concerning specific actions the LSA can undertake, both immediately and &lt;/font&gt;&lt;font color=&quot;#000000&quot; face=&quot;Arial&quot;&gt;in the future, to facilitate the development of a cyberinfrastcuture &lt;/font&gt;&lt;font color=&quot;#000000&quot; face=&quot;Arial&quot;&gt;for linguistics, disseminate information related to this endeavor, and &lt;/font&gt;  &lt;font color=&quot;#000000&quot; face=&quot;Arial&quot;&gt;promulgate a &amp;quot;culture change&amp;quot; with regard to sharing of data, tools, and results.  These actions would build on the initial steps already taken by the LSA in this regard, such as its digital publishing platform, eLanguage.&lt;/font&gt;&lt;br&gt;   &lt;h3&gt;&lt;br&gt;  &lt;/h3&gt;&lt;h3&gt;Community Building&lt;/h3&gt;&lt;br&gt;We began with the observation that successful technology is always supported by (while also supporting) a community of users. We define a community for these purposes as a group of people working on similar issues, using the same tools/platforms/resources, who talk to each other and who share principles and practices in their research efforts. Different communities may be display these various properties to differing extents. To give some examples, SIGs (special interest groups within larger scholarly bodies) illustrate communities primarily defined by working on similar issues. Computational linguistic communities that have grown up around the creation of use of tools and resources include the developer and user groups of WordNet, FrameNet, NLTK, GATE and UIMA. There are also communities of linguists who are producing shared databases, but not new computational tools or formats. These include groups such as CHILDES, PhonBank, AphasiaBank, LIDES, etc. Vehicles for communication (discussed further below) can also create communities of their own. Examples here include the readership of LINGUISTList and the readership of the Corpora list. Note also that communities can vary greatly in size (readership of LINGUISTList being at one extreme) and of course overlap with one another, as individuals belong to multiple different communities.&lt;br&gt;&lt;br&gt;Since communities are important for both communication (see below) and the success of software projects, we considered ways to foster the development of communities, while recognizing that such things cannot be precisely engineered (nor their success precisely predicted). Means of community building include: &lt;br&gt;&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Funding programs sponsoring multiple groups working on the same/similar problems (e.g., language documentation sponsored by ELDP, DARPA programs)   &lt;/li&gt;&lt;li&gt;  Evaluation campaigns (lots of examples from compling here: Semeval, MUC, TREC, CLEF, CoNLL shared tasks)   &lt;/li&gt;&lt;li&gt;  On-line fora (from big like LINGUIST List to small like user groups for particular tools)   &lt;/li&gt;&lt;li&gt;  SIGs (of particular relevance here is ACL&amp;#39;s SIGANN)&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Workshops (e.g., LAW, E-MELD workshops, Cyberling 2009) and conferences (LREC is of particular note here in having created a community) &lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Tutorials &lt;br&gt;&lt;ul&gt;  &lt;li&gt;  At summer schools (LSA institute, EuroLAN, Johns Hopkins, ESSLLI)   &lt;/li&gt;&lt;li&gt;  On the web   &lt;/li&gt;&lt;li&gt;  As part of funded projects (e.g., SILT)&lt;/li&gt;&lt;/ul&gt;  &lt;/li&gt;&lt;li&gt;  Journals (LRE, Language, eLanguage, Computational Linguistics...)&lt;/li&gt;&lt;/ul&gt;  &lt;h3&gt;  &lt;br&gt;Communication&lt;/h3&gt;&lt;br&gt;We identified communication as a problem that cross-cuts many aspect of large-scale coordination of effort. In particular, we need to communicate about standards (availability and development), tool and resource availability, needs assessment, and principles &amp;amp; practices. People working on tools and standards across linguistics and the language sciences more broadly need to be aware of each other and each other&amp;#39;s efforts and need to be able to communicate with potential users for needs assessment. (Mark Liberman commented that every successful piece of software starts with someone scratching an itch: they have a problem, build a solution, and share that solution. But not everyone who has a problem that can be solved with software has the means/skills to build that software themselves.) People potentially using tools and standards need to be able to find them. People who should be using tools and standards but don&amp;#39;t yet know about them, need to be reached.&lt;br&gt;&lt;br&gt;With these kinds of communication in mind, we developed a list of potential communication vehicles:&lt;br&gt;&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Existing communities&amp;#39; infrastructure (newsletters, meetings, websites), including both informal communities and scholarly organizations&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Teaching materials, especially made available over the web (syllabi, problem sets, web-based tutorials)&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Wikis/blogs (&amp;quot;bottom-up web-based communication&amp;quot;)&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  On-going maintenance of information collections   &lt;/li&gt;&lt;li&gt;  Reasons for people to come back to the on-line communication site   &lt;/li&gt;&lt;li&gt;  ad words: Set up context sensitive &amp;quot;ads&amp;quot; on the model of Google AdWords which could run on LINGUIST, lsadc.org, etc, where the things being advertised are relevant projects and standards (and no money is exchanged)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;/li&gt;&lt;li&gt;  Funded collaborations (e.g., SILT/FLaReNet)&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Workshops/tutorials   &lt;/li&gt;&lt;li&gt;  Reviewing guidelines/review feedback   &lt;ul&gt;  &lt;li&gt;  Pushing funding agencies to require plans (and follow through) for using standards and publishing data for proposals that use tools/create data&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Pushing funding agencies to require proposals for new tools/standards to appropriately cite and situate themselves within the existing tools/standards ecology   &lt;/li&gt;&lt;li&gt;  Conference/journal reviewing check for appropriate citations of data, tools, resources&lt;/li&gt;&lt;/ul&gt;  &lt;/li&gt;&lt;li&gt;  Resource maps/eliciting metadata (cf. LREC 2010)   &lt;/li&gt;&lt;li&gt;  Journals like Journal of Experimental Linguistics which publish code along with the resulting research.&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Idea from Steve Moran: A new journal (perhaps in the eLanguage set) on the model of Journal of Experimental Linguistics, which publishes data sets collected in the field. Maybe called &amp;quot;Journal of Linguistic Description&amp;quot;?&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;Content&lt;/h3&gt;&lt;br&gt;This section briefly outlines the content that we need to be communicating about within the large-scale coordination of effort required to bring about a cyberinfrastructure for linguistics (and the language sciences).&lt;br&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;Data level interoperability, on two levels: Interoperability of data and annotation format, including means of mapping between existing formats (cf. GrAF) and interoperability of annotation content, via ontologies (cf. GOLD) or other inventories of linguistic categories (cf. ISOcat).&lt;/li&gt;&lt;li&gt;Tool level interoperability and methods for collaboration in tool development.&lt;/li&gt;&lt;li&gt;Managing issues arising from commitment to Open Access, establishing and publicizing shared principles.&lt;/li&gt;&lt;li&gt;Promoting individual researcher &amp;quot;buy-in&amp;quot;. A cyberinfrastructure is not useful until it is populated with data, but our field is in need of culture change in this respect. We believe this culture change can be achieved through a combination of making it easier for individual researchers to contribute data (through useful tools), educational campaigns on the part of the LSA and similar groups, funding agencies establishing policies requiring data sharing, and publication venues requiring testing against and citing existing available data.&lt;/li&gt;&lt;li&gt;Connection to other fields: Linguistics is only one of the language sciences, and digitized data from many fields (education, political science, law, ...) can be valuable for linguists. As we develop our infrastructure, we need to be mindful of how it fits into this larger ecology, and where teaming up with other language sciences can bring economies of scale. Closer to home, the field of computational linguistics has a good deal of cyberinfrastructure (and communication around cyberinfrastructure) established. We envision creating a portal into cyberinfrastructure concerns for people who identify as linguists, which rather than attempting to encompass all language-related cyberinfrastructure itself links to existing efforts in allied disciplines.&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;  &lt;h3&gt;  Summary: 5 Cs of cyberinfrastructure&lt;br&gt;&lt;/h3&gt;&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Collaboration: Joint research alone isn&amp;#39;t enough to bring about a cyberinfrastructure, though it will play a key role. We need large-scale coordination of effort. &lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Cooperation models: There are many ways to coordinate effort, and we will probably use all of them.&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Coordination: On the technical side, we need interoperability, which entails coordination on standards&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Communication: On the people side, we can&amp;#39;t achieve coordination without communication, bringing people in, making them aware of each other, and keeping them in touch.&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Community Building: Key to both successful communication and successful software.&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;h3&gt;  Action items - Short term&lt;br&gt;&lt;/h3&gt;&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Draft recommendations to funding agencies regarding standards, data publication, etc.   &lt;/li&gt;&lt;li&gt;  Draft recommendations to journal editors and conference organizers regarding citing tools/resources and publishing data   &lt;/li&gt;&lt;li&gt;  Create teaching resources (through LSA?)   &lt;/li&gt;&lt;li&gt;  Continue this conversation (all WGs, on the wiki for now)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;h3&gt;  Action items - Long term&lt;/h3&gt;&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  Ensure communication among projects/groups   &lt;/li&gt;&lt;li&gt;  Push those developing standards to for specific areas (e.g., PHON group connected to TalkBank) to contribute to ISO TC37 SC4   &lt;/li&gt;&lt;li&gt;  Work towards data/annotation harmonization&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>annotation streams</title><link>http://cyberling.elanguage.net/page/annotation+streams</link><author>mebeckman</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/annotation+streams</guid><pubDate>Tue, 08 Sep 2009 09:20:41 CDT</pubDate><description>This wiki page is a record of a discussion about terminology. The Annotation Standards working group considered the different terms being used for different paths (the logical term if we use the annotation graph representation), and for describing various types of relationship among different paths. For now this page is simply the body of e-mail messages in this thread, which could be edited later to make a coherent description of the terminological ambiguities, relating them to discussion of the ontology in the paragraph where the &amp;quot;white paper&amp;quot; cites the Annotation Graph framework of Bird and Liberman (2001), and linked into the main page there if/when these pages migrate to their eventual home. &lt;br&gt;&lt;br&gt;n.b. Calling the paths &amp;quot;streams&amp;quot; invokes Hertz&amp;#39;s Delta system, as described in, e.g.: &lt;br&gt;Hertz, Susan R. (1990). The Delta programming language: an integrated approach to non-linear phonology, phonetics, and speech synthesis. In John Kingston &amp;amp; Mary E. Beckman (eds.), &lt;i&gt;Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech&lt;/i&gt;, pp. 215-25. Cambridge University Press.&lt;br&gt;&lt;br&gt;_____________________________________________________________________&lt;br&gt;&lt;br&gt;From: Charles Fillmore&lt;br&gt;Subject: &amp;quot;layered annotation&amp;quot; &lt;br&gt;Date: Sun, 6 Sep 2009 16:58:45 -0700 &lt;br&gt;&lt;br&gt;I think I might have introduced a confusion of terminology in connection with &amp;quot;layered annotation.&amp;quot; &lt;br&gt; &lt;br&gt;Since we&amp;#39;re in general talking about stand-off annotations, we can have one layer that shows part of speech, one that shows lexical meanings, one that shows speaker switch, etc., etc., and that&amp;#39;s all fine. But I find that I&amp;#39;ve also used &amp;quot;layering&amp;quot; to refer to the problem of using one annotator&amp;#39;s product to form the basis of another level of analysis (and have claimed that this is both necessary and problematic). Can anybody think of a way of phrasing the latter that would avoid confusion?&lt;br&gt;&lt;br&gt;From: Sarah Churng&lt;br&gt;Date: Sun, 6 Sep 2009 17:20:23 -0700&lt;br&gt;&lt;br&gt;For &amp;#39;layers&amp;#39; aimed at the simultaneous displays of different time-aligned annotations such as pos tags, lexical transcriptions, intonational phrases, etc. (in contrast to the levels of analysis &amp;#39;layers&amp;#39;), I think we referred to these in our meetings as tiers for linguistic types? &lt;br&gt; &lt;br&gt;In ELAN, stackable tiers are available both as independent tiers which are time-alignable (such as direct transcriptions) and as referring tiers which are not time-alignable and must inherit from parent transcription tiers (such as with translations). &lt;br&gt;&lt;br&gt;Is this getting close to what you mean? &lt;br&gt;&lt;br&gt;From: Stuart Robinson&lt;br&gt;Date: Mon, 7 Sep 2009 00:36:54 +0000 (UTC)&lt;br&gt;&lt;br&gt;At the risk of muddying the waters, a tier in Sarah&amp;#39;s sense could have multiple layers then, no? So Chuck&amp;#39;s example would be one where a tier for morphosyntax or whatever has multiple layers (i.e., annotation on annotation). On top of that you might also have versioning.&lt;br&gt;&lt;br&gt;From: Sarah Churng&lt;br&gt;Date: Sun, 6 Sep 2009 18:41:04 -0700&lt;br&gt;&lt;br&gt;Right, and, to a large extent, I think the documentation for ELAN actually faces the same conundrum Chuck brings up---that is, they seem to use &amp;quot;tier&amp;quot; in both senses Chuck is trying to distinguish. First, there are the tiers I mention below for aligning &amp;quot;layers&amp;quot; of POS tagging, lexical meaning, etc, and these straightforwardly can be handled across a time axis. &lt;br&gt;&lt;br&gt;Second, there are tier-to-tier relationships, so to speak, between different tier types. These are of the kind that Chuck was originally asking to give a different label than &amp;quot;layer&amp;quot; to in his e-mail. ELAN gets around this by distinguishing &amp;quot;parent&amp;quot; vs. &amp;quot;child&amp;quot; tiers. So, the product of a parent tier is available for the annotation of a child tier, but not vice versa. &lt;br&gt;&lt;br&gt;Is this something we can adopt? It nicely admits that there is an overlap of &amp;quot;layers&amp;quot; for both senses of the word, but makes it clear that not all layers are created equal in the sense of what each annotation can inherit. And the issue of multiple embedded layers that Stuart brings up is independent and able to coinhabit with the different &amp;#39;parent&amp;#39; vs. &amp;#39;child&amp;#39; layers. &lt;br&gt;&lt;br&gt;The documentation on tiers in ELAN: http://www.lat-mpi.eu/tools/elan/manual/ch05s01.html/view&lt;br&gt;&lt;br&gt;From: Mary Beckman &amp;lt;mbeckman@ling.osu.edu&amp;gt;&lt;br&gt;Date: Mon, 7 Sep 2009 09:41:19 -0400&lt;br&gt;&lt;br&gt;Muddying the waters further (or perhaps clarifying?), I think the distinction that we&amp;#39;re making maybe is not between types of annotation streams and their relationships, but between a static and a dynamic view, no? &lt;br&gt; &lt;br&gt;That is, when I hear/read &amp;quot;tier&amp;quot;, I tend to think of the simultaneous display and/or the associated simultaneous development of different streams of parallel annotations that are anchored to the same primary data. This anchoring can be either via a time stamp (as in Figure 1) or just via reference to shared nodes in the annotation graph (as in the phrase-internal sharing of nodes for the word-by-by transcription and gloss in Figure 2). But this is a static &amp;quot;result-oriented&amp;quot; view of the parallel analysis streams and their relationships. &lt;br&gt;&lt;br&gt;By contrast, when I hear or read &amp;quot;layer&amp;quot;, I tend to think of the dynamics of how the different annotation streams were originally developed. Although there are cases where the analyses have a necessary order -- e.g., a word-by-word gloss probably has to come after a transcription and tokenization of the primary data -- there are also many cases where the &amp;quot;layering&amp;quot; is arbitrary or idiosyncratic. For example, in (ame_)ToBI labelling, some people, such as Stef Shattuck-Hufnagel, find that they have to mark the Break Indices first and then go back and mark the Tones. Others, such as Nanette Veilleux (and me), can&amp;#39;t do Break Indices before we do Tones. If the annotation isn&amp;#39;t left in a partial state, though, there is no way to recover the difference between Stef and Nanette. So this is a dynamic &amp;quot;process-oriented&amp;quot; view of the relationship. &lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Machine reusability of data</title><link>http://cyberling.elanguage.net/page/Machine+reusability+of+data</link><author>alexispalmer</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Machine+reusability+of+data</guid><comments>added refs and links</comments><pubDate>Tue, 08 Sep 2009 06:27:52 CDT</pubDate><description>&lt;font color=&quot;#0000ff&quot;&gt;NOTE: this page is still in process. at present it contains a number of conceptual recommendations without suggestions of how these recommendations could be implemented. Please feel free to contact the author with comments, complaints, and suggestions: Alexis Palmer, apalmer@coli.uni-sb.de&lt;/font&gt;&lt;br&gt;&lt;br&gt;This page discusses one particular manner in which linguistic data may be re-purposed: as training data for statistical machine learning approaches in computational linguistics and/or natural language processing. More specifically, we&amp;#39;re talking about training data for supervised or semi-supervised methods -- methods that learn from labeled data.&lt;br&gt;&lt;br&gt;&lt;h2&gt;Linguistic data as training material for machine learning&lt;br&gt;&lt;/h2&gt;&lt;table align=&quot;bottom&quot; cellpadding=&quot;3&quot; class=&quot;WPC-edit-style-grid1 WPC-edit-border-all WPC-edit-styleData-color1=%23ebebeb&amp;color2=%23c7c7c7&quot; height=&quot;176&quot; width=&quot;1156&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;23%&quot;&gt;Case study subdiscipline: &lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;77%&quot;&gt;Computational linguistics, language documentation and description&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;23%&quot;&gt;Goals of this case study:&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;77%&quot;&gt;To highlight how decisions related to the annotation and storage of linguistic data (in this case, interlinear glossed texts from a language documentation project) can make the data more or less useful as training data for statistical machine learning methods&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;1. What makes data good training data?&lt;br&gt;&lt;br&gt;&lt;/font&gt;The key word here is &lt;b&gt;consistency&lt;/b&gt;. Roughly put, the machine model learns generalizations over observed data and uses those to predict analyses for previously-unseen data. In order for it generalize well, the collection of data must be as internally-consistent as possible in the way that it is coded/labeled. &lt;br&gt;&lt;br&gt;A second important consideration is the &lt;b&gt;underlying data structure&lt;/b&gt;. For efficient machine processing, there must be some explicit indication of relations between the text and its annotations. These two points are illustrated below with examples of interlinear glossed text, a common way of representing language data.&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;2. Labeling consistency&lt;/font&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;Typographic consistency&lt;br&gt;&lt;/b&gt;It is important in any data collection and annotation effort that all annotators work from one agreed-upon set of labels. For the sake of the machine learner, it is also important to adhere to capitalization and punctuation conventions. For example, &lt;b&gt;&amp;#39;PST&amp;#39;&lt;/b&gt; and &lt;b&gt;&amp;#39;pst&lt;/b&gt;&amp;#39; may both be intended to indicate a past tense morpheme, but the machine will see them as two distinct labels. Of course, many such issues can be handled by processing the data post-annotation and pre-model training, but to do so efficiently requires text manipulation skills that those producing the original data may or may not have. &lt;br&gt;&lt;br&gt;One way for projects to maintain labeling consistency is by use of an annotation interface which restricts the space of allowed labels.&lt;br&gt;&lt;b&gt;&lt;br&gt;Analytic consistency&lt;br&gt;&lt;/b&gt;Maintaining analytic consistency is a much more difficult task. In cases where the analysis is reasonably well-understood at the outset of annotation, agreed-upon conventions for analysis and annotation may be made available to annotators in the form of a detailed annotation manual. It is often the case, however, that analysis and annotation proceed in parallel. In documentation and description of less-studied (or previously unstudied) languages, this is in fact the normal situation.&lt;br&gt;&lt;br&gt;Several bits of record-keeping can help to deal with changing analyses:&lt;br&gt;&lt;ul&gt;&lt;li&gt;tracking the source of each label (i.e. the specific annotator) as well as the time and date of annotation&lt;/li&gt;&lt;li&gt;documenting changes in analysis and/or labeling conventions, indicating the nature and source of the change, how the change should be manifested in the annotation (in other words, what did the previous analysis look like? what does the new analysis look like?), the date and time at which the decision to change the analysis was made, and whether or not the change has been back-propagated to previously-labeled data&lt;/li&gt;&lt;li&gt;using annotation tools and/or data formats which are able to maintain a historical record of changes in the data (along with the metadata associated with those changes)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;We recognize that some of these desiderata are not easily attainable with currently-available systems for text glossing and interlinearization, particularly in the language documentation context. We thus add our voice to those calling for development of an open source, updated, general-purpose system for text interlinearization and glossing.&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;3. Data structures&lt;/font&gt;&lt;br&gt;&lt;br&gt;First, we point to the pages of &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+1%3A+Annotation+Standards&quot; target=&quot;_self&quot;&gt;WG1: Annotation Standards&lt;/a&gt; as well as the &lt;a href=&quot;http://cyberling.elanguage.net/page/Existing+Standards+and+Technologies&quot; target=&quot;_self&quot;&gt;Existing Resources&lt;/a&gt; page for many valuable resources pertaining to standardization of data structures for annotation. The resources presented on these pages include links to proposed standards and extensive bibliographic references related to this topic. &lt;br&gt;&lt;br&gt;&lt;b&gt;Interlinear glossed text (IGT)&lt;/b&gt;&lt;br&gt;The particular concern in this case study is the use of interlinear glossed text (IGT) as training data for a machine learner. First, here&amp;#39;s an example of IGT from the Mayan language Uspanteko (Pixabaj et al.).&lt;br&gt;&lt;br&gt;Full text:&lt;i&gt; Kita&amp;#39; tinch&amp;#39;ab&amp;#39;ej laj inyolj iin.&lt;/i&gt;&lt;br&gt;&lt;br&gt;&lt;table align=&quot;bottom&quot; cellpadding=&quot;3&quot; class=&quot;WPC-edit-style-grid1 WPC-edit-border-all WPC-edit-styleData-color1=%23ebebeb&amp;color2=%23c7c7c7&quot; width=&quot;1000&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td bgcolor=&quot;#d9d9d9&quot; class=&quot;WPC-edit-custom-bgColor WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;&lt;b&gt;TEXT&lt;/b&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderRight-double WPC-edit-custom-borderRight WPC-edit-borderLeft-double WPC-edit-custom-borderLeft&quot; width=&quot;10%&quot;&gt;kita&amp;#39;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;tinch&amp;#39;ab&amp;#39;ej&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;laj&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;inyolj&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft&quot; width=&quot;10%&quot;&gt;iin&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td bgcolor=&quot;#d9d9d9&quot; class=&quot;WPC-edit-custom-bgColor WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;&lt;b&gt;MORPHEME&lt;/b&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderRight-double WPC-edit-custom-borderRight WPC-edit-borderLeft-double WPC-edit-custom-borderLeft&quot; width=&quot;10%&quot;&gt;kita&amp;#39;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;t-&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;in-&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;ch&amp;#39;abe&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;-j&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;laj&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;in-&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;yolj&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-borderLeft-double WPC-edit-custom-borderLeft&quot; width=&quot;10%&quot;&gt;iin&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td bgcolor=&quot;#d9d9d9&quot; class=&quot;WPC-edit-custom-bgColor WPC-edit-custom-borderRightWPC-edit-custom-bgColor WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;&lt;b&gt;GLOSS&lt;/b&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderRight WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight WPC-edit-borderLeft-double WPC-edit-custom-borderLeft&quot; width=&quot;10%&quot;&gt;NEG&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderRight WPC-edit-custom-borderLeft WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;INC-&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderTop WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-custom-borderTop WPC-edit-custom-borderBottom WPC-edit-borderBottom-solid WPC-edit-custom-borderBottom WPC-edit-borderTop-solid WPC-edit-custom-borderTop WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;E1S-&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;hablar&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;-SC&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;PREP&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-borderLeft-double WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-borderRight-none WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;E1S-&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-custom-borderRight WPC-edit-custom-borderLeft WPC-edit-borderLeft-none WPC-edit-custom-borderLeft WPC-edit-borderRight-double WPC-edit-custom-borderRight&quot; width=&quot;10%&quot;&gt;idioma&lt;/td&gt;&lt;td align=&quot;center&quot; class=&quot;WPC-edit-custom-borderLeft WPC-edit-borderLeft-double WPC-edit-custom-borderLeft&quot; width=&quot;10%&quot;&gt;yo&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br&gt;Spanish translation: &lt;i&gt;No le hablo in mi idioma.&lt;/i&gt;&lt;br&gt; English translation:&lt;i&gt; I don&amp;#39;t speak to him in my language.&lt;/i&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;Links between annotation tiers&lt;br&gt;&lt;/b&gt;The table above shows three tiers of annotation for this Uspanteko clause. The &amp;#39;TEXT&amp;#39; tier contains each word of the clause (word boundaries are indicated by double-line cell borders). The &amp;#39;MORPHEME&amp;#39; tier shows a segmentation of each word into its component morphemes, and the &amp;#39;GLOSS&amp;#39; tier shows a morpheme-by-morpheme gloss of the clause, including both gloss labels for non-stem morphemes (e.g. NEG for &lt;i&gt;kita&amp;#39;&lt;/i&gt;) and lemma translations for stem morphemes (e.g. &lt;i&gt;hablar&lt;b&gt; &lt;/b&gt;&lt;/i&gt;for &lt;i&gt;ch&amp;#39;abe&lt;/i&gt;).&lt;br&gt;&lt;br&gt;Two NLP tasks we might imagine learning from such data are &lt;b&gt;morphological segmentation &lt;/b&gt;(producing the &amp;#39;MORPHEME&amp;#39; tier, given at least the &amp;#39;TEXT&amp;#39; tier and perhaps the translation(s) as well) and &lt;b&gt;morpheme glossing &lt;/b&gt;(roughly, given the &amp;#39;MORPHEME&amp;#39; tier, produce the &amp;#39;GLOSS&amp;#39; tier). This is where the data structure used to represent the interlinear text becomes crucial! &lt;br&gt;&lt;br&gt;Most often when we encounter IGT -- as, in fact, in the table above -- the links between annotation tiers are conveyed through visual aspects of the presentation. Here, for example, the association of morphemes with the words they belong to is communicated using double-line borders at word boundaries. Visually-oriented presentations of IGT do not generally provide the explicit encoding of these relationships that a machine learner needs to make sense of the data. In order to use IGT as training data, it must be presented to the machine learner in a format that &lt;b&gt;directly encodes links between elements from one annotation tier to those on another&lt;/b&gt;. &lt;br&gt;&lt;br&gt;&lt;b&gt;Structured representational formats&lt;/b&gt;&lt;br&gt;What is needed to address this concern is a format which preserves structured links between annotation tiers. XML formats are one way of preserving said links. At the same time, using XML follows current recommendations regarding longevity and portability of data (for example, Bird and Simons 2003, &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://emeld.org/school/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;EMELD School of Best Practices&lt;/a&gt;). Several XML formats for IGT have been proposed, including &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ethnoer.unimelb.edu.au/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;EthnoER&lt;/a&gt;&amp;#39;s EOPAS (Schroeter and Thieberger 2006), IGT-XML (Palmer and Erk 2007), and an earlier model outlined in Bow, Hughes, and Bird 2003 (Bow et al. 2003)&lt;font color=&quot;#ff0000&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;. Another approach is the use of Annotation Graphs (e.g. Bird and Liberman 2001, Maeda et al. 2002)&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#ff0000&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;. &lt;br&gt;&lt;br&gt;&lt;font size=&quot;3&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;References:&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;---Bird, Steven and Mark Liberman. 2001. &amp;#39;A formal framework for linguistic annotation.&amp;#39; &lt;i&gt;Speech Communication.&lt;/i&gt; 33(1-2): 23-60.&lt;br&gt;---Bird, Steven and Gary Simons. 2003.  &amp;#39;Seven dimensions of portability for language documentation and description.&amp;#39; &lt;i&gt;Language&lt;/i&gt;, 79(3): 557-582.&lt;br&gt;---Bow, Catherine, Baden Hughes, and Steven Bird. 2003. &amp;#39;Towards a general model of interlinear text.&amp;#39; In &lt;i&gt;Proceedings of EMELD Workshop 2003: Digitizing and Annotating Texts and Field Recordings&lt;/i&gt;. LSA Institute: Lansing MI, USA.&lt;br&gt;---Maeda, Kazuaki, Steven Bird, Xiaoyi Ma, and Haejoong Lee. &amp;#39;Creating Annotation Tools with the Annotation Graph Toolkit&amp;#39;. In &lt;i&gt;Proceedings of the Third International Conference on Language Resources and Evaluation (LREC). &lt;/i&gt;&lt;br&gt;---Palmer, Alexis and Katrin Erk. 2007. &amp;#39;IGT-XML: An XML format for interlinearized glossed text.&amp;#39; In &lt;i&gt;Proceedings of the Linguistic Annotation Workshop (LAW-07), ACL 2007.&lt;/i&gt;&lt;br&gt;---Pixabaj, Telma Can (coordinator), Miguel Angel Vicente M&amp;eacute;ndez, Mar&amp;iacute;a Vicente M&amp;eacute;ndez, and Oswaldo Ajcot Dami&amp;aacute;n. Uspanteko text collection, in &lt;i&gt;Text Collections in Four Mayan Languages, 2003-2007.&lt;/i&gt; &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.okma.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;OKMA &lt;/a&gt;(Oxlajuuj Keej Maya&amp;#39; Ajtz&amp;#39;iib&amp;#39;), Supported by &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hrelp.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Endangered Languages Documentation Programme&lt;/a&gt; (SOAS, University of London).&lt;br&gt;---Schroeter, Ronald and Nicholas Thieberger. 2006. &amp;#39;EOPAS: the EthnoER online representation of interlinear text.&amp;#39; In &lt;i&gt;Sustainable Data from Digital Fieldwork &lt;/i&gt;(proceedings of conference held at the University of Sydney, 4-6 December 2006). Sydney University Press.&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#ff0000&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;3&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Discussion: Where should all this information go?</title><link>http://cyberling.elanguage.net/page/Discussion%3A+Where+should+all+this+information+go%3F</link><author>mebeckman</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Discussion%3A+Where+should+all+this+information+go%3F</guid><pubDate>Mon, 07 Sep 2009 04:53:29 CDT</pubDate><description>HThere was some discussion at the workshop of eventually moving some or all of the information on this wiki into a wikipedia article or series of articles. This page exists to host a discussion of that idea, and provide a forum for other ideas about where this information should live going forward.&lt;br&gt;&lt;br&gt;&lt;h3&gt;Wikipedia and/or glottopedia&lt;/h3&gt;&lt;br&gt;Martin Haspelmath suggested that, while wikipedia may be appropriate for disseminating articles aimed at the proverbial &amp;quot;educated lay reader&amp;quot;, we need to also disseminate more technical information to linguists using a specialist site such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.glottopedia.de/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;glottopedia&quot;&gt;glottopedia&lt;/a&gt;. Mark Liberman countered that much information on wikipedia is quite technical, citing the articles on various terms and concepts in maths and computer science, such as this article on &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Maximum_likelihood&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;maximum likelihood&quot;&gt;maximum likelihood&lt;/a&gt; or this article on &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Ontology_%28information_science%29&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;ontology (in information science)&quot;&gt;ontology (in information science)&lt;/a&gt;. &lt;br&gt;&lt;br&gt;In evaluating the arguments for and against establishing a separate wiki for more technical articles, we probably should look at opinion pieces such as this article on &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://thedecisiontree.com/blog/?p=72&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;why does wikipedia suck on science&lt;/a&gt;, where Thomas Goetz says, &amp;quot;Curious about just what &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Epigenetics&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;epigenetics&lt;/a&gt; is? Figure you really should know what &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Mitochondrion&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;mitochondria&lt;/a&gt; do? Don&amp;rsquo;t count on Wikipedia - odds are their analysis is too pedantic for you, as it is for me.&amp;quot; &lt;br&gt;&lt;br&gt;In &lt;font color=&quot;#333333&quot;&gt;deciding whether and how to use Wikipedia, we may want to look at an initiative at NIH to train NIH scientists in how to write effective Wikipedia articles.&lt;/font&gt; &lt;i&gt;Wired Science&lt;/i&gt; has an &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.wired.com/wiredscience/2009/07/wikipedia-training-scientists-on-wiki-culture&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;article about a &amp;quot;Wikipedia Academy&amp;quot;&lt;/a&gt; that NIH hosted recently. &lt;br&gt;&lt;br&gt;It may also be useful to review some of the insights about the sociology of wiki culture in Marshall Poe&amp;#39;s &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.theatlantic.com/doc/200609/wikipedia&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;article in the September 2009 Atlantic Monthly&quot;&gt;article in the September 2006 &lt;i&gt;Atlantic Monthly&lt;/i&gt;&lt;/a&gt; and Nicholson Baker&amp;#39;s &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.nybooks.com/articles/21131&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;March 2008 New York Review of Books piece&quot;&gt;March 2008 &lt;i&gt;New York Review of Books&lt;/i&gt; piece&lt;/a&gt;. More information on these issues can gleaned from reading the opinions expressed in (and experiences with Wiki editing recounted in) some of the comments on &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://arstechnica.com/media/news/2009/05/wikipedia-hoax-reveals-limits-of-journalists-research.ars&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;this Ars Technica piece on the Maurice Jarre Wikipedia hoax&lt;/a&gt;.&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Group 4 White Paper</title><link>http://cyberling.elanguage.net/page/Group+4+White+Paper</link><author>haspelmath</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Group+4+White+Paper</guid><comments>I was a co-chair originally, but stepped down in favor of Tracy Holloway King.</comments><pubDate>Wed, 02 Sep 2009 09:48:50 CDT</pubDate><description>&lt;br&gt;&lt;h2&gt;  &lt;b&gt;Data Reliability and Provenance&lt;/b&gt;&lt;/h2&gt;&lt;br&gt;Peter Austin (co-chair)&lt;br&gt;Martin Haspelmath &lt;br&gt;Kurt Bollacker&lt;br&gt;Tracy Holloway King (co-chair)&lt;br&gt;Koenraad de Smedt &lt;br&gt;Paul Trilsbeek&lt;br&gt;&lt;br&gt;&lt;h2&gt;  Abstract&lt;/h2&gt;&lt;br&gt;In this paper we discuss what data provenance and data reliability are, with special attention to the needs of the linguistics community when moving from simple data sets to more complex cyberinfrastructures. The integrity and completeness of provenance information as a basis for citation, rights management, etc. is crucial for those who record, annotate and compile materials, but also for the sources of raw materials, in particular for indigenous and minority communities, and for the scholars who use these materials as a basis for further scientific work. We suggest some first steps to promote data sharing and publication in the linguistics community.&lt;br&gt;&lt;br&gt;&lt;h2&gt;  Data Provenance&lt;/h2&gt;&lt;br&gt;&lt;i&gt;Provenance is the who, what, and when of metadata&lt;/i&gt;. When a data set is created, it is important to know where it comes from and who is responsible for its publication. Adequate information about provenance, i.e. information about how, where, when and by whom the data was collected, encoded and annotated, and who assumes responsibility for its publication, allows the quality of the data set to be assessed, provides a contact in case of questions, and establishes authorship of the data set, similar to the authorship of an academic paper.&lt;br&gt;&lt;br&gt;The contents and status of a data set are not always clear from a quick inspection of the data. The data might come from native-speaker informants, from published works of literature, from the web, etc. Adequate metadata are needed for cataloguing data sets so as to make them searchable, and to secure that the data will be used in an appropriate way. Metadata should include a detailed description of the data set, including sufficient information on its provenance. In some cases provenance information should even be provided for separate sections of the data set wherever differences are relevant. For example, the data may have been collected over several years of field work; knowing when each section was collected and by who and under which circumstances may be important.&lt;br&gt;&lt;br&gt;Provenance is extremely important to the scientific community which uses data sets as the basis of further analysis, hypothesis testing, etc. For the user, provenance is crucial assessing whether, and to what extent, a data set can form an appropriate basis for subsequent scientific work. Knowing how the data was collected and processed and who was responsible for it allows future users of the data set to judge the quality of the data, how it fits in with their research and how it compares to other data sets. Provenance is also important for the replicability of data and research results. Unless it is clear where the data came from and how it was created, data sets cannot be replicated. For example, if a data set has constituency trees annotated over it, it is important to know whether the trees were manually constructed, created automatically, or bootstrapped by manually correcting automatically constructed trees.&lt;br&gt;&lt;br&gt;For the creators, provenance provides a way to get credit for the scientific work done in creating the data sets and allows the community to cite the data sets with their authors in academic works. Creating a high quality data set is extremely time consuming and requires highly skilled linguists: as such, it is important that those doing this work get credit for their contribution. Provenance is also important in establishing and maintaining privacy rights for those who provided the data. There are many reasons that data may not be appropriate to publish freely in its entirety, e.g. individuals may be identifiable from videos or from the content of discussions, or private rituals may be recorded. In particular the rights of indigenous and minority communities should be respected and acknowledged.&lt;br&gt;&lt;br&gt;Provenance tightly linked to the proper, standardized use of metadata and documentation in reliable environments. All data must be tagged with the appropriate metadata and linked to its documentation. This allows researchers to understand what is, and is not, included in the data set and its annotation and to correctly cite the data set, thereby acknowledging the work of the data set creators and allowing future researchers to reproduce their findings. Metadata and documentation is facilitated by adherence to standards and best practices. As students may not be familiar with these, more senior researchers and experts in natural language data curation need to share their knowledge and to facilitate students adherence to the accepted standards.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Achieving Reliable Provenance Through Publication&lt;br&gt;&lt;/h3&gt;  A major question is how to achieve reliable data provenance in the linguistic community and promoting the sharing of data. Creating data sets is a time consuming, highly skilled task and the scientific community as a whole needs to acknowledge this and to provide support for those working on creating and maintaining data sets. Individuals contributing to and creating these data sets need to get institutional credit for data publication. For example, these should count for tenure reviews and other review processes and should be an integral part of grant applications, both in having data set publication be part of the grant work and in the granting agencies favoring researchers who publish curated data sets, just as they favor those with a proven track record of traditional academic publications.&lt;br&gt;&lt;br&gt;Given the nature of the linguistics community and the working paradigms that they are used to, we suggest&lt;i&gt; promoting curated data sets as publications&lt;/i&gt;. The technology is currently available to treat curated data as publication. There is extensive archival work on linguistic data sets, including the work done by organizations such as the Linguistic Data Consortium. There are also examples from other scientific fields where publication of data collection has become an established scientific practice. However, there needs to be extensive institutional and social engagement in order for curated data as publication to become the norm in the linguistics and language studies communities. &lt;br&gt;&lt;br&gt;Researchers need to be encouraged to publish curated data sets. This can be done in part by requiring it at the institutional level or as part of a grant reward. It can also be aided by providing more infrastructure to publish data sets, including providing information on best practices and on how to access the necessary technologies. In addition, researchers need to cite the published data sets that they used in their research. Reviewers and publishers of articles and books should reject submissions if they do not cite the data sources they they used in their research. These citations should be standardized by using the metadata and the publishers of the curated data sets should facilitate this by providing information as to how the data set should be cited.&lt;br&gt;&lt;br&gt;The linguistic community will also need to provide support to ensure annotation and data quality control, just as it does with published academic papers. Not all data collection and annotation is done equally well and is of equal value. The community needs to have a way to recognize and acknowledge this, similar to the relative value of different books publishers, journals, and conference proceedings.&lt;br&gt;&lt;br&gt;Publication of data through a recognized publication channel with an ISSN would make it easier to cite the data uniformly and correctly, would clearly establish authorship, would allow for different editions, would enforce the use of standards and metadata and would a catalyst for giving academic credit to the makers. It would also promote reviews and rating systems.&lt;br&gt;&lt;br&gt;In a different field, the data journal &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.earth-system-science-data.net/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Earth System Science Data&lt;/a&gt; promotes the rapid publication of research on original data sets. Their policy is outlined as follows:&lt;br&gt;&lt;blockquote&gt;  &lt;i&gt;&amp;quot;Articles in the data section may pertain to the planning, instrumentation and execution of experiments or collection of data. Any interpretation of data is outside the scope of regular articles.&amp;quot;&lt;/i&gt;&lt;/blockquote&gt;  &lt;blockquote&gt;  &lt;i&gt;&amp;quot;In the first stage, papers that pass a rapid access peer-review are immediately published on theEarth System Science Data Discussions (ESSDD) website. They are then subject to Interactive Public Discussion, during which the referees&amp;#39; comments (anonymous or attributed), additional short comments by other members of the scientific community (attributed) and the authors&amp;#39; replies are also published inESSDD. In the second stage, the peer-review process is completed and, if accepted, the final revised papers are published inESSD. To ensure publication precedence for authors, and to provide a lasting record of scientific discussion,ESSDD andESSD are both ISSN-registered, permanently archived and fully citable.&lt;/i&gt;&amp;quot;&lt;br&gt;&lt;/blockquote&gt;  It should be noted that articles in ESSD seem to contain the data mostly within the articles themselves. A journal is, however, not the same as a cyberinfrastructure. A data publication is meant to announce to the community that data has been available and how it was obtained and annotated, and also to give credit where it is due. The actual data can be accessed in a variety of ways, not necessarily through the same channel. Since publication is by nature public, there may be some issues with restricted data. Full metadata could be published, but the metadata could stipulate restrictions on the accessibility of the actual data (e.g. due to proprietary data or privacy issues).&lt;br&gt;&lt;br&gt;One could imagine organizations acting as the publishers of articles with their data sets. Publishers of data will have the responsibility of checking at least the formal aspects of published data, such as proper use of metadata, adherence to standards etc. It might be possible that the same data is published at different hosts. We would need buy-in from institutions and linguists to support this. Publishers would themselves be rated and would need to actively advertise their data publications and make them attractive to researchers. Peer review of data publications should be stimulated. Perhaps language resources would be published unedited first, then reviews can be added on later, where annotation might count as a type of review; by having the language data out, even in unedited form, people would be encouraged to annotate it. Many academic credit systems (e.g. in Norway) require full peer reviewing as well as the use of recognized academic publishing channels.&lt;br&gt;&lt;br&gt;&lt;h3&gt;  Persistence and Fine-Grained Provenance Information&lt;br&gt;&lt;/h3&gt;  Given that linguists change institutions and that URLs shift over time, it is important that future researchers be able to access the same data that is being used today and to be certain that this is the same data as was used by other researchers. On the Internet, the assignment of provenance information to a piece of information can be assured through the use of a Persistent Identifier (PID). PIDs are globally unique identifiers that remain the same even if the URL of a resource changes. Central PID resolvers are used to administer the locations and additional metadata information of the resources. One PID can be used to refer to multiple identical copies of a resource in different locations. Examples of PID systems are the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.handle.net&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Handle&lt;/a&gt; system and the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.doi.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;DOI&lt;/a&gt; system. PID systems only work though as long as the administration of the resource links is maintained, so the assignment of PIDs alone does not guarantee the long-term stability of the resource references. The linguistic communicty might consider setting up a registration authority for linguistic data.&lt;br&gt;&lt;br&gt;Every entity involved in data set creation can be identified by a unique PID. These include entities such as people, organizations, and their roles, the data sets and documents themselves, and different views and mashups of the data. Added value by annotations etc. of existing material can be managed by cascading PIDs with different rights. A proliferation of provenance information may increase the size of the information by orders of magnitude but disk space is cheap and so we can do this very fine grained, e.g. sound clips indicated by start and stop time in a speech corpus. Assigning full provenance information can however be complicated, e.g. when material is translated or when value is added to copyrighted material (e.g. Wall Street Journal corpus) or when a speech corpus has a radio broadcast in the background.&lt;br&gt;&lt;br&gt;Tomorrow&amp;#39;s cyberinfrastructures will not be limited to static storage of data, but will be dynamic systems that process and present data according to user&amp;#39;s needs. This context will present special challenges to handling provenance information. The following dynamic functionalities can be considered:&lt;br&gt;&lt;ul&gt;  &lt;li&gt;  customized presentation of data: filtering, reformatting, style sheets&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  pipelined processes   &lt;/li&gt;&lt;li&gt;  mashups: the integration of data from various sources (sometimes combining various modes)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;In contrast to paper materials, which are static and pre-edited, a cyberarchive allows (and should allow) the user to participate in filtering and presenting information (&amp;quot;play editor yourself&amp;quot;). An example is the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://wab.aksis.uib.no/transform/wab.php?modus=opsjoner&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Wittgenstein archives at Bergen&lt;/a&gt;, which contains digitized manuscripts: the user can choose to include or exclude certain pieces of information and has options for visualization. This creates challenges for provenance. How would you cite a specific view among many other possible views of this material? Such reference may be possible by generating a unique URI and PID for the transformed and formatted web page. An &amp;quot;I want to cite this&amp;quot; button would make this process easy for the user. Dynamic customized views could help with scientific collaboration since it is a good way to bring things together at the presentation level; the tools for this are very useful since most linguists do not have the user interface design skills to do this themselves.  &lt;br&gt;&lt;br&gt;The annotation layered on curated data is often done in conjunction with specific software. Having &amp;quot;software as a service&amp;quot; available to the linguistics community can aid in this process. It will be particularly valuable for institutions with less extensive computing infrastructure, allowing their researchers access to state of the art data set curation facilities. New frameworks such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://uima-framework.sourceforge.net/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;UIMA&lt;/a&gt; can aid in the interactive pipelining of processes on data. This again is a challenge for provenance information. Ideally, a PID can be assigned to every step in the pipelining process; note that at every step, intermediate data could be worth storing as a new resource.&lt;br&gt;&lt;br&gt;Furthermore, Rosetta, &lt;a href=&quot;http://cyberling.elanguage.net/page/Freebase&quot; target=&quot;_self&quot;&gt;Freebase&lt;/a&gt;, the Internet Archive, etc. allow for mashups of data. Cyberinfrastructures should provide mashup functionality, i.e. the smart combination of data from various sources. Provenance is a challenging for mashup data since the mashup process is dynamic and every combination of specific versions of data produces a new mashup version. Also, some people might want to reference the particular rendering of a mashup. Referencing views could easily escalate when every user can have a personal view.&lt;br&gt;&lt;br&gt;&lt;h2&gt;  Reliable Identification, Authorization and Rights Management&lt;br&gt;&lt;/h2&gt;  As part of provenance, in recording the who, what, and when of metadata, it is necessary to have trusted identification of individuals, organizations, and services. This identification needs to persist over time so that decades after a data set is created, it is still possible to determine who created it and how it was created. Reliable identification of individuals is an important prerequisite for at least two purposes:&lt;br&gt;&lt;ol&gt;  &lt;li&gt;  Identifying authors and sources to give credit to data creation.&lt;br&gt;  &lt;/li&gt;&lt;li&gt;  Identifying users to provide authorization based on licenses and rights.&lt;br&gt;&lt;/li&gt;&lt;/ol&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.clarin.eu&quot; target=&quot;_self&quot;&gt;CLARIN&lt;/a&gt; has stated that some system with global e-identities can solve many problems associated with the current situation of people having different usernames at the various sites they use. A single logon should identify the user in an easy way. This is a largely solved problem on technical level, but various solutions are available and none are widely used by the community:  &lt;br&gt;&lt;ol&gt;  &lt;li&gt;  &lt;a href=&quot;http://cyberling.elanguage.nethttps://www.myopenid.com/&quot; target=&quot;_self&quot;&gt;OpenID&lt;/a&gt;   &lt;/li&gt;&lt;li&gt;  Federations of local e-identity providers&lt;br&gt;&lt;/li&gt;&lt;/ol&gt;Authorization is a tuple linking a set of rights, the identity of a piece of data and the identity of a user. Authorization can take various forms depending on restrictions, e.g. access can be given to anyone, can be based on email domains (i.e. affiliated institutions), can require the acceptance of a user license, etc. dependent on what the author or stakeholder decides. There are many such restrictions, e.g., access may be given for non-commercial purposes only, or access to sacred songs can be restricted to the initiated, or subparts of data may be proprietary and requires additional agreements. There is a need to train people on what types of rights and access are appropriate. There may be privacy issues with sources such as hospital patient data, sign language data, etc. which may place restrictions on availability.  &lt;br&gt;&lt;br&gt;Some language data, especially spontaneous speech and sign language, cannot be distributed due to privacy issues, in particular in utterances referring to people. becomes very complex with international access, where different countries may have different rules for guarding privacy. This situation may require different country-specific licenses, so international cooperation may need legal advice from the start.&lt;br&gt;&lt;br&gt;Some technology exists to anonymize source materials, e.g. by masking, manual and automatic, e.g. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://tlt07.uib.no/papers/6.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;making non-words out of words, but keeping POS&lt;/a&gt;; media is harder to deal with than strict text, especially sign language where need much of the exact original data; any encoding of a facial expression is going to be on the edge. Perhaps 3D models or avatars could be useful, but often not automatic and sometimes not possible, e.g. in sign language.&lt;br&gt;&lt;br&gt;At the start of the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mpi.nl/DOBES&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;DOBES&lt;/a&gt; program, legal specialists advised keeping all data closed. Since that was not an option, a code of conduct was worked out which  provides a workable solution. Yet, some data may need to remain closed to all but core researchers in a tightly controlled project. One could use information from original institution waivers to guide what permissions to assign to the data. The &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.clarin.eu&quot; target=&quot;_self&quot;&gt;CLARIN&lt;/a&gt; project has a working group on IPR and licensing issues. The European parliament has expressed an interest in solving the complexity of the issues and may want to promote a revision of legislation, hopefully leading to wider availability of data for research purposes. Some lobbying towards legislators may be necessary.&lt;br&gt;&lt;br&gt;Researchers are often unwilling to turn over their data for storage and distribution in repositories. One reason is that some people feel their data is not ready yet: once data is in a repository they feel it is cast in stone, which is a real problem for data which is never quite complete such as a dictionary of a living language. It is therefore crucial that repositories offer versioning and updating of the stored materials. Some researchers might prefer to control distribution themselves from their own homepage. A possible solution could be that links or web pages could be generated from the repository automatically. This could also be useful for university administration which has to document research production. A soft approach to repositories could help lead people to understand what a data repository is and what it can start to enable.&lt;br&gt;&lt;br&gt;A special situation may occur when someone data is retracted, either for privacy reasons or because of errors or other reasons. One could mark data as deleted, invalid or superseded without actually destroying the data; one may also want to be able to temporarily restrict data for certain reasons.&lt;br&gt;&lt;br&gt;&lt;h2&gt;  Suggested First Steps&lt;/h2&gt;&lt;br&gt;Curated data as publication and the corresponding data provenance and reliability could involve a major, long-term infrastructure project for the linguistic community. We would like to suggest a few simple first steps that the community as a whole can take. Our basic suggestion is to provide both carrots and sticks to the community and to pursue proactive education in data set publication. Although some linguists will become major contributors to curated data sets while others will play a more minor role, all linguists should understand and appreciate their importance: no linguist left behind. It will therefore be useful for the linguistic community to to engage in extensive dissemination and training efforts and to establish links with ongoing generic projects on metadata standards and preservation (e.g. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.loc.gov/standards/premis/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;PREMIS&lt;/a&gt;).&lt;br&gt;&lt;br&gt;Encouraging students is not enough since the actions of successful researchers speak much more loudly than words. Therefore, if well-known, successful members of the linguistic community publish and share their data sets, this will set a powerful example for the next generation. Those who do publish their data should provide a simple &amp;quot;cite as&amp;quot; button with data to make it easy for those using the data to cite it in their works. This is particularly necessary since many linguists are unsure of the proper way in which to cite such data sets. Finally, it would be good to provide a service provision for data structure and integrity validation and for format conversion. Data structure and integrity validation can, at a minimum, check that the relevant metadata is in place before publication. As a more complex process, it can check over the data structure to be sure that it is correct and complete, e.g. a tab delineated file should have tabs between fields. A format conversion service would allow data set creators to publish their data sets in a variety of formats, both making it easier for researchers to use them and helping ensure the long term survival of the data as formats change.&lt;br&gt;&lt;br&gt;There are also some institutional safeguards that can easily be put into place to ensure that data publication occurs and is done properly. First, publishers, editors, and reviewers can require provenance information for all data sources before agreeing to publish research work. Second, editors and funding agencies encourage data sets to be published. This is already starting but should be further encouraged. Also, the funding agencies should ensure that the data is in fact published if it was part of the grant and if possible should publicize that the data is now available. If grantees do not publish their data as promised, either the funding agency could withhold any future grants to the grantee and its institution or could not pay out the final part of the grant until the data is published, similar to how some dissertation fellowships reserve part of the grant until the signed dissertation is submitted.&lt;br&gt;&lt;br&gt;Finally, the establishment of electronic data publishing journals in conjunction with a cyberinfrastructure should be considered, so as to provide a formal channel for establishing authorship of data sets and creating a scholarly reference in addition to a framework for peer review.&lt;br&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Events</title><link>http://cyberling.elanguage.net/page/Events</link><author>Koenraad</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Events</guid><pubDate>Wed, 02 Sep 2009 03:37:37 CDT</pubDate><description>Conferences and workshops related to cyberinfrastructure (chronological)&lt;br&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hf.uio.no/tekstlab/rilivs/program.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;RILIVS workshop on infrastructure for linguistic variation&lt;/a&gt;&lt;br&gt;Oslo, 17-18 Sep. 2009&lt;br&gt;The workshop is part of a series of explorative workshops financed by NOS-HS under the heading &amp;quot;Research Infrastructure for Linguistic Variation Studies&amp;quot; (RILiVS). RILiVS is based on the networks Scandinavian Dialect Syntax (ScanDiaSyn), Nordic Centre of Excellence in Microcomparative Syntax (NORMS) plus the Medieval Nordic Text Archive (Menota), SweDia 2000, and a newly established network for the documentation of Sami languages led from the University of Troms&amp;oslash;.&lt;br&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.csc.fi/english/pages/neeri09&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Networking Event for European Research Infrastructures + Standards Workshop&lt;/a&gt;&lt;br&gt;Helsinki, 30 Sep. - 2 Oct. 2009&lt;br&gt;CLARIN would like to invite all interested European research infrastructure initiatives to exchange information to get a more comprehensive picture which will be essential for moving to a more coherent European eInfrastructure scenario.&lt;br&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.alliancepermanentaccess.eu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;2009 Annual Conference of the Alliance for Permanent Access&lt;/a&gt;&lt;br&gt;Den Haag, 24 Nov. 2009&lt;br&gt;The theme of the 2009 Annual Conference is: Keeping the Records of Science Accessible: towards a sustainable science data e-infrastructure!&lt;br&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.clarin.eu/events/e-humanities-workshop-at-5th-ieee-international-conference-on-e-science&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;e-Humanities workshop at 5th IEEE International Conference on e-Science&lt;/a&gt;&lt;br&gt;Oxford, 9-11 Dec. 2009&lt;br&gt;This e-Humanities track aims to showcase projects that contribute to e-Humanities, whether by providing integrated and interoperable infrastructures, or by offering new types of applications making use of such infrastructures and connecting the digital islands.&lt;br&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lrec-conf.org/lrec2010/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;The seventh international conference on Language Resources and Evaluation (LREC 2010)&lt;/a&gt;&lt;br&gt;Malta, 17-23 May 2010 (incl. workshops)&lt;br&gt;The LREC 2010 Map of Language Resources, Technologies and Evaluation will be a collective enterprise of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure.&lt;br&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>WG2: Resources</title><link>http://cyberling.elanguage.net/page/WG2%3A+Resources</link><author>danmccloy</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/WG2%3A+Resources</guid><pubDate>Mon, 31 Aug 2009 14:51:49 CDT</pubDate><description>&lt;h3&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;This page lists the resources referred to in the WG2 Wiki pages:&lt;/font&gt;&lt;/font&gt;&lt;/h3&gt;&lt;font color=&quot;#ff0000&quot;&gt;PLEASE CONTRIBUTE relevant links and resources&lt;/font&gt;&lt;br&gt;&lt;br&gt;&lt;h3&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;Encoding and Annotation Links&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;:&lt;/font&gt;&lt;/h3&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://uni-leipzig.de/%7Eautotyp&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Autotyp&lt;/a&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO&lt;/a&gt; &lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/my_iso_job.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;My ISO Job: Guidance for Delegates and Experts&lt;/a&gt; (accessible document describing ISO and how experts can participate)&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/joining_in_2007.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Joining In: Participating in International Standardization&lt;/a&gt; (another easy-to-read document on participating in ISO stnds. development)&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/standards_development.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Standards Development&lt;/a&gt; (webpage on standards development with links)&lt;br&gt;&lt;/blockquote&gt;ISO Committees of interest to linguists:&lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TC 37/SC 4&lt;/a&gt; - Language resource management&lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/working_group.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TC 37/SC 4 Working Groups&lt;/a&gt;&lt;br&gt;&lt;/blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/iso_technical_committee.htm?commid=48124&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TC 37/SC 2&lt;/a&gt; - Terminographical and lexicographical working methods&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://std.dkuug.dk/jtc1/sc2/wg2/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;JTC 1/SC 2/WG2&lt;/a&gt;: Universal character set&lt;br&gt;&lt;/blockquote&gt;ISO Script Codes (ISO 15924): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/iso15924/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;font color=&quot;#800080&quot; size=&quot;3&quot;&gt;http://www.unicode.org/iso15924/&lt;/font&gt;&lt;/a&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.language-archives.org/OLAC/metadata.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;OLAC&lt;/a&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.typecraft.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TypeCraft&lt;/a&gt; &lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Unicode&lt;/a&gt;&lt;br&gt;&lt;blockquote&gt;Unicode codecharts&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/charts/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.unicode.org/charts/&lt;/a&gt;&lt;br&gt;&lt;/blockquote&gt;Fonts&lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://scripts.sil.org/IPAhome&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;font color=&quot;#0000ff&quot;&gt;http://scripts.sil.org/IPAhome&lt;/font&gt;&lt;/a&gt; (Unicode-enabled fonts with IPA and other resources)&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm&lt;/a&gt; (John Wells&amp;#39; website on fonts with IPA and Unicode info)&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistlist.org/sp/Fonts.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://linguistlist.org/sp/Fonts.html&lt;/a&gt; (LinguistList info on fonts)&lt;br&gt;&lt;/blockquote&gt;Character pickers&lt;br&gt; &lt;blockquote&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://people.w3.org/rishida/scripts/pickers/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://people.w3.org/rishida/scripts/pickers/&lt;/a&gt;&lt;/font&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://weston.ruter.net/projects/ipa-chart/view/keyboard/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://weston.ruter.net/projects/ipa-chart/view/keyboard/&lt;/a&gt;&lt;br&gt;&lt;/blockquote&gt;Keyboard and Input Methods&lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://scripts.sil.org/UniIPAKeyboard&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;font color=&quot;#0000ff&quot;&gt;http://scripts.sil.org/UniIPAKeyboard&lt;/font&gt;&lt;/a&gt;&lt;br&gt;&lt;font color=&quot;#0000ff&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://linguistlist.org/cfdocs/emeld/school/classroom/unicode/ipafont.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://linguistlist.org/cfdocs/emeld/school/classroom/unicode/ipafont.htm&lt;/a&gt;&lt;/font&gt;&lt;/font&gt; (Tips on how to assign a keystroke for input and how to create a keyboard for character input)&lt;br&gt;&lt;font color=&quot;#0000ff&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.phon.ucl.ac.uk/resource/phonetics/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.phon.ucl.ac.uk/resource/phonetics/&lt;/a&gt;&lt;/font&gt;&lt;/font&gt; (Unicode phonetic keyboard for PC with SIL fonts)&lt;br&gt;&lt;/blockquote&gt;Recommendations on developing new orthographies&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/notes/tn19/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.unicode.org/notes/tn19/&lt;/a&gt; &lt;br&gt;&lt;/blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://linguistlist.org/cfdocs/emeld/school/classroom/unicode/documentation.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Unicode for Language Documentation&lt;/a&gt; (E-MELD document with useful information for linguists on Unicode)&lt;br&gt;&lt;/blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/XSL_Transformations&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;XSL-transformations&lt;/a&gt; (EXtensible Stylesheet Language)&lt;br&gt;&lt;br&gt;&lt;h3&gt;Data Collection Instruments Links:&lt;/h3&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://faculty.washington.edu/wassink/Brown+Bag/Elicitation+materials.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Elicitation Materials Clearinghouse&lt;/a&gt; (Sociolinguistics elicitation instruments, U of Washington, Sociolinguistics Laboratory)&lt;br&gt;Typological tools for field linguists (questionnaires, etc., Max Planck Institute for Ev. Anthro., Leipzig): &lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.eva.mpg.de/lingua/tools-at-lingboard/tools.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.eva.mpg.de/lingua/tools-at-lingboard/tools.php&lt;/a&gt;&lt;br&gt;&lt;/blockquote&gt; &lt;br&gt;&lt;h3&gt;Metadata Tagging Links:&lt;/h3&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.bartus.org&quot; target=&quot;_self&quot;&gt;Akustyk&lt;/a&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Sociolinguistics+common+practices&quot; target=&quot;_self&quot;&gt; &lt;/a&gt;&lt;br&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Sociolinguistics+common+practices&quot; target=&quot;_self&quot;&gt;Sociophonetics metadata tags&lt;/a&gt;&lt;br&gt;&lt;h3&gt;Web-Design Links:&lt;/h3&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Representational_State_Transfer&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;RESTful&lt;/a&gt; design patterns (REpresentational State Transfer)&lt;br&gt;&lt;br&gt;&lt;h3&gt;Storage Links:&lt;/h3&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.swordapp.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;SWORDS&lt;/a&gt; (based on the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://atompub.org/rfc4287.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Atom Publishing Protocol&lt;/a&gt;) &lt;br&gt;Version control software:&lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.nongnu.org/cvs/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Concurrent Versions System (CVS)&lt;/a&gt;: An open-source revision control system&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://subversion.tigris.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Subversion&lt;/a&gt;: An open-source revision control system&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://sharepoint.microsoft.com/Pages/Default.aspx&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Microsoft Sharepoint Server&lt;/a&gt;, 2007&lt;br&gt;&lt;/blockquote&gt;&lt;h3&gt;&lt;br&gt;&lt;/h3&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;Infrastructure for Long-Term Archiving:&lt;br&gt;&lt;font color=&quot;#000000&quot; size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.escidoc.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;eSciDoc&lt;/a&gt; collaborative eResearch infrastructure&lt;/font&gt;&lt;font size=&quot;2&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;font size=&quot;2&quot;&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.fedora-commons.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Fedora Commons&lt;/a&gt; Repository Software&lt;/font&gt;&lt;br&gt;&lt;/font&gt;&lt;br&gt;&lt;h3&gt;Sharing Links:&lt;/h3&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/SOAP&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;WS/SOAP&lt;/a&gt; (standards for web services)&lt;br&gt;&lt;br&gt;&lt;h3&gt;Access Links:&lt;/h3&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Access_control_list&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Wiki access control lists&lt;/a&gt; (for assignment of usage rights and privileges)&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;Bibliography:&lt;/font&gt;&lt;br&gt;DiPaolo, M. and Yaeger-Dror, M. (forthcoming) &lt;u&gt;Best Practices in Sociophonetics&lt;/u&gt;. Cambridge UP&lt;u&gt;&lt;br&gt;&lt;/u&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Existing Standards and Technologies</title><link>http://cyberling.elanguage.net/page/Existing+Standards+and+Technologies</link><author>alexispalmer</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Existing+Standards+and+Technologies</guid><pubDate>Mon, 31 Aug 2009 13:25:35 CDT</pubDate><description>See also pages for &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Subfield-specific+practices&quot; target=&quot;_self&quot;&gt;subfield-specific best practices&lt;/a&gt;&lt;br&gt;See also information re: &lt;a href=&quot;http://cyberling.elanguage.net/page/Getting+Involved+in+ISO+Standards+Development&quot; target=&quot;_self&quot;&gt;participating in the development of ISO standards&lt;br&gt;&lt;/a&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;5&quot;&gt;Existing annotation standards and resources&lt;/font&gt;&lt;font size=&quot;5&quot;&gt;&lt;br&gt;&lt;/font&gt;This section lists the various annotation conventions and other resources for developing and discussing annotation standards that were suggested by participants of Cyberling09. It was copied from &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+1%3A+Annotation+Standards&quot; target=&quot;_self&quot;&gt;the working group 1 page&lt;/a&gt; on 23 July 2009, and ultimately (after the workshop whitepapers are published) this should be the permanent home for information on existing standards. &lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot;&gt;6.1. Phonetics and phonology&lt;br&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;Phone segment tagging symbols, including both:&lt;br&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;IPA and its various ASCII-fications, such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.phon.ucl.ac.uk/home/sampa/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;SAMPA&quot;&gt;SAMPA&lt;/a&gt;, &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/WorldBet&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;WorldBet&quot;&gt;WorldBet&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;and language-specific phoneme-segment encodings such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Arpabet&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;ArpaBet&quot;&gt;ArpaBet&lt;/a&gt; for American English and the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.kokken.go.jp/katsudo/seika/corpus/public/labeling.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;CSJ encoding&quot;&gt;CSJ encoding&lt;/a&gt; for Japanese&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;the various &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ling.ohio-state.edu/%7Etobi/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ToBI&lt;/a&gt; conventions&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot;&gt;6.2. Morphosyntax &lt;br&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.eva.mpg.de/lingua/resources/glossing-rules.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Leipzig Glossing rules&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC_37-4_N225_CD_MAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Morphosyntactic Annotation Format (MAF)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;ISO Lexical Markup Framework (LMF): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lexicalmarkupframework.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Homepage (with Publications and Tools)&quot;&gt;Homepage (with Publications and Tools)&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Typecraft:  a &lt;i&gt;labeling system&lt;/i&gt; which, for any verb construction of a given language, provides a template for that construction type displaying its argument structure, in a fashion as transparent as possible. The template is constructed from a universally established inventory of labeling primitives.&lt;br&gt;http://www.typecraft.org/tc2wiki/Verbconstructions_cross-linguistically_-_Introduction&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;tags for short-unit word (SUW) and long-unit word (LUW) in the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.kokken.go.jp/katsudo/seika/corpus/public/5.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;CSJ&quot;&gt;CSJ&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot;&gt;6.3. Syntax and semantics&lt;br&gt;&lt;br&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://framenet.icsi.berkeley.edu/index.php?option=com_wrapper&amp;Itemid=126&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;FrameNet&quot;&gt;FrameNet annotation manual&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;FILL IN HERE FOR ENGLISH TREEBANK, etc., etc., etc.&lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC37_SC4_N285_MetaModelSynAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Syntactic Annotation Format (SynAF)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/new_doc/iso_tc37_sc4_n269_ver10_wg2_24617-1_semaf-time_utf8.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Semantic Annotation Format - Time and Events (SemAF-TIME)&lt;/a&gt; (formerly &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.timeml.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TimeML&lt;/a&gt;)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot;&gt;6.4. Pragmatics and discourse structure&lt;br&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://childes.psy.cmu.edu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;CHAT&quot;&gt;CHAT&lt;/a&gt; conventions for segmenting turns and identifying the participant and the setting&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.rochester.edu/research/cisd/resources/damsl/RevisedManual/RevisedManual.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;DAMSL&quot;&gt;DAMSL&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;and other schemes documented and discussed at the 1998 &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.umd.edu/users/traum/DSD/schemes.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;DRI meeting&quot;&gt;DRI meeting&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot;&gt;6.5. &lt;/font&gt;&lt;font color=&quot;#808080&quot;&gt;Gesture&lt;/font&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;David MacNeil&amp;#39;s&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://mcneilllab.uchicago.edu/topics/annotation.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Gesture annotation&quot;&gt; Gesture annotation&lt;/a&gt; &lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://childes.psy.cmu.edu/manuals/bts.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;BTS sign transcription system&lt;/a&gt;&lt;/li&gt;&lt;li&gt;NEUROGES-ELAN system: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.berlingesturecenter.de/seminare/neurogeselan/neurogeselan.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;workshop series homepage&quot;&gt;workshop series homepage&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;[look at ANVIL page and LDC list for more?] &lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot;&gt;6.6. Other resources&lt;/font&gt;&lt;br&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;ISO TC37 SC4 - Language Resource Management : &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.tc37.sc4.org&quot; target=&quot;_self&quot;&gt;Homepage&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC_37-4_N225_CD_MAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.isocat.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO Data Category Registry for linguistic concepts&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;    &lt;b&gt;&lt;i&gt;&lt;u&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistics-ontology.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;GOLD (General Ontology for Linguistic Description)&lt;/a&gt;&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;various (lineages of) POS tagging systems such as:&lt;/li&gt;&lt;li&gt;text transcription symbols promoted on the LDC Corpus Cookbook page for &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://projects.ldc.upenn.edu/Corpus_Cookbook/transcription/symbols.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;transcribing text&quot;&gt;transcription/symbols&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ilc.cnr.it/EAGLES/isle/ISLE_Home_Page.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;International Standards for Language Engineering (ISLE)&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;ISO Linguistic Annotation Framework (LAF) &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Ffaculty%2Fide%2Fpapers%2FLAF-LREC06.pdf&quot; target=&quot;_self&quot;&gt;Overview&lt;/a&gt; and &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/iso_tc37_sc4_N463_rev00_wg1_wd_LAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Draft Standard (under revision)&lt;/a&gt;&lt;/li&gt;&lt;li&gt;XML Serialization for LAF : &lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Ffaculty%2Fide%2Fpapers%2FLAW.pdf&quot; target=&quot;_self&quot;&gt;ISO Graph Annotation Format (GrAF)&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.xces.org&quot; target=&quot;_self&quot;&gt;XML Corpus Encoding Standard (XCES) &lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Fsigann&quot; target=&quot;_self&quot;&gt;SIGAnn : ACL Special Interest Group on Annotations &lt;/a&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/http%2F%2Fwww.cs.vassar.edu%2Fsigann&quot; target=&quot;_self&quot;&gt;&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;NXT System: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://groups.inf.ed.ac.uk/switchboard/links.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Switchboard in NXT&quot;&gt;Switchboard in NXT&lt;/a&gt; &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.amiproject.org/showcase/standards-and-toolkits/nite-xml-toolkit-for-annotations&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;NITE XML Toolkit&quot;&gt;NITE XML Toolkit&lt;/a&gt; &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://videolectures.net/mlmi04ch_carletta_iab/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Video lecture by Jean Carletta: the NITE XML Toolkit Meets the ICSI Meeting Corpus&quot;&gt;Video lecture by Jean Carletta: the NITE XML Toolkit Meets the ICSI Meeting Corpus&lt;/a&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;tool and framework for building &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;E-MELD (Electronic Metastructure for Endangered Language Data): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://emeld.org/index.cfm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Homepage&quot;&gt;Homepage&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;All of the annotation systems listed on the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ldc.upenn.edu/annotation/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;COCOSDA technical topic domain Corpus Annotation Tools page&quot;&gt;COCOSDA Corpus Annotation Tools page&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font size=&quot;5&quot;&gt;Existing standards for storage, retrieval, and search&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;This section was copied from &lt;a href=&quot;http://cyberling.elanguage.net/page/WG2%3A+Existing+Standards&quot; target=&quot;_self&quot;&gt;the working group 2 page&lt;/a&gt; on 23 July 2009, and this information should ultimately live here.&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;br&gt;&lt;b&gt;STORAGE&lt;/b&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/b&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;i&gt;-- text encoding standards&lt;/i&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;i&gt;&lt;br&gt;&lt;/i&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://unicode.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Unicode&lt;/a&gt;&lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Text Encoding Initiative (TEI) P5 Standard&lt;/a&gt; &amp;quot;These guidelines make recommendations about suitable ways of representing those features of textual resources which need to be identified explicitly in order to facilitate processing by computer programs. In particular, they specify a set of markers (or tags) which may be inserted in the electronic representation of the text, in order to mark the text structure and other features of interest.&amp;quot;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;...&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt; &lt;i&gt;(domain-specific) terminological standards&lt;br&gt;&lt;/i&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;terminological standards (e.g., the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistics-ontology.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;GOLD ontology&lt;/a&gt;)&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;...&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;--&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;i&gt;-- storage and retrieval standards&lt;/i&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt; &lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;repository systems (e.g. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.escidoc.com/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;www.escidoc.org&lt;/a&gt; and &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.fedora-commons.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;www.fedora-commons.org&lt;/a&gt; might be relevant in relation to long-term archiving&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.delaman.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;DELAMAN&lt;/a&gt; is &amp;#39;as international umbrella body for archives and other initiatives with the goal of documenting and archiving endangered languages and cultures worldwide. Our aim is to stimulate interaction about practical matters that result from the experiences of fieldworkers and archivists, and to act as an information clearinghouse.&amp;#39;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;The &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.rosettaproject.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Rosetta Projec&lt;/a&gt;t is another archive actively promoting a set of best practices for storage of language data.&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;b&gt;RETRIEVAL&lt;/b&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;i&gt;-- reference/identification standards (i.e. metadata)&lt;br&gt;&lt;/i&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.sil.org/ISO639-3/codes.asp&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO language codes&lt;/a&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.language-archives.org/tools.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;OLAC metadata&lt;/a&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://dublincore.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Dublin Core&lt;/a&gt; metadata standard&lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;-- &lt;i&gt;Citation Standards&lt;br&gt;&lt;/i&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ocoins.info/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;COINS&lt;/a&gt; (a simple standard for embedding Dublin Core citation metadata in a web page)&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;br&gt;&lt;b&gt;SEARCH&lt;/b&gt;&lt;br&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.opensearch.org/Home&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Open Search&lt;/a&gt;. A very simple standard for sharing search results, usually by expressing such search results in the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.atomenabled.org/developers/syndication/atom-format-spec.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Atom Syndication Format&lt;/a&gt;. Although this is not the best standard (in terms of design, extensibility), it is relatively easy to adopt. Open Search also has proposed geographic extensions to describe how to query a collection based on geographic parameters. &lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font size=&quot;3&quot;&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot;&gt;ACCESS/REUSE&lt;/font&gt;&lt;/b&gt;&lt;br&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;-- &lt;i&gt;Cultural Heritage Global Schema and Ontologies&lt;br&gt;&lt;/i&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://cidoc.ics.forth.gr/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;CIDOC&lt;/a&gt; (an ontology mainly applied by European museums and other heritage organizations that is nicely abstracted and very generalized, but is complex and has some difficulties in application)&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://ochre.lib.uchicago.edu/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;OCHRE/ArchaeoML&lt;/a&gt; (a somewhat more simple global schema / ontology for cultural heritage applications, including archaeology, epigraphy and philology. It is highly abstract so that projects and collections retain native descriptive terminologies but some degree of interoperability and shared services are facilitated.&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;-- &lt;i&gt;Copyright and Intellectual Property&lt;br&gt;&lt;/i&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://creativecommons.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Creative Commons&lt;/a&gt; provides a series of standard copyright licenses and associated metadata to explicitly give certain permissions and conditions for use/reuse of copyrighted content. These are useful to define how content can be used. However, these are complicated to apply with scientific data, since US copyright law makes a distinction between &amp;quot;facts&amp;quot; (ideas, concepts, objective data) and &amp;quot;expressions&amp;quot;. Since many scientific datasets contain factual measurements and observation, they may not be protected by copyright. To make matters more complicated, the determination of what&amp;#39;s a fact and what&amp;#39;s an expression is ambiguous and a blurred distinction. This legal ambiguity and complexity makes it harder to use and reuse scientific data. Therefore, Creative Common&amp;#39;s scientific arm, &amp;quot;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://sciencecommons.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Science Commons&lt;/a&gt;&amp;quot;, recommends that scientists do not use Creative Commons copyright licenses for scientific data. Instead, Science Commons recommends that scientific application explicitly dedicate data to the public domain using the &amp;quot;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://wiki.creativecommons.org/CC0&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;CC-Zero&lt;/a&gt;&amp;quot; declaration. CC-Zero removes legal ambiguity around data, removes all restrictions for reuse, and in theory, maximizes the scientific value of data.&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt; &lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;i&gt;-- APIs/standards for interfaces with other resources (e.g. corpora, lexica/lexical resources, treebanks?, ...)&lt;br&gt;&lt;/i&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.w3.org/TR/wordnet-rdf/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;WordNet&lt;/a&gt; &amp;quot;This document presents a standard conversion of Princeton WordNet to RDF/OWL. It describes how it was converted and gives examples of how it may be queried for use in Semantic Web applications.&amp;quot;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font size=&quot;5&quot;&gt;Existing tools, web services, and other technologies&lt;br&gt; &lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;This section was copied in part from &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+3%3A+Tools+%28existing+and+future+%22killer+apps%22%29&quot; target=&quot;_self&quot;&gt;the working group 3 page&lt;/a&gt; on 23 July 2009.&lt;br&gt;&lt;br&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;b&gt;TOOLS&lt;/b&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;br&gt; &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/b&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.typecraft.org/tc2wiki/Main_Page&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TypeCraft&lt;/a&gt; Collaborative text annotation&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://wals.info/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;WALS&lt;/a&gt; The World Atlas of Language Structures Online&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.csufresno.edu/odin/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ODIN&lt;/a&gt; Online Database of INterlinear glossed text&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.textgrid.de/en.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TextGrid&lt;/a&gt; &amp;quot;TextGrid aims to create a community grid for the collaborative editing, annotation, analysis and publication of specialist texts. It thus forms a cornerstone in the emerging e-Humanities.&amp;quot;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.nltk.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Natural Language Toolkit (NLTK)&lt;/a&gt; &amp;quot;Open source Python modules, linguistic data and documentation for research and development in natural language processing, supporting dozens of NLP tasks, with distributions for Windows, Mac OSX and Linux.&amp;quot;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://hudesktop.hucompute.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;eHumanities Desktop&lt;/a&gt; (project is in alpha development stage, no description available yet)&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tei-c.org/Roma/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Roma: TEI validation tool&lt;/a&gt; &amp;quot;These pages will help you design your own TEI validator, as a DTD, RELAXNG or W3C Schema.&amp;quot;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://projects.palaso.org/projects/show/chorus&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Chorus&quot;&gt;Chorus&lt;/a&gt; is a version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed. Chorus is a &lt;a href=&quot;http://cyberling.elanguage.net/page/The+Tools+group+is+charged+with+identifying+and+documenting+existing+and+needed+tools+which+will+be+the+face+of+the+cyberinfrastructure+for+ordinary+working+linguists.+These+tools+include+both+those+used+by+data+creators+%28e.g.%2C+linguists+annotating+data+that+they+later+share%29+and+data+consumers+%28e.g.%2C+linguists+using+the+annotated+data+of+others+to+create+new+kinds+of+data%29.++Existing+Tools++++++*+TypeCraft+Collaborative+text+annotation+++++*+WALS+The+World+Atlas+of+Language+Structures+Online+++++*+ODIN+Online+Database+of+INterlinear+glossed+text++++++*+TextGrid+%22TextGrid+aims+to+create+a+community+grid+for+the+collaborative+editing%2C+annotation%2C+analysis+and+publication+of+specialist+texts.+It+thus+forms+a+cornerstone+in+the+emerging+e-Humanities.%22++++++*+Natural+Language+Toolkit+%28NLTK%29+%22Open+source+Python+modules%2C+linguistic+data+and+documentation+for+research+and+development+in+natural+language+processing%2C+supporting+dozens+of+NLP+tasks%2C+with+distributions+for+Windows%2C+Mac+OSX+and+Linux.%22+++++*+eHumanities+Desktop+%28project+is+in+alpha+development+stage%2C+no+description+available+yet%29+++++*+Roma%3A+TEI+validation+tool+%22These+pages+will+help+you+design+your+own+TEI+validator%2C+as+a+DTD%2C+RELAXNG+or+W3C+Schema.%22+++++*+Chorus+is+a+version+control+system+designed+to+enable+workflows+appropriate+for+typical+language+development+teams+who+are+geographically+distributed.+++Needed+Tools++++++*+%3F%3F+++What%27s+a+killer+application%3F+Google+Maps+may+be+a+good+example+for+a+killer+app%3A++++++*+it%27s+killer+in+the+way+it+brought+mapping+data+to+everyone.+++++*+it+actually+killed%2C+e.g.+gml+-+at+least+gml%27s+hope+for+mass+adoption.+++++*+it+didn%27t+piggyback+on+a+standard%2C+but+set+one%3A+kml+-+and+it+turned+out%2C+creating+xml+files+isn%27t+that+much+of+a+problem%2C+if+you+want+it+badly+enough.++But++++++*+can+there+be+someting+like+micro-killer-apps%3F+++++*+can+there+be+something+like+scientific+killer+apps%3F+doesn%27t+%22scientific%22+mean+%22too+small+to+be+killer%22%3F++Following+the+Google+Maps+example+a+killer+app+would+help+pull+data+out+of+the+drawers.+This+might+happen+in+two+ways%3A+++++1.+Make+publishing+data+easier+or++++2.+provide+big+enough+incentives+to+submit+to+tedious+publishing.++What+could+killer+apps+for+linguistics+look+like%3F++++++*+search+engines%3F+or+the+semantic+web+%28see+this+blog+post+for+an+idea+of+what+this+could+mean%29%3F+++++*+data+visualization%3F+++++*+can+%22archiving%22+or+%22longterm+preservation%22+be+a+killer+app%3F+%28Does+not+sound+like+it+-+does+it.%29sss+++++*+is+reproducible+research+enough+of+an+incentive+to+publish+data%3F++may+the+killer+app+be+something+social%2Fpolitical+-+like+a+new+model+for+scientific+recognition+on+the+web%3F+and+if+so%2C+what+can+we+do+to+bring+it+about%3F+Foster+skills%3F++Killer+applications+are+applications+that+are+used+lots+and+lots+Therefore+a+good+question+might+be%3A+Who+are+the+linguists+interested+in+finding+and%2For+producing+reusable+data%3F++*+computational+linguists+%28yes%29+*+corpus+linguists+%28may+be%29+*+typologists+%28may+be%29+*+descriptive+linguists+%28+perhaps+...%29+*+theoretical+linguists+%28hm...+%29++But+data+for+the+computational+linguist+is+probably+not+quite+the+same+as+data+for+the+typologist%2C+and+where+do+theoretical+linguists+stand+when+it+comes+to+%27data%27+%3F+Likewise+a+killer+app+for+a+computational+linguist+is+probably+something+very+different+from+an+application+that+a+descriptive+linguist%2C+engaged+into+field+work%2C+would+care+to+call+a+useful+tool.+Theoretical+linguists+on+the+other+hand+would+probably+not+like+to+spend+much+time+on+finding+data.+Finally+the+generation+of+reusable+resources%2C+if+considered+important+at+all%2C+must+pay+off+academically+to+attract+more+than+the+occasional+linguist.+Perhaps+we+can+conclude+from+this+that+we+rather+need+a+cluster+of+tools+than+this+one+application+-+together+they+might+be+a+killer.+%3A%29++So+following+the+definition+above+%28%22killer+apps+are+apps+that+are+used+a+lot%22%29%2C+we+can+probably+assume+that+future+killer+apps+will+be+on+the+web.+++Desirable+Characteristics+of+Apps++++++*+No+dead+ends+for+data%3A+While+some+apps+%28e.g.+filemaker%29+may+be+%22killer%22+in+how+they+help+organizing+data%2C+they+also+make+reusing+the+data+hard.&quot; target=&quot;_self&quot; title=&quot;Palaso Project&quot;&gt;Palaso Project&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://purl.org/linguistics/e-linguistics&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;e-Linguistics&lt;/a&gt;: building a cyberinfrastructure for linguistics (including a Python toolkit for data migration; documentation is still being posted)&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.icsi.berkeley.edu/%7Ejan/projects/CDET/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Consistent Document Engineering Toolkit&quot;&gt;Consistent Document Engineering Toolkit&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;Thai language specific tools&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.cmu.edu/%7Epaisarn/software.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;SWATH&lt;/a&gt; Thai word segmentation and POS tagging&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.arts.chula.ac.th/%7Eling/wordseg/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;CU Thai word segmentation&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.arts.chula.ac.th/%7Eling/tts/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;CU Thai Romanization&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hlt.nectec.or.th/products/ispeech.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;iSpeech&lt;/a&gt; Automatic speech recognition toolkit&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hlt.nectec.or.th/products/vaja.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Vaja &lt;/a&gt;Thai/English text-to-speech synthesis engine&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tcllab.org/libs/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;LIBS&lt;/a&gt; String kernel based language-script-encoding (LSE) identifier for 85 LSEs&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tcllab.org/kui/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;KUI&lt;/a&gt; Collaborative WordNet translation platform&lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tcllab.org/chumpol/sndws/api_regis.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Web service and Open API&lt;/a&gt; for Asian WordNet, Multilingual Morphological Analyzer, and Thai Soundex&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br&gt;General purpose tools in use by linguists:&lt;br&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.r-project.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;R Project&quot;&gt;R Project&lt;/a&gt; for statistical computing and the linguistics packages in &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://emu.sourceforge.net/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;EMU&quot;&gt;EMU&lt;/a&gt;.&lt;br&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.fon.hum.uva.nl/praat/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Praat&lt;/a&gt; doing phonetics by computer&lt;font color=&quot;#999900&quot; face=&quot;Palatino,Book Antiqua,Times&quot; size=&quot;6&quot;&gt;&lt;b&gt;&lt;br&gt;&lt;/b&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://python.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;Python&quot;&gt;Python&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.anvil-software.de/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot; title=&quot;ANVIL&quot;&gt;ANVIL&lt;/a&gt; video annotation research tool&lt;/li&gt;&lt;/ul&gt;&lt;font size=&quot;3&quot;&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font color=&quot;#808080&quot;&gt; &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/b&gt;&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;b&gt;WEB SERVICES&lt;/b&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;br&gt; &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/b&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;i&gt;-- &lt;/i&gt;&lt;/font&gt;    &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.ics.uci.edu/%7Efielding/pubs/dissertation/rest_arch_style.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;b&gt;Representational state transfer (&lt;/b&gt;&lt;b&gt;REST) &lt;/b&gt;&lt;/a&gt;  &lt;ul&gt;&lt;li&gt;    A design pattern used as a standard in which clients and servers are able to communicate over the internet.&lt;/li&gt;&lt;li&gt;  Communication is limited to the four verbs of the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.w3.org/Protocols/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;HTTP &lt;/a&gt;Protocol: GET, POST, PUT, DELETE. These four commands are used in manipulating any resource on the internet that has a URI. The limitation of only four commands provides simplicity in the semantics of communication between the client and server.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Resource: Any entity; anything that can be identified with a &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Uniform_Resource_Identifier&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;URI &lt;/a&gt;(e.g. phone number, car, person, idea).&lt;/li&gt;&lt;li&gt;GET: Retrieve one or many resources from the server.&lt;/li&gt;&lt;li&gt;POST: Update a resource on the server.&lt;/li&gt;&lt;li&gt;PUT: Add a new resource to the server.&lt;/li&gt;&lt;li&gt;DELETE: Destroy a resource from the server.&lt;/li&gt;&lt;li&gt;Note that GET, POST, PUT and DELETE are passed in the header as methods of an HTTP request. GET is the default method when submitting a URL request using a web browser.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://multitree.linguistlist.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;MultiTree &lt;/a&gt;Examples:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;GET &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://multitree.linguistlist.org/codes/pol&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://multitree.linguistlist.org/codes/pol &lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Semantics: &amp;ldquo;Get the code resource &amp;lsquo;pol&amp;rsquo; from the multitree.linguistlist.org server.&amp;rdquo;&lt;/li&gt;&lt;li&gt;Returns an HTML formatted page of all data pertaining to the &amp;lsquo;pol&amp;rsquo; code resource in the MultiTree repository&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;GET &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://multitree.linguistlist.org/codes/pol/trees.json&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://multitree.linguistlist.org/codes/pol/trees.json&lt;/a&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Semantics: &amp;ldquo;Get all tree resources that contain the code resource &amp;lsquo;pol&amp;rsquo; and return in JSON format.&lt;/li&gt;&lt;li&gt;Returns a javascript object that lists all tree resources which contain the code &amp;lsquo;pol&amp;rsquo;. &lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.json.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;JSON&lt;/a&gt;: Javascript object notation used for marking up data. Similar to XML but far more succinct since its purpose is for communicating between machines with disregard for human readability.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;It is possible to implement a RESTful web service from the ground up, though several popular frameworks help facilitate its construction:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://rubyonrails.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Ruby on Rails&lt;/a&gt; (ruby)&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.djangoproject.com/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Django &lt;/a&gt;(python)&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://framework.zend.com/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ZEND &lt;/a&gt;(php)&lt;/li&gt;&lt;li&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.restlet.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Restlet &lt;/a&gt;(java)&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;REST can be compared to other standards of communication over the internet: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://en.wikipedia.org/wiki/Remote_procedure_call&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;RPC &lt;/a&gt;and &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.w3.org/TR/soap/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;SOAP&lt;/a&gt;. The critique of these standards, though which have been used more widely than REST, is that they add unnecessary complexity when designing both communication interfaces between clients and servers.&lt;br&gt;&lt;/li&gt;&lt;li&gt;See also: &lt;a href=&quot;http://cyberling.elanguage.net/page/Web+Standards&quot; target=&quot;_self&quot; title=&quot;Web Standards&quot;&gt;Web Standards&lt;/a&gt;&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;b&gt;OTHER TECHNOLOGIES&lt;/b&gt;&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font color=&quot;#333333&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;br&gt; &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/b&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;3&quot;&gt;&lt;i&gt;-- here&lt;/i&gt;&lt;/font&gt;&lt;br&gt; &lt;ul&gt;&lt;li&gt;&lt;br&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Getting Involved in ISO Standards Development</title><link>http://cyberling.elanguage.net/page/Getting+Involved+in+ISO+Standards+Development</link><author>alexispalmer</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Getting+Involved+in+ISO+Standards+Development</guid><comments>Moved from: WG2: Big ideas</comments><pubDate>Mon, 31 Aug 2009 13:23:46 CDT</pubDate><description>The topic of standards has been raised in several Working Groups, and a number ISO standards have been mentioned at Cyberling 2009 (e.g., &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/new_doc/ISO_TC_37-4_N225_CD_MAF.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;MAF&lt;/a&gt;, ISO 639 [language tags], and &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.isocat.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISOcat&lt;/a&gt; [ISO 12620]). The discussion of standards raises a more general issue, not separately addressed: how can linguists participate in ISO standards development and why is it important they do?&lt;br&gt;&lt;br&gt;&lt;h2&gt;What is ISO and why should linguists get Involved in ISO standards development?&lt;/h2&gt;The International Organization for Standardization (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.iso.org&lt;/a&gt;) is a very large organization with standards covering everything from screw thread specifications to language tags. &lt;br&gt;&lt;br&gt;Because standards development is &amp;ldquo;based on voluntary involvement of all interests in the market place,&amp;rdquo; it is very important for linguists to participate in the creation of any standard that relates to language. The involvement of linguists can ensure that the ISO standards have been reviewed carefully and meet linguists&amp;rsquo; needs. For some ISO standards, such as character encoding, it is also crucial that user communities be directly consulted. This is particularly true for users of less well-known languages (and their writing systems). Since linguists often work with such user communities, they can act as intermediaries to encourage such user community participation. For groups of linguists that have been developing their own set of standards independent of ISO, funneling such work into an ISO standard will help guarantee long-term stability and longevity. &lt;br&gt;&lt;br&gt;&lt;h2&gt;The ISO standards process&lt;/h2&gt;The process of developing and approving a standard can be quite lengthy, lasting several years. Hence participation requires time, commitment, and dedication from those who wish to participate. &lt;br&gt;&lt;br&gt;Meetings can be held in different locations throughout the world, so funding for travel is also a desideratum, though attending meeting is not required. Attendance is useful, however, as it is an opportunity to observe first-hand how ISO Working Groups function. It also enables participants to meet face-to-face, voice their concerns, and discuss topics in depth, which can be more difficult if handled via email. Since standard development is done through consensus, talking with others in-person can help move the process of coming to agreement more quickly. &lt;br&gt;&lt;br&gt;&lt;h2&gt;How to participate in ISO standards development&lt;br&gt;&lt;/h2&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;Read up on the ISO standards development process (such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/my_iso_job.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.iso.org/iso/my_iso_job.pdf&lt;/a&gt;) and other docs on the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ISO website&lt;/a&gt; (such as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/joining_in_2007.pdf&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.iso.org/iso/joining_in_2007.pdf&lt;/a&gt; and &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/standards_development.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.iso.org/iso/standards_development.htm &lt;/a&gt;)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Contact the Convener of the Working Group (WG) you are interested in to see if you can participate (see list below). Read the relevant standard the WG is working on and related documents.&lt;br&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;Contact your country&amp;rsquo;s standards organization (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/iso_members&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.iso.org/iso/iso_members&lt;/a&gt;), describe your expertise, and see if you can participate in the Working Group in an official capacity.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Attend the Working Group meetings, if at all possible. The Working Group Convener can permit &amp;ldquo;invited guests&amp;rdquo; to participate in meetings.  There may be a limitation on the number of times you can participate as an &amp;ldquo;invited guest,&amp;rdquo; depending upon the rules of your standards organization, so if you feel you can commit to long-term participation, try to get appointed as a representative or an &amp;ldquo;expert&amp;rdquo; by your member body.  &lt;/li&gt;&lt;/ul&gt;    &lt;h2&gt;&lt;br&gt;&lt;/h2&gt;&lt;h2&gt;Participating in the Unicode Standard&amp;#39;s development&lt;/h2&gt;The description above focuses on how linguists can participate in ISO standards development. For linguists interested in character encoding, Unicode is another way to participate in standards development, as Unicode and ISO 10646 are completely synchronized. All characters accepted into ISO 10646 must also be approved by the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/consortium/utc.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Unicode Technical Committee (UTC)&lt;/a&gt;. &lt;br&gt;&lt;br&gt;For some linguists, participation in Unicode might be more accessible. &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/timesens/calendar.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;UTC meetings&lt;/a&gt; are held quarterly, often in the San Francisco Bay Area. In contrast, the ISO Working Group that oversees work on ISO 10646 (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://std.dkuug.dk/jtc1/sc2/wg2/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;JTC 1/SC 2/WG2&lt;/a&gt;) meets twice a year, often in different locations throughout the world. &lt;br&gt;&lt;br&gt;The UTC discussions can be quite technical in nature, as &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/consortium/memblist.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;UTC members&lt;/a&gt;   &amp;ndash; drawn largely from the computer industry   &amp;ndash; need to be sure that UTC decisions can be implemented on current platforms and in software, and are in line with Unicode policies. Still, input from linguists and members of the user community on character and script proposals is valuable and highly encouraged; feedback is particularly welcome through the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/reporting.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Unicode website form&lt;/a&gt;.&lt;br&gt;&lt;br&gt;Ways to get involved:&lt;br&gt;&lt;br&gt;&lt;ul&gt;&lt;li&gt;To pose a question on Unicode or to keep an eye on Unicode discussions, sign up for the public &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/consortium/distlist.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Unicode email list&lt;br&gt;&lt;br&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;To become more involved in Unicode development, &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/consortium/memblist.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;become a member of Unicode&lt;/a&gt;, either as a representative of your institution, as an individual, or as a student. This will give you access to Unicode documents and the &amp;quot;inner&amp;quot; Unicode email list (Unicore). &lt;br&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;If you are available, attend a Unicode Technical Committee meeting as an observer (or, as a member, see above). Contact Deborah Anderson, UC Berkeley representative to the Unicode Consortium, for further information (dwanders at berkeley dot edu [with no spaces]).&lt;br&gt;&lt;br&gt;&lt;/li&gt;&lt;li&gt;Submit feedback on the Unicode Standard, its other specifications, or any aspect of the Unicode website via the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/reporting.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Unicode website form&lt;/a&gt;. &lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;h3&gt;References&lt;i&gt;: ISO Technical committees (TC) / Subcommittees (SC) / Joint Technical Committees (JTC) of Interest to Linguists&lt;/i&gt;&lt;/h3&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TC 37/SC 4&lt;/a&gt; - Language resource management (work on feature structures, morpho-syntactic annotation framework, linguistic annotation framework, word segmentation of written texts, syntactic annotation framework, semantic annotation framework) &lt;br&gt;&lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/WG1/wg1.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Working Group 1&lt;/a&gt;: Basic descriptors and mechanisms for language resources&lt;br&gt;&lt;/blockquote&gt; &lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/WG2/wg2.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Working Group 2&lt;/a&gt;: Representation Schemes &lt;br&gt;&lt;/blockquote&gt; &lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/WG3/wg3.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Working Group 3&lt;/a&gt;: Multilingual text representation&lt;br&gt;&lt;/blockquote&gt; &lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/WG4/wg4.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Working Group 4&lt;/a&gt;: Lexical database&lt;br&gt;&lt;/blockquote&gt; &lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.tc37sc4.org/WG5/wg5.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Working Group 5&lt;/a&gt;: Workflow of language resource management&lt;br&gt;&lt;/blockquote&gt; &lt;br&gt; &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/iso_technical_committee.htm?commid=48124&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;TC 37/SC 2&lt;/a&gt; - Terminographical and lexicographical working methods &amp;ndash; codes for the representation of names of languages (ISO 639), terminological entries in standards, lexicographical production and marketing, interpreting/interpretation processes, &lt;br&gt;&lt;blockquote&gt; Working Group 1: Language coding&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt; Working Group 2: Terminography&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt; Working Group 3: Lexicography&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt; Working Group 4: Source identification for language resources&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt; Working Group 5: Requirements and certification schemes for cultural diversity management&lt;br&gt;&lt;/blockquote&gt;&lt;blockquote&gt; Working Group 6: Translation and Interpretation Services&lt;br&gt;&lt;/blockquote&gt;&lt;br&gt;JTC1/SC2 Coded character sets&lt;br&gt; &lt;blockquote&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://std.dkuug.dk/jtc1/sc2/wg2/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Working Group 2&lt;/a&gt;: Universal coded character set&lt;/blockquote&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Genealogical classification with AutoTyp</title><link>http://cyberling.elanguage.net/page/Genealogical+classification+with+AutoTyp</link><author>alexispalmer</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Genealogical+classification+with+AutoTyp</guid><comments>added one link</comments><pubDate>Mon, 31 Aug 2009 07:54:13 CDT</pubDate><description>&lt;table align=&quot;bottom&quot; cellpadding=&quot;3&quot; class=&quot;WPC-edit-style-grid1 WPC-edit-border-all WPC-edit-styleData-color1=%23ebebeb&amp;color2=%23c7c7c7&quot; height=&quot;176&quot; width=&quot;1156&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;23%&quot;&gt;Case study subdiscipline: &lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;77%&quot;&gt;Typology&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;23%&quot;&gt;Project title:&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;77%&quot;&gt;Autotyp typological databases&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;23%&quot;&gt;Software used: &lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;77%&quot;&gt;FileMaker Pro (TM)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;23%&quot;&gt;Goals of this case study:&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;77%&quot;&gt;Demonstrate the use of notes and a log to track changes in the genealogical classification in the database. &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br&gt;&lt;br&gt;See slideshow:  jn_cyberling_reanalysis_trail.ppt  (attachment to this page)&lt;br&gt;&lt;br&gt;(The genealogical classification can also be accessed online at &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://uni-leipzig.de/%7Eautotyp&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://uni-leipzig.de/~autotyp&lt;/a&gt; )&lt;br&gt;&lt;br&gt;This is a series of screenshots of records in the Autotyp typological database. One of the modules there is a genealogy for every language we have in the database (about 2600 languages so far). By now the genealogy amounts to a near-complete list of all the world&amp;#39;s language families and most of their major subgrouping. It is under constant revision as fieldwork and comparative work discover new languages and families, join families into larger groups, and otherwise change classification. The literature is also filled with speculative proposals about genealogical relationships that do not have adequate support. Keeping track of the revisions is not particularly burdensome and is absolutely essential for documenting the grounds for the classification. (Unlike most other classifications of the world&amp;#39;s languages, ours is based on evidence and criteria for non-chance degrees of resemblance, so every decision does need to be described.)&lt;br&gt;&lt;br&gt;We have a Notes field on all of our genealogical classification records, and this is used to record the history of editing and reclassifying. The slides in this slideshow show:&lt;br&gt;&lt;br&gt;3. The levels of classification we consider determinate. Stock and language entries are required for every record.&lt;br&gt;&lt;br&gt;4. The search interface as seen for a language search from the on-line search interface.&lt;br&gt;&lt;br&gt;6. Sample language record for Arabic, with its classification at different levels. Note that Afroasiatic, though a bona fide family, is not a stock but a higher-level grouping (demonstrable but not reconstructable).&lt;br&gt;&lt;br&gt;8-15 (see commentary slides 7 and 14): These are a set of screenshots of records for languages of the Penutian macrogroup. They document our changing decisions on the status of this group and the subgrouping of some of its components.&lt;br&gt;&lt;br&gt;16-17 Page from the database log outlining the changes made to Penutian in 2008.&lt;br&gt;&lt;br&gt;17-18 Slide showing evidence and counterevidence presented in Notes comments, on the subclassification of Western Malayo-Polynesian.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>WG2: Subfield-specific practices</title><link>http://cyberling.elanguage.net/page/WG2%3A+Subfield-specific+practices</link><author>alexispalmer</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/WG2%3A+Subfield-specific+practices</guid><pubDate>Mon, 31 Aug 2009 07:23:33 CDT</pubDate><description>&lt;font color=&quot;#0000ff&quot; size=&quot;4&quot;&gt;&lt;font size=&quot;5&quot;&gt;Standards are great, now how can I use them?&lt;br&gt;&lt;/font&gt;&lt;br&gt;&lt;/font&gt;&lt;font size=&quot;4&quot;&gt;Widespread use of standards and/or best practices just won&amp;#39;t happen unless it is easy for people to:&lt;br&gt;&lt;/font&gt;&lt;ol&gt;&lt;li&gt;&lt;font size=&quot;4&quot;&gt;Locate information re: standards and what they entail.&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;4&quot;&gt;&lt;b&gt;Learn how to apply the standards to their own data.&lt;/b&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ol&gt;&lt;font size=&quot;4&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;font size=&quot;4&quot;&gt;The page(s) linked below provide common practices in some specific subfields of Linguistics:&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Sociolinguistics+common+practices&quot; target=&quot;_self&quot;&gt;Sociolinguistics&lt;/a&gt;&lt;br&gt;&lt;font color=&quot;#0000ff&quot;&gt;&lt;u&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Typology+common+practices&quot; target=&quot;_self&quot;&gt;Typology&lt;/a&gt;&lt;/u&gt;&lt;/font&gt;&lt;br&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Language+documentation+common+practices&quot; target=&quot;_self&quot;&gt;Language documentation and description&lt;/a&gt;&lt;br&gt;&lt;br&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Phonetics+common+practices&quot; target=&quot;_self&quot;&gt;Phonetics (article stub)&lt;/a&gt;&lt;br&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Syntax+common+practices&quot; target=&quot;_self&quot;&gt;Syntax (article stub)&lt;/a&gt;&lt;br&gt;&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#ff0000&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;font size=&quot;3&quot;&gt;&lt;font color=&quot;#0000ff&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;------------------------------------------------------------------------------------------------------------&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#ff0000&quot;&gt;&lt;br&gt;&lt;font size=&quot;3&quot;&gt;PLEASE CONTRIBUTE!&lt;/font&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#ff0000&quot; size=&quot;3&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font color=&quot;#ff0000&quot;&gt;This is an opportunity to share your expertise regarding subfield-specific standards or best practices, how these can be applied in your field, and why one should bother doing so. We are seeking contributions to this page and its offspring, in two forms:&lt;br&gt; 1. Expansion of existing seed lists&lt;br&gt; 2. Creation of new seed lists&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>WG2: Case Studies</title><link>http://cyberling.elanguage.net/page/WG2%3A+Case+Studies</link><author>alexispalmer</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/WG2%3A+Case+Studies</guid><pubDate>Mon, 31 Aug 2009 07:20:13 CDT</pubDate><description>&lt;font size=&quot;3&quot;&gt;The case studies developed by Working Group 2 are intended to serve as illustrations of the following topics as these have arisen and been addressed in specific subfields of linguistics:&lt;br&gt;&lt;br&gt;&lt;/font&gt;&lt;ul&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Unicode character encoding standards for increasing stable display, readability, and sharing of data&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Relational database storage&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Wiki-based sharing of research&lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Metadata tags for increased transparency and usability of data&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Version control&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Web standards for sharing datasets&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;Machine reusability of language data&lt;br&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;font size=&quot;3&quot;&gt;&lt;br&gt;Case Studies:&lt;br&gt;&lt;/font&gt;&lt;ol&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Interlinearisation+case+study&quot; target=&quot;_self&quot;&gt;Interlinearisation: Web-based collaborative annotation using TypeCraft&lt;br&gt;&lt;/a&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Machine+reusability+of+data&quot; target=&quot;_self&quot;&gt;NLP and linguistic data: What makes data suitable for machine use?&lt;br&gt;&lt;/a&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Reasons+for+Using+Unicode&quot; target=&quot;_self&quot;&gt;&lt;font size=&quot;3&quot;&gt;Character encoding: Reasons for using Unicode&lt;/font&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Web+Standards&quot; target=&quot;_self&quot;&gt;Web standards: Web design for effective data-sharing&lt;br&gt;&lt;/a&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font size=&quot;3&quot;&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Sociolinguistics+case+study+%28version+control%29&quot; target=&quot;_self&quot; title=&quot;sociolinguistics case study: collaborative research utilizing version control&quot;&gt;Sociolinguistics case study: Collaborative research utilizing version control&lt;/a&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;a href=&quot;/page/Genealogical+classification+with+AutoTyp&quot; target=&quot;_self&quot;&gt;&lt;font size=&quot;3&quot;&gt;Typology: Genealogical classification using AutoTyp&lt;/font&gt;&lt;/a&gt;&lt;/li&gt;&lt;/ol&gt;&lt;font color=&quot;#ff0000&quot;&gt;&lt;br&gt;PLEASE CONTRIBUTE!&lt;font size=&quot;2&quot;&gt;&lt;br&gt;We welcome the contribution of additional case studies regarding standards for data storage, search, and retrieval (and other relevant topics... there are many). &lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#ff0000&quot; size=&quot;2&quot;&gt;&lt;font color=&quot;#0000ff&quot;&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Reasons for Using Unicode</title><link>http://cyberling.elanguage.net/page/Reasons+for+Using+Unicode</link><author>DeborahAnderson</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Reasons+for+Using+Unicode</guid><pubDate>Sat, 29 Aug 2009 14:05:43 CDT</pubDate><description>&lt;h2&gt;Why use Unicode?&lt;/h2&gt;&lt;br&gt;Linguistic data should be created and stored using standards. For written text, the character encoding standard is Unicode (&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://unicode.org/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://unicode.org&lt;/a&gt;) /ISO 10646. &lt;br&gt;&lt;br&gt;Linguists who create text using a non-Unicode font run the risk of jeopardizing their data, for it won&amp;#39;t be easily found by widely used search processes, nor will it be saved in a stable, standardized format that will guarantee longevity. &lt;br&gt;&lt;br&gt;    PDFs, Word documents, webpages, and other documents that are put online but are created using a non-Unicode font will cause problems when doing searching. Using images for missing letters or symbols also present a problem, as in the following example.&lt;br&gt;&lt;br&gt; The snippets below come from the article &amp;ldquo;A Preliminary Study of Jaw Movement in Arrernte Consonant Production&amp;rdquo; by Marija Tabaina (&lt;i&gt;Journal of the International Phonetic Association&lt;/i&gt; 2009, 39: 33-51). There are two versions available: an HTML version and a PDF.&lt;br&gt;&lt;br&gt;The HTML version uses images for certain IPA symbols, including one for the voiceless retroflex stop &amp;ldquo;ʈ&amp;rdquo;.&lt;br&gt;&lt;br&gt; &lt;br&gt;&lt;br&gt;    Unfortunately, it is not possible to search on the images in the HTML document itself, or by doing a &amp;ldquo;Google&amp;rdquo; search across the Internet on such an image, so one won&amp;#39;t be able to locate f &amp;ldquo;ʈ&amp;rdquo; in the HTML document.&lt;br&gt;&lt;br&gt;In the PDF version of the same text, a non-Unicode font has been used for the &amp;ldquo;ʈ&amp;rdquo;. At first glance, the &amp;ldquo;ʈ&amp;rdquo; appears fine in the PDF:   &lt;br&gt;&lt;br&gt; &lt;br&gt;&lt;br&gt;    However, the font used has put the glyph for &amp;ldquo;ʈ&amp;ldquo;(Unicode LATIN SMALL LETTER T WITH RETROFLEX HOOK, U+0288) in the spot that is properly allocated in Unicode for the DAGGER (&amp;dagger;, U+2020). As a result, it is not possible to search in this document for the &amp;ldquo;ʈ&amp;rdquo; (Unicode U+0288), because the &amp;ldquo;ʈ&amp;rdquo; has been overlaid on the dagger character. (In a similar way, many old Greek fonts would put the lowercase alpha on top of Latin &amp;quot;a&amp;quot; in the font, lowercase beta on top of &amp;quot;b&amp;quot;, etc. To search for alpha, one had to search on the Latin letter &amp;quot;a&amp;quot;. With Unicode, alpha has its own number [codepoint], which is different from the Latin lowercase &amp;quot;a&amp;quot;, so now it is possible to search for alpha separately from Latin &amp;quot;a&amp;quot;.) &lt;br&gt;&lt;br&gt;A non-Unicode font will present other problems for the user: If one copies and pastes the letter &amp;ldquo;ʈ&amp;rdquo; from the above PDF into a Unicode-compliant word-processing document, it appears as a dagger.&lt;br&gt;&lt;br&gt;The above example demonstrates that using a non-Unicode font can prevent search engines from finding documents and text properly. &lt;br&gt;&lt;br&gt;Another important factor is that old documents created with non-Unicode fonts will be hard to read in the future, since they won&amp;rsquo;t be based on an international standard. If an elderly linguist were to die and leave his data on his computer (which he had keyed in with his own non-standard font), it may take considerable time and effort to convert the data into a standardized format that is usable by others, with the possibility that the data could be lost forever.&lt;br&gt;&lt;br&gt;&lt;h2&gt;Tools (Fonts/Keyboards/etc.): &lt;/h2&gt;The listing below is not comprehensive, but is only intended to provide a few reliable webpages that provide tools and other useful information on Unicode-enabled products.&lt;br&gt;&lt;br&gt;&lt;h3&gt;&lt;i&gt;Fonts&lt;/i&gt;&lt;/h3&gt;Most core fonts that come with recent operating systems are Unicode-based, but they may not include all the special characters required by linguists. (Note: A new font that will be released with Windows 7, Ebrima, will include improved support for many African languages that use the Latin script. It also include the Vai and N&amp;#39;Ko scripts.)&lt;br&gt;&lt;ul&gt;&lt;li&gt;Unicode-enabled fonts with IPA from SIL: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://scripts.sil.org/IPAhome&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;font color=&quot;#0000ff&quot;&gt;http://scripts.sil.org/IPAhome&lt;/font&gt;&lt;/a&gt; (Note: SIL fonts that are prefixed with &amp;quot;SIL IPA&amp;quot; are non-Unicode fonts) &lt;/li&gt;&lt;li&gt;John Well&amp;#39;s webpage with a listing of IPA fonts (and Unicode input info): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm&lt;/a&gt;&lt;/li&gt;&lt;li&gt;LinguistList info on fonts: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://linguistlist.org/sp/Fonts.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://linguistlist.org/sp/Fonts.html&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;&lt;i&gt;&amp;quot;Character pickers&amp;quot;&lt;/i&gt;&lt;/h3&gt;These enable users to select the letters or symbols they wish, and cut and paste them into documents &lt;br&gt;&lt;ul&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://people.w3.org/rishida/scripts/pickers/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://people.w3.org/rishida/scripts/pickers/&lt;/a&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;h3&gt;&lt;i&gt;Keyboards and Inputting Methods&lt;/i&gt;:&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;font color=&quot;#000000&quot;&gt;Keyboard info from SIL:&lt;/font&gt;&lt;i&gt;&lt;font color=&quot;#000000&quot;&gt; &lt;/font&gt;&lt;/i&gt;&lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://scripts.sil.org/UniIPAKeyboard&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;font color=&quot;#0000ff&quot;&gt;http://scripts.sil.org/UniIPAKeyboard&lt;/font&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;font color=&quot;#0000ff&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;Keyboards and Inputting Tips from E-MELD: &lt;a class=&quot;external&quot; href=&quot;http://linguistlist.org/cfdocs/emeld/school/classroom/unicode/ipafont.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://linguistlist.org/cfdocs/emeld/school/classroom/unicode/ipafont.htm&lt;/a&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;li&gt;&lt;font color=&quot;#0000ff&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;IPA Keyboards for the PC from the Speech, Hearing, and Phonetic Sciences Dept., University College London: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.phon.ucl.ac.uk/resource/phonetics/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.phon.ucl.ac.uk/resource/phonetics/&lt;/a&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;h2&gt;Other Script Standards and Useful Resources:&lt;/h2&gt;&lt;b&gt;ISO Script Codes&lt;/b&gt; (ISO 15924): &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/iso15924/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;font color=&quot;#800080&quot; size=&quot;3&quot;&gt;http://www.unicode.org/iso15924/&lt;/font&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;Recommendations on the Development of New Orthographies&lt;/b&gt;: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.unicode.org/notes/tn19/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;http://www.unicode.org/notes/tn19/&lt;/a&gt;&lt;br&gt;This is a set of guidelines for linguists who are devising orthographies, so the orthography can be accessible to users on computers&lt;br&gt;&lt;br&gt;&lt;a class=&quot;external&quot; href=&quot;http://linguistlist.org/cfdocs/emeld/school/classroom/unicode/documentation.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;&lt;b&gt;Unicode for Language Documentation&lt;/b&gt;&lt;/a&gt;&lt;br&gt;An informative page from E-MELD with sections on (a) adding characters to Unicode, (b) precomposed forms (letters made up of a base character + one or more diacritic, and why these are not in Unicode), and (c) IPA and Unicode.&lt;br&gt;&lt;b&gt;&lt;br&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Getting+Involved+in+ISO+Standards+Development&quot; target=&quot;_self&quot;&gt;How to Get Involved in ISO Standards Development&lt;/a&gt;&lt;br&gt;&lt;/b&gt;This document also includes a short section on how to get involved in the development of the Unicode Standard&lt;b&gt;&lt;br&gt;&lt;/b&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>WG2: Big ideas</title><link>http://cyberling.elanguage.net/page/WG2%3A+Big+ideas</link><author>DeborahAnderson</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/WG2%3A+Big+ideas</guid><pubDate>Sat, 29 Aug 2009 13:10:04 CDT</pubDate><description>On this page:How do we define &amp;#39;other standards&amp;#39;?&lt;br&gt;An aside...Standards vs. Best Practices&lt;br&gt;How do we encourage adoption of standards in linguistics?&lt;br&gt;&lt;blockquote&gt;Data sharing: the publication model&lt;br&gt;Standards are great, now how do I use them?&lt;br&gt;How can I participate in the creation of ISO standards?&lt;br&gt;&lt;br&gt;&lt;/blockquote&gt;&lt;h3&gt;How do we define &amp;#39;other standards&amp;#39;?&lt;/h3&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;3&quot;&gt;For the purposes of this workshop, we take our domain of interest to be standards related to the &lt;b&gt;sharing of language data&lt;/b&gt; within the linguistics community. The discussion is organized around a questionable* division of data sharing into four subtopics:&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;div align=&quot;left&quot;&gt;&lt;font size=&quot;3&quot;&gt;&lt;font size=&quot;1&quot;&gt;*&lt;/font&gt;&lt;font size=&quot;1&quot;&gt;We call this a questionable division because the four are deeply interrelated.&lt;/font&gt;&lt;br&gt;&lt;/font&gt;&lt;/div&gt;&lt;br&gt;&lt;ol&gt;&lt;li&gt;Storage of digital data&lt;br&gt;&lt;/li&gt;&lt;li&gt;Retrieval and discoverability of digital data (i.e. discoverability at the document or resource level)&lt;br&gt;&lt;/li&gt;&lt;li&gt;Search of digital data (i.e. discoverability at the within-document level)&lt;/li&gt;&lt;li&gt;Access and reusability of digital data&lt;/li&gt;&lt;/ol&gt;&lt;h3&gt;&lt;br&gt;The table below displays some of the key areas of concern for each of the four topics we take to fall within our domain of interest. &lt;br&gt;&lt;/h3&gt;&lt;table align=&quot;bottom&quot; cellpadding=&quot;3&quot; class=&quot;WPC-edit-style-grid1 WPC-edit-border-all WPC-edit-styleData-color1=%23ebebeb&amp;color2=%23c7c7c7&quot; width=&quot;100%&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align=&quot;center&quot; bgcolor=&quot;#2db6cf&quot; class=&quot;WPC-edit-custom-bgColor&quot; width=&quot;25%&quot;&gt;&lt;b&gt;STORAGE&lt;/b&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; bgcolor=&quot;#2db6cf&quot; class=&quot;WPC-edit-custom-bgColor&quot; width=&quot;25%&quot;&gt;&lt;b&gt;RETRIEVAL&lt;/b&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; bgcolor=&quot;#2db6cf&quot; class=&quot;WPC-edit-custom-bgColorWPC-edit-custom-bgColorWPC-edit-custom-bgColor&quot; width=&quot;25%&quot;&gt;&lt;b&gt;SEARCH&lt;/b&gt;&lt;/td&gt;&lt;td align=&quot;center&quot; bgcolor=&quot;#2db6cf&quot; class=&quot;WPC-edit-custom-bgColor&quot; width=&quot;25%&quot;&gt;&lt;b&gt;ACCESS/REUSE&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;METADATA&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;METADATA&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;METADATA&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;METADATA&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;DIGITIZATION &lt;/b&gt;of both primary data &amp;amp; metadata&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;VERSIONING &lt;/b&gt;tracking major changes/decisions made and motivations for same&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;SOURCE MATERIALS &lt;/b&gt;linking to audio/video source&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;table align=&quot;bottom&quot; cellpadding=&quot;3&quot; class=&quot;WPC-edit-style-grid1 WPC-edit-border-all WPC-edit-styleData-color1=%23ebebeb&amp;color2=%23c7c7c7&quot; width=&quot;100%&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;CITATION STANDARDS &lt;/b&gt;how to cite datasets&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;FORMATS &amp;amp; STANDARDS&lt;/b&gt; open access standards &amp;amp; formats&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;DIGITAL FINGERPRINTING&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;** ANNOTATION CONVENTIONS &lt;/b&gt;collection and dissemination of conventions used in existing data collections&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;PRIVACY/LEGAL ISSUES &lt;/b&gt;related to user access, privilege assignment, copyright and data ownership&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;LEGACY DATA &lt;/b&gt;providing within-subfield model for best practices data sharing&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;ADAPTIVE CODING &lt;/b&gt;ability to adjust data coding scheme as knowledge evolves&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;CONSISTENCY &lt;/b&gt;consistency and quality of data annotations&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;** SUBFIELD-SPECIFIC USABILITY CONCERNS &lt;/b&gt;specialized standards, metadata sets, ontologies, etc.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;PUBLICATION&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;STABLE ADDRESSING OF RESOURCES&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;REUSABILITY &lt;/b&gt;repurposing of data for use in addressing new research questions by both humans and machine&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;ARCHIVING&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;WEB STANDARDS&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;td class=&quot;&quot; width=&quot;25%&quot;&gt;&lt;b&gt;&lt;br&gt;&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;** indicates action items listed in the table that may be implemented immediately or in the near term.&lt;h3&gt;&lt;br&gt;&lt;/h3&gt;&lt;h3&gt;&lt;br&gt;&lt;/h3&gt;&lt;h3&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;An aside...standards vs. best practices (a distinction with a difference?)&lt;/font&gt;&lt;/font&gt;&lt;/h3&gt;We weren&amp;rsquo;t certain that linguistics, as a field or academic culture, has a tradition of clearly differentiating standards from best practices. We do not have an organizational body within the field that sets standards. Subfields are autonomous and vary to the extent that standards or best practices are discussed, named and adhered to. We nonetheless considered a few possible distinctions that might generally be made between the two, so we can be as clear as possible about what we mean when we use the terms &amp;ldquo;standard&amp;rdquo; and &amp;ldquo;best practice&amp;rdquo; in these wiki pages:&lt;br&gt;&lt;br&gt;STANDARDS:&lt;br&gt;&lt;ul&gt;&lt;li&gt;Often, these are theory-neutral conventional systems for accomplishing some task (often related to analysis, description, or publication) in linguistics (e.g., &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+1%3A+Annotation+Standards&quot; target=&quot;_self&quot;&gt;the IPA system for phonetic transcription&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;Named (so practitioners may name standards to which their practices adhere in published work, for example) &lt;/li&gt;&lt;li&gt;Official (new standards will explicitly obsoletize prior or existing ones) handed down from a high-level organization charged with regulating usage, nomenclature, etc.&lt;/li&gt;&lt;li&gt;Use is subject to sanction or mandate&lt;br&gt;&lt;/li&gt;&lt;li&gt;Developed over time via a process involving the deliberations of an organizational body of experts, after discussion and consensus &lt;/li&gt;&lt;li&gt;Follow from best practices, ranked and subjected to selection&lt;/li&gt;&lt;li&gt;links: discussions regarding standards &lt;/li&gt;&lt;li&gt;links: political issues&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;BEST PRACTICES:&lt;br&gt;&lt;ul&gt;&lt;li&gt;Often, principled practices rather than mandated systems for accomplishing some analytical, descriptive, or publication-related task in linguistics&lt;/li&gt;&lt;li&gt;Recommended, but not strictly enforced&lt;/li&gt;&lt;li&gt;Generated by practitioners in a bottom-up process, who wish to build consensus in practice and are often interested in motivating the need for a particular practice&lt;/li&gt;&lt;/ul&gt;&lt;br&gt;&lt;br&gt;&lt;h3&gt;How do we encourage adoption of standards in linguistics?&lt;/h3&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;Data sharing: the publication model offers some possibilities with regard to building incentives for adopting standards, acknowledging use of annotated corpora, receiving and giving credit for the use of annotated and marked-up data (as a scholarly practice of value to the field). Working group 5 explored &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+5%3A+Models+From+Other+Fields&quot; target=&quot;_self&quot;&gt;ways that other disciplines are sharing data&lt;/a&gt;, so we may learn from these examples.&lt;/font&gt;&lt;font color=&quot;#808080&quot; size=&quot;4&quot;&gt;&lt;br&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;3&quot;&gt;Publication mechanisms for linguistic data collections are one possibility for encouraging adoption of standards. &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;ol&gt;&lt;li&gt;Receiving academic credit for publication of data would provide a needed incentive for doing the extra work needed to be sure that standards are followed.&lt;/li&gt;&lt;li&gt;Peer review will improve the quality of shared data.&lt;/li&gt;&lt;li&gt;Publication and proper citation of data facilitate demonstrating the scholarly contribution made by providing the data.&lt;/li&gt;&lt;li&gt;Publication of legacy data would provide a valuable training ground for young researchers as well as providing a model for preparation of data according to best practices.&lt;/li&gt;&lt;/ol&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;br&gt;Standards are great, now how can I use them?&lt;br&gt;&lt;/font&gt;&lt;/font&gt;Widespread use of standards and/or best practices just won&amp;#39;t happen unless it is easy for people to:&lt;br&gt;&lt;ol&gt;&lt;li&gt;Locate information re: standards and what they entail.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Learn how to apply the standards to their own data.&lt;/b&gt;&lt;/li&gt;&lt;/ol&gt;&lt;font color=&quot;#333333&quot; size=&quot;3&quot;&gt;Of course, a commitment to communication, collaboration, coordination, community building and open access to data are crucial for supporting the use of standards. &lt;a href=&quot;http://cyberling.elanguage.net/page/Group+7%3A+Collaboration+Structure&quot; target=&quot;_self&quot;&gt;Working Group 7&lt;/a&gt; discussed this issue.&lt;/font&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;How can I participate in the creation of ISO standards?&lt;br&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;3&quot;&gt;ISO is home to a wide array of standards, and the process of standardization can appear to be opaque and daunting to the outsider. A short page devoted to how linguists can participate in ISO standards development is located at: &lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;a href=&quot;http://cyberling.elanguage.net/page/Getting+Involved+in+ISO+Standards+Development&quot; target=&quot;_self&quot;&gt;How to Get Involved in ISO Standards Development&lt;/a&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;font color=&quot;#000000&quot; size=&quot;3&quot;&gt;.&lt;/font&gt;&lt;font color=&quot;#000000&quot; size=&quot;3&quot;&gt; (This page also includes a short section on how to participate in the development of the Unicode Standard.)&lt;/font&gt; &lt;br&gt; &lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;font color=&quot;#000000&quot;&gt;&lt;font size=&quot;3&quot;&gt;By actively participating in ISO standardization, linguists will have a vested interest in the use of such standards. Involvement by linguists also has the result of making sure standards are suited to current needs, and haven&amp;#39;t become fossilized. Ideally, participants should get recognition from their host institution for work on standards development, a job that often requires many hours of time and (at times) considerable personal expense. &lt;br&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;/font&gt;&lt;font color=&quot;#808080&quot;&gt;&lt;font size=&quot;4&quot;&gt;&lt;br&gt;Future work:&lt;br&gt;&lt;/font&gt;&lt;/font&gt;WG2: Big ideas -- not yet written up&lt;br&gt;&lt;ol&gt;&lt;li&gt;simple but powerful tools&lt;/li&gt;&lt;li&gt;privacy and ethics concerns must be considered &lt;/li&gt;&lt;/ol&gt;&lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Typology common practices</title><link>http://cyberling.elanguage.net/page/Typology+common+practices</link><author>Johanna.Nichols</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Typology+common+practices</guid><comments>Page drafted.</comments><pubDate>Fri, 28 Aug 2009 23:22:19 CDT</pubDate><description>&lt;br&gt;Common practices:&lt;br&gt; Several institutions and projects have large databases with cross-linguistic typological information on a number of languages. (List with links to be added.) The information has generally been gathered over some years from published grammars, one&amp;#39;s own fieldwork, consultations with experts, etc. and represents an extremely large outlay of time and resources. In some cases language data is solicited in return for authorship credit for that portion of the database. Ways that database owners might provide peer review for these contributions have occasionally been discussed but probably never implemented.&lt;br&gt; Several projects make some or all of their data available, via a search interface or as downloadable data files. There are no field-wide standards for content of fields, data format, etc. but this does not seem to be an obstacle to finding and using data. It is standard to link data to language names using ISO codes. There is a growing consensus that data is most useful if it is free of one&amp;#39;s own lumpings (thus, e.g., not &amp;quot;large/medium/small&amp;quot; for inventories of elements such as morphological cases or phonemes, but the actual number; not &amp;quot;VO/OV&amp;quot; but an actual specific basic word order). &lt;br&gt;&lt;br&gt;Needed practices:&lt;br&gt; Consensus on how to give database authors and owners credit for their work while also making data publicly available. Practices include: Make data publicly available and request a citation. Offer data on request in exchange for citation. Offer data on request in exchange for coauthorship. (All three models are used in other fields.)&lt;br&gt; Peer review for databases as a whole. (This is separate from the question of how the database owner obtains peer review for individual contributions to the database.) Accuracy of entries and usefulness, comprehensiveness, empirical and theoretical adequacy, etc. of data categories need review.&lt;br&gt; Appropriate credit (on one&amp;#39;s CV and with one&amp;#39;s employing institution or prospective employer) for creation and maintenance of databases. &lt;br&gt;    &lt;br&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item><item><title>Language documentation common practices</title><link>http://cyberling.elanguage.net/page/Language+documentation+common+practices</link><author>Johanna.Nichols</author><guid isPermaLink="false">http://cyberling.elanguage.net/page/Language+documentation+common+practices</guid><comments>Drafted section &quot;Language documentation common practices&quot;.</comments><pubDate>Fri, 28 Aug 2009 19:50:40 CDT</pubDate><description>&lt;b&gt;&lt;font color=&quot;#808080&quot;&gt;NOTE: this structure is offered only as a suggested skeleton. Please modify as you see fit.&lt;/font&gt;&lt;br&gt;&lt;br&gt;Introduction:&lt;/b&gt; The fieldworker (and this includes documentary linguists) needs to learn the structure of the language and produce a grammatical description, dictionary, and texts usable by researchers and others (including interested members of the speech community). He or she is usually also singlehandedly responsible for producing theoretical and comparative work on the language. Not much will be said here about producing grammars (though see Dryer 2006) or the general scholarly work, other than to note that the whole set of responsibilities gives the fieldworker a considerably heavier workload than the average for linguists, so tools and standards that save time are especially needed. The tasks in need of standards and tools for sharing and access are: recording and/or digitizing; transcribing; creating texts; creating a lexicon; and archival storage and access of recordings, texts, and lexicon (and perhaps other materials). &lt;br&gt;&lt;br&gt;&lt;b&gt;Recommended Readings:&lt;/b&gt;&lt;u&gt;&lt;br&gt;&lt;/u&gt;Dryer, Matthew S. 2006. Descriptive theories, expanatory theories, and basic linguistic theory. In Felix Ameka, Alan Dench and Nicholas Evans, eds., &lt;i&gt;Catching Language: Issues in Grammar Writing&lt;/i&gt;. Berlin: Mouton de Gruyter.&lt;u&gt;&lt;br&gt;&lt;br&gt;&lt;/u&gt;&lt;b&gt;Software:&lt;/b&gt;&lt;u&gt;&lt;br&gt;&lt;/u&gt;(See Existing standards... below.)&lt;br&gt;&lt;u&gt;&lt;br&gt;&lt;/u&gt;&lt;b&gt;Existing standards, common practices, and/or best practices:&lt;/b&gt;&lt;br&gt;Recording: Information available from the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mpi.nl/DOBES/documents&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;DoBeS project&lt;/a&gt; of the Max Planck Institute for Psycholinguistics, Nijmegen; the &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.hrelp.org/archive/resources/index.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Hans Rausing Endangered Languages Project&lt;/a&gt;, SOAS, London; and others.&lt;br&gt;&lt;br&gt;Transcribing: Many field linguists use &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://sourceforge.net/projects/trans/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Transcriber&lt;/a&gt;, available from SourceForge.net.&lt;br&gt;&lt;br&gt;Text work. The main tasks are inputting or importing transcribed material, morphological interlinearizing, syntactic annotation, parsing, lemmatization, concordance building, and preparation for corpus searching and archiving. For standards for some kinds of interlinearization and annotation see the report of Working Group 1. To my knowledge there are no current widely used software tools that could be described as best practice, though there is a growing consensus about what they need to do (e.g. multiple annotation tiers; link to recordings; enable non-interlinear annotation for discontinuous, non-compositional, multi-word, and non-linear categories and functions; enable printout of text segments with standard publishable interlinears; enable corpus searches of all kinds; enable other access). There is also very little knowledge of what might go into theory-neutral syntactic annotation. &lt;br&gt; A list of tools dated 2004 and including some text tools is &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.cs.mu.oz.au/research/lt/emeld/softsurv2/documents/index.html&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;here&lt;/a&gt;. I am aware of these updates and additions: &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.kura.ats.lmu.de/index.php&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;Kura&lt;/a&gt;; and a glossing tool under development by Thomas Mayer (presented at &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://lsa2009.berkeley.edu/alt8/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;ALT8&lt;/a&gt;, July 2009). The DoBeS project has elaborate and fairly specialized &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.lat-mpi.eu/tools/&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;tools&lt;/a&gt; for text and dictionary work.&lt;br&gt; &lt;br&gt;Dictionary compilation. The most common practice for compiling dictionaries of all kinds (descriptive, defining, etymological, etc.) seems to be use of commercial database software to create a self-standing database that does not, e.g., link to a text corpus. &lt;br&gt;&lt;br&gt;Archival storage and access: Archives have their own standards for metadata and data formats. The DoBeS project has an extensive &lt;a class=&quot;external&quot; href=&quot;http://cyberling.elanguage.nethttp://www.mpi.nl/DOBES/language_archives&quot; rel=&quot;nofollow&quot; target=&quot;_blank&quot;&gt;list of archives&lt;/a&gt; with links. &lt;br&gt;&lt;br&gt;&lt;b&gt;Needed standards and data-sharing resources:&lt;/b&gt;&lt;br&gt;&lt;br&gt;Easy-to-use tools for text work are badly needed. Basic research, leading ultimately to standards, for theory-neutral syntactic annotation is needed.&lt;br&gt;&lt;b&gt;&lt;br&gt;&lt;/b&gt;&lt;hr size=&quot;1&quot;&gt;&lt;br/&gt;</description></item></channel></rss>