WG5_final_reportThis is a featured page

Towards a Cyberinfrastructure for Linguistics: Models from Other Fields

prepared by Cyberling 2009 Working Group 5

co-chairs: Scott Farrar, D. Terence Langendoen
members: Steven Moran, Cornelius Puschmann, Dwight van Tuyl


Abstract

This document presents the findings of Cyberling 2009: Working Group 5 which was charged with reviewing how other fields have been successful (or not) at implementing a cyberinfrastructure. First, we present the rationale for having such a working group to begin with. We then turn to a brief introductory summary of the successes of key related fields. Next, we discuss how the following attributes and practices work to comprise the best recipe for a cyberinfrastructure: (1) having no central data store, (2) using appropriate conceptualizations, (3) providing flexible visualization tools, (4) being open and (5) providing easy ways to cite materials. We then turn to a description of collaborative modularity, our philosophy on the ideal organizational structure for the field. Finally, we present particular technologies and practices that we found to be compelling and instructive for the field of linguistics.

1. The Charge and Rationale

The field of linguistics appears to be well equipped as a model science for the 21st century. Linguists are by nature users of technology and progressive. Linguistics is one of the first disciplines to use the Web to further its activities and build its community (cf. Linguist List, founded in 1989, and present on the Web since the mid-1990s). We as a field are users of advanced instrumentation to collect and analyze our data, and have produced a variety of software to assist our everyday work, e.g. Shoebox, ELAN, and Praat. On the other hand, linguistics is challenged in specific ways, and these have caused a slow adoption of many recent technologies that could help advance our science. For instance, the field is by its nature theoretically speculative and eschews the confines of standards and normative pressures. As a result we find ourselves with a lack of infrastructure and standards. This is particularly evident when it comes to our views on data and data providers. Linguists on the whole have not actively tried to make their data widely available to others, either because they have not seen the need to, not wanted to, or not known how to. Those who have made their data available have generally made little effort to link their data with the data of others. There are exceptions to this situation of course. In terms of standards, there are the IPA, Unicode, and ISO/Ethnologue language codes. Linguists have a rich tradition of citing data 'snippets' such as example sentences, and are beginning to recognize the importance of publishing data, even without an accompanying analysis. Still, linguists do not generally agree on a core conceptualization of linguistics (e.g., the basic inventory of feature types, language varieties, etc.) which makes it difficult to link large amounts of diverse data.

At Cyberling 2009 held in Berkeley, CA, our working group was charged with reviewing how other fields had been successful (or not) at implementing a cyberinfrastructure and handling their data. We met with a plan to discuss the current state of affairs for linguistics, especially where we could compare linguistics with other fields. We decided to choose a few particular fields and to discuss how they have approached their own cyberinfrastructure needs. We organized our discussion by first enumerating the key facets of the approaches that contribute to success. We also discussed bad examples of cyberinfrastructure from other fields. A good portion of the discussion centered around tools that have contributed to success. Such was the process that led to the current public statement (this document). We chose to explore the best facets of cyberinfrastructure from various fields (not necessarily choosing a specific field or tool to copy) including facets of the following categories:
  • Data handling practices
  • Collaborative/organizational structure
  • Internet technology and tools
The remainder of this document explores each of these categories in hopes that the field as a whole can learn from the efforts of other fields.

2. Brief Overview of Related Fields

We begin by giving a few examples of how other fields have handled their own cyberinfrastructure needs, showing what can be achieved when researchers, educators and technical support staff organize around the goal of enabling world-wide data collection, distribution and analysis. By "related fields" we simply mean those with data. Ordinarily, one would expect a group of linguists to look towards sister disciplines in the humanities or social sciences. For our charge, however, we chose to look much further afield, in particular, at the hard sciences and at current practices on the Web. We decided that these fields were more representative of the path that we feel leads to success and one which we believe linguistics should ultimately follow. Here is a summary listing of exemplary fields from our broad survey/discussion:
  • The Integrated Public Use Microdata Series (IPUMS) https://international.ipums.org/international/, based at the University of Minnesota, provides users with free access to interoperable data from 130 censuses from a total of 44 countries, representing nearly 280 million anonymized person records. It enables researchers to analyze and visualize the world's population in time and space in order to study social change and human impact on the environment.
  • The Worldwide Protein Data Bank (wwPDB) http://www.wwpdb.org/ is an international group of organizations that serve as protein data deposition, processing and distribution centers. It maintains a single PDB archive of macromolecular structural data that is freely and publicly available to the worldwide community. The database contains approximately 50,000 proteins that researchers have discovered and shared openly with the global community. The associated infrastructure makes it possible to visualize and analyze these molecules, permitting further progress in many fields and applications such as engineered drug design.
  • The Consortium for the Barcode of Life http://barcoding.si.edu/, sponsored by the Smithsonian Institution and the Sloan Foundation, is an international effort to develop reliable means for the identification of biological species. Barcoding uses a short DNA sequence in an organism's genome as a barcode equivalent to determine the species of a biological sample. Adoption of a barcode-format standard allows for a sample, whether in a museum or field collection, to be instantly linked to related information resources worldwide.
  • The National Virtual Observatory (NVO) http://www.us-vo.org/ is a partnership of US universities, observatories and federal agencies, to develop interactive visual portals to the sky, and is a founding member of the International Virtual Observatory Alliance (IVOA) http://www.ivoa.net/. These resources enable data from radio, optical and X-ray astronomical observatories and archives to be aggregated, compared and analyzed, providing opportunities for new discoveries about the nature and origins of the universe.

3. Data Handling Practices

In terms of how data are to be handled (processed, curated and searched) we find several aspects of other fields very compelling. The first is to have no central data store. By 'central data store', we refer to a single location (or site) at which all of the field's data are housed and maintained. This means that data providers must adhere to only the central authority's recommended practices and formats. It means that the data are centrally managed. Whereas such an approach has its merits (uniformity and quality control), research in linguistics and the data that are produced are too diverse for such a top-down approach. As articulated in the open paragraphs of this document, such an approach would likely require a common conceptualization for the field, something not yet in achievable.

However, without any conceptualization at all, it would not be possible to group any data together in the first place. That brings us to our second point: the necessity of having appropriate conceptualizations of the field. By 'conceptualization' we refer to terminologies, such as the elements used in the Leipzig Glossing Rules (http://www.eva.mpg.de/lingua/resources/glossing-rules.php) or the ISO/Ethnologue language codes (http://www.ethnologue.com), and to ontologies, as in the General Ontology for Linguistic Description (GOLD) (Farrar and Langendoen, 2003). We certainly do not advocate any one terminology or ontology, but do advocate the field's awareness of such resources and their further development. Only recently have our journals, funding agencies and professional societies started to require the use of such resources. We see it as absolutely advantageous for all such organizations to recommend the use of standard terminologies and ontologies.

Terminologies and ontologies for data may not satisfy all users. One way to alleviate some of the resistance towards their use is to provide services on their benefits, e.g., flexible ways to visualize data. That is, even if one linguist uses a particular set of terminology, another linguist can view the same data using a different set. Visualization also concerns the way in which data are arranged for human consumption. Viewed in various ways, the same dataset can provide many types of information and potentially lead to a better understanding. For instance, consider Google Sky a service that combines astronomical data from different observatories. It provides a very user friendly, visually pleasing, and scientifically accurate way to access data from world-class observatories such as the Hubble Space Telescope, GALEX space-base Ultraviolet observatory, the Chandra space-based X-ray observatory, and others. The point is that the same data can be visualized according to a number of scientific dimensions, and such is a key selling point for using standard terminologies and ontologies.

Related is the concept of open data. For data, 'openness' means that anyone can access them. For many of the fields we surveyed (e.g., astronomy), there was no issue with opening up the data to the general scientific community or to the public. However, for some, the medical domain in particular, data are incredibly sensitive. These fields have, nevertheless, come up with very clever ways to publish their data. Often, this involves anonymization: the removal or masking of names, ages, addresses, etc. While anonymization may not be appropriate for all kinds of linguistics data (e.g., field data on endangered languages), it does provide one means by which data can be published.

Whether data are publicly accessible or not, they should be easy to cite and assign credit to. The field of geoscience, for instance, has created a state-of-the-art search facility for open datasets (see http://pangaea.de). Not only do users have access to the data, but datasets are given along with Digital Object Identifiers (DOIs) a scheme that allows for unambiguous citation.

In summary then, these aspects of linguistics data handling have the greatest potential for success:
  • no central data store
  • appropriate conceptualizations of the field
  • flexible visualization of data
  • open data (but sensitive to access restrictions)
  • easy to cite (DOIs, URIs)

4. Collaborative/organizational structure

One of the most challenging aspects of creating a cyberinfrastructure is to accommodate collaboration while fostering individual innovation, all within some sort of organizational structure or structures. To this end we introduce more of an overall approach than an actual solution, one which we refer to as collaborative modularity. This approach concerns how projects are developed with an eye towards future collaboration. In the past most linguistics projects that could be regarded as digital were developed "in-house". For example, consider the many on-line linguistic databases (e.g., WALS and various other typological efforts). Excellent resources as they are, very few can be searched within a common framework or be accessed with a common API.

A particularly instructive domain to illustrate benefits of the principle of collaborative modularity is astronomy. For this field there are three distinct models for a cyberinfrastructure. The first is NASA's Extragalactic DB (http://nedwww.ipac.caltech.edu/) which is an older model that compiles numbers and names from the scientific literature. It uses a traditional database and presents data to the user in a traditional format. In most respects, this project is bottom-up and is well grounded in data that are actually written about. On the other hand, the project focuses on "legacy data" and, then, attempts an a posteriori merging of data to achieve meaningful comparisons. The second is the Sloan Digital Sky Survey (http://www.sdss.org/), a more top-down approach which compiles data from a single observing instrument (more or less). What is interesting about this project is that it is managed, complete with a scientific council and spokesperson, governing body, and publication policy. The project by its very nature encourages interoperation with more than 150 scientists having contributed to this common cause. The pay-off for contributing data includes wide dissemination of data and the benefit of the many tools usable with the available datasets. Finally, there is the secular project Google Sky (http://www.google.com/sky/) put together by Google, astronomers and the University of Washington. It provides a very user friendly, visually pleasing, and scientifically accurate way to access data from world-class observatories, including the Hubble Space Telescope, GALEX space-base Ultraviolet observatory, the Chandra space-based X-ray observatory, and others. It is an exemplar in display, but it is extraordinary because of the amount of data that are made interoperable to achieve a singular effect.

The state of things for linguistics is probably something resembling the Extragalatic DB, with sub-disciplines within linguistics having some organizational structures such as the Sloan project. While each approach has its merits, our field could greatly benefit from a project such as the Google Sky. In the end, tailoring data for a certain project is not the best model for collaborative data. Instead, we suggest keeping the overall idea of data sharing in mind when projects are planned. For the developer on the ground, this means using the latest technology and practices in the most forward looking way possibly, even at the expense of time and other resources.

5. Internet Technology and Tools

This section covers specific Internet technology and software that we think will serve to enhance how a cyberinfrastructure for linguistics is created and maintained. First we cover trends and practices, ideas or technologies that are not embodied in any one specific tool. Second we turn to specific pieces of software that are useful for either adoption or comparison.

5.1 Internet trends and practices

By Internet "trends and practices" we refer not to specific tools (cf. the following section) but to general architectures and methods currently in use or becoming more popular. We consider these as best practices for handling data on the Web, and they include: Linked Data, cloud computing, and Web services.
We begin with Linked Data (http://linkeddata.org). Consider that the purpose of encoding data is to make every assumption explicit especially regarding the meaning of annotations. Two important criteria are required for an implementation: (1) uniqueness both of annotation elements and data content and (2) the ability to place such information on the Web and link among individual data. Uniqueness refers to the identifiability an individual datum at any level of granularity. That is, a single phonetic feature instance should be as uniquely identifiable as an entire text. We want the Web to be the medium of data storage and the ability to link among that data. Uniqueness is achieved by using Uniform Resource Identifiers (URIs) for all data (Berners-Lee et al., 1998). A URI is a reference scheme that can be used to refer to anything whatsoever, from documents on the Web to actual physical entities (e.g., my horse) or even abstractions (e.g., world peace). As a naming scheme, URIs use the typical http protocol of the Web. For instance, the following URIs might refer to linguistic constructs:

http://linguistics.org/PluralNumber
http://linguistics.org/Lexeme
http://linguistics.org/LinguisticFeature

Note that a URI does not need to actually be located "on the Web". The following URIs are perfectly valid identifiers (for the obvious referents):

http://solar-system.org/Mars
http://people.com/JohnSmith
http://ideas.co.uk/WorldPeace

URIs are only the first requirement for the implementation. The second is a way to link the data. For this we use the Resource Description Framework (RDF) (Lassila and Swick, 1999) which is basically a graph model with a serialisation (physical representation) amenable to the structure of the Web. Within an RDF graph, we require that graph nodes and arcs all have URIs. Thus, every node and arc is identifiable and definable. The ontological framework is built atop the RDF/URI system and statements are made using URIs as predicate and argument. The basic element of an RDF graph is the triple which is of the form subject-predicate-object. Each concept, role and individual has a URI and occupies a node (or arc) in the RDF graph. The entire enterprise of placing data on the Web using URIs and RDF is known as the Linked Data approach (Berners-Lee, 2006).

We now turn to another Web technology that has recently grown in significance, that of cloud computing. Cloud computing utilizes the resources provided by Internet servers in order to process, maintain, and share data in a non-localized manner. As the landscape of computing becomes more diverse, with each platform having its own idiosyncrasies of advantages and limitations, many application developers are turning to the Internet as a unified platform accessible from any web browser. The benefits of this transition from local to cloud computing are numerous; many exhibit key features of what a linguistics cyberinfrastructure requires in order to be successful.
  • Stable repository: Data stored on a local machine require its owner to be diligent in their maintenance of a data back-up plan. Though in reality, many data owners do not have a data back-up plan to begin with. Storing data in the cloud frees data owners from this responsibility as they are able to rely on technicians skilled in preserving data.
  • Computational resources: The problems involved with the diversity of hardware platforms can be alleviated by relying on the cloud or a server farm for its computational ability. This effectively frees the user's platform, be it a personal computer or mobile device, from the burden of the required resources in order to utilize an application for storing, retrieving and analyzing data.
  • Shared Data: Cloud computing, by its nature, is imbued with the ability to provide access to data that may have been produced from a variety of sources. In comparison to localized computing, cloud computing provides a significantly lower barrier for collaboration as analysis of data is not limited to ones own local repository.
In order to develop software that would fully utilize the paradigm of cloud computing and a collaborative linguistics cyberinfrastructure, principles derived from the W3C's definition of web services need to be accounted for in the design of our systems. What this means is that projects shouldn't perceive themselves as segregated resources, but rather as modular components which are able to communicate with other web services for editing, storing, analyzing and displaying data. A web service uses web related standards such as SOAP (as messaging protocol) and REST (an architectural style) to support interoperable machine-to-machine communication over a network. This machine-to-machine communication allows data from an originating repository to have a life of its own beyond the scope of an individual project's charge. Though a project's responsibilities may require the development of many kinds of services for editing, storing, analyzing and displaying data as a proof-of-concept blueprint, the communication interfaces between these components would be accessible so that the linguistic community may develop other web services that may evolve the utilization of its data. Each service may be exchanged for another in order to provide functionality not conceived in the original project's design.Thus, we service translates to something value-added.

5.2 Sources of inspiration (pieces of software)

Here we summarize specific software that we found to be sources of inspiration for linguistics. The first is the Pangaea search tool for geoscience (http://pangaea.de). It is both a repository and an exemplar of how datasets could be cited. Pangaea was already mentioned with respect to DOIs allowing for unambiguous citation. We could compare this project to the efforts of the Open Language Archives Community (OLAC) (Simons and Bird 2003).

Next is the FreeBase project (http://www.freebase.com/), an open, free community model for literally any kind of data. The database resides "in the cloud" and currently contains more than 6 million topics with accompanying facts. Any individuals who have data can create additions to the DB using their free and easy-to-use API. Data providers are encouraged to link their own data to existing resources when possible. We could compare this project to the aims of e-Linguistics (http://purl.org/linguistics/e-linguistics) that of providing ways to gather and merge disparate descriptive data in linguistics.

Turning to a resource that excels at providing educational materials and as a portal for its field is Nanohub (http://nanohub.org/). Nanohub is a website for nanoscience and technology. The site is the state-of-the-art exemplar for how an individual field can muster its resources to present an inviting and educationally friendly portal to its science. The closest resource we have in linguistics is perhaps the Linguist List (http://linguistlist.org) where researches can share and discuss ideas (and data).

Next is OpenWetWare, an information and practices resource (in fact a wiki) for the dissemination and promotion of know-how and wisdom among researchers and groups who are working in biology and biological engineering. The wiki is impressive because of its experiment protocols and courses all found at a single location. In linguistics, we could compare the Glottopedia project (http://www.glottopedia.de) to this effort. Finally for a tool that we found particularly compelling, ManyEyes (http://manyeyes.alphaworks.ibm.com/manyeyes/) is used for the visualization of all kinds of data. First of all the site provides many open datasets. The main function of this service is that contributors can design and post novel visualization tools to see the same dataset, all according to an API.

6. Summary and Open Questions

We have presented the results of our Cyberling 2009 working group's discussion on how a cyberinfrastructure can be achieved by looking at exemplars from other fields. We chose to explore the best facets of cyberinfrastructure from various fields and began with a few illustrations of what they have achieved. By "related fields" we simply mean those with data, and we found the hard sciences (biology, astronomy, etc.) and various Web movements (Linked Data, cloud computing and web services) to possess the most compelling facets. We organized these facets in terms of following categories:
  • Data handling practices
  • Collaborative/organizational structure
  • Internet technology and tools
In terms of data handling we found that successful fields did not rely on a central data store, but instead adhered to a distributed model. Success was also seen when adequate conceptualizations (terminologies and ontologies) were employed. Much more than just "nice" for presentations, we found flexible visualization of data to be key. Data should be as open as possible, but service providers should be sensitive to access restrictions where required. Finally, all data should be easy to cite, using DOI or URI systems.
For collaboration/organizational structure, we expounded our main philosophy of collaborative modularity. This approach concerns how the project ought to be developed, namely, with an eye towards future collaboration. We contrasted a few past and current projects (those developed for "in-house" use) with more advanced projects that seek interoperation from the very beginning.

These common threads were found in a number of Internet technologies and tools from which we gained much inspiration. First, we gave an overview of the Linked Data movement, cloud computing and Web services. Finally, we presented key tools that we found to be sources of inspiration for linguistics: Pangaea, Freebase, ManyEyes, just to name a few.

References Cited

Berners-Lee, T., Fielding, R., and Masinter, L. (1998, Aug). Uniform resource identifiers (URI): Generic syntax. Technical Report RFC 2396, IETF (Internet Engineering Task Force).
Farrar, Scott and Langendoen, D. Terence (2003). A linguistic ontology for the Semantic Web. GLOT International, 7(3), 97–100.

Lassila, Ora and Swick, Ralph. R. (1999, Feb). Resource Description Framework (RDF) model and syntax specification. Recommendation, W3C. http://www.w3.org/TR/REC-rdf-syntax/.

Simons, Gary and Steven Bird (2003) The open language archives community: An infrastructure for distributed archiving of language resources. Literary and Linguistic Computing (18), 117–128.



TerryLangendoen
TerryLangendoen
Latest page update: made by TerryLangendoen , Aug 18 2009, 3:15 PM EDT (about this update About This Update TerryLangendoen reformatting - TerryLangendoen


view changes

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.