Version User Scope of changes
Aug 31 2009, 10:01 AM EDT Koenraad 1415 words added, 1 word deleted
Aug 31 2009, 9:29 AM EDT Koenraad 296 words added, 141 words deleted

Changes

Key:  Additions   Deletions

Data Reliability and Provenance


Peter Austin (co-chair)
Martin Haspelmath (co-chair)
Kurt Bollacker
Tracy Holloway King
Koenraad de Smedt
Paul Trilsbeek

Overview


In this paper we discuss what data provenance and data reliability are with special attention to needs of the linguistics community when moving from simple data sets to more complex cyberinfrastructures. The integrity and completeness of provenance information as a basis for citation, rights management etc. is crucial for those who record, annotate and compile materials, but also for the sources of raw materials, in particular for indigenous and minority communities, and for the scholars who use the materials as a basis for further scientific work. We suggest some first steps to promote data sharing and publication in the linguistics community.

Data Provenance


Provenance is the who, what, and when of metadata.

When a data set is created, it is important to know where it comes from and who is responsible for its publication. Adequate information about provenance, i.e. information about how, where, when and by who the data was collected, encoded, annotated and documented, and who assumes responsability for its publication, allows the quality of the data set to be assessed, provides a contact in case of questions, and establishes authorship of the data set, similar to the authorship of an academic paper.

The contents and status of a data set are not always clear from a quick inspection of the data. The data might come from native-speaker informants, from published works of literature, from the web, etc. Adequate metadata are needed for cataloguing datasets so as to make them searchable, and to secure that the data will be used in an appropriate way. Metadata should include a detailed description of the data set, including provenance information. In some cases provenance information should even be provided for separate sections of the data set wherever differences are relevant. For example, the data may have been collected over several years of field work and knowing when each section was collected and by who and under which circumstances may be important.

Provenance is extremely important to the scientific community which uses data sets as the basis of further analysis, hypothesis testing, etc. For the user, provenance is crucial assessing whether, and to what extent, a data set can form an appropriate basis for subsequent scientific work. Knowing how the data was collected and processed and who was responsible for it allows future users of the data set to judge the quality of the data, how it fits in with their research and how it compares to other data sets. Provenance is also important for the replicability of results. Unless it is clear where the data came from and how it was created, data sets cannot be replicated. For example, if a data set has constituency trees annotated over it, it is important to know whether the trees were manually constructed, created automatically, or bootstrapped by manually correcting automatically constructed trees.

For the creators, provenance provides a way to get credit for the scientific work done in creating the data sets and allows the community to cite the data sets with their authors in academic works. Creating a high quality data set is extremely time consuming and requires highly skilled linguists and so it is important that those doing this work get credit for their contribution. Provenance is also important in establishing and maintaining privacy rights for those who provided the data. There are many reasons that data may not be appropriate to publish freely in its entirety, e.g. individuals may be identifiable from videos or from the content of discussions, or private rituals may be recorded. In particular the rights of indigenous and minority communities should be respected and acknowledged.

As part of provenance, in recording the who, what, and when of metadata, it is necessary to have trusted identification of individuals, organizations, and services. This identification needs to persist over time so that decades after a data set is created, it is still possible to determine who created it and how it was created.

Achieving reliable provenance through publication


A major question is how to achieve reliable data provenance in the linguistic community and promoting the sharing of data. Creating data sets is a time consuming, highly skilled task and the scientific community as a whole needs to acknowledge this and to provide support for those working on creating and maintaining data sets. Individuals contributing to and creating these data sets need to get institutional credit for data publication. For example, these should count for tenure reviews and other review processes and should be an integral part of grant applications, both in having data set publication be part of the grant work and in the granting agencies favoring researchers who publish curated data sets, just as they favor those with a proven track record of traditional academic publications.

Given the nature of the linguistics community and the working paradigms that they are used to, we suggest promoting curated data sets as publications.publications. The technology is currently available to treat curated data as publication. There is extensive archival work on linguistic data sets, including the work done by organizations such as the Linguistic Data Consortium. There are also examples from other scientific fields where publication of data collection has become an established scientific practice. However, there needs to be extensive institutional and social engagement in order for curated data as publication to become the norm in the linguistics and language studies communities.

Researchers need to be encouraged to publish curated data sets. This can be done in part by requiring it at the institutional level or as part of a grant reward. It can also be aided by providing more infrastructure to publish data sets, including providing information on best practices and on how to access the necessary technologies. In addition, researchers need to cite the published data sets that they used in their research. Reviewers and publishers of articles and books should reject submissions if they do not cite the data sources they they used in their research. These citations should be standardized by using the metadata and the publishers of the curated data sets should facilitate this by providing information as to how the data set should be cited.

The linguistic community will also need to provide support to ensure annotation and data quality control, just as it does with published academic papers. Not all data collection and annotation is done equally well and is of equal value. The community needs to have a way to recognize and acknowledge this, similar to the relative value of different books publishers, journals, and conference proceedings.

Publication of data through a recognized publication channel with an ISSN would make it easier to cite the data uniformly and correctly, would clearly establish authorship, would allow for different editions, would enforce the use of standards and metadata and would a catalyst for giving academic credit to the makers. It would also promote reviews and rating systems.

In a different field, the data journal Earth System Science Data promotes the rapid publication of research on original data sets. Their policy is outlined as follows:
"Articles in the data section may pertain to the planning, instrumentation and execution of experiments or collection of data. Any interpretation of data is outside the scope of regular articles."
"In the first stage, papers that pass a rapid access peer-review are immediately published on theEarth System Science Data Discussions (ESSDD) website. They are then subject to Interactive Public Discussion, during which the referees' comments (anonymous or attributed), additional short comments by other members of the scientific community (attributed) and the authors' replies are also published inESSDD. In the second stage, the peer-review process is completed and, if accepted, the final revised papers are published inESSD. To ensure publication precedence for authors, and to provide a lasting record of scientific discussion,ESSDD andESSD are both ISSN-registered, permanently archived and fully citable."

However, articles in ESSD seem to contain the data mostly within the articles themselves. ESSD is not the same as a cyberinfrastructure. The publication is meant to announce to the community that data has been available and how it was obtained and annotated, and also to give credit where it is due. The actual data could be accessed in a variety of ways, not necessarily through the same channel.

One could imagine organizations to act as the publishers of articles with their datasets. Publishers of data will have a responsibility of checking at least the formal aspects of published data, such as proper use of metadata, adherence to standards etc. It might be possible that the same data is published at different hosts. We would need a buy-in from institutions and linguists. Publishers would themselves be rated and will need to actively advertize their data publications and make them attractive to researchers.

Since publication is by nature public, there are some issues with restricted data. Full metadata could be published, but the metadata could stipulate restrictions on the accessibility of the actual data (e.g. due to proprietary data or privacy issues).

There are also issues with editing and peer review. Maybe language resources would be publshed unedited first, then reviews can be added on later, where annotation might count as a type of review; by having the language data out, even in unedited form, people would be encouraged to annotate it. Some academic credit systems in countries like Norway are based on criteria like peer reviewing and require that recognized academic publishing channels.

Technical considerations


On the Internet, rights can be assigned to pieces of information associated with a handle. Handles are globally unique, persistent identifiers managed by organizations such as handle.net. The use of handles assures persistence which volatile URLs cannot offer. Even so, there are issues with trust and long-term maintenance. It might be considered to set up a registration authority for linguistic data. Given that linguists change institutions and that URLs shift over time, it is important the future researchers be able to access the same data that is being used today and to be certain that this is the same data as was used by other researchers.

Every entity involved in data set creation can be identified by a unique handle. These include entities such as people, organizations, and their roles, the data sets and documents themselves, and different views and mashups of the data. Added value by annotations etc. of existing material can be managed by cascading handles with different rights. A proliferation of provenance information may increase the size of the information by orders of magnitude but disk space is cheap and so we can do this very fine grained, e.g. sound clips indicated by start and stop time in a speech corpus. Assigning full provenance information can however be complicated, e.g. when material is translated or when value is added to copyrighted material (e.g. Wall Street Journal corpus) or when a speech corpus has a radio broadcast in the background.

The annotation layered on curated data is often done in conjunction with specific software. In addition, software is used to provide access to the data itself, including various reformattings and user-friendly views. It is important to preserve this software so that the annotations can be understood and recreated. In addition, as more software is made available, it can be used by other linguists in their work to enhance their data sets. Having software-as-a-service available to the linguistics community can aid in this process. It will be particularly valuable for institutions with less extensive computing infrastructure, allowing their researchers access to state of the art data set curation facilities.

In contrast to paper materials, which are static and pre-edited, a cyberarchive allows (or should allow) the user to participate in filtering and presenting information ("play editor yourself"). An example is the Wittgenstein archives at Bergen, which contains digitized manuscripts; the user can choose to include or exclude certain pieces of information and has options for visualization. How would you cite a specific view among many other possible views of this material? This could this be done by generating a unique URI and handle for the transformed and formatted web page. An "I want to cite this" button would make this process easy for the user. Such dynamic aspects could help with collaboration since it is a good way to bring things together at the presentation level; the tools for this are very useful since most linguists don't have the UI skills to do this themselves; Rosetta is trying to provide some examples; allow linguists to deal with how to use and view the data.

Rosetta, Freebase, Internet Archive etc. allow for mashups of data. Cyberinfrastructures should provide mashup functionality, i.e. the smart combination of data from various sources; Kurt and Laura have been working on this and have some demos of how one could display information together. It is challenging to reference mashup data since the mashup process is dynamic and every combination of specific versions of data produces a new mashup version. Also, some people might want to reference the particular rendering of a mashup. Referencing views could easily escalate when every user can have a personal view.

Identification of individuals, authorization and rights management


Reliable identification of individuals is an important prerequisite for at least two purposes:
  1. Identifying authors to give credit to data creation.
  2. Identifying users to provide authorization based on licenses, etc.
CLARIN has concluded that some system with global e-identities can solve many problems associated with the current situation of people having different usernames at the various sites they use. A single logon should identify the user in an easy way. This is a largely solved problem on technical level, but various solutions are available and none are widely used by the community:
  1. OpenID
  2. Federations of local e-identity providers
Authorization is a tuple linking a set of rights, the identity of a piece of data and the identity of a user. Authorization can take various forms depending on restrictions, e.g. access can be given to anyone, can be based on email domains (i.e. affiliated institutions), can require the acceptance of a user license, etc. dependent on what the author or stakeholder decides. There are many such restrictions, e.g., access may be given for non-commercial purposes only, or access to sacred songs can be restricted to the initiated, or subparts of data may be proprietary and requires additional agreements. There is a need to train people on what types of rights and access are appropriate. There may be privacy issues with sources such as patient data, sign language data etc. which may place restrictions on availability.

Some language data, especially spontaneous speech and sign language, cannot be distributed due to privacy issues, in particular in utterances referring to people. becomes very complex with international access, where different countries may have different rules for guarding privacy. This situation may require different country-specific licences, so international cooperation may need legal advice from the start.

Some technology exists to anonymize source materials, e.g. by masking, manual and automatic, e.g. making non-words out of words, but keeping POS; media is harder to deal with than strict text, especially sign language where need much of the exact original data; any encoding of a facial expression is going to be on the edge. Perhaps 3D models or avatars could be useful, but often not automatic and sometimes not possible e.g. in sign language.

In DOBES, legal specialists advised to keep all data closed. A code of conduct and ethical rules may provide a workable solution, but some data may need to remain closed to all but core researchers in a tightly controlled project. One could use information from original institution waivers to guide what permissions to assign to the data. The CLARIN project has a working group on IPR and licensing issues. The European parliament has expressed an interest in solving the complexity of the issues and may want to promote a revision of legislation, hopefully leading to wider availability of data for research purposes. Some lobbying towards legislators may be necessary.

A special situation may occur when someone data is retracted, either for privacy reasons or because of errors or other reasons. One could mark data as deleted, invalid or superseded withour actually destroying the data; one may also want to be able to temporarily restrict data for certain reasons.

Researchers are often unwilling to turn over their data for storage and distribution in repositories. One reason is that some people feel their data is not ready yet: once data is in a repository they feel it is cast in stone, which is a real problem for data which is never quite complete such as a dictionary of a living language. It is therefore crucial that repositories offer versioning and updating of the stored materials. Some researchers might prefer to control distribution themselves from their own homepage. A possible solution could be that links or webpages could be generated from the repository automatically. This could also be useful for university administration which has to document research production. A soft approach to repositories could help lead people to understand what a data repository is and what it can start to enable.

Reliability


Good data provenance is useless without data reliability. At its very basic level, this means preservation of the bits so that the data itself does not sit on one machine which, if it failed, would result in the loss of that data (similarly, keeping the data locked on paper in a file cabinet is subject to the same problems). The archival institutions have established best practices for this basic level of data reliability.

Data reliability also involved the access to and use of the data, including issues of privacy. The data repository should state what types of people have access to the data and record who has accessed it, whether it be by downloading the data or browsing through the data in an archive. In addition, the data must state how it can be used, e.g. can it be republished or serve as the basis for commercial systems. By providing this information and by recording who has accessed the data, it is possible to enforce any violations of use and privacy.

Comprehensibility is key to data reliability. All data must be tagged with the appropriate metadata and linked to its documentation. This allows researchers to understand what is, and is not, included in the data set and its annotation and to correctly cite the data set, thereby acknowledging the work of the data set creaters and allowing future researchers to reproduce their findings. Metadata and documentation is facilitated by strict adherence to standards and best practices. As students may not be familiar with these, more senior researchers and experts in natural language data curation need to share their knowledge and to facilitate students adherence to the accepted standards. As a very simple example of this, there are certain file formats which are standardly used for different types of data: each data set should record what formats it is using and the creator should ensure that these are ones that are used by the field.

Suggested First Steps


Curated data as publication and the corresponding data provenance and reliability could involve a major, long-term infrastructure project for the linguistic community. However, we would like to suggest a few simple first steps that the community as a whole can take, including workshops of this type. Our basic suggestion is to provide both carrots and sticks to the community and to pursue proactive education in data set publication. Although some linguists will become major contributors to curated data sets while others will play a more minor role, all linguists should understand and appreciate their importance: no linguist left behind.

There are several ways to encourage linguists to published curated data. First of all, mentors need to publish andshare their data sets as model for next generation. Encouraging students is not enough since the actions of successful researchers speak much more loudly than words: if senior members of the field are not publishing data, then newly minted PhDs will not do so. Those who do publish their data should provide a simple "cite as" button with data to make it easy for those using the data to cite it in their works. This is particularly necessary since many linguists are unsure of the proper way in which to cite such data sets. Finally, it would be good to provide a service provision for data structure and integrity validation and for format conversion. Data structure and integretity validation can, at a minimum, check that the relevant metadata is in place before publication. As a more complex process, it can check over the data structure to be sure that it is correct and complete, e.g. a tab delineated file should have tabs between fields. A format conversion service would allow data set creators to publish their data sets in a variety of formats, both making it easier for researchers to use them and helping ensure the long term survival of the data as formats change.

There are also some institutional safeguards that can easily be put into place to ensure that data publication occurs and is done properly. First, publishers, editors, and reviewers can require provenance information for all data sources before agreeing to publish research work. Second, editors and funding agencies encourage data sets to be published. This is already starting but should be further encouraged. Also, the funding agencies should ensure that the data is in fact published if it was part of the grant and if possible should publicize that the data is now available. If grantees do not publish their data as promised, either the funding agency could withhold any future grants to the grantee and its instituation or could not pay out the final part of the grant until the data is published, similar to how some dissertation fellowships reserve part of the grant until the signed dissertation is submitted.