Data Reliability and Provenance
Peter Austin (co-chair)
Martin Haspelmath (co-chair)
Kurt Bollacker
Tracy Holloway King
Koenraad de Smedt
Paul Trilsbeek
Overview
In this paper we discuss what data provenance and data reliability are with special attention to needs of the linguistics community. We then suggest some first steps to promote data sharing and publication in the linguistics community.
Data Provenance
Provenance is the who, what, and when of metadata.
When a data set is created, it is important to know who contributed to it and who is responsible for it. It is important to know who contributed to a data set by collecting the data and by providing the data. The data might come from native-speaker informants, from published works of literature, from the web, etc. This allows the quality of the data set to be assessed and for it to be compared to other data sets. The responsible party provides a contact in case of questions about the data set and serves as the author of the data set, similar to the author of an academic paper.
The contents of a data set are not always clear from a quick inspection of the data. The data set needs to contain information about what it is. An obvious example of this is to report the language of the data set and its mode, e.g. video or audio recordings of spoken language, highly edited written language. Any additional annotations, such as transcription of spoken language, translations, part of speech tagging and morphological analysis, etc. that are provided with the data set need to be included, along with who provided these annotations. Technical information as to the mode and data format are also necessary, e.g. what type of media was used to record and store the recordings, are the annotations in tab delimited format.
Finally, provenance must include when the data set was created, ideally with dates for the collection as a whole as well as for the different parts of it. For example, the data may have been collected over several years of field work and knowing when each section was collected may be important.
As part of provenance, in recording the who, what, and when of metadata, it is necessary to have trusted identification of individuals, organizations, and services. This identification needs to persist over time so that decades after a data set is created, it is still possible to determine who created it and how it was created.
Provenance is extremely important to the linguistic community which uses natural language data sets as the basis of all of its work. Provenance provides a way to assign credit for the creation of data sets and to cite those data sets in academic works. Creating a high quality data set is extremely time consuming and requires a highly skilled linguist and so it is important that those doing this work get credit for their contribution. Provenance is also important in establishing and maintaining privacy rights for those who provided the data. There are many reasons that data may not be appropriate to publish freely in its entirety, e.g. individuals may be identifiable from videos or from the content of discussions, private rituals may be described. Provenance also makes it possible to judge the value of the data. Knowing how the data was collected and processed and who was responsible for it allows future users of the data set to judge the quality of the data and how it fits in with their research and with their own and other data sets. Finally, provenance is important for the replicability of results. Unless it is clear where the data came from and how it was created, information is lost and the data set cannot be replicated. For example, if a data set has constituency trees annotated over it, it is important to know whether the trees were manually constructed, created automatically, or bootstrapped by manually correcting automatically constructed trees.
How to Achieve Provenance
A major question is how to achieve reliable data provenance in the linguistic community. Creating data sets is a time consuming, highly skilled task and the community as a whole needs to acknowledge this and to provide support for those working on creating and maintaining data sets. Given the nature of the linguistics community and the working paradigms that they are used to, we suggest prompoting curated data sets as publication.
The technology is currently available to treate curated data as publication. There is extensive archival work on linguistic data sets, including the work done by organizations such as the Linguistic Data Consortium.
However, there needs to be extensive institutaional and social engagement in order for curated data as publication to become the norm in the linguistic community.
Individuals contributing to and creating these data sets need to get institutional credit for data publication. For example, these should count for tenure reviews and other review processes and should be an integral part of grant applications, both in having data set publication be part of the grant work and in the granting agencies favoring researchers who publish curated data sets, just as they favor those with a proven track record of traditional academic publications.
Researchers need to be encouraged to publish curated data sets. This can be done in part by requiring it at the institutional level or as part of a grant reward. It can also be aided by providing more infrastructure to publish data sets, including providing information on best practices and on how to access the necessary technologies. In addition, researchers need to cite the published data sets that they used in their research. Reviewers and publishers of articles and books should reject submissions if they do not cite the data sources they they used in their research. These citations should be standardized by using the metadata and the publishers of the curated data sets should facilitate this by providing information as to exactly how the data set should be cited.
The linguistic community will also need to provide support to ensure annotation and data quality control, just as it does with published academic papers. Not all data collection and annotation is done equally well and is of equal value. The community needs to have a way to recognize and acknowledge this, similar to the relative value of different books publishers, journals, and conference proceedings.
One concrete way to help with the estabilishment and maintenance of provenance is to use handles. Handlesd are globally unique, persistent identifiers for various entities involved in data set creation. These include entities such as people, organizations, and their roles, the data sets and documents themselves, and different views and mashups of the data. Given that linguists change institutions and that URLs shift over time, it is important the future researchers be able to access the same data that is being used today and to be certain that this is the same data as was used by other researchers.
The annotation layered on curated data is often done in conjunction with specific software. In addition, software is used to provide access to the data itself, including various reformattings and user-friendly views. It is important to preserve this software so that the annotations can be understood and recreated. In addition, as more software is made available, it can be used by other linguists in their work to enhance their data sets. Having software-as-a-service available to the linguistics community can aid in this process. It will be particularly valuable for institutions with less extensive computing infrastructure, allowing their researchers access to state of the art data set curation facilities.
Reliability
Good data provenance is useless without data reliability. At its very basic level, this means preservation of the bits so that the data itself does not sit on one machine which, if it failed, would result in the loss of that data (similarly, keeping the data locked on paper in a file cabinet is subject to the same problems). The archival institutions have established best practices for this basic level of data reliability.
Data reliability also involved the access to and use of the data, including issues of privacy. The data repository should state what types of people have access to the data and record who has accessed it, whether it be by downloading the data or browsing through the data in an archive. In addition, the data must state how it can be used, e.g. can it be republished or serve as the basis for commercial systems. By providing this information and by recording who has accessed the data, it is possible to enforce any violations of use and privacy.
Comprehensibility is key to data reliability. All data must be tagged with the appropriate metadata and linked to its documentation. This allows researchers to understand what is, and is not, included in the data set and its annotation and to correctly cite the data set, thereby acknowledging the work of the data set creaters and allowing future researchers to reproduce their findings. Metadata and documentation is facilitated by strict adherence to standards and best practices. As students may not be familiar with these, more senior researchers and experts in natural language data curation need to share their knowledge and to facilitate students adherence to the accepted standards. As a very simple example of this, there are certain file formats which are standardly used for different types of data: each data set should record what formats it is using and the creator should ensure that these are ones that are used by the field.
Suggested First Steps
Curated data as publication and the corresponding data provenance and reliability could involve a major, long-term infrastructure project for the linguistic community. However, we would like to suggest a few simple first steps that the community as a whole can take, including workshops of this type. Our basic suggestion is to provide both carrots and sticks to the community and to pursue proactive education in data set publication. Although some linguists will become major contributors to curated data sets while others will play a more minor role, all linguists should understand and appreciate their importance: no linguist left behind.
There are several ways to encourage linguists to published curated data. First of all, mentors need to publish andshare their data sets as model for next generation. Encouraging students is not enough since the actions of successful researchers speak much more loudly than words: if senior members of the field are not publishing data, then newly minted PhDs will not do so. Those who do publish their data should provide a simple "cite as" button with data to make it easy for those using the data to cite it in their works. This is particularly necessary since many linguists are unsure of the proper way in which to cite such data sets. Finally, it would be good to provide a service provision for data structure and integrity validation and for format conversion. Data structure and integretity validation can, at a minimum, check that the relevant metadata is in place before publication. As a more complex process, it can check over the data structure to be sure that it is correct and complete, e.g. a tab delineated file should have tabs between fields. A format conversion service would allow data set creators to publish their data sets in a variety of formats, both making it easier for researchers to use them and helping ensure the long term survival of the data as formats change.
There are also some institutional safeguards that can easily be put into place to ensure that data publication occurs and is done properly. First, publishers, editors, and reviewers can require provenance information for all data sources before agreeing to publish research work. Second, editors and funding agencies encourage data sets to be published. This is already starting but should be further encouraged. Also, the funding agencies should ensure that the data is in fact published if it was part of the grant and if possible should publicize that the data is now available. If grantees do not publish their data as promised, either the funding agency could withhold any future grants to the grantee and its instituation or could not pay out the final part of the grant until the data is published, similar to how some dissertation fellowships reserve part of the grant until the signed dissertation is submitted.