Data Reliability and Provenance
Peter Austin (co-chair)
Martin Haspelmath
(co-chair)Kurt Bollacker
Tracy Holloway King
(co-chair)Koenraad de Smedt
Paul Trilsbeek
Abstract
In this paper we discuss what data provenance and data reliability are, with special attention to the needs of the linguistics community when moving from simple data sets to more complex cyberinfrastructures. The integrity and completeness of provenance information as a basis for citation, rights management, etc. is crucial for those who record, annotate and compile materials, but also for the sources of raw materials, in particular for indigenous and minority communities, and for the scholars who use these materials as a basis for further scientific work. We suggest some first steps to promote data sharing and publication in the linguistics community.
Data Provenance
Provenance is the who, what, and when of metadata. When a data set is created, it is important to know where it comes from and who is responsible for its publication. Adequate information about provenance, i.e. information about how, where, when and by whom the data was collected, encoded and annotated, and who assumes responsibility for its publication, allows the quality of the data set to be assessed, provides a contact in case of questions, and establishes authorship of the data set, similar to the authorship of an academic paper.
The contents and status of a data set are not always clear from a quick inspection of the data. The data might come from native-speaker informants, from published works of literature, from the web, etc. Adequate metadata are needed for cataloguing data sets so as to make them searchable, and to secure that the data will be used in an appropriate way. Metadata should include a detailed description of the data set, including sufficient information on its provenance. In some cases provenance information should even be provided for separate sections of the data set wherever differences are relevant. For example, the data may have been collected over several years of field work; knowing when each section was collected and by who and under which circumstances may be important.
Provenance is extremely important to the scientific community which uses data sets as the basis of further analysis, hypothesis testing, etc. For the user, provenance is crucial assessing whether, and to what extent, a data set can form an appropriate basis for subsequent scientific work. Knowing how the data was collected and processed and who was responsible for it allows future users of the data set to judge the quality of the data, how it fits in with their research and how it compares to other data sets. Provenance is also important for the replicability of data and research results. Unless it is clear where the data came from and how it was created, data sets cannot be replicated. For example, if a data set has constituency trees annotated over it, it is important to know whether the trees were manually constructed, created automatically, or bootstrapped by manually correcting automatically constructed trees.
For the creators, provenance provides a way to get credit for the scientific work done in creating the data sets and allows the community to cite the data sets with their authors in academic works. Creating a high quality data set is extremely time consuming and requires highly skilled linguists: as such, it is important that those doing this work get credit for their contribution. Provenance is also important in establishing and maintaining privacy rights for those who provided the data. There are many reasons that data may not be appropriate to publish freely in its entirety, e.g. individuals may be identifiable from videos or from the content of discussions, or private rituals may be recorded. In particular the rights of indigenous and minority communities should be respected and acknowledged.
Provenance tightly linked to the proper, standardized use of metadata and documentation in reliable environments. All data must be tagged with the appropriate metadata and linked to its documentation. This allows researchers to understand what is, and is not, included in the data set and its annotation and to correctly cite the data set, thereby acknowledging the work of the data set creators and allowing future researchers to reproduce their findings. Metadata and documentation is facilitated by adherence to standards and best practices. As students may not be familiar with these, more senior researchers and experts in natural language data curation need to share their knowledge and to facilitate students adherence to the accepted standards.
Achieving Reliable Provenance Through Publication
A major question is how to achieve reliable data provenance in the linguistic community and promoting the sharing of data. Creating data sets is a time consuming, highly skilled task and the scientific community as a whole needs to acknowledge this and to provide support for those working on creating and maintaining data sets. Individuals contributing to and creating these data sets need to get institutional credit for data publication. For example, these should count for tenure reviews and other review processes and should be an integral part of grant applications, both in having data set publication be part of the grant work and in the granting agencies favoring researchers who publish curated data sets, just as they favor those with a proven track record of traditional academic publications.
Given the nature of the linguistics community and the working paradigms that they are used to, we suggest
promoting curated data sets as publications. The technology is currently available to treat curated data as publication. There is extensive archival work on linguistic data sets, including the work done by organizations such as the Linguistic Data Consortium. There are also examples from other scientific fields where publication of data collection has become an established scientific practice. However, there needs to be extensive institutional and social engagement in order for curated data as publication to become the norm in the linguistics and language studies communities.
Researchers need to be encouraged to publish curated data sets. This can be done in part by requiring it at the institutional level or as part of a grant reward. It can also be aided by providing more infrastructure to publish data sets, including providing information on best practices and on how to access the necessary technologies. In addition, researchers need to cite the published data sets that they used in their research. Reviewers and publishers of articles and books should reject submissions if they do not cite the data sources they they used in their research. These citations should be standardized by using the metadata and the publishers of the curated data sets should facilitate this by providing information as to how the data set should be cited.
The linguistic community will also need to provide support to ensure annotation and data quality control, just as it does with published academic papers. Not all data collection and annotation is done equally well and is of equal value. The community needs to have a way to recognize and acknowledge this, similar to the relative value of different books publishers, journals, and conference proceedings.
Publication of data through a recognized publication channel with an ISSN would make it easier to cite the data uniformly and correctly, would clearly establish authorship, would allow for different editions, would enforce the use of standards and metadata and would a catalyst for giving academic credit to the makers. It would also promote reviews and rating systems.
In a different field, the data journal
Earth System Science Data promotes the rapid publication of research on original data sets. Their policy is outlined as follows:
"Articles in the data section may pertain to the planning, instrumentation and execution of experiments or collection of data. Any interpretation of data is outside the scope of regular articles."
"In the first stage, papers that pass a rapid access peer-review are immediately published on theEarth System Science Data Discussions (ESSDD) website. They are then subject to Interactive Public Discussion, during which the referees' comments (anonymous or attributed), additional short comments by other members of the scientific community (attributed) and the authors' replies are also published inESSDD. In the second stage, the peer-review process is completed and, if accepted, the final revised papers are published inESSD. To ensure publication precedence for authors, and to provide a lasting record of scientific discussion,ESSDD andESSD are both ISSN-registered, permanently archived and fully citable."
It should be noted that articles in ESSD seem to contain the data mostly within the articles themselves. A journal is, however, not the same as a cyberinfrastructure. A data publication is meant to announce to the community that data has been available and how it was obtained and annotated, and also to give credit where it is due. The actual data can be accessed in a variety of ways, not necessarily through the same channel. Since publication is by nature public, there may be some issues with restricted data. Full metadata could be published, but the metadata could stipulate restrictions on the accessibility of the actual data (e.g. due to proprietary data or privacy issues).
One could imagine organizations acting as the publishers of articles with their data sets. Publishers of data will have the responsibility of checking at least the formal aspects of published data, such as proper use of metadata, adherence to standards etc. It might be possible that the same data is published at different hosts. We would need buy-in from institutions and linguists to support this. Publishers would themselves be rated and would need to actively advertise their data publications and make them attractive to researchers. Peer review of data publications should be stimulated. Perhaps language resources would be published unedited first, then reviews can be added on later, where annotation might count as a type of review; by having the language data out, even in unedited form, people would be encouraged to annotate it. Many academic credit systems (e.g. in Norway) require full peer reviewing as well as the use of recognized academic publishing channels.
Persistence and Fine-Grained Provenance Information
Given that linguists change institutions and that URLs shift over time, it is important that future researchers be able to access the same data that is being used today and to be certain that this is the same data as was used by other researchers. On the Internet, the assignment of provenance information to a piece of information can be assured through the use of a Persistent Identifier (PID). PIDs are globally unique identifiers that remain the same even if the URL of a resource changes. Central PID resolvers are used to administer the locations and additional metadata information of the resources. One PID can be used to refer to multiple identical copies of a resource in different locations. Examples of PID systems are the
Handle system and the
DOI system. PID systems only work though as long as the administration of the resource links is maintained, so the assignment of PIDs alone does not guarantee the long-term stability of the resource references. The linguistic communicty might consider setting up a registration authority for linguistic data.
Every entity involved in data set creation can be identified by a unique PID. These include entities such as people, organizations, and their roles, the data sets and documents themselves, and different views and mashups of the data. Added value by annotations etc. of existing material can be managed by cascading PIDs with different rights. A proliferation of provenance information may increase the size of the information by orders of magnitude but disk space is cheap and so we can do this very fine grained, e.g. sound clips indicated by start and stop time in a speech corpus. Assigning full provenance information can however be complicated, e.g. when material is translated or when value is added to copyrighted material (e.g. Wall Street Journal corpus) or when a speech corpus has a radio broadcast in the background.
Tomorrow's cyberinfrastructures will not be limited to static storage of data, but will be dynamic systems that process and present data according to user's needs. This context will present special challenges to handling provenance information. The following dynamic functionalities can be considered:
- customized presentation of data: filtering, reformatting, style sheets
- pipelined processes
- mashups: the integration of data from various sources (sometimes combining various modes)
In contrast to paper materials, which are static and pre-edited, a cyberarchive allows (and should allow) the user to participate in filtering and presenting information ("play editor yourself"). An example is the
Wittgenstein archives at Bergen, which contains digitized manuscripts: the user can choose to include or exclude certain pieces of information and has options for visualization. This creates challenges for provenance. How would you cite a specific view among many other possible views of this material? Such reference may be possible by generating a unique URI and PID for the transformed and formatted web page. An "I want to cite this" button would make this process easy for the user. Dynamic customized views could help with scientific collaboration since it is a good way to bring things together at the presentation level; the tools for this are very useful since most linguists do not have the user interface design skills to do this themselves.
The annotation layered on curated data is often done in conjunction with specific software. Having "software as a service" available to the linguistics community can aid in this process. It will be particularly valuable for institutions with less extensive computing infrastructure, allowing their researchers access to state of the art data set curation facilities. New frameworks such as
UIMA can aid in the interactive pipelining of processes on data. This again is a challenge for provenance information. Ideally, a PID can be assigned to every step in the pipelining process; note that at every step, intermediate data could be worth storing as a new resource.
Furthermore, Rosetta,
Freebase, the Internet Archive, etc. allow for mashups of data. Cyberinfrastructures should provide mashup functionality, i.e. the smart combination of data from various sources. Provenance is a challenging for mashup data since the mashup process is dynamic and every combination of specific versions of data produces a new mashup version. Also, some people might want to reference the particular rendering of a mashup. Referencing views could easily escalate when every user can have a personal view.
Reliable Identification, Authorization and Rights Management
As part of provenance, in recording the who, what, and when of metadata, it is necessary to have trusted identification of individuals, organizations, and services. This identification needs to persist over time so that decades after a data set is created, it is still possible to determine who created it and how it was created. Reliable identification of individuals is an important prerequisite for at least two purposes:
- Identifying authors and sources to give credit to data creation.
- Identifying users to provide authorization based on licenses and rights.
CLARIN has stated that some system with global e-identities can solve many problems associated with the current situation of people having different usernames at the various sites they use. A single logon should identify the user in an easy way. This is a largely solved problem on technical level, but various solutions are available and none are widely used by the community:
- OpenID
- Federations of local e-identity providers
Authorization is a tuple linking a set of rights, the identity of a piece of data and the identity of a user. Authorization can take various forms depending on restrictions, e.g. access can be given to anyone, can be based on email domains (i.e. affiliated institutions), can require the acceptance of a user license, etc. dependent on what the author or stakeholder decides. There are many such restrictions, e.g., access may be given for non-commercial purposes only, or access to sacred songs can be restricted to the initiated, or subparts of data may be proprietary and requires additional agreements. There is a need to train people on what types of rights and access are appropriate. There may be privacy issues with sources such as hospital patient data, sign language data, etc. which may place restrictions on availability.
Some language data, especially spontaneous speech and sign language, cannot be distributed due to privacy issues, in particular in utterances referring to people. becomes very complex with international access, where different countries may have different rules for guarding privacy. This situation may require different country-specific licenses, so international cooperation may need legal advice from the start.
Some technology exists to anonymize source materials, e.g. by masking, manual and automatic, e.g.
making non-words out of words, but keeping POS; media is harder to deal with than strict text, especially sign language where need much of the exact original data; any encoding of a facial expression is going to be on the edge. Perhaps 3D models or avatars could be useful, but often not automatic and sometimes not possible, e.g. in sign language.
At the start of the
DOBES program, legal specialists advised keeping all data closed. Since that was not an option, a code of conduct was worked out which provides a workable solution. Yet, some data may need to remain closed to all but core researchers in a tightly controlled project. One could use information from original institution waivers to guide what permissions to assign to the data. The
CLARIN project has a working group on IPR and licensing issues. The European parliament has expressed an interest in solving the complexity of the issues and may want to promote a revision of legislation, hopefully leading to wider availability of data for research purposes. Some lobbying towards legislators may be necessary.
Researchers are often unwilling to turn over their data for storage and distribution in repositories. One reason is that some people feel their data is not ready yet: once data is in a repository they feel it is cast in stone, which is a real problem for data which is never quite complete such as a dictionary of a living language. It is therefore crucial that repositories offer versioning and updating of the stored materials. Some researchers might prefer to control distribution themselves from their own homepage. A possible solution could be that links or web pages could be generated from the repository automatically. This could also be useful for university administration which has to document research production. A soft approach to repositories could help lead people to understand what a data repository is and what it can start to enable.
A special situation may occur when someone data is retracted, either for privacy reasons or because of errors or other reasons. One could mark data as deleted, invalid or superseded without actually destroying the data; one may also want to be able to temporarily restrict data for certain reasons.
Suggested First Steps
Curated data as publication and the corresponding data provenance and reliability could involve a major, long-term infrastructure project for the linguistic community. We would like to suggest a few simple first steps that the community as a whole can take. Our basic suggestion is to provide both carrots and sticks to the community and to pursue proactive education in data set publication. Although some linguists will become major contributors to curated data sets while others will play a more minor role, all linguists should understand and appreciate their importance: no linguist left behind. It will therefore be useful for the linguistic community to to engage in extensive dissemination and training efforts and to establish links with ongoing generic projects on metadata standards and preservation (e.g.
PREMIS).
Encouraging students is not enough since the actions of successful researchers speak much more loudly than words. Therefore, if well-known, successful members of the linguistic community publish and share their data sets, this will set a powerful example for the next generation. Those who do publish their data should provide a simple "cite as" button with data to make it easy for those using the data to cite it in their works. This is particularly necessary since many linguists are unsure of the proper way in which to cite such data sets. Finally, it would be good to provide a service provision for data structure and integrity validation and for format conversion. Data structure and integrity validation can, at a minimum, check that the relevant metadata is in place before publication. As a more complex process, it can check over the data structure to be sure that it is correct and complete, e.g. a tab delineated file should have tabs between fields. A format conversion service would allow data set creators to publish their data sets in a variety of formats, both making it easier for researchers to use them and helping ensure the long term survival of the data as formats change.
There are also some institutional safeguards that can easily be put into place to ensure that data publication occurs and is done properly. First, publishers, editors, and reviewers can require provenance information for all data sources before agreeing to publish research work. Second, editors and funding agencies encourage data sets to be published. This is already starting but should be further encouraged. Also, the funding agencies should ensure that the data is in fact published if it was part of the grant and if possible should publicize that the data is now available. If grantees do not publish their data as promised, either the funding agency could withhold any future grants to the grantee and its institution or could not pay out the final part of the grant until the data is published, similar to how some dissertation fellowships reserve part of the grant until the signed dissertation is submitted.
Finally, the establishment of electronic data publishing journals in conjunction with a cyberinfrastructure should be considered, so as to provide a formal channel for establishing authorship of data sets and creating a scholarly reference in addition to a framework for peer review.