Group 4: Protecting Data Reliability and ProvenanceThis is a featured page

The Provenance group will address the issue of protecting the reliability of data as it moves through the cyberinfrastructure as well as its provenance: this is critical for both data creators (who need credit for the work they’ve done and the academic contribution of collecting, curating and annotating data) and the data users (who need to know where the data has come from so they can form an opinion of how much credence to give it and how to give proper credit to the originator of the data). Furthermore, as one person’s analysis is encoded in annotation it becomes the next person’s data, so the provenance and reliability mechanisms need to scale to multiple layers of annotation over one original data set.

This group will also address the related issue of how to establish a culture of data sharing and what mechanisms to put in place to encourage people to share data.

Group members (member bio page):
  • Peter Austin (co-chair)
  • Martin Haspelmath (co-chair)
  • Kurt Bollacker
  • Tracy Holloway King
  • Koenraad de Smedt
  • Paul Trilsbeek
White Paper

Panel Report

slides
Interim report

Notes from Workshop

Reliability
  • Components of reliability:
    • How do we preserve the original bits?
    • How do we insure sufficient accessibility?
    • How do know the data is comprehensible/comparable?
  • How do we know when data is damaged or missing?
  • How do we know when we care?
  • If you don't do careful curation at the beginning, you have to do it at the end, and it will be more costly, more difficult and less reliable
Provenance
  • How do we encode provenance metadata, including:
    • Who made a change?
      • A person
      • An organization
      • An entity filling a role
    • What was the change made?
    • Where do we store the provenance information?
      • With the original data
      • In a separate store/index
  • Do we have to keep all provenance information or can we allow some to be destroyed?
  • What do we do when provenance information is missing?
  • How do we decide to filter using provenance information?
  • Should data and metadata be in a flat model? (i.e. can we have provenance on provenance data?)

Dynamics

  • How do we keep track of data which is by nature dynamic?
    • Consider monitor corpora which are continuously growing.
  • Annotations may change dynamically, e.g. treebanks may be updated as grammar models are developing
    • Strong versioning systems may be necessary.

Privacy

To what extend does this working group need to address privacy issues?
  • Many data sources (esp. spontaneous speech corpora) allow individuals to be identified
  • Can/should this information be kept in the system but private to users
    • Which techniques for anonymizing, masking etc. are available?

Data sharing

"Sharing data is a tenet of science, yet commonplace in only a few subdisciplines." Piwovar & Chapman 2008
  • Interest of scientists: fast dissemination, prompt access, maximal impact, quality assurance, feedback to the author, affordability
  • Companies typically want to buy into projects, but do not want to invest in long-term preservation; they want all the rights but intend to use the
    results immediately; thus a lot of results get lost.
  • Value for supplier: stored intellectual property, PR value for organization
  • Value for consumer: saving cost of re-discovery
  • Value for society: strong trigger for further research and innovation
  • Rights on supplier side: copyright
  • Rights on consumer side: 'consumer' rights, fitness for purpose, protection against false claims
  • PsychData, ZPID: community-funded project, exceptional since a lot of psychologists do not want to share data (i.a. for ethical reasons due
    to personal character of some data). PsychData provides data archiving but does not archive everything, focuses on large surveys,studies of unique populations, studies that cannot be replicated, longitudinal studies, etc.
  • DFG recommendation: Primary data as the basis for publications shall be securely stored for ten years in a durable form in the institution of their origin.

Economic aspects

  • Costs vary across the lifecycle stages of an infrastructure and are difficult to estimate.
  • High cost/value of data re-collection, lost data (sometimes unique data) and conversion of non-standardized data
  • Main economical problem: data deluge
  • There is an imperfect market: market value (demand/supply equilibrium) vs. moral obligations (open access, public funding), so we cannot use
    classical economic models
  • Some services can be contracted to third party, e.g. Portico

"Trustworthy" Digital Repository assessment/audit methods




tracyhollowayking
tracyhollowayking
Latest page update: made by tracyhollowayking , Aug 17 2009, 12:24 PM EDT (about this update About This Update tracyhollowayking Edited by tracyhollowayking

2 words added

view changes

- complete history)
Keyword tags: None
More Info: links to this page
Started By Thread Subject Replies Last Post
mebeckman To what extend does this working group need to address privacy issues? 2 Jul 16 2009, 6:40 PM EDT by mebeckman
Thread started: Jul 14 2009, 6:18 PM EDT  Watch
In Working Group 1, we've been pondering where/how to make sure that there is a page that addresses privacy and ethical treatment of human subjects, which seems to be an issue that cross-cuts the charge to Group 2 and Group 4 as well as touching on the charge to us. Could the three groups figure out how best to make sure that this doesn't slip through the cracks between the groups?
Do you find this valuable?    
Keyword tags: None
Show Last Reply
Showing 1 of 1 threads for this page