Group 4 NotesThis is a featured page


Who we are


Peter: descriptive, documentary linguistics perspective; SOAS London; endangered languages program; people doing documentation, collecting data

Kurt: digital research directory; Rosetta project, technical advisor; digital data preservation project, what does it mean to preserve it for a really long time; have some philosophies and rules of thumb, want to improve guidelines and see how works

Paul: Max Planck Nijmegen; have large archive with linguistics material; involved in big European projects like DOBES and CLARIN: preserving data, make it accessible, extra layers on top of data

Koenraad: comp ling prof in Bergen; language resources since Celex; treebanking and from that into infrastructures; in Europe lots of attention via national initiatives and wanting to tie things together; contact person for CLARIN in Norway

Tracy: Employee at Powerset, Microsoft. They use lots of data. An infrastructure implies adding layers of structure onto data, reusing and sharing them

Martin: Max Planck in Leipzig (evolutionary anthropology); linguistic diversity and history of it; language documentation; language comparison, language typology, WALS on line even though sold well in print; share data, especially if counts as publication, added incentive; will publish more on-line: 41 vocabularies (comparative dictionary); Max Planck has a digital library; ties with Rosetta; having issues with ISO since need unique code for each variety of each language, including historical

Keywords:


Reliability
, Provenance, Data Sharing

Data sharing at any scale, but even more so at a large scale, presupposes securing reliablility and provenance. Furthermore, there is a sociological dimension to data sharing

Provenance


The integrity and completeness of provenance information as a basis for citation, rights management etc. is crucial for those who record, annotate and compile materials, but also for the sources of raw materials, in particular for indigenous and minority communities.

Best practice dictates that handles are used rather than volatile URLs to refer to every piece of information. Handles are global identifiers or permalinks and can be managed by organizations such as handle.net. Even so, there are issues with trust and long-term maintenance. DOIs are already widely in use for e-journals. Perhaps a registration authority could be set up for linguistics.

Rights can be assigned to information associated with a handle. Added value by annotations etc. of existing material can be managed by cascading handles with different rights. A proliferation of provenance information may increase the size of the information by orders of magnitude but disk space is cheap and so we can do this very fine grained, e.g. sound clips indicated by start and stop time in a speech corpus.

The technical infrastructure is at the edge of doing this, but the field is not there yet; we need steps to get there since practice of scholars is behind

Provenance can be tricky, e.g. when material is translated or when value is added to copyrighted material (e.g. Wall Street Journal corpus) or when a speech corpus has a radio broadcast in the background.

Rights and approvals


Paul Newman wrote an article on copyright for the scholar.

Some questions:

  • How to deal with copyright management, which is not the same as avoiding plagiarism?
  • How to make sure that relevant metadata is encoded?
  • How to cite and give credit for data, tools, etc?
  • How to assign rights to derived resources? E.g. X trains a stochastic disambiguator on a treebank which Y created by parsing a corpus compiled by Z with a grammar developed by W, etc.
  • How do we bring the whole field along in using best practice?
  • Do we need editorial boards who hand out stamps of approval?
  • Do vetted data presuppose formal criteria (e.g. appropriate metadata) or does the content need quality assurance (which could be costly)?
  • How can data be retracted when significant shortcomings are discovered?
  • How do we ensure that relevant metadata is encoded?
Some possible steps toward solutions:

  • Approval of formal aspects such as metadata, consistency etc. is realistic and allows publishing quickly
  • Metadata and reusability should give credit; there should be models and best practice for these aspects, and easy systems and templates
  • There are existing repositories and plans for more extensive infrastructures (e.g. CLARIN) that accept contributions from non-affiliates.
  • Repositories, national libraries etc. deal with long term preservations and access.
  • A permalink would help in finding out who uses data and in the possible retraction of data
  • Normally publishers take care of rights clearance; licenses may be necessary
  • Some existing services for getting a repository going: Amazon S3; Google

Reliability and permanence


Analog data as well as digital data on old media are deteriorating. Free digital resources on the web can last a long time when mirrored, but is nevertheless fragile if not maintained. Long term curation implies secure storage, distributed copies and regular migration to new information bearers. This is costly and requires a serious organization. Handles list all the master copies and keep track of migrating data. Big national libraries are beginning to take such responsibilities seriously. DOBES and the Max Planck Digital Library are doing this for language resources and CLARIN is moving toward such an infrastructure for all language resources.

The expense for physically storing all text is continually dropping, while audio and video take more space but are hopefully converging; so storage is not the issue and language data is nothing compared to the massive data from physics, astronomy, etc., being stored. The cost issues are in annotating, checking and indexing metadata, conversion to newer formats, and other curation. The organization and technical issues are made more complicated by the fact that language data are very heterogeneous.

The scientific and societal value of conserving language data can be considerable. Some of the language data cannot be reconstructed if lost or the cost if repeating their creation is huge; sometimes metadata cannot be reconstructed either due to researchers forgetting (or dying)

Identification and authorization of individuals


Reliable identification of individuals is an important prerequisite for at least two purposes:
  1. Identifying authors to give credit to data creation.
  2. Identifying users to provide authorization based on licenses, etc.
CLARIN has concluded that some system with global e-identities can solve many problems associated with the current situation of people having different usernames at the various sites they use. A single logon should identify the user in an easy way. This is a largely solved problem on technical level, but various solutions are available and none are widely used by the community:
  1. OpenID
  2. Federations of local e-identity providers
Authorization is a tuple linking a set of rights, the identity of a piece of data and the identity of a user. Authorization can take various forms depending on restrictions, e.g. access can be given to anyone, can be based on email domains (i.e. affiliated institutions), can require the acceptance of a user license, etc. dependent on what the author or stakeholder decides. There are many such restrictions, e.g., access may be given for non-commercial purposes only, or access to sacred songs can be restricted to the initiated, or subparts of data may be proprietary and requires additional agreements. There is a need to train people on what types of rights and access are appropriate. There may be privacy issues with sources such as patient data, sign language data etc. which may place restrictions on availability.

Giving credit for data creation and sharing


The stick and the carrot may have to be used. Funding agencies are starting to require data publication or at least data archiving. These agencies could be motivated to host the data or at least a link. Credit should be given for adding value to data sets. There should be requirements not only for making data available, but making them available in a form that allows reuse, e.g. through appropriate use of standards, metadata and interfaces.

The responsibility for making data available should not just fall on the individual researcher, but on the institution. Here is also a carrot, since institutions will get prestige from hosting data that becomes well known and is the basis of many studies, but these rewards may be difficult to obtain for more specialized data (e.g. lesser known languages).

Many corpora and other data are hard to count as research results, but are extremely important, even for languages like English (e.g. BNC corpus); crucial input to much linguistics study.

Publication of data


A number of issues could be resolved by the formal publication of data collection rather than simpler models of data sharing. Publication of data through a recognized publication channel with ISSN etc. would make it easier to cite the data uniformly and correctly, would clearly establish authorship, would allow for different editions, would enforce the use of standards and metadata and would a catalyst for giving academic credit to the makers. It would also promote reviews and rating systems.

The data journal Earth System Science Data promotes the rapid publication of research on original data sets. Here is their policy:
"Articles in the data section may pertain to the planning, instrumentation and execution of experiments or collection of data. Any interpretation of data is outside the scope of regular articles."
"In the first stage, papers that pass a rapid access peer-review are immediately published on theEarth System Science Data Discussions (ESSDD) website. They are then subject to Interactive Public Discussion, during which the referees' comments (anonymous or attributed), additional short comments by other members of the scientific community (attributed) and the authors' replies are also published inESSDD. In the second stage, the peer-review process is completed and, if accepted, the final revised papers are published inESSD. To ensure publication precedence for authors, and to provide a lasting record of scientific discussion,ESSDD andESSD are both ISSN-registered, permanently archived and fully citable."

However, articles in ESSD seem to contain the data mostly within the articles themselves. ESSD is not the same as a cyberinfrastructure. The publication is meant to announce to the community that data has been available and how it was obtained and annotated, and also to give credit where it is due. The actual data could be accessed in a variety of ways, not necessarily through the same channel.

One could imagine organizations to act as the publishers of articles with their datasets. Publishers of data will have a responsibility of checking at least the formal aspects of published data, such as proper use of metadata, adherence to standards etc. It might be possible that the same data is published at different hosts. We would need a buy-in from institutions and linguists. Publishers would themselves be rated and will need to actively advertize their data publications and make them attractive to researchers.

Since publication is by nature public, there are some issues with restricted data. Full metadata could be published, but the metadata could stipulate restrictions on the accessibility of the actual data (e.g. due to proprietary data or privacy issues).

There are also issues with editing and peer review. Maybe language resources would be publshed unedited first, then reviews can be added on later, where annotation might count as a type of review; by having the language data out, even in unedited form, people would be encouraged to annotate it. Some academic credit systems in countries like Norway are based on criteria like peer reviewing and require that recognized academic publishing channels.

Ratings of publication channels for language data and impact assessment (cf. CiteSeer) may follow.

Michael Cysouw wrote a detailed proposal for modern dictionary publication.

Saturday morning session

Privacy and rights management


Some language data, especially spontaneous speech and sign language, cannot be distributed due to privacy issues, in particular in utterances referring to people. becomes very complex with international access, where different countries may have different rules for guarding privacy. This situation may require different country-specific licences, so international cooperation may need legal advice from the start.

Some technology exists to anonymize source materials, e.g. by masking, manual and automatic, e.g. making non-words out of words, but keeping POS; media is harder to deal with than strict text, especially sign language where need much of the exact original data; any encoding of a facial expression is going to be on the edge. Perhaps 3D models or avatars could be useful, but often not automatic and sometimes not possible e.g. in sign language.

DOBES example: legal specialists advised to keep all data closed. Code of Conduct/ethical rules only workable solution, some data needs to remain closed.
could use information from original institution waivers to guide what permissions to put in. BNC for sign: some of the data locked up for all but core researchers; divide between sharable and not.

With respect to rights management, there is a CLARIN working group on IPR and licensing issues. The European parliament has expressed an interest in solving the complexity of the issues and may want to promote a revision of legislation, hopefully leading to wider availability of data for research purposes. Some lobbying towards legislators may be necessary.

Reliability, persistence, acceptance, retraction


Data which may be inconsistently or obsoletely coded or which do not contain enough metadata is still abundant. Such data may need to be checked and converted when placed into an infrastructure. An example is the fact that some lexicography groups still write and format dictionaries directly in legacy word processors. Advice on formats, coding and annotation is needed, as well as roadmaps and tools that promote conversion into more up to data, powerful formats. An infrastructure should therefore contain a registry for conversion software and could well provide a user friendly service so that users can upload, validate and convert data. Certainly tools should not act as barriers, but as aids. ARXIV has a useful system for automated ingest in physics, with steps along the way to ensure compliance, including for research papers.

Once you have provenance, what if someone retracts/deletes all or some research data? One could mark data as deleted, invalid or superseded withour actually destroying the data; privacy would be one reason but it is often economically or politically inconvenient to have data around (cf. Swiss bank case in WikiLeaks; recordings of politically sensitive events, should have been a trade secret). One may want to be able to temporarily restrict data for certain reasons.

Acceptance


How do we promote acceptance of the idea that data needs to be stored, properly encoded, made available to others, etc.?

Linguists and humanities scholars must be convinced of the fact that data will not only be ready by people, but also by machines, and therefore there are requirements on encoding. However, this will only work if the encoding is transformed (or hidden) to make the data readable by humans as well. Presentation is a prerequisite for acceptance. Note that Wikipedia data encoding is all done at presentation level, which makes it hard to extract out in a machine readable way; even when have a template for the metadata, e.g. the language template is not used by English which uses its own template.

A carrot for researchers is that once one has a machine readable version, one can produce many different views of the content, one can see multiple versions of the text and highlight different things, produce word lists, concordances etc.

Renderings and combinations of data


In contrast to paper materials, which are static and pre-edited, a cyberarchive allows (or should allow) the user to participate in filtering and presenting information ("play editor yourself"). An example is the Wittgenstein archives at Bergen, which contains digitized manuscripts; the user can choose to include or exclude certain pieces of information and has options for visualization. How would you cite a specific view among many other possible views of this material? This could this be done by generating a unique URI and handle for the transformed and formatted web page. An "I want to cite this" button would make this process easy for the user. Such dynamic aspects could help with collaboration since it is a good way to bring things together at the presentation level; the tools for this are very useful since most linguists don't have the UI skills to do this themselves; Rosetta is trying to provide some examples; allow linguists to deal with how to use and view the data.

Rosetta, Freebase, Internet Archive etc. allow for mashups of data. Cyberinfrastructures should provide mashup functionality, i.e. the smart combination of data from various sources; Kurt and Laura have been working on this and have some demos of how one could display information together. It is challenging to reference mashup data since the mashup process is dynamic and every combination of specific versions of data produces a new mashup version. Also, some people might want to reference the particular rendering of a mashup. Referencing views could easily escalate when every user can have a personal view.

Linguistic taxonomy browser: want provenance information so that can choose which information to use in a given view; also a motivation to people to include provenance data

Repositories


Researchers are often unwilling to turn over their data for storage and distribution in repositories. One reason is that some people feel their data is not ready yet: once data is in a repository they feel it is cast in stone, which is a real problem for data which is never quite complete such as a dictionary of a living language. It is therefore crucial that repositories offer versioning and updating of the stored materials. Some researchers might prefer to control distribution themselves from their own homepage. A possible solution could be that links or webpages could be generated from the repository automatically. This could also be useful for university administration which has to document research production. A soft approach to repositories could help lead people to understand what a data repository is and what it can start to enable.

Large infrastructures and funding


Large scale infrastructures could be expensive (cf. the CLARIN calculations of what it will cost to maintain a European research infrastructure for language). However, it should also be considered how much one can do with minimal funding but with guidelines and existing resources (such as Google, Amazon and the Internet archive). If cheap solutions are easy enough, people will use them if they provide a benefit over the status quo. Initial success can be a proof of concept which can then be a catalyst to get funding for scaling up things and make them more viable in the long term.

In view of the slow process of starting up a large infrastructure, it is better to get something going quickly even if the approach has not been fully worked out: this could save data that might otherwise be lost. A way to get started is by building small, easy tools that encode best practices, such as a checking the consistency and completeness of metadata.

When ultimately aiming for a large, encompassing infrastructure (such as what CLARIN is aiming for), interfaces will be very important -- both interfaces to human users and interface systems to let computer systems work together. The aim is to establish one seamless virtual space so that people don't have to know about the specifics of where certain resources are located. An infrastructure would not necessarily replace or copy existing repositories and services, but could aim at connecting them. Provenance information is very important as an underlying feature, since a user might combine a parser X with a tagger Y from different places and the system should track this.

A first connection between resources from differenc sources should perhaps be attempted on level of metadata, where added value could be provided for interoperability on data level. An example application would be a simultaneous search across repositories for data, e.g. for all occurrences of a word within documents of a certain period, area, etc. Gread added value would be provided by being able to place metadata restrictions (period, area, etc.) on search in the actual data (words, tags, etc.) in a distributed environment. Efficiency is not a trivial issue in this respect. Pulling down huge quantities of data and then searching and filtering might not be the optimal solution. Furthermore, downloading data should be avoided when this goes against stated restrictions due to copyright etc. Good pipelining of linguistic processes over data is crucial to large scale infrastructures, and requires a good interplay between advanced systems and the necessary computing power. Software as a service would be a good model.

also, data aggregation is an issue and this can be provide as part of the infrastructure

can pool computing resources to make this easier; and many are small computations; turnaround is faster; searching over arbitrary structures like trees and graphs

will need good organization effort and long term; would want mirrored across multiple sites, e.g. EU and NSF


metadata

what need to keep? who (strong identifier) and when (standard time stamp) minimally; OLAC was involved with this; what, e.g. version change, copy/move from elsewhere, transformation (preservation metadata standards) www.loc.gov/standards/premis

would want to make this easy to use with menus, etc.; want a nice service tool to do this; SOAS uses this for some documents

what is required to collect and make it searchable; don't want user to have to continually redo; have service for this

Saturday afternoon



meeting with tools


service provision

given repositories to put stuff and ways to reference, would be value to making this work as service provision

connect tool builders with the data

have software run in the cloud/on the grid; much stuff is simple but some might require heavy computing power

example: want to test out idea about syntactic phenomena; don't want to have to download all the data and do all the processing and the data aggregation

what type of linguists would use? theoretical linguists; comp ling with complex tools

one central place? can have geared towards specific needs, e.g. sociolinguists; could have skins; could start with one-size-fits-all and then evolve

use cases to guide (could be start of a grant proposal)
satisfy the needs
what do, what problems, what can gain, how

put on wiki and let people add

firefox as a user interface
look for metadata

pull down data (e.g. from metadata)
process with pipeline
update with annotation and provenance
upload, which allows others to use it
search

storage as a service, including backups

don't want linguists to do what they are not good at: instead have experts in db, etc


many underlying systems are already in place, especially for other disciplines
some of these want to include humanities and so could link in with

what is special about linguistics given that a lot of this exists
  • interfaces for visualizing, GUIs
  • types of annotation: basic functions and APIs
  • what types of queries are possible, especially for efficient search

Germany's TextGrid, eclipse based
UIMA based projects

is this sociological/organizational or technological problem

sociology is very important:
  • what will people use
  • how encourage

directly sharing data is very valuable, how encourage

provenance for bootstrapped annotation
many layers of annotation

even with data cheap, want some type of quality control and also retraction of data

partially annotated, redoing any time ok for things like field data
make a regular part of process
encourage to put parts up, e.g. graded access
at first put up even just for self and then publish out

  1. first get data on server
  2. encourage by getting cool things for free, e.g. can organize and view
  3. then encourage to share
  4. full annotation can procede

academics worried about stolen, or not getting full credit
  • publish data to help alleviate this
  • field linguistics notes could really benefit from this
  • some may want to "publish" just for self and maybe later let out
  • for when require publication of data as part of a grant, make it easy to put it out and ensure that is always available
  • educational mission is an important part of this: reach out to students and faculty

make sure is not responsibility of author to "fix" or alter; can license in such a way to require/encourage republication

service oriented architecture
  • ways to modify and transform; want to help keep things running
  • what happens with parts of a pipeline are no longer maintained
  • want backup and additional ways around
  • but will still have issues when get new platforms and upgrades

tools are harder to migrate than data: institutes like Max Planck are putting effort into this

mashups have similar issues

CLARIN: thinking of having larger centers that take over from other centers if need be

who would make funding decisions for things of this scale?
  • many components are out there
  • missing: human interface and how connects to the annotation standards; getting people to start using it; commitment (e.g. Max Planck guarantees data for 50 years)
  • what would the team that does this look like

are linguists involved in digital humanities project (Project Bamboo)?

cost: e.g. CLARIN 165 million over about 22 countries
use as much existing stuff as possible
and there will be volunteer work on this
try to get a few cool features to use initially

Work Session 3:


interim report:

software as a service
storage as service
high-performance computing as a service

customizable editions of the data (Wittgenstein example)
example of how can benefit even if put mainly text up with minimal annotation
tuning way you look at the data to specific research needs: one view is not enough
data mashups: user participates in creating data

what types of data are there? one person's analysis becomes another persons data
what generalizations apply to different types, e.g. publication applies to some types

some things will be "publications" and some not, but still want to be sharable
  • typology: data could be an additional publication with the typological analysis itself, e.g. coordination, loan words
  • lists of NPs used in analysis is a bit different: less detailed annotation, more automated acquisition
  • dictionaries, especially in non-final form
  • (Atlas of European Languages was published volume by volume with data, comments on what did and systems used, maps; all on paper)
  • field worker with annotated corpus with translation: getting and analyzing data very time consuming

document what did: many judgments go into how the annotation was done and need to know this; credits scientific achievement

privacy issues
some technology exists to anonymize source materials, e.g. by masking.
This becomes harder with international access, where different countries may have different rules for guarding privacy. This situation may require different country-specific licences and legal advice.

want: top level take away, one page take away, the whole thing

Work Session 4: clean up, start of report generation, and panel prep

possible organization of report:
  • intro: what are the issues




Koenraad
Koenraad
Latest page update: made by Koenraad , Aug 15 2009, 3:48 PM EDT (about this update About This Update Koenraad Edited by Koenraad

69 words added
88 words deleted

view changes

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.