Sociolinguistics case study (version control)This is a featured page

Collaborative research utilizing version control

Case study subdiscipline: Sociolinguistics
Project title:DIALECT EVOLUTION AND ONGOING VARIABLE LINGUISTIC INPUT: ENGLISH IN THE PACIFIC NORTHWEST 200 YEARS AFTER LEWIS AND CLARK
National Science Foundation Award: BCS-0643374
(Alicia Beckford Wassink, Principle Investigator, University of Washington,
Department of Linguistics)
Software used: Microsoft Sharepoint Server 2007 (for version control and remote collaboration)
Goals of this case study:Demonstrate the use of versioning software in an online research collaboration area (ORCA) to register the changes made to spoken language recordings and associated data to enable tracking of modifications, and make transparent the nature of and motivations for the changes



Project description from the project homepage (http://www.artsci.washington.edu/nwenglish/index.asp):
The Pacific Northwest English project investigates the features of English spoken in the Pacific Northwestern region of the United States (PNW), two hundred years after the introduction of non-indigenous speakers to the region. The Pacific Northwest English (PNWE) project explores the extent of English dialect development in the Pacific Northwest region of the United States. It also documents the stories of families with deep roots in the Pacific Northwest region. The project outcomes include: acoustic analysis of vowel speakers of speakers indigenous to the PNW (formant data, amplitude and duration measures), orthographic transcriptions of 200 hours of unscripted speech, detailed social network measures and analyses for judgement sample speakers. Sociolinguistic investigation includes analysis of speaker participation in several phonological changes affecting North American English, examination of questions related to interethnic contact and dialect focusing across three generations. Oral histories for participating families are being produced in conjunction with the Washington State Historical Society, and Prof. Jean Harris, Highline Community College.

1. Data Elicitation

• Elicitation of data involves the utilization of a hybrid methodology, combining phonetic analysis with a standard multi-part variationist sociolinguistic interview schedule allowing collection of data in different spoken registers (unscripted conversation, one-on-one interview, reading passage, word lists, semantic differentials, syntactic diagnostic prompts).
• While we cannot make all elicitation instruments available in this wiki (to avoid exposing materials to potential respondents), similar elicitation materials are publicly available in the Elicitation Materials Clearinghouse, Sociolinguistics Laboratory, University of Washington.
• Two-part sample includes data recorded in the field for a judgement sample and in the laboratory using telephony devices to acquire data for a complementary random sample.

2. Storage

• Original recordings (recorded at a 44.1kHz sampling rate in uncompressed form to compact flash media, using M-Audio MicroTrack digital flash recording devices) are stored in four locations (as required by IRB protocols): 1) on a file server subjected to regular, incremental backups, 2) in ISO-9660 formatted compact disks, locked in a cabinet accessible only to the principal investigator, 2) in redacted form on compact disks in a CD archive, 3) in redacted form on a file server in an online research collaboration area (Microsoft SharePoint ORCA).
• Redacted formats have been edited in Praat software for the removal of potentially identifying subject information, i.e., the acoustic signal has been attenuated to zero, while leaving the time dimension intact. This allows all versions of the soundfiles to retain original timings, enabling location of temporal events of interest across versions of the recordings and transcriptions (which have been time-stamped based upon the non-redacted versions of the signal).

3. Version control

• Version control is provided via an online research collaboration workspace created using Microsoft SharePoint 2007. SharePoint runs on any platform (Researchers in our team are currently using MAC, Windows and Linux operating systems). Versioning is particularly useful in the document libraries where soundfiles, transcriptions, and praat text tiers are stored (Fig 1).

Fig 1. Screenshot showing organization of ORCA main page.

ORCAmainpg

• Version control requires (in this case, although other versioning software varies) that each user check out a soundfile or transcript from the document library. SharePoint allows only one user to check out a file at a time, but other software (such as CSV, Subversion, etc) does not have this limitation.
• The file is modified by the user.
• At the end of a work session, the user uploads the modified version of the file to the document library. The software prompts the user to provide comments regarding what changes were made to the document, and automatically timestamps the new file with the upload time and version (Figs 2a,b). Figure 2a shows the SharePoint pulldown for a file called SR2CF2A_non-conversational. This is an orthographic transcription file that exists in several versions because it has been subjected to a process of anonymization. We desire to view the version history for this file. Figure 2b shows the version history for this file. The current iteration is version 6, which has been wiped of information that potentially identifies a study participant. Clicking on any version (from 1-6) will result in a prompt by the system to view or restore that iteration.

Fig 2a. version history pulldown from main document library

version history pulldown

Fig 2b. Screenshot of version history comment page

version history detail
• Crucially, all prior versions are available to the user. This allows full control and comparison of different versions of the documents stored in the library without overwriting data.
• A discussion area within the ORCA allows discussion of substantive changes to collection, analysis, and other protocols so that important decisions may be registered as part of the project history.


Fig 3. Screenshot showing topic list from the general discussion site
discussion list

4. Metadata

• Akustyk software is used for associating project, speaker and token level metadata with events in the sound file.
• A project handbook registers methodology and decisions made.
  • The metadata associated with all recordings is here

5. Access
• Sharepoint allows for restriction of access depending on permissions criteria for each member of the research team. It is possible, in principle, to share redacted versions of the recordings with all members of the team with data analysis functions, and restrict access to the non-redacted versions to the PI. Permissions criteria are set by principal investigator.
• The public face of the project includes: 1) the project website, 2) exemplifying soundfiles that may be played out or downloaded from maps on the project website, 3) individuals and organizations may download datafiles for particular speakers from the project website, for the set of speakers who have consented that their materials be made available in this way (see Human Subjects consent form sample).


Versioning software:
• Concurrent Versions System (CVS): An open-source revision control system (http://www.nongnu.org/cvs/)
• Subversion: An open-source revision control system (http://subversion.tigris.org/)
• Microsoft Sharepoint Server, 2007 (http://sharepoint.microsoft.com/Pages/Default.aspx)


Benefits of utilizing version control software:
-Very easy to use
-Allows research team to avoid the pitfall of saving numerous copies of the same file(s) in the same, or worse, different locations, and having to remember those locations.
-Allows research team to keep track of the current version of a working soundfile, spreadsheet (containing acoustic measures and demographic data in this case) between different users and/or different machines
-Offers the capability to revert to earlier versions of some or all of the files in a given workspace
-Some versioning software (e.g., Subversion), offers the ability to merge work done on the same file by different users
-Members of the working team located remotely may all access common elements in the same workspace when they *do* talk together (e.g. video or teleconferencing); and contribute without the risk of overwriting each other's work

Considerations:
Researchers will have to make value judgements about exactly what changes or decisions are meaningful to them, as well as to potential users. This means that different types of users will find versions of the data useful to differing degrees. In addition, version comments will always be somewhat subjective. Some subfields may have developed best practices regarding comments useful for versioning.

Considerations for different types of user (assuming resource uses a graded permissions structure):
-external users accessing a large corpus may not need the comments offered or content of different versions. They want to know the version number.
-researchers picking up the corpus to use for addressing new research questions are the group most likely to benefit from access to versions of the data (to know what version of the data they are working with, see the structure and format of this dataset as opposed to earlier versions, understand changes made to the resource after they have used it).

We want version control to be easy for the researcher, and useful. Once we begin to use such a tool, there may be more information than some researchers need, but version control INCREASES the utility of the data resource for others partly because it provides a minimal amount of metadata for a data resource (when the resource in its current form was produced, by whom and what the iteration of the version is). If a researcher has even a small set of basic principles for judging what types of comments are meaningful, supplying version comments can require little user effort. In short, a little goes a long way.




AliciaBW
AliciaBW
Latest page update: made by AliciaBW , Jul 22 2009, 12:21 PM EDT (about this update About This Update AliciaBW Edited by AliciaBW

1 word added
1 word deleted

view changes

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.