Group 7: Collaboration StructureThis is a featured page

.

The collaboration structure group is charged with considering methods for enhancing collaboration and communication on three levels. Level 1 involves forming communication pathways that can link individual researchers to the overall agenda of developing a shared cyberinfrastructure. Level 2 involves the collaborations that are needed between system developers and tool developers to assure maximum interoperability and open access between data formats and programs. Level 3 involves support for collaborations between linguists and researchers in other sciences, grounded on the use of a shared, open access cyberinfrastructure. For each of these levels, we need to design lightweight methods for ensuring ongoing collaboration and coordination.


The specific agenda items for this group include:
1. Data level interoperability: roundtrips between formats, transductions of formats, funding for the process of developing compatibility
2. Tool level interoperability and methods for collaboration in tool development.
3. Issues arising from a commitment to open access.
4. How to maximize data-sharing: the role of NSF, NIH, and LSA in terms of promoting greater commitment to data-sharing.
5. Characterization of linguistic digital data types and methods for linking to non-digital data.
6. An agenda for developing linkages to other sciences - the bigger picture of Extended Linguistics.
7. Lightweight administration within a framework of complex organizations: NIH, NSF, LSA, CLARIN, etc.

Following are further analyses of these seven agenda items:

1. Data level interoperability.
  • Level 1 involves compatibility in annotation format, such as a formats provided by frameworks such as Annotation Graphs (AG) or the Linguistic Annotation Framework's Graph Annotation Format (GrAF).
  • Level 2 involves compatibility in terms of data categories (content), wherein categories are the same conceptually and can be mapped to one another. This can be facilitated by ontological resources like GOLD as well as the ISOcat
  • Level 3 compatibility involves use of a common set of notational conventions to express a fully declared range of content categories.
2. Tool level interoperability and methods for collaboration in tool development.
  • LRT, TalkBank, E-Meld standards are largely similar.
  • Media standardization issues for streaming serving and programs. YouTube, Google, Mozilla, and others are developing standards and systems that we could adopt.
  • Roundtrips between tool formats: CHAT, Anvil, EAF, AG, EXMaRLDA, Wavesurfer, TEI, SALT. Some of these tools are already interoperable, but these pathways need to be made clearer to users.
  • AG Tools approach. This approach allows programmers to develop new tools from the AG Toolkit. However, this is only linked at Level 1. Can there be a similar approach at Levels 2 and 3?
3. Issues arising from a commitment to open access. What data must be kept away from the public and what data can be made freely available? How can linguists work together to increase access to larger amounts of linguistically important data?
  • Sharing and IRB principles at talkbank.org/share.
  • Legacy data vs. forward-looking protocols (E-Meld, AphasiaBank, as examples)
  • Community, population constraints
4. Methods for promoting a higher level of data contribution and individual researcher "buy in".
  • Inducements: publication, easy tool linkage
  • Community: role of LSA
  • Obligations and standards: role of NIH, NSF, DARPA, IE
  • Leading role of the European Community in setting standards for data-sharing for grant recipients
5. Characterization of linguistic digital data types and methods for linking to non-digital data.
Here, it is important to distinguish the emphasis on corpora and linked media from the many other types of digital data that are of interest to linguists. In the area of Linguistic Exploration, the fundamental objects may be word lists, sentence lists, or dictionaries. In Linguistic Anthropology digitized records of objects are important. This extends eventually over to Archaeology and even information on human genetics etc. In the Learning Sciences, there is an emphasis on linking classroom video to individual student portfolios that may include letters, tests, art work and so on. For digital libraries, it is important to make clear where the hard copies actually reside. For many of these objects, identification can be made through the assigment of digital object identifiers (DOIs). However, this is a largely unexplored territory for most linguists.

6. An agenda for developing linkages to other sciences - the bigger picture of Extended Linguistics. Here, the MacWhinney-Groves NSF report should be particularly helpful.

7. Lightweight administration within a framework of complex organizations: NIH, NSF, LSA, CLARIN, etc. There is a perception that some work on the development of shared cyberinfrastrucure has been top-heavy on committee work and reports without producing a significant amount of shared interoperable resources. Is there a way to build organizational structures that produce open-access products? Who should determine patterns of collaboration or should these patterns "emerge" through specific less-organized exchanges. But then how these interactions be guided toward cooperation and interoperability? Perhaps an emphasis on standards for collaboration might be possible.

Recommended Readings

Additional Links

White paper draft/notes

Introduction


In considering collaboration, we distinguish first between joint research efforts and large-scale coordination. The former may cross many boundaries: of disciplines, institutions, and nations. It is also by and large already successful, and not something that we thought this group needed to focus directly on (though of course a cyberinfrastructure for linguistics would, as an important contribution, provide further support for this kind of collaboration). Rather, we defined our task as the fostering of large-scale coordination such as is required to get the field of linguistics (and sister disciplines in the language sciences) to organize around the creation of a cyberinfrastructure and its population with data. We saw three main facets to the large-scale coordination of effort: cooperation, community building, and communication. The following figure summarizes the relationships we see among these concepts.

Collaboration graph

Large-scale coordination of effort


Under the heading of large-scale coordination of effort, we include such things as the reuse of data, especially across (otherwise) unrelated research groups; mechanisms for publishing/sharing data; evaluation/quality control of resources (broadly defined to include data sets, tools, standards, etc); establishment of and agreement on standards; and coherence of principles, goals, and architectures across groups involved in the creation of a cyberinfrastructure. In all of these concerns, we place a high priority on avoiding duplication of effort. In order to foster large-scale coordination of effort, we see three main classes of tools: formal modes of cooperation, explicit work towards community building, and vehicles of communication.

Coordination Models


We identified a few examples of models of coordination:

  • ISO: exploit existing infrastructure for coordination
  • TalkBank: coordinate among researcher groups such as AphasiaBank, CHILDES, PhonBank, ClassBank, etc.
  • TEI: build infrastructure from the ground up
  • CLARIN: exploit an EC framework (ESFRI) financing the feasibility part of establishing European infrastructures, for CLARIN in the Humanities, to be then continued with national funding from the various EU countries
  • FLaReNet/SILT: parallel international funded efforts to establish international networks, with bottom up coordination by the projects' coordinators
Clearly, the various coordination models are linked to different funding models, and can (to differing degrees) be deliberately fostered by funding agencies.

The Role of the LSA


An informal working group has been created within the LSA, consisting
of LSA staff, leadership and technology consultants. The working group will make detailed recommendations to the LSA's Executive Committee concerning specific actions the LSA can undertake, both immediately and in the future, to facilitate the development of a cyberinfrastcuture for linguistics, disseminate information related to this endeavor, and promulgate a "culture change" with regard to sharing of data, tools, and results. These actions would build on the initial steps already taken by the LSA in this regard, such as its digital publishing platform, eLanguage.


Community Building


We began with the observation that successful technology is always supported by (while also supporting) a community of users. We define a community for these purposes as a group of people working on similar issues, using the same tools/platforms/resources, who talk to each other and who share principles and practices in their research efforts. Different communities may be display these various properties to differing extents. To give some examples, SIGs (special interest groups within larger scholarly bodies) illustrate communities primarily defined by working on similar issues. Computational linguistic communities that have grown up around the creation of use of tools and resources include the developer and user groups of WordNet, FrameNet, NLTK, GATE and UIMA. There are also communities of linguists who are producing shared databases, but not new computational tools or formats. These include groups such as CHILDES, PhonBank, AphasiaBank, LIDES, etc. Vehicles for communication (discussed further below) can also create communities of their own. Examples here include the readership of LINGUISTList and the readership of the Corpora list. Note also that communities can vary greatly in size (readership of LINGUISTList being at one extreme) and of course overlap with one another, as individuals belong to multiple different communities.

Since communities are important for both communication (see below) and the success of software projects, we considered ways to foster the development of communities, while recognizing that such things cannot be precisely engineered (nor their success precisely predicted). Means of community building include:

  • Funding programs sponsoring multiple groups working on the same/similar problems (e.g., language documentation sponsored by ELDP, DARPA programs)
  • Evaluation campaigns (lots of examples from compling here: Semeval, MUC, TREC, CLEF, CoNLL shared tasks)
  • On-line fora (from big like LINGUIST List to small like user groups for particular tools)
  • SIGs (of particular relevance here is ACL's SIGANN)
  • Workshops (e.g., LAW, E-MELD workshops, Cyberling 2009) and conferences (LREC is of particular note here in having created a community)
  • Tutorials
    • At summer schools (LSA institute, EuroLAN, Johns Hopkins, ESSLLI)
    • On the web
    • As part of funded projects (e.g., SILT)
  • Journals (LRE, Language, eLanguage, Computational Linguistics...)


Communication


We identified communication as a problem that cross-cuts many aspect of large-scale coordination of effort. In particular, we need to communicate about standards (availability and development), tool and resource availability, needs assessment, and principles & practices. People working on tools and standards across linguistics and the language sciences more broadly need to be aware of each other and each other's efforts and need to be able to communicate with potential users for needs assessment. (Mark Liberman commented that every successful piece of software starts with someone scratching an itch: they have a problem, build a solution, and share that solution. But not everyone who has a problem that can be solved with software has the means/skills to build that software themselves.) People potentially using tools and standards need to be able to find them. People who should be using tools and standards but don't yet know about them, need to be reached.

With these kinds of communication in mind, we developed a list of potential communication vehicles:

  • Existing communities' infrastructure (newsletters, meetings, websites), including both informal communities and scholarly organizations
  • Teaching materials, especially made available over the web (syllabi, problem sets, web-based tutorials)
  • Wikis/blogs ("bottom-up web-based communication")
    • On-going maintenance of information collections
    • Reasons for people to come back to the on-line communication site
    • ad words: Set up context sensitive "ads" on the model of Google AdWords which could run on LINGUIST, lsadc.org, etc, where the things being advertised are relevant projects and standards (and no money is exchanged)
  • Funded collaborations (e.g., SILT/FLaReNet)
  • Workshops/tutorials
  • Reviewing guidelines/review feedback
    • Pushing funding agencies to require plans (and follow through) for using standards and publishing data for proposals that use tools/create data
    • Pushing funding agencies to require proposals for new tools/standards to appropriately cite and situate themselves within the existing tools/standards ecology
    • Conference/journal reviewing check for appropriate citations of data, tools, resources
  • Resource maps/eliciting metadata (cf. LREC 2010)
  • Journals like Journal of Experimental Linguistics which publish code along with the resulting research.
  • Idea from Steve Moran: A new journal (perhaps in the eLanguage set) on the model of Journal of Experimental Linguistics, which publishes data sets collected in the field. Maybe called "Journal of Linguistic Description"?

Content


This section briefly outlines the content that we need to be communicating about within the large-scale coordination of effort required to bring about a cyberinfrastructure for linguistics (and the language sciences).

  • Data level interoperability, on two levels: Interoperability of data and annotation format, including means of mapping between existing formats (cf. GrAF) and interoperability of annotation content, via ontologies (cf. GOLD) or other inventories of linguistic categories (cf. ISOcat).
  • Tool level interoperability and methods for collaboration in tool development.
  • Managing issues arising from commitment to Open Access, establishing and publicizing shared principles.
  • Promoting individual researcher "buy-in". A cyberinfrastructure is not useful until it is populated with data, but our field is in need of culture change in this respect. We believe this culture change can be achieved through a combination of making it easier for individual researchers to contribute data (through useful tools), educational campaigns on the part of the LSA and similar groups, funding agencies establishing policies requiring data sharing, and publication venues requiring testing against and citing existing available data.
  • Connection to other fields: Linguistics is only one of the language sciences, and digitized data from many fields (education, political science, law, ...) can be valuable for linguists. As we develop our infrastructure, we need to be mindful of how it fits into this larger ecology, and where teaming up with other language sciences can bring economies of scale. Closer to home, the field of computational linguistics has a good deal of cyberinfrastructure (and communication around cyberinfrastructure) established. We envision creating a portal into cyberinfrastructure concerns for people who identify as linguists, which rather than attempting to encompass all language-related cyberinfrastructure itself links to existing efforts in allied disciplines.

Summary: 5 Cs of cyberinfrastructure


  • Collaboration: Joint research alone isn't enough to bring about a cyberinfrastructure, though it will play a key role. We need large-scale coordination of effort.
  • Cooperation models: There are many ways to coordinate effort, and we will probably use all of them.
  • Coordination: On the technical side, we need interoperability, which entails coordination on standards
  • Communication: On the people side, we can't achieve coordination without communication, bringing people in, making them aware of each other, and keeping them in touch.
  • Community Building: Key to both successful communication and successful software.

Action items - Short term


  • Draft recommendations to funding agencies regarding standards, data publication, etc.
  • Draft recommendations to journal editors and conference organizers regarding citing tools/resources and publishing data
  • Create teaching resources (through LSA?)
  • Continue this conversation (all WGs, on the wiki for now)

Action items - Long term


  • Ensure communication among projects/groups
  • Push those developing standards to for specific areas (e.g., PHON group connected to TalkBank) to contribute to ISO TC37 SC4
  • Work towards data/annotation harmonization



EmilyMBender
EmilyMBender
Latest page update: made by EmilyMBender , Sep 10 2009, 8:11 PM EDT (about this update About This Update EmilyMBender Filled in coordination models, minor formatting edits - EmilyMBender

47 words added
6 words deleted

view changes

- complete history)
Keyword tags: None
More Info: links to this page
Started By Thread Subject Replies Last Post
danmccloy How to add attachments 0 Jul 8 2009, 4:36 PM EDT by danmccloy
Thread started: Jul 8 2009, 4:36 PM EDT  Watch
FYI, attachments cannot be added through the Easy Edit window. The attachment interface is under "more tools" (to the right of the Easy Edit button.
Do you find this valuable?    
Keyword tags: None
Showing 1 of 1 threads for this page
Adobe Portable Document Format TAC-May-2009.pdf (Adobe Portable Document Format - 68k)
posted by EmilyMBender   Jul 21 2009, 7:25 PM EDT
LSA Technology Advisory Committee report to Exec Committee, May 2009
Adobe Portable Document Format LAF.pdf (Adobe Portable Document Format - 183k)
posted by nancyide   Jul 15 2009, 2:40 PM EDT
Overview of the ISO Linguistic Annotation Framework (LAF)
Adobe Portable Document Format LAW.pdf (Adobe Portable Document Format - 315k)
posted by nancyide   Jul 15 2009, 2:35 PM EDT
Overview of the Graph Annotation Format (GrAF)
Adobe Portable Document Format Vienna09_Short_Report.pdf (Adobe Portable Document Format - 693k)
posted by nicoletta.calzolari   Jul 14 2009, 7:49 PM EDT
This attachment has no description.
Adobe Portable Document Format INTEROP.pdf (Adobe Portable Document Format - 369k)
posted by danmccloy   Jul 8 2009, 4:30 PM EDT
SILT Proposal
(Showing the last 5 of 6 - view all)