Design Doc for Linguistics-Oriented Data Sharing/Publishing Platform
Use cases
who would be interested in the platform?
- the kwa lexicographer
a traditional linguist that wants to do the right thing. - tracy: linguist that has done work, which should not be lost.
- alicia: (big) data without time to annotate/analyse.
- virach: competition on thai word segmentation.
1. the kwa lexicographer
linguist uses toolbox to write a lexicon on a kwa language dialect,
using custom markers (custom fields of lexical entries), thus toolbox doesn't work anymore.
what the linguist wants to do is prepare a printed book.
would be interested to get interlinear glossed examples for the lexicon
concerns:
- data conversion
- merging of data
- cross-platform publishing (including the non-web world)
extension (once on the platform):
- find new data (on kwa), transcribe it -> online collaboration for annotation
subfields:
- language documentation
- field methodologies
- phonology
2. traci: linguist that has done work, which noone should have to do again.
(radio shows, dictionary entries)
the squib
-> backup
-> interest in "publication clearance process" (the "lawyer button")
subfields:
concerns:
- backup
- documentation
- reproducibility
extension:
3. alicia: there's (big) data without time to annotate/analyse.
(pacific nw phonetics) ("light annotation problem")
-> the case for video
-> the researcher that wants to reuse/reannotate data of others
subfields:
- sociolinguistics
- phonology
concerns:
- backup
- sharing with access control (graded access)
extension:
4. virach: competition on thai word segmentation
the tool builder
-> tools in search of data
-> results in search of judges
subfields:
- computational linguistics
concerns:
- find users for tools
- software as a service (potential for mashups)
- distribution mechanism
- attribution
extension:
with a platform that solves the concerns mentioned above, what else do we get?
foster the education interaction: connect students with community and
with data. (attribution will help here)
now it turns out that some concerns match up with extensions.
-> once the people are on the platform, the collaboration and network
effects set in.
- potential for citation, publication
Requirements
- data backup
- data conversion
- remote access
- publishing (legal/copyright/etc)
- graded access
- metadata: The platform should already provide for the basic metadata to allow for attribution (who created which data when). Tagging with language codes is necessary to enable network effects.
Potential
- platform to deliver software as a service