Automatically Annotated Repository of Digital Audio and Video Resources Community

Working Groups

Background: The AARDVARC workshops were funded by the National Science Foundation in Stage 1 of the 3-stage BCS initiative “Building Community and Capacity for Data-Intensive Research.” In Stage 1, the funded projects were to “develop visions, teams, and capabilities dedicated to creating new, large-scale, next-generation data resources and relevant analytic techniques.” Stage 2 is intended to fund a selection (perhaps one-fourth) of these communities to develop prototypes of the facilities designed in Stage 1. The small number of projects that advance to Stage 3 will be funded to develop the actual facility.

: The overarching goal of AARDVARC is to design a digital repository of data from understudied languages which offers depositors automated or semi-automated transcription of their data. The repository would be designed to facilitate data sharing among and address the huge backlog of not transcribed field recordings that exist. In addition, by accumulating a large body of multilingual data in a uniform format, the repository will advance large scale qualitative and quantitative studies in multiple fields.

The goals of the first AARDVARC workshop are threefold:
- To survey the need for a community-oriented speech and video corpora archive/annotation facility and the uses to which the data would be put
- To specify the administrative and technical requirements of the repository
- To assess the potential for partially-annotated transcription of language resources by reviewing the tools and procedures currently in use, as well as the volume and formats of the existing data.

At this point, we envision that the repository will operate on a ‘take one, leave one” basis: a user (e.g., a field linguist) will submit several hours of transcribed recordings as a training corpus and in return receive access to a trained annotation tool that will aid them in transcribing the rest of their recordings. The repository will support fine-grained search over deposited corpora, as well as download of individual language or multilingual corpora, when permissions are available.

We expect that the working groups will modify and correct these preliminary ideas.

Day 1: Uses and Design of the Repository

Group 1: USE CASES
Working Group Chair: Arienne Dwyer
Support: Chris DiCanio

Develop use cases for such a repository: how would speech corpora be used in different disciplines? Are there specifications which such corpora must meet in order to be useful?

Group 2
Working Group Chair: Mark Liberman

Identify the legal and intellectual property rights issues that the repository design must address. Evaluate different ways the repository might handle these issues.

Working Group Chair: Gary Simons
What should the administrative structure of such an archive be? Who should govern it? What conditions should be specified for corpus sharing? How can such an archive be developed to be self-sustaining?

Day 2: The Feasibility of Automated Transcription

Group 1
Working Group Chair: Monica Macaulay

Review the results of the survey on existing data from understudied language and cultures. Report on the amount of data and the most common formats and tools employed. If possible, make a recommendation about the data formats that the repository should support, and note any anticipated problems.

Working Group Chair: Jeff Good
Support: Silke Hamann
Participants: Steven Abney, Jonathan Amith, Damir Cavar, Christian DiCanio, Arienne Dwyer, Jeff Good, Mietta Lennes, Tanja Schultz.

Download the annotation samples from different disciplines which are available at [link will be available soon]. Compare the annotation systems and content foci in order to determine the consequences, if any, of different choices of annotation tools. How difficult would it be to switch from these systems to one of the systems used for automatic speech recognition?

Summary and conclusions (slides by Jeff Good)

Working Group Chair: Doug Whalen

What are the special requirements of video processing? What types of content focus are common? What is annotated, and how? Should audio and video corpora be part of the same repository?

Day 3

Group Reports

Panel Discussion: the way to the automatic annotation, how far are we from the goal? What are the short-term goals in achieving the ultimate goal of an automatic speech recognition/annotation system?