The corpora, tools and methodologies topic welcomes posts, literature reviews, and videos/demos.
Facial expressions, vocal intonation, gestures and other non-vocal cues play an important role in communication. However, there are not many freely available audiovisual conversational corpora which can be employed to study such a relevant topic. Here there is a short list of some of them. If you are compiling a corpus or have a corpus available and you wish to add it to the list, please write to Zoraida Callejas at:
SCARE: A Situated Corpus with Annotated Referring Expressions (ENGLISH)
Each session in this corpus records the joint problem-solving of a pair of human partners working through a treasure-hunt style task in a virtual world. The corpus includes: quicktime movies, audio files, video time-aligned transcriptions and annotation, audio-aligned word-level transcriptions, positional information from the virtual world’s log and state-change information on items that could be manipulated in the virtual world. The corpus is freely available for research and educational use only, and is especially suited for investigating task-oriented dialog or deictic expressions in English. Corpus web page (more information & download): http://slate.cse.ohio-state.edu/quake-corpora/scare/
The CHILDES Database (MULTILINGUAL)
The CHILDES database contains transcript and media data collected from conversations between young children and their playmates and caretakers. Conversations with older children and adults are available from TalkBank. The corpus includes: audio and video files and transcriptions in CHAT and CA/CHAT formats. The conversations are divided in the following categories: American English, British English, Bilingual (bilingual language acquisition), Clinical (Data from adults and children with language disorders in various languages), Narrative (data on narrations by adults and children in various languages), Germanic Languages (data on the acquisition of Germanic languages), Romance Languages (data on the acquisition of Romance languages), Slavic Languages, East Asian Languages, Celtic Languages and Other (data on the acquisition of other languages). The corpus is distributed under the GPLv2 license. Corpus web page (more information & download): http://childes.psy.cmu.edu/
IFA Dialog Video corpus (DUTCH)
The IFA Dialog Video corpus is a collection of annotated video recordings of friendly Face-to-Face dialogs. It is modelled on the Face-to-Face dialogs in the Spoken Dutch Corpus (CGN). For this corpus 20 dialog conversations of 15 minutes we recorded and annotated, in total 5 hours of speech. The corpus includes: compressed video (MPEG-4, DivX3, OGV), speech files (RIF/WAV, SPEEX, Ogg Vorbis, MP3) and annotations. The annotations are comprised of: word and phoneme aligned orthographic transcription, part-of-Speech labelling, gaze direction, readable dialog transcripts, SMIL xml files and metadata on the recordings and the speakers. The corpus is distributed under the GPLv2 license. The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. Corpus web page (more information & download): http://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/
There are now three major corpora of meetings.
The earliest corpus is the ICSI meeting corpus. This corpus comprises 40 hours of recorded speech from meetings. The speech has been transcribed. This data is available from the LDC.
The AMI meeting corpus is a multimodal meeting corpus. Participant interactions were recorded using microphones, video cameras and electronic pens. There are about 100 hours of interactions in this corpus, which was funded by the European Union and is available from ELRA. Annotation tools are also available.
Meeting data has also been collected at Stanford and at CMU as part of the CALO project. This data is not yet publicly available.
Much of the existing software for supporting annotation was designed ad-hoc employing special purpose data representations. There are a number of tools which try to overcome this problem by employing XML as a representational formal for linguistic annotation.
A salient toolkit is NITE XML, a software that arose out of the European HLT-Project NITE (Natural Interactivity Tools Engineering). Although the NITE project itself finished in 2003, the software is now being maintained and further developed via SourceForge and is in use on a number of large distributed projects (e.g. it has been used for the AMI and ICSI Meeting Corpora as well as the Switchboard Corpus, the OASIS Dialogue Database, the Monitor Corpus and the Diagrams Corpus) as well as by individual researchers.
The NITE XML Toolkit provides open-source libraries to support heavily annotated corpora whether they are multimodal, textual, monologue or dialogue. Besides, its build in tools provide help for common tasks, which can be extended using their Java API. The toolkit also integrates a powerful query language and command line tools for data analysis.
Video lecture by Jean Carletta:
The NITE XML Toolkit meets the ICSI Meeting Corpus: import, annotation, and browsing
More information:
Contact person: Jonathan Kilgour (jonathan AT inf DOT ed DOT ac AND DOT uk)
NITE XML Toolkit Homepages: http://groups.inf.ed.ac.uk/nxt/index.shtml
Download:
The NITE XML Toolkit at SourceForge.net: http://sourceforge.net/projects/nite/files/