CLARIN-D for Linguistic Fieldwork, Ethnology, and Language Typology

CLARIN-D supports linguistic fieldwork, ethnology, and language typology by providing services to access language data for distinct languages and language families, as well as annotate and analyse audiovisual documentation of language use and cultural practice with a focus on the creation and sustainable archiving of digital corpora and typological databases. The Working Group 3 (WG3) »Linguistic Fieldwork, Ethnology, Language Typology« is a network of scholars whose aim it is to support the fieldwork through the digital research infrastructure.

Data for Research

With the DoBeS-Corpora, CLARIN offers the first collection of documented speech data worldwide and has provided important pioneer work for further initiatives in this area. The DoBeS-Corpora contain annotated audiovisual data from 68 endangered languages and represent the worlds' linguistic diversity in extraordinary typological and geographical depth. Further collections with similar aims and structure are under construction in Hamburg and Köln, two CLARIN-D centres. 

Furthermore, CLARIN provides and makes accessible various different resources for ethnology and language typology. The Virtual Language Observatory (VLO) provides access to many resources for research such as Yuracaré, a linguistic isolate in bolivian Amazon, or Beaver, an Athabaskan language spoken in Canada. Due to its powerful and intuitive faceted search, the VLO is an excellent starting point for searching resources for specific, less well described languages.

→ More about »Accessing«

Software Tools for Research Projects

CLARIN-D offers several relevant tools and services for analysing and preparing language data.

CMDI-Maker offers quick and simple creation of metadata in IMDI- and for language documentation relevant CMDi-Formats to archive such data in language archives and other repositories. 

ELAN is a crucial tool for language documentation and other areas in linguistics that work with audiovisual data. ELAN enables researchers to create time aligned transcripts and other annotations. ELAN can also import data formats from glossary tools like Toolbox or FLEx. 

EXMARaLDA is a system of tools and formats for working with audiovisual corpora. The transcription and annotation editor is interoperable with other transcription tools and allows manual as well as automatic annotation via WebMAUS or webservices such as WebLicht. Furthermore, it can output and visualize audiovisual and transcription data in different formats and layouts. The system not only offers score editors, but also desktop tools for corpus and data management (Coma) and search and analysis of transcription data, annotation data, and meta data (EXAKT). 

Poio API
Poio API is a free open source Python library, which allows access and analysis of data from the language documentation. Poio AI converts data formats like EAF, toolbok data or typecraf xml and other formats of annotation graphs defined in ISO 24612. Such graphs allow uniform access to linguistic data from a variety of sources. 

WebMAUS is a web service which makes precise automatic alignment of phones possible. WebMAUS is accessible via its web interface, as well as via the annotation tools ELAN and EXMARaLDA. 

→ More about »Analysing«

Providing Your Own Research Data

The CLARIN-D network not only provides tools for the analysis of language data but also offers the possibility to sustainably archive your own research data and make it available to the research community for reuse. By cooperating with a CLARIN centre, the data can be prepared in a way where it is sufficiently described by metadata. One such tool is the CMDI-Maker, which can be used for creating descriptions that allow the research community to easily access the data and results via dedicated search engines.

Would you like to provide your data through the CLARIN-D infrastructure? Contact a specialised centre or the CLARIN-D Helpdesk.

→ More about "Preparation and Depositing"

Use Case

One example use case of a CLARIN-D tool is MultiCAST. Using ELAN, audiovisual data was transcribed, translated and annotated. The annotation scheme developed for this project was GRAID. It defined a tier for the analysis of data and so used the very flexible tier structure provided by ELAN. To archive and publish the corpus on a long-term basis, the data was annotated with metadata using the CMDI-Maker. Analysis of MultiCAST GRAID annotation yielded important new insights into discourse and ergativity. Results of the analysis were published in Haig & Schnell (2016) and are publicly available in the Language Archive Cologne.

CLARIN Centres

The Hamburger Zentrum für Sprachkorpora (HZSK) has maintained the Working Group 3 (WG3) »Linguistic Fieldwork, Ethnology, Language Typology« since October 2016. The HZSK offers help with accessing and analysing digital language resources, and with sustainably providing language corpora via the HZSK-Repository. The WG3 closely works together with the HZSK in the areas of tools and workflows. 

The CLARIN Centre at Max Planck Institute for Psycholinguistics provides a number of central data, tools and services for the WG3 via the Language Archive.

The CLARIN Knowledge Centre »Linguistic Diversity and Language Documentation« (CKLD), established in autumn 2017, is a collaboration of institutes of the Universities of London, Cologne and Hamburg. As a K-Centre, this institution offers supervision and support for research projects in the area of language documentation and language typology.  


  • Alexandre Arkhipov, Universität Hamburg, Institute for Finno-Ugrian Studies/Uralic Studies
  • Peter Bouda MA, Interdisciplinary Centre for Social and Language Documentation, Minde Portugal.
  • Dr. Michael Cysouw, Philipps-Universität Marburg, Forschungszentrum Deutscher Sprachatlas
  • PD Dr. Sebastian Drude, Vigdísarstofnun, Reykjavík
  • Dr. Volker Gast, Friedrich-Schiller-Universität Jena, Department of English and American Studies
  • Dr. Geoffrey Haig, Universität Bamberg, Institute for Oriental Studies
  • Dagmar Jung, Universität Zürich, ACQDIV Project
  • Dr. Johann-Mattis List, Max Planck Institute for the Science of Human History, Jena
  • Sebastian Nordhoff, Freie Universität Berlin, Arbeitsgruppe Deutsche Grammatik und Allgemeine Sprachwissenschaft
  • Kilu von Prince, Humboldt-Universität zu Berlin, Institut für deutsche Sprache und Linguistik
  • Michael Rießler, Universität Freiburg, Skandinavisches Seminar
  • Dr. Elena Skribnik, Ludwig-Maximilians-Universität München, Institut für Finnougristik / Uralistik
  • Sabine Stoll, Universität Zürich, Psycholinguistisches Laboratorium
  • Dr. Beáta Wagner-Nagy, Universität Hamburg, Institut für Finnougristik/Uralistik
  • Claudia Wegener, Universität zu Köln, Institut für Linguistik
  • Dr. Thomas Widlok, Universität zu Köln, Institut für Afrikanistik und Ägyptologie
  • Taras Zakharko MA, Universität Zürich, Institut für Vergleichende Sprachwissenschaft 
  • Nils Schiborr, Universität Bamberg, Institut für Orientalistik
  • Prof. Dr. Henning Schreiber, Asien-Afrika-Institut, Universität Hamburg

Chair and Contact

Resources from the Discipline for the Discipline

During the implementation phase of CLARIN-D, the WG identified important resources and tools, which have been developed and prepared for reuse. These small projects are called curation projects within CLARIN-D.

Curation Projects