News

Exploring Language Learning with DAKODA

[10.04.2026]

How do people learn German? The CATALPA project DAKODA helps researchers find out more.

A set of five dice labeled A1 through C1; a finger tilts the sixth dice labeled C2 so that a German flag is visible. — Tests measure language proficiency. But how accurately do these tests reflect actual skill levels? The DAKODA project facilitates research on this topic.

Anyone coming from abroad who wants to study in Germany must demonstrate proficiency in the German language. The standard requirement is at least a C1 level, according to the Common European Framework of Reference. In most cases, it is even a C2 level or a German Language Test for University Admission.

About the Project

DAKODA stands for Data Competencies in DaF/DaZ: Exploration of Language Technology Approaches for Analyzing L2 Acquisition Stages in Learner Corpora of German. The project was funded by the Federal Ministry of Research, Technology, and Space. The University of Leipzig was a project partner. The project makes data widely accessible, including to other researchers, thereby creating an important foundation that can be used for basic research and learning engineering.

But how do you design such tests so that they truly reflect the language proficiency required at the university?

A Data Foundation for Further Research

Language researchers have been grappling with these types of questions for a long time. In order to answer them, scientists must take a closer look at language acquisition. "Until now, however, there has been no satisfactory data foundation for this," explains Prof. Dr. Torsten Zesch. "The DAKODA project has provided a solution to this problem."

Making Learner Corpora Usable for Research

The idea behind the project: Existing annotated collections of learner language, known as learner corpora, were consolidated and made accessible to researchers. "That wasn't exactly easy," says computational linguist Zesch. The corpora were very diverse. Sometimes they were essays; other times, they were fictional letters or descriptions of picture stories. Two corpora contained spoken texts, while the rest were written. The learners' language proficiency levels also varied greatly.

The goal was to enable analyses across the various text corpora independent of contextual content. "The key here is how the verb is placed in learners' sentences because that reveals a lot about their language proficiency," says Zesch. According to the current state of research, there are various stages of language acquisition, from beginner sentences with typical errors ("Ich Schule gehen") to correct verb placement in subordinate clauses ("Bevor ich in die Schule gegangen bin, ..."). "The verb plays a key role in this," says Zesch.

Interactive Dashboard Enables Searches

In addition to the technical challenges of computational linguistics, data protection issues also had to be resolved. Which existing corpora can be used legally, and which can be made available online? Some of the texts were written several decades ago, so the participants’ consent forms did not all meet today’s requirements.

However, after more than three years of work on the project, the DAKODA team, led by Zesch and Prof. Dr. Katrin Wisniewski from the University of Leipzig, is very satisfied with the result. A repository allows users to download corpora in various file formats. Via an interactive dashboard, researchers can analyze data across corpora, for example by language proficiency level or native language. Sample analyses using Jupyter Notebooks (a browser-based environment for data analysis) demonstrate how data from the corpora can be examined. "This is intended to facilitate access for researchers without in-depth IT knowledge," says Zesch.

Presentation at Methodology Fair

Ruppenhofer_Mannheim — Dr. Josef Ruppenhofer (right) presented the project at a methodology fair in Mannheim.

To make it easier for this particular group to access the project’s resources, project staff member Dr. Josef Ruppenhofer offered various workshops and also presented the project at conferences. Most recently, he introduced DAKODA at the “Deutsch im europäischen Sprachraum” methodology fair in Mannheim, which was organized by the Leibniz Institute for the German Language. There, he attracted many interested parties to work with the corpora. "DAKODA opens up new possibilities for researchers to gain fundamental insights into language acquisition," says Zesch. "We are pleased that we at CATALPA were able to contribute to this."

Christina Lüdeke | 18.06.2026