Challenges of Tagging and Parsing Middle Low German

Melissa Farasyn, Mariya Koleva & Elisabeth Witzenhausen, Universiteit Gent

Middle Low German (MLG) is a cover term for a group of related dialects spoken in northern Germany from about 1250 until 1600. The language  was only partly standardized. This poster reports on the construction of a tagged and parsed corpus of Historical Low German. The corpus will contain Old Saxon and MLG, but is currently focusing on the MLG part. The corpus is balanced concerning genre, period and scribal languages. Furthermore, all texts are dated and localized, not translated and in prose,  in order to facilitate large-scale diachronic and diatopic research into the under-researched syntax of MLG. Syntactic annotation is the feature that sets the corpus apart from other initiatives working on digital corpora of historical German varieties. As the language data is highly heterogeneous and taggers usually rely on normalized spelling, the construction of the corpus via POS- and morphology tagging and parsing raises various challenges, for which we will present our solution. Furthermore, the poster will address the automatic POS and morphology tagging as well  as  shallow parsing in order to classify the data prior to syntactic annotation with a supervised machine learning   algorithm.