Data Mining for Large Scale Corpus Linguistic

Christian Poelitz, Technische Universitaet Dortmund

Large digital corpora of written language, such as those that are held by the CLARIN-D centers, provide excellent possibilities for linguistic research on authentic language data. The size of the corpora allows for remarkable insights into the distribution of notable language usage phenomena with respect to time and/or domain-specific aspects. Despite these advances, the large number of hits that can be retrieved from corpora often leads to challenges in concrete linguistic research settings. This is particularly the case, if the queried word-forms or constructions are (semantically) ambiguous. Besides large text corpora of reference texts and couments, additional resources like dictionaries, statistics, temporal information or WordNets are available. For digital humanity, methods that automatically extract semantic concepts from the text copora in the context of available language resources provide helpful tools in understanding large text collections. We present a tool box to perform variety linguistic and diachronic linguistic tasks in heterogenous language resources. Based on copora and language resources provided by the Dictionary of the German Language, we show studies of word meanings over time and across document genres. We demonstrate how we use topic models with additional temporal and text class information to perform linguistic tasks on large text copora. Ways to evaluate topic models for corpus linguistic tasks are discussed and possible qualitative and quantitative evalution method are presented.