Document Type : Research Paper
Authors
1 Institute for Humanities and Cultural Studies
2 Allameh Tabataba'i University
Abstract
Dialectology is a major research topic for a long time to find the dialects’ geographical distribution and to classify dialects to develope the atlas of dialects. In this field, one can use computer’s capabilities to store and to organize information, to find similarities between dialects, to visualize the isoglosses on a geographic map, and the like. The common property of these studies is that a full text is not used and the studies are mostly limited to questionnaires containing a few key words or phrases.
Another useful aspect of using computers in dialectology is processing the dialectic data and automatically annotating this data. In the current research, in addition to preparing a dialectic corpus containing full texts for the Gilaki dialect, we put an effort to prepare a language model to annotate the data at two levels, namely part-of-speech and lemmatization. Since there is no annotated training data for making the Gilaki language model, we manually annotate the developed corpus and then create a statitical language model. To show the quality of the developed language model, the available data is divided into two sets, namely as training and test data, and we will evaluate the model using the 5-point cross-evaluation method. According to the experimental results, the performances of the models for lemmatization and part-of-speech tagging of the Gilaki dialect are 91.20% and 90.79%, respectively.
Keywords
- computational dialectology
- natural language processing
- the Gilaki dialect
- corpus
- automatic annotation
Main Subjects