Studying the Gilaki Dialect from the Computational Dialectology Perspective: Developing a Lemmatized and Part-of-Speech Tagged Corpus

Ghayoomi, Masood; Mohseni Khorrami, Shayan; BakhshiZadeh Gashti, Atena

doi:10.22054/ls.2024.79598.1655

Articles in Press

Document Type : Research Paper

Authors

¹ Institute for Humanities and Cultural Studies

² Allameh Tabataba'i University

https://doi.org/10.22054/ls.2024.79598.1655

Abstract

Dialectology is a major research topic for a long time to find the dialects’ geographical distribution and to classify dialects to develope the atlas of dialects. In this field, one can use computer’s capabilities to store and to organize information, to find similarities between dialects, to visualize the isoglosses on a geographic map, and the like. The common property of these studies is that a full text is not used and the studies are mostly limited to questionnaires containing a few key words or phrases.
Another useful aspect of using computers in dialectology is processing the dialectic data and automatically annotating this data. In the current research, in addition to preparing a dialectic corpus containing full texts for the Gilaki dialect, we put an effort to prepare a language model to annotate the data at two levels, namely part-of-speech and lemmatization. Since there is no annotated training data for making the Gilaki language model, we manually annotate the developed corpus and then create a statitical language model. To show the quality of the developed language model, the available data is divided into two sets, namely as training and test data, and we will evaluate the model using the 5-point cross-evaluation method. According to the experimental results, the performances of the models for lemmatization and part-of-speech tagging of the Gilaki dialect are 91.20% and 90.79%, respectively.

Keywords

Main Subjects

Computational Linguistics

Language Science Studies

Studying the Gilaki Dialect from the Computational Dialectology Perspective: Developing a Lemmatized and Part-of-Speech Tagged Corpus

Articles in Press, Accepted Manuscript
Available Online from 10 September 2024

Studying the Gilaki Dialect from the Computational Dialectology Perspective: Developing a Lemmatized and Part-of-Speech Tagged Corpus

Articles in Press, Accepted Manuscript Available Online from 10 September 2024

Articles in Press, Accepted Manuscript
Available Online from 10 September 2024