Document Type : Research Paper
Authors
1 Institute for Humanities and Cultural Studies
2 Language and Computer Group, Academy of Persian Language and Literature, Tehran, Iran
Abstract
The development of information technology and its integration with natural language have made it possible to understand and generate linguistic content in an algorithmic manner. The existence of different research groups and their usage of different data and processing algorithms have made the results obtained from the tools not comparable. To address this shortcoming, in this article an attempt has been made to develop a benchmark corpus that is based on the approved Persian Orthography Rule to be used for evaluating the performance of Persian editing tools.
In the current research, after examining the approved Persian Orthography Rule, the proposed rules are divided into eight main categories. Then, a dataset of about 98000 words is collected and compared with the rules in two genres, namely scientific and news. A fraction of the words of this data is marked based on the eight categories; and the evaluation of the models is limited to these marked words only. In the current study, the performance of five editing tools, namely ViraVirast, FarsiYar, Virastman, Paknevis, and Gagool, is compared. Comparing the performance of the tools with the gold standard data, it is concluded that Paknevis has the highest performance with an average of %74.07 in the two scientific and news genres, and Gagool has the lowest performance with an average of %21.83 based on the approved Persian Orthography Rule. Finally, the performance of the tools is compared with each other in the eight categories to determine the strengths and weaknesses of the tools.
Keywords
Main Subjects