Comparative analysis of editorial tools using a developed benchmark corpus based on the approved Persian Orthography Rule

Ghayoomi, Masood; Nezami, Lili

doi:10.22054/ls.2025.88342.1747

Articles in Press

Document Type : Research Paper

Authors

Masood Ghayoomi ¹
Lili Nezami ²

¹ Institute for Humanities and Cultural Studies

² Language and Computer Group, Academy of Persian Language and Literature, Tehran, Iran

10.22054/ls.2025.88342.1747

Abstract

The development of information technology and its integration with natural language have made it possible to understand and generate linguistic content in an algorithmic manner. The existence of different research groups and their usage of different data and processing algorithms have made the results obtained from the tools not comparable. To address this shortcoming, in this article an attempt has been made to develop a benchmark corpus that is based on the approved Persian Orthography Rule to be used for evaluating the performance of Persian editing tools.
In the current research, after examining the approved Persian Orthography Rule, the proposed rules are divided into eight main categories. Then, a dataset of about 98000 words is collected and compared with the rules in two genres, namely scientific and news. A fraction of the words of this data is marked based on the eight categories; and the evaluation of the models is limited to these marked words only. In the current study, the performance of five editing tools, namely ViraVirast, FarsiYar, Virastman, Paknevis, and Gagool, is compared. Comparing the performance of the tools with the gold standard data, it is concluded that Paknevis has the highest performance with an average of %74.07 in the two scientific and news genres, and Gagool has the lowest performance with an average of %21.83 based on the approved Persian Orthography Rule. Finally, the performance of the tools is compared with each other in the eight categories to determine the strengths and weaknesses of the tools.

Keywords

Main Subjects

Computational Linguistics

Language Science Studies

Comparative analysis of editorial tools using a developed benchmark corpus based on the approved Persian Orthography Rule

Articles in Press, Accepted Manuscript
Available Online from 11 December 2025

Comparative analysis of editorial tools using a developed benchmark corpus based on the approved Persian Orthography Rule

Articles in Press, Accepted Manuscript Available Online from 11 December 2025

Articles in Press, Accepted Manuscript
Available Online from 11 December 2025