ON THE ACCURACY AND COMPLEXITY OF THE MULTI-STAGE METHOD FOR CORRECTING DISTORTED TEXTS DEPENDING ON THE DEGREE OF DISTORTION

  • D.V. Vakhlakov FGUP “NTC “Orion””
  • V.A. Peresypkin FGUP “NTC “Orion””
  • A.V. Germanovich Moscow State University, Institute of Asian and African Studies
  • S.Y. Melnikov OOO “Lingvisticheskie I informatsionye tehnologii”
  • N.N. Copkalo Southern Federal University
Keywords: Multi-stage method of text correction, language model, Levenshtein distance, completeness and accuracy of correction, F1-measure, WER, CER, linguistic experts

Abstract

One of the main factors that significantly complicate the understanding, translation and
analysis of texts obtained by automatic recognition of speech or images of texts is the presence of
distortions in the form of erroneous symbols, words and phrases. Until recently, there were no
effective software tools for correcting texts with significant distortions, although this task is relevant
both for Russian and other common languages in the context of the active use of recognition
systems in advanced augmented reality systems. The authors proposed a new multi-stage method
for correcting distorted texts, which significantly increases the accuracy of the correction (in
terms of the number of correctly corrected words in the text) and is based on the sequential detection
of errors and their correction. In this paper, we evaluate the accuracy and computational
complexity of the proposed method for correcting distorted texts at various levels of distortion, and
determine its place among other modern approaches to correction. The most typical errors of
recognition systems are: – replacing a word with a similar sound or graphic spelling; – replacing
several words with one; – replacing one word with several; – omission of words; – insertion or
deletion of short words (including prepositions and conjunctions). As a result of recognition, a
distorted text is obtained, which consists mainly of dictionary words, even in places of distortion.
With a large number of distortions, the texts become almost unreadable. Due to the fact that it is
problematic to select texts with a wide range of distortion levels in the required amount based on
the results of real machine recognition of speech and images of texts, software modeling of distortions
was used. A text distortion technique has been proposed and implemented that simulates the
results of recognition systems in a wide range of distortions; distorted texts have been prepared in
the required amount. Within the framework of the proposed multi-stage correction method, nondictionary
word forms and words are considered distorted if the probability of their occurrence in
the text in accordance with the chosen language model is less than a given threshold. For such
distorted words, a list of possible variants of words is built, which includes only those word forms
from the dictionary that are at a certain Levenshtein distance from the word under study. The corrected
text from the tables of word variants is obtained by searching for the most probable chain
of word forms. The correction method consists of several stages, at each stage only those fragments
of the text that remain distorted after the previous stage are corrected. According to the
results of the experiments on the correction of distorted texts, it was concluded that the proposed
correction method showed good results with an average value of F-measure >50 % in the distortion
range from 0 to 75 %. Linguistic experts confirmed the fruitfulness of the proposed approach
to correction and its preference over other modern approaches, fixing that with a level of distortion
of up to 50 % of words, the corrected text is read with much less effort than a distorted one,
and with a level of distortion of up to 70% of words, the corrected text also allows you to highlight
useful information about the content.

References

1. Meshcheryakov R.V. Struktura sistem sinteza i raspoznavaniya rechi [The structure of speech
synthesis and recognition systems], Izvestiya Tomskogo politekhn. un-ta [Izvestiya Tomsk
Polytechnic University], 2009, Vol. 315, No. 5, pp. 127-132.
2. Smirnov S.V. Korrektirovka oshibok opticheskogo raspoznavaniya na osnove reytingorangovoy
modeli teksta [Correction of optical recognition errors based on the rating-rank model
of the text], Tr. SPIIRAN [Proceedings of SPIIRAN], 2014, Issue 4, No. 35, pp. 64-82.
3. Germanovich A.V.,Mel'nikov S.Yu., Peresypkin V.A., Sidorov E.S., Tsopkalo N.N. Informatsionnye
izmereniya yazyka. Programmnaya sistema otsenki chitaemosti iskazhennykh tekstov
[Information dimensions of language. Software system for assessing the readability of distorted
texts], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences],
2019, No. 8, pp. 6-18.
4. www.topwar.ru > 18316 – pehotnaja-sistema-dopolnennoj-realnosti-IVAS (SShA) [www.topwar.ru
> 18316 – pehotnaja-sistema-dopolnennoj-realnost-IVAS (USA)]. 29.03.2021.
5. www.tadviser.ru > index php / Stat'ya Komp'yuternoe_zrenie_tekhnologii_rynok_perspektivy
[www.tadviser.ru > index php / Article Computer_view_technology_market prospects].
26.06.2019.
6. Vakhlakov D.V., Mel'nikov S.Yu., Peresypkin V.A. Mnogoetapnyy metod avtomaticheskoy
korrektsii iskazhennykh tekstov [Multi-stage method of automatic correction of distorted
texts], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2020,
No. 7, pp. 35-45.
7. Subramaniam L.V. et al. A survey of types of text noise and techniques to handle noisy text //
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, July 23-
24, 2009, Barcelona, Spain.
8. Available at: https://www.ldc.upenn.edu/collaborations/current-projects/madcat.
9. Strassel S., Friedman L., Ismael S., Brandschain L. New Resources for Document
Classification, Analysis and Translation Technologies, Proceedings of the 6th International
Conference on Language Resources and Evaluation, LREC 2008.
10. Stein B., Hoppe D., Gollub T. The impact of spelling errors on patent search, In Proceedings of
the 13th Conference of the European Chapter of the Association for Computational Linguistics
(EACL 2012), pp. 570-579.
11. Nguyen T., Jatowt A., Coustaty M., Doucet A. Survey of Post-OCR Processing Approaches,
ACM Comput. Surv. 54, 6, Article 124 (July 2021), 37 p.
12. Ghosh S., Kristensson P. Neural Networks for Text Correction and Completion in Keyboard
Decoding, arXiv:1709.06429, 2017.
13. Rybanov A.A., Filippova E.M., Sviridova O.V., Fedotova L.A. Sistema kolichestvennykh
pokazateley monitoringa za protsessom razvitiya navyka vvoda informatsii [A system of quantitative
indicators for monitoring the process of developing the information input skill],
Pedagogicheskaya informatika [Pedagogical informatics], 2020, No. 1, pp. 136-142.
14. Zhang D., Yang Z. Word Embedding Perturbation for Sentence Classification, CoRR preprint
arXiv:1804.08166, 2018.
15. Birin D.A., Mel'nikov S.Yu., Peresypkin V.A., Pisarev I.A., TSopkalo N.N. Ob effektivnosti
sredstv korrektsii iskazhennykh tekstov v zavisimosti ot kharaktera iskazheniy [On the effectiveness
of the means of correction of distorted texts depending on the nature of the distortion],
Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2018, No. 8,
pp. 104-114.
16. Malykh V. Robust-to-Noise Models in Natural Language Processing Tasks, Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics: Student Research
Workshop. Florence, Italy, July 28 - August 2, 2019, pp. 10-16.
17. Soper E., Fujimoto S., Yu Y. BART for Post-Correction of OCR Newspaper Text, Proceedings
of the 2021 EMNLP Workshop W-NUT: The 7th Workshop on Noisy User-generated Text.
November 11, 2021, pp. 284-290.
18. Belinkov Y., Bisk Y. Synthetic and natural noise both break neural machine translation,
arXiv:1711.02173, 2017.
19. Khayrallah H., Koehn P. On the Impact of Various Types of Noise on Neural Machine Translation,
In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation,
2018, pp. 74-83.
20. Devlin J., Chang M., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional
transformers for language understanding, arXiv:1810.04805, 2018.
21. Kumar A., Makhija P., Gupta A. Noisy Text Data: Achilles’ Heel of BERT, Proceedings of the
2020 EMNLP Workshop W-NUT: The Sixth Workshop on Noisy User-generated Text, pp. 16-21.
22. Vaibhav, Singh S., Stewart C., Neubig G. Improving Robustness of Machine Translation with
Synthetic Noise, arXiv:1902.09508, 2019.
23. Niu X., Mathur P., Dinu G., Al-Onaizan Y. Evaluating Robustness to Input Perturbations for
Neural Machine Translation, arXiv:2005.00580, 2020.
24. Karpukhin V., Levy O., Eisenstein J., Ghazvininejad M. Training on Synthetic Noise Improves
Robustness to Natural Noise in Machine Translation, arXiv:1902.01509, 2019.
25. Li Z., Rei M., Specia L. Visual Cues and Error Correction for Translation Robustness,
arXiv:2103.07352, 2021.
26. Riabi A., Sagot B., Seddah D. Can Character-based Language Models Improve Downstream
Task Performance in Low-Resource and Noisy Language Scenarios?, Proceedings of the 2021
EMNLP Workshop W-NUT: The 7th Workshop on Noisy User-generated Text. November 11,
2021, pp. 423-436.
27. Mel'nikov S.Yu., Peresypkin V.A. O primenenii veroyatnostnykh modeley yazyka dlya
obnaruzheniya oshibok v iskazhennykh tekstakh [On the application of probabilistic language
models to detect errors in distorted texts], Vestnik komp'yuternykh i informatsionnykh
tekhnologiy [Bulletin of Computer and Information Technologies], 2016, No. 5, pp. 29-34.
28. Belozerov A.A., Vakhlakov D.V., Mel'nikov S.Yu., Peresypkin V.A., Sidorov E.S. Tekhnologicheskie
aspekty postroeniya sistemy sbora i predobrabotki korpusov novostnykh tekstov dlya sozdaniya
modeley yazyka [Technological aspects of building a system for collecting and preprocessing news
text corpora to create language models], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU.
Engineering Sciences], 2016, No. 12, pp. 29-42.
Published
2022-03-02
Section
SECTION III. INFORMATION PROCESSING IN DISTRIBUTED, RECONFIGURABLE AND NEURAL NE