МНОГОЭТАПНЫЙ МЕТОД АВТОМАТИЧЕСКОЙ КОРРЕКЦИИ ИСКАЖЕННЫХ ТЕКСТОВ

D.V. Vakhlakov; V.A. Peresypkin; S.Y. Melnikov

D.V. Vakhlakov
V.A. Peresypkin
S.Y. Melnikov

Keywords: Language model, automatic text correction, distorted text, noisy text, F1-measure, Levenshtein distance

Abstract

One of the main factors that significantly complicate the understanding, translation and
analysis of texts obtained by automatic speech recognition or optical recognition of text images
are the distortions contained in them in the form of erroneous characters, words and phrases.
The most typical errors of recognition systems are: – replacement of a word with a similar sounding
or graphic spelling; – replacing several words with one; – replacement of one word with several;
– skipping words; – insertion or deletion of short words (including prepositions and conjunctions).
As a result of recognition, a text is obtained that has distortions and consists mainly of dictionary
words, including in places of distortion. With a large amount of distortion, the texts become
almost unreadable. Automatic processing of such texts is very difficult, although this task is
relevant both for Russian and for other common languages. Correction software that works well at
low distortions in the text, in the case of texts with a high level of distortion, regardless of their
origin, show unsatisfactory results. This makes it necessary to develop independent approaches to
correcting distorted texts. A new multi-pass method for correction of distorted texts based on sequential
error identification and correction of distorted texts is proposed. Non-dictionary word
forms and word forms which occurrence probability in the text in accordance with the selected
probabilistic model is less than a preset threshold are considered to be distorted. After setting of
the distortion sign for individual words, this sign is spread to their combinations, i.e. distorted text
fragments are extracted. A list of possible word variants which includes only those word forms
from the dictionary that are located at a certain Levenshtein distance from the word under study is
built for them. The corrected text from word variants is obtained by searching for the most probable
chain of word forms. The correction method consists of several passes, at each pass only those
fragments of the text are corrected that remained distorted after the previous pass of correction.
The method allows to increase significantly the quality (accuracy) of the correction. In the carried
out experiments the quality of correction in terms of the F1-measure for moderately distorted texts
has been increased by 9 %, and for highly distorted texts – by 7.7 %.

References

1. Petrova O. O., Bulatov K. B. Metody post-obrabotki rezul'tatov raspoznavaniya
mashinochitaemoy zony dokumentov [Methods of post-processing of results of recognition of
the machine-readable zone of documents], Tr. ISA RAN. Spetsial'nyy vypusk [Proceedings of
the ISA RAS. Special issue], 2018, pp. 43-50.
2. Lee C., Wu S., Liu C., Lee H. Spoken SQuAD: A Study of Mitigating the Impact of Speech
Recognition Errors on Listening Comprehension, Proc. Interspeech., 2018, pp. 3459-3463.
3. Meshcheryakov R.V. Struktura sistem sinteza i raspoznavaniya rechi [Structure of systems
synthesis and speech recognition], Izvestiya Tomskogo politekhn. un-ta [Proceedings of the
Tomsk Polytechnic University], 2009, Vol. 315, No. 5, pp. 127-132.
4. Shakirov I.Sh., Kalakov B.A. Avtomatizatsiya ruchnoy korrektirovki oshibok opticheskogo
raspoznavaniya simvolov [Automating manual correction of optical character recognition errors],
Inzhenernye resheniya [Engineering solutions], 2020, No. 3 (13), pp. 7-13.
5. Birin D.A., Mel'nikov S.Yu., Peresypkin V.A., Pisarev I.A., Tsopkalo N.N. Ob effektivnosti
sredstv korrektsii iskazhennykh tekstov v zavisimosti ot kharaktera iskazheniy [On the effectiveness
of correction tools for distorted texts depending on the nature of the distortion],
Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2018, No. 8
(202), pp. 104-114.
6. Speller – Tekhnologii Yandeksa [Speller-Yandex Technologies]. Available at:
https://tech.yandex.ru/speller/ (accessed 08 November 2020).
7. AfterScan – post-OCR text proofing, advanced spell-checking, automatic correction. Available
at: http://www.afterscan.com/ru/ (accessed 08 November 2020).
8. Turdakov D. i dr. Texterra: infrastruktura dlya analiza tekstov [Texterra: infrastructure for text
analysis], Tr. Instituta sistemnogo programmirovaniya RAN [Proceedings of the Institute of System
Programming of the Russian Academy of Sciences], 2014, Vol. 26, Issue 1, pp. 421-438.
9. Microsoft Cognitive Services – API Bing проверки орфографии. Available at:
https://www.microsoft.com/en-us/bing/apis/bing-spell-check-api (accessed 08 November 2020).
10. Chiron G., Doucet A., Coustaty M., Moreux J.P. ICDAR 2017 competition on post-OCR text
correction, 2017 14th IAPR International Conference on Document Analysis and Recognition
(ICDAR), 2017, Vol. 1, pp. 1423-1428.
11. Rigaud C., Doucet A., Coustaty M., Moreux J.P. ICDAR 2019 Competition on Post-OCR Text
Correction, International Conference on Document Analysis and Recognition, 2019, pp. 1588-1593.
12. Das A.K., Goswami S., Lee K., Park S.J. A hybrid and scalable error correction algorithm for
indel and substitution errors of long reads, BMC Genomics, 2019. Vol. 20 (Suppl 11), pp. 1-15.
13. Germanovich A.V., Mel'nikov S.Yu., Peresypkin V.A., Sidorov E.S., Tsopkalo N.N.
Informatsionnye izmereniya yazyka. Programmnaya sistema otsenki chitaemosti iskazhennykh
tekstov [Information dimensions of the language. Software system for evaluating the readability
of distorted texts], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering
Sciences], 2019, No. 8, pp. 6-18.
14. Mel'nikov S.Yu., Peresypkin V.A. O primenenii veroyatnostnykh modeley yazyka dlya
obnaruzheniya oshibok v iskazhennykh tekstakh [On the application of probabilistic language
models for detecting errors in distorted texts], Vestnik komp'yuternykh i informatsionnykh
tekhnologiy [Bulletin of Computer and Information Technologies], 2016, No. 5, pp. 29-34.
15. Zhou Z., Meng H., Lo W. A multi-pass error detection and correction framework for Mandarin
LVCSR, In: Proceedings of the International Conference on Spoken Language Processing
(ICSLP), 2006, pp. 1646-1649.
16. Nguyen T.-T.-H., Coustaty M., Doucet A., Jatowt A., Nguyen N.-V. Adaptive Edit-Distance and
Regression Approach for Post-OCR Text Correction, In: Dobreva M., Hinze A., Žumer M.
(eds), Maturity and Innovation in Digital Libraries. ICADL 2018: Lecture Notes in Computer
Science, Vol. 11279, pp. 278-289.
17. Zukerman I., Partovi A. Improving the understanding of spoken referring expressions through
syntactic-semantic and contextual-phonetic error-correction, Computer Speech & Language,
2017, Vol. 46, pp. 284-310.
18. Li B., Chang F., Liu G. Speech Recognition error correction by using combinational measures,
3rd IEEE International Conference on Network Infrastructure and Digital Content, Beijing,
2012, pp. 375-379.
19. Zhou Z. An error detection and correction framework to improve large vocabulary continuous
speech recognition. PhD Thesis, HK, 2009.
20. Ning Y., Xing C., Zhang L. Domain Knowledge Enhanced Error Correction Service for Intelligent
Speech Interaction, In: Wang D., Zhang LJ. (eds), Artificial Intelligence and Mobile Services
– AIMS 2019: Lecture Notes in Computer Science, Vol. 11516, pp. 179-187.
21. Zavareh F., Zukerman I., Kim S., Kleinbauer T. Error Detection in Automatic Speech Recognition,
In Proceedings of Australasian Language Technology Association Workshop, 2013,
pp. 101-105.
22. Bassil Y., Alwani M. Post-Editing Error Correction Algorithm for Speech Recognition using
Bing Spelling Suggestion, International Journal of Advanced Computer Science and Applications,
2012, Vol. 3, No. 2, pp. 95-101.
23. Bassil Y., Semaan P. ASR Context-Sensitive Error Correction Based on Microsoft N-Gram
Dataset, Journal of Computing, January 2012, Vol. 4, I.1, pp. 34-42.
24. Abuhaiba I. Skew Correction of Textural Documents, Journal of King Saud University –
Computer and Information Sciences, 2003, Vol. 15, pp. 73-93.
25. Cao H., Prasad R., Natarajan P., MacRostie E. Robust page segmentation based on smearing
and error correction unifying top-down and bottom-up approaches, In: Ninth Internat. Conf. on
Document Analysis and Recognition, Curitiba, Brazil, 2007, pp. 392-396.
26. Belozerov A.A., Vakhlakov D.V., Mel'nikov S.Yu., Peresypkin V.A., Skavinskaya D.V.
Ispol'zovanie evolyutsionnykh metodov diskretnoy optimizatsii dlya korrektsii iskazhennykh
tekstov [The use of evolutionary methods of discrete optimization for correction of distorted
texts], Vestnik komp'yuternykh i informatsionnykh tekhnologiy [Bulletin of Computer and Information
Technologies], 2018, No. 12, pp. 3-10.

MULTI-PASS METHOD FOR AUTOMATIC CORRECTION OF DISTORTED TEXTS

Abstract

References