ОБ ЭФФЕКТИВНОСТИ СРЕДСТВ КОРРЕКЦИИ ИСКАЖЕННЫХ ТЕКСТОВ В ЗАВИСИМОСТИ ОТ ХАРАКТЕРА ИСКАЖЕНИЙ

D.A. Birin; V.A. Peresypkin; S.Y. Melnikov; I.A. Pisarev; N.N. Copkalo

D.A. Birin FGUP NII «Kvant»
V.A. Peresypkin FGUP “NTC “Orion”
S.Y. Melnikov OOO “Lingvisticheskie I informatsionye tehnologii”
I.A. Pisarev Southern Federal University
N.N. Copkalo Southern Federal University

Keywords: Noisy texts, random distortions, automatic correction, post-processing

Abstract

The capabilities of four automatic text correction software (Yandex.Speller, Afterscan, Bing Spell Check, Texterra) for noisy texts correction are analyzed. The distortions of texts that occur while typing text on the keyboard and recognition systems working are described. Experimental data on the accuracy of the correction of distorted texts obtained both by typing and as the output of real OCR systems processing low-quality images and ASR systems in a noisy environment are presented. To simulate the distortions caused by the recognition systems, a two-stage model of random text distortions is proposed. At the first stage (word distortions with a given probability) the distorted word in the text is replaced with a random dictionary word with Levenshtein distance 1 or 2. The replacement word is chosen according to the uniform distribution. At the second stage (character distortions with a given probability) the distorted character is removed with a probability of 1/3, or a random character is inserted before it with a probability of 1/3, or it is replaced with a random alphabet character with a probability of 1/3. The replacement character is chosen according to the uniform distribution. The distorted texts obtained in this way are corrected using the Yandex.Speller and Bing Spell Check software and the percentage of true words in the correct-ed text is calculated. The data are averaged over a set of texts. The results of experiments with an estimation of the correction accuracy in the following parameter range are given: the probabilities of word distortion vary from 0 to 0.9 and the probabilities of symbol distortion vary from 0 to 0.5. The results show that Yandex.Speller, Bing Spell Check and Texterra provide good quality of the correction of distortions that occur while typing. This software are ineffective for correcting dis-tortions caused by the recognition systems.

References

1. Birin, D.A., Mel'nikov S.YU., Peresypkin V.A. Ob effektivnosti sredstv korrektsii iskazhennykh tekstov dlya rezul'tatov raboty sistem raspoznavaniya [About efficiency of means of correction of the distorted texts for results of work of systems of recognition], Superkomp'yuternye tekhnologii (SKT-2018): Materialy 5-y Vserossiyskoy nauchno-tekhnicheskoy konferentsii [Supercomputer technologies (SKT-2018): Materials of the 5th all-Russian scientific and tech-nical conference]: in 2 vol. Vol. 1. Rostov-on-Don; Taganrog: Izd-vo YuFU, 2018, pp. 71-75.
2. Subramaniam L.V. et al. A survey of types of text noise and techniques to handle noisy text, Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, July 23-24, 2009, Barcelona, Spain. DOI: 10.1145/1568296.1568315.
3. Bassil Y., Alwani M. Post Editing Error Correction Algorithm for Speech Recognition using Bing Spelling Suggestion, International Journal of Advanced Computer Science and Applica-tions, 2012, Vol. 3, No. 2, pp. 95-101.
4. Feld M., Momtazi S., Freigang F., Klakow D., Müller C. Mobile texting: can post-ASR correc-tion solve the issues? An experimental study on gain vs. costs, Proceedings of the 2012 ACM international conference on Intelligent User Interfaces, February 14-17, 2012, pp. 37-40. Lis-bon, Portugal. DOI: 10.1145/2166966.2166974.
5. Evershed J., Fitch K. Correcting Noisy OCR: Context beats Confusion DATeCH 2014, May 19–20, 2014, Madrid, Spain DOI:10.1145/2595188.2595200.
6. Lopresti D.P. Optical character recognition errors and their effects on natural language pro-cessing, International Journal on Document Analysis and Recognition (IJDAR), September 2009, Vol. 12, Issue 3, pp. 141–151. DOI: 10.1007/s10032-009-0094-8.
7. Packer T.L., Lutes J.F., Stewart A.P., Embley D.W., Ringger E.K., Seppi K.D., et al. Extracting person names from diverse and noisy OCR text, Proceedings of the fourth workshop on Ana-lytics for noisy unstructured text data AND '10, 2010, pp. 19-26. DOI 10.1145/1871840.1871845
8. Kumar A., Lehal G.S. Automatic Text Correction for Devanagari OCR, Indian Journal of Sci-ence and Technology, December 2016, Vol. 9 (45). DOI: 10.17485/ijst/2016/v9i45/106372.
9. Gadde P., Goutam R., Shah R., Bayyarapu H.S., Subramaniam L.V. Experiments with artifi-cially generated noise for cleansing noisy text, Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, MOCR AND ’11, pp. 4:1-4:8. ACM, 2011.
10. Dey L., Haque S.K.M. Studying the effects of noisy text on text mining applications, Proceed-ings of The Third Workshop on Analytics for Noisy Unstructured Text Data AND’09. Barcelo-na, Spain, 2009, pp. 107-114.
11. Clark E., Araki K. Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English, Procedia - Social and Behavioral Sciences 27, December 2011, pp. 2-11. DOI: 10.1016/j.sbspro.2011.10.577.
12. Saloot M.A., Idris N., Mahmud R. An architecture for Malay Tweet normalization, Inf. Pro-cess. Manag., 2014, Vol. 50, No. 5, pp. 621-633, DOI: 10.1016/j.ipm.2014.04.009.
13. Wang A., Kan M.-Y., Andrade D., Onishi T., Ishikawa K. Chinese Informal Word Normaliza-tion: an Experimental Study, International Joint Conference on Natural Language Processing, 2013, pp. 127-135. DOI: 10.1007/978-3-319-68612-7_25.
14. Tursun O., Cakici R. Noisy Uyghur Text Normalization, Proceedings of the 3rd Workshop on Noisy User-generated Text, Copenhagen, Denmark, September 7, 2017. – P. 85–93. DOI: 10.18653/v1/w17-4412.
15. Ikeda T., Shindo H., Matsumoto Y. Japanese Text Normalization with Encoder-Decoder Mod-el, Proceedings of the 2nd Workshop on Noisy User-generated Text. – Osaka, Japan, Decem-ber 11, 2016, pp. 118-126.
16. Bassil, Y., Alwani, M. OCR post-processing error correction algorithm using Google’s online spelling suggestion, Journal of Emerging Trends in Computing and Information Sciences, Jan-uary 2012, Vol. 3, No. 1.
17. Спеллер – Технологии Яндекса. Available at: https://tech.yandex.ru/speller/ (accessed 08 November 2018).
18. AfterScan – post-OCR text proofing, advanced spell-checking, automatic correction. Available at: http://www.afterscan.com/ru/ (accessed 08 November 2018).
19. Turdakov D. i dr. Texterra: infrastruktura dlya analiza tekstov [Texterra: Infrastructure for text analysis], Trudy Instituta sistemnogo programmirovaniya RAN [Proceedings of Institute for system programming of Russian Academy of Sciences], 2014, Vol. 26, Issue 1, pp. 421-438. DOI: 10.15514/ISPRAS-2014-26(1)-18.
20. Microsoft Cognitive Services – API Bing проверки орфографии. Available at: https://azure.microsoft.com/ru-ru/services/cognitive-services/spell-check/ (accessed 08 No-vember 2018).
21. Meshcheryakov R.V. Struktura sistem sinteza i raspoznavaniya rechi [Structure of speech syn-thesis and recognition systems], Izvestiya Tomskogo politekhn. un-ta [News of Tomsk Poly-technic University], 2009, Vol. 315, No. 5, pp. 127-132.
22. Smirnov S.V. Korrektirovka oshibok opticheskogo raspoznavaniya na osnove reytingo-rangovoy modeli teksta [Correction of optical recognition errors based on the rating-rank mod-el of the text], Trudy SPIIRAN [SPIIRAS Proceedings], 2014, Issue 4, No. 35, pp. 64-82. DOI: 10.15622/sp.35.5.
23. Rudakov I.V., Romanov A.S. Raspoznavanie tekstovogo izobrazheniya s uchetom morfologii slova [Recognition of a text image taking into account the morphology of the word], Nauka i obrazovanie: nauchnoe izdanie MGTU im. N.E. Baumana [Science and education: scientific publication of MSTU. N.E. Bauman], 2012, Issue 4, pp. 1-6.
24. Farra N., Tomeh N., Rozovskaya A., Habash N. Generalized Character-Level Spelling Error Correction, ACL (2), 2014, pp. 161-167.
25. Belozerov A.A., Vakhlakov D.V., Mel'nikov S.YU., Peresypkin V.A., Sidorov E.S. Tekhnologicheskie aspekty postroeniya sistemy sbora i predobrabotki korpusov novostnykh tekstov dlya sozdaniya modeley yazyka [Technological aspects of creation of system of gath-ering and preprocessing of the corpora of news texts to create language models], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2016, No. 12 (185), pp. 29-42. DOI: 10.18522/2311-3103-2016-12-2942.

ON THE EFFICIENCY OF THE NOISY TEXT CORRECTION SOFTWARE DEPENDING ON THE DISTORTION TYPE

Abstract

References