One of the main factors that significantly complicate the understanding, translation and
analysis of texts obtained by automatic speech recognition or optical recognition of text images
are the distortions contained in them in the form of erroneous characters, words and phrases.
The most typical errors of recognition systems are: – replacement of a word with a similar sounding
or graphic spelling; – replacing several words with one; – replacement of one word with several;
– skipping words; – insertion or deletion of short words (including prepositions and conjunctions).
As a result of recognition, a text is obtained that has distortions and consists mainly of dictionary
words, including in places of distortion. With a large amount of distortion, the texts become
almost unreadable. Automatic processing of such texts is very difficult, although this task is
relevant both for Russian and for other common languages. Correction software that works well at
low distortions in the text, in the case of texts with a high level of distortion, regardless of their
origin, show unsatisfactory results. This makes it necessary to develop independent approaches to
correcting distorted texts. A new multi-pass method for correction of distorted texts based on sequential
error identification and correction of distorted texts is proposed. Non-dictionary word
forms and word forms which occurrence probability in the text in accordance with the selected
probabilistic model is less than a preset threshold are considered to be distorted. After setting of
the distortion sign for individual words, this sign is spread to their combinations, i.e. distorted text
fragments are extracted. A list of possible word variants which includes only those word forms
from the dictionary that are located at a certain Levenshtein distance from the word under study is
built for them. The corrected text from word variants is obtained by searching for the most probable
chain of word forms. The correction method consists of several passes, at each pass only those
fragments of the text are corrected that remained distorted after the previous pass of correction.
The method allows to increase significantly the quality (accuracy) of the correction. In the carried
out experiments the quality of correction in terms of the F1-measure for moderately distorted texts
has been increased by 9 %, and for highly distorted texts – by 7.7 %.
