APPLICATION OF EXACT AND LIMIT APPROXIMATIONS OF STATISTICS PROBABILITY DISTRIBUTIONS FOR THE PROBLEM OF TEXT PROCESSING

  • А.К. Melnikov STC CLSC «InformInvestGroup»
Keywords: Probability, test statistics, standard distribution, exact distribution, limit distribution, processing efficiency, relative efficiency of distribution, computational complexity of method, performance of multiprocessor computer system

Abstract

In the paper we consider application of limit and exact approximations of statistics proba-bility distributions for the problem of selection of texts with specific statistical properties. For selection of texts with equiprobable distribution of their symbols we use the statistical fitting crite-rion. Here, as a standard distribution of the test statistic we use its various approximations. As extreme approximations we use limit distributions, and as exact approximations we use Δexact distributions. The difference between Δexact distributions and exact distributions does not exceed the specified Δ. We present the calculation results of Δexact distributions, show their variations from the values of limit distributions for different statistics. We consider the notion of processing efficiency for selection of equiprobable texts, which shows the part of wrong selected texts. We com-pare the processing efficiency for exact and limit approximations of standard distributions of test statistics. We have proved that the processing efficiency does not decreasing, but in many cases it is increasing, if the exact approximation is used instead of the extreme one. To compare the statistical criteria which are based on the same test statistic and different standard distributions, we introduce a concept of the distribution relative efficiency which shows the fold increase of the number of wrong selected texts for the criterion of one or another distribution used as a standard distribution. We show the functional connection between the concepts “processing efficiency” and “relative efficiency” of distributions. Owing to availability of high-performance computing facilities, which can be used for calculation of Δexact distributions for such parameters as the length and capacity of the text alphabet, we have proved the statement about relative efficiency of distributions. Owing to the statement it is possible to select a standard distribution of the criterion (with the highest pro-cessing efficiency) from the set of distributions of the test statistic. In addition we give the exam-ples of the values of relative efficiency for exact and extreme approximations.

References

1. Chepovskiy A.M. Informatsionnye modeli v zadachakh obrabotki tekstov na estestvennykh yazykakh [Information models in tasks of processing of natural language texts]. Moscow: Natsional'nyy otkrytyy universitet «INTUIT», 2015, 228 p. ISBN 978-5-9556-0176-2.
2. Ivchenko G.I., Medvedev Yu.I. Vvedenie v matematicheskuyu statistiku [Introduction to mathematical statistics]. Moscow: LENARD, 2017, 608 p. ISBN 978-5-9710-4535-9.
3. Ronzhin A.F. Effektivnost' tipa CHernova dlya kriteriev soglasiya, osnovannykh na empiricheskikh funktsiyakh raspredeleniya [Tchernov’s efficiency for fitting criteria based on empirical functions of distribution], Teoriya veroyatnosti i ee primenenie [Probability theory and its application], 1985, 30:2, pp. 378-381.
4. Borovkov A.A. Veroyatnostnye protsessy v teorii massovogo obsluzhivaniya [Stochastic pro-cesses in queueing theory]. Moscow: Nauka, 1972, 367 p.
5. Borovkov A.A. Matematicheskaya statistika [Mathematical statistics]. Novosibirsk: Izd-vo IM SORAN, Nauka, 1997, 772 p.
6. Kramer G. Matematicheskie metody statistiki [Mathematical methods of statistics]. Mos-cow: Mir, 1975, 648 p.
7. Mel'nikov A.K. Primenenie tochnykh raspredeleniy v protsedure dvukhetapnoy obrabotki tekstov [Application of exact distributions in the procedure of two-step text processing], Obozrenie prikladnoy i promyshlennoy matematiki [Review of applied and industrial math-ematics], 2018, Vol. 25, Issue 2. In print. Available at: https://tvp.ru/conferen/vsppmXIX/ repso051.pdf (accessed 19 July 2018).
8. Ivchenko G.I., Medvedev Yu.I. Matematicheskaya statistika [Mathematical statistics]. Mos-cow: Knizhnyy dom "LIBROKOM", 2014, 352 p. ISBN 978-5-397-04141-6.
9. Mel'nikov A.K., Ronzhin A.F. Obobshchennyy statisticheskiy metod analiza tekstov, osnovannyy na raschete raspredeleniy veroyatnosti znacheniy statistik [A generalized statis-tical method of analyzing texts based on the calculation of probability distributions of values of statistics], Informatika i ee primeneniya [Informatics and its applications], 2016, Vol. 10, Issue 4, pp. 89-95. ISSN 1992-2264.
10. Mel'nikov A.K. Slozhnost' rascheta tochnykh raspredeleniy veroyatnosti simmetrichnykh additivno razdelyaemykh statistik i oblast' primeneniya predel'nykh raspredeleniy [The com-plexity of calculating the exact probability distributions of symmetric additive-separated sta-tistics and the application of limit distributions], Doklady TUSUR [Proceedings of Tomsk State University of Control Systems and Radioelectronics]. Tomsk, 2017, Vol. 20, No. 4, pp. 126-130. ISSN 1818-0442.
11. Fisher R.A. Statisticheskie metody dlya issledovateley [Statistical methods for researchers]. Moscow: Gosstatizdat, 1958, 73 p.
12. Kendall M.G., St'yuart A. Teoriya raspredeleniy [Distribution theory]. Moscow: Nauka, 1966, 302 p.
13. Zelyukin N.B., Mel'nikov A.K. Slozhnost' rascheta tochnykh raspredeleniy veroyatnosti znacheniy statistik i oblast' primeneniya predel'nykh raspredeleniy [Slozhnost’ rascheta tochnykh raspredeleniy veroyatnosti znacheniy statistik i oblast’ primeneniya predelnykh raspredeleniy], Elektronnye sredstva i sistemy upravleniya: Materialy dokladov XIII Mezhdunar. nauch.-prakt. konf. (29 noyabrya – 1 dekabrya 2017 g.) [Electronic facilities and control systems: reports of the XIIIth International scientific and practical], 29th November – 1st December, 2017]: In 2 part. Part 2. Tomsk: V-Spektr, 2017, pp. 84-90. Available at: https://storage.tusur.ru/files/115115/2017-2.pdf (accessed 13 July 2018).
14. Mel'nikov A.K. Metodika rascheta raspredeleniy veroyatnostey znacheniy statistik, blizkikh k ikh tochnym raspredeleniyam [Calculation methodology of approximate-to-exact distribu-tion of statistics probabilities], Obozrenie prikladnoy i promyshlennoy matematiki [Review of applied and industrial mathematics], 2017, Vol. 24, Issue 5. Available at: http://tvp.ru/conferen/vsppmXVIII/kisso075.pdf (accessed 13 July 2018).
15. Mel'nikov A.K. Metodika rascheta raspredeleniya veroyatnostey znacheniy simmetrichnykh additivno razdelyaemykh statistik, priblizhennykh k ikh tochnomu raspredeleniyu [Pro-cessing complexity for exact probability distributions of symmetrical additively partitioned statistics and application area of limit distributions], Nauchnyy vestnik NGTU [Science bulle-tin of the Novosibirsk state technical university], 2018, No. 1 (70), pp. 153-166. ISBN 1814-1196. Doi: 10.17212/1814-1196-2018-1-153-166.
16. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables in such that it can be reasonably supposed to have arisen from random sampling, Philos. Mag. Ser. 5, 1900, Vol. 50, No. 302, pp. 157-170.
17. Neyman F., Pearson E.S. On the use and interpretation of certain test criteria for purposes of statistical inference, Biometrika, 1928, Vol. 20-A, pp. 175-240, 264-299.
18. Smith P.F., Rae D.S., Manderscheid R.W., Silbergeld S. Exact and approximate distributions of the chi-squared statistic for equiprobability, Commun. Statist., 1979, B. 8 (2), No. 1, pp. 131-149.
19. Matusita K. Decision rules, based on the distanse, for problems of fit tu o samples, and esti-mation, Ann. Math. Stat., 1955, Vol. 26, pp. 631-640.
20. Ronzhin A.F. Asimptoticheskaya lokal'naya otnositel'naya effektivnost' (ALOE) kriteriev soglasiya [Asymptotic local relative efficiency (ALRE) of fitting criteria], Tezisy dokladov Vsesoyuznoy konferentsii «Veroyatnostnye metody v diskretnoy matematike» [Reports of All-USSR conference “Probabilistic methods in discrete mathematics”]. Petrozavodsk, 1983, pp. 70-71.
Published
2019-04-04
Section
SECTION III. MATHEMATICAL AND SOFTWARE