COMPARATIVE ANALYSIS OF METHODS OF VECTORIZATION OF HIGH DIMENSIONAL TEXT DATA

  • F.S. Bulyga Southern Federal University
  • V. М. Kureichik Southern Federal University
Keywords: Big data, clustering, cluster analysis, data mining, vectorization, text data clustering, k-means, Word2Vec, TF-IDF, Bag-of-Words

Abstract

The presented publication is devoted to an overview of the problem of presenting textual information
for the subsequent implementation of cluster analysis in the framework of processing
and managing high-dimensional information. Modern requirements for analytical, search and
recommendation information systems demonstrate the weak formation of a holistic solution that
can provide a sufficient level of speed and quality of the results obtained within the framework of
the current information technology market. The search for a solution to the presented problem
entails the need to conduct an objective analysis of existing solutions for representing textual information
in vector space, in order to form a holistic view of the advantages and disadvantages of
the analyzed approaches, as well as the formation of criteria that allow one to implement their
own approach, devoid of identified weaknesses. The presented work is analytical, and allows you
to get an idea of the current state and elaboration of the identified problem within a limited subject
area. Clustering of text data is the automatic formation of subsets, the elements of which are instances
of documents of some researched, unstructured sample of a fixed dimension. This process
can be classified as unsupervised learning, which implies the absence of an expert who personally
assigns class indices to the original sample of documents. However, the implementation of cluster
analysis of text data without any pre-processing is impossible. To do this, it is necessary to ensure
standardization and reduction of input data to a single format and form. Within the framework of
this stage of the implementation of cluster analysis, the presented publication discusses methods
for preprocessing text data. The novelty of the presented publication lies in the formation of the
theoretical basis of the main methods of text data vectorization, by systematizing and objectifying
the proposed assumptions, by conducting a series of experimental studies. The main difference of
this work from the already published scientific works is the systematization and analysis of modern
solutions, as well as the hypotheses about the relevance and effectiveness of our own hybridized
approach designed for text data vectorization.

References

1. Parkhomenko D.A. Data vizualization makes sense of Big data, Big Data and Advanced
Analytics, 2021, No. 7-1, pp. 416-417.
2. Esaulenko A.S., Nikonenko N.D. Bol'shie dannye. Real'nost' i perspektivy [Big data. Reality
and prospects], Upravlenie innovatsiyami: teoriya, metodologiya, praktika [Management of
innovations: theory, methodology, practice], 2016, No. 17, pp. 74-79.
3. Grodel' Yu.V., Lagun D.A. Problema Big Data i NoSQL podkhod k ee resheniyu [The problem
of Big Data and NoSQL approach to its solution], Nauka, obrazovanie, obshchestvo: tendentsii
i perspektivy: Sb. nauchnykh trudov po materialam Mezhdunarodnoy nauchno-prakticheskoy
konferentsii [Collection of scientific papers based on the materials of the International Scientific
and Practical Conference]: in 5 part. Moscow, 2014, pp. 31-32.
4. Abashin V.G., Zholobova G.N., Gorokhova R.I., Nikitin P.V., Semenov A.M., Zaraev R.E.
Podgotovka studentov k rabote s bol'shimi dannymi s primeneniem klastera Hadoop [Preparing
students to work with big data using the Hadoop cluster], Sovremennye naukoemkie
tekhnologii [Modern high technologies], 2022, No. 6, pp. 78-82.
5. Denisenko V.V., Evteeva K.S., Savchenko I.I., Skrypnikov A.V., Berestovoy A. Raspredelennye
vychislitel'nye modeli Mapreduce i Mapreduce-algoritma [Distributed computational models
of Mapreduce and Mapreduce-algorithm], Sistemnyy analiz i modelirovanie protsessov
upravleniya kachestvom v innovatsionnom razvitii agropromyshlennogo kompleksa: Mater. V
Mezhdunarodnoy nauchno-prakticheskoy konferentsii,v ramkakh realizatsii Assotsiatsii
«Tekhnologicheskaya platforma «Tekhnologii pishchevoy» [System analysis and modeling of
quality management processes in the innovative development of agro-industrial complex: Materials
of the V International Scientific and Practical Conference, within the framework of the
Association "Technological Platform" Food Technologies”], 2021, pp. 319-326.
6. Mamedova G.A., Zeynalova L.A., Melikova R.T. Tekhnologii bol'shikh dannykh v elektronnom
obrazovanii [Big data technologies in e-education], Otkrytoe obrazovanie [Open education],
2017, Vol. 21, No. 6, pp. 41-48.
7. Trofimov I.E. Raspredelennye vychislitel'nye sistemy dlya mashinnogo obucheniya [Distributed
Computing Systems for Machine Learning], Informatsionnye tekhnologii i vychislitel'nye
sistemy [Information Technologies and Computing Systems], 2017, No. 3, pp. 56-69.
8. Zhuravlev Yu.I. Ob algebraicheskom podkhode k resheniyu zadachi raspoznavaniya ili
klassifikatsii [On the algebraic approach to solving the problem of recognition or classification],
Problemy kibernetiki [Problems of Cybernetics], 1978, Vol. 33, pp. 5-68.
9. Rabinovich Yu.I. Klasternyy analiz detalizatsii telefonnykh peregovorov [Cluster analysis of
the details of telephone conversations], Sistemy i sredstva informatiki [Systems and means of
informatics], 2007, Vol. 17, No. 1, pp. 52-78.
10. Lushnikov N.D., Ismagilova A.S. Evklidovo rasstoyanie kak osnova programmnogo kompleksa
po mnogofaktornoy biometricheskoy autentifikatsii [Euclidean distance as the basis of a software
package for multi-factor biometric authentication], Matematicheskoe modelirovanie
protsessov i sistem: Mater. XI Mezhdunarodnoy molodezhnoy nauchno-prakticheskoy
konferentsii [Mathematical modeling of processes and systems: Proceedings of the XI International
Youth Scientific and Practical Conference]. Sterlitamak, 2021, pp. 53-55.
11. Ruzibaev O.B., Eshmetov S.D. Issledovanie i analiz algoritmov na osnove nechetkogo metoda
k blizhayshikh sosedey s primeneniem razlichnykh metrik pri diagnostike raka molochnoy
zhelezy [Research and analysis of algorithms based on the fuzzy k nearest neighbors method
using various metrics in the diagnosis of breast cancer], Nauka i mir [Nauka i mir], 2016,
No. 5-1 (33), pp. 102-107.
12. Le Min' Taun, Shukurov I.S., Nguen Tkhi May. Issledovanie intensivnosti gorodskogo ostrova
tepla na osnove gorodskoy planirovki [Study of the intensity of the urban heat island based on
urban planning], Stroitel'stvo: nauka i obrazovanie [Construction: science and education],
2019, Vol. 9, No. 3, pp. 54-65.
13. Shumskaya A.O. Otsenka effektivnosti metrik rasstoyaniya Evklida i rasstoyaniya
Makhalanobisa v zadachakh identifikatsii proiskhozhdeniya teksta [Estimation of Efficiency
Metrics of Euclid Distance and Mahalanobis Distance in Problems of Identification of Text
Origin], Doklady Tomskogo gosudarstvennogo universiteta sistem upravleniya i radioelektroniki
[Reports of Tomsk State University of Control Systems and Radioelectronics], 2013, No. 3 (29),
pp. 141-145.
14. Sherstnev P.A. Issledovanie metodov vektorizatsii dokumentov na osnove vektorizatsii slov
[Investigation of document vectorization methods based on word vectorization], Aktual'nye
problemy aviatsii i kosmonavtiki: Sb. materialov VII Mezhdunarodnoy nauchno-prakticheskoy
konferentsii, posvyashchennoy Dnyu kosmonavtiki [Actual problems of aviation and astronautics:
Collection of materials of the VII International scientific and practical conference dedicated
to Cosmonautics Day]: in 3 vol. Krasnoyarsk, 2021, pp. 216-218.
15. Tian L., Huang R., Wang Y. Metric learning in codebook generation of bag-of-words for
person re-identification, ICPRAM 2019 - Proceedings of the 8th International Conference on
Pattern Recognition Applications and Methods. Prague. 2019, pp. 298-306.
16. Bulyga F.S., Kureychik V.M. Algoritmy aglomerativnoy klasterizatsii primenitel'no k
zadacham analiza lingvisticheskoy ekspertnoy informatsii [Algorithms of agglomerative clustering
in relation to the problems of analysis of linguistic expert information], Izvestiya YuFU.
Tekhnicheskie nauki [Izvestiya SFedU. Technical science], 2021, No. 6 (223), pp. 73-88.
17. Nartsev A.D. Text classification by means of word2vec model and a convolutional neural
network, Presenting Academic Achievements to the World. Natural Science: Mater. X
nauchnoy konferentsii molodykh uchenykh. Saratov, 16 aprelya 2019 goda [Presenting Academic
Achievements to the World. Natural Science: Proceedings of the X scientific conference
of young scientists, Saratov, April 16, 2019]. Saratov, 2020, Vol. 9, pp. 71-77.
18. Levchenko S.V. Razrabotka metoda klasterizatsii slov po smyslovym kharakteristikam s
ispol'zovaniem algoritmov Word2Vec [Development of a method for clustering words by semantic
characteristics using Word2Vec algorithms], Novye informatsionnye tekhnologii v
avtomatizirovannykh sistemakh [New information technologies in automated systems], 2017,
No. 20, pp. 44-46.
19. Bulyga F.S., Kureychik V.M. Klasterizatsiya korpusa tekstovykh dokumentov pri pomoshchi
algoritma k-means [Clusterization of text document corpus using the k-means algorithm],
Izvestiya vysshikh uchebnykh zavedeniy. Severo-Kavkazskiy region. Tekhnicheskie nauki
[Izvestia of higher educational institutions. North Caucasian region. Technical science], 2022,
No. 3 (215), pp. 33-40.
20. Newsgroups, Qwone. 2019. Available at: http://qwone.com/~jason/20Newsgroups/ (accessed
15 August 2022).
Published
2023-06-07
Section
SECTION III. INFORMATION PROCESSING ALGORITHMS