• F.S. Bulyga Southern Federal University
  • V. М. Kureichik Southern Federal University
Keywords: Big data, clustering, cluster analysis, data mining, vectorization, text data clustering, k-means, Word2Vec, TF-IDF, Bag-of-Words


The presented publication is devoted to an overview of the problem of presenting textual information
for the subsequent implementation of cluster analysis in the framework of processing
and managing high-dimensional information. Modern requirements for analytical, search and
recommendation information systems demonstrate the weak formation of a holistic solution that
can provide a sufficient level of speed and quality of the results obtained within the framework of
the current information technology market. The search for a solution to the presented problem
entails the need to conduct an objective analysis of existing solutions for representing textual information
in vector space, in order to form a holistic view of the advantages and disadvantages of
the analyzed approaches, as well as the formation of criteria that allow one to implement their
own approach, devoid of identified weaknesses. The presented work is analytical, and allows you
to get an idea of the current state and elaboration of the identified problem within a limited subject
area. Clustering of text data is the automatic formation of subsets, the elements of which are instances
of documents of some researched, unstructured sample of a fixed dimension. This process
can be classified as unsupervised learning, which implies the absence of an expert who personally
assigns class indices to the original sample of documents. However, the implementation of cluster
analysis of text data without any pre-processing is impossible. To do this, it is necessary to ensure
standardization and reduction of input data to a single format and form. Within the framework of
this stage of the implementation of cluster analysis, the presented publication discusses methods
for preprocessing text data. The novelty of the presented publication lies in the formation of the
theoretical basis of the main methods of text data vectorization, by systematizing and objectifying
the proposed assumptions, by conducting a series of experimental studies. The main difference of
this work from the already published scientific works is the systematization and analysis of modern
solutions, as well as the hypotheses about the relevance and effectiveness of our own hybridized
approach designed for text data vectorization.


