• V.V. Bova Southern Federal University
  • Y.A. Kravchenko Southern Federal University
  • S.I. Rodzin Southern Federal University
Keywords: Text data cluster analysis, agglomerative clustering, quality metrics, non-hierarchical clustering, affine transformation method, dendograms, scatterplots


The article deals with one of the important tasks of artificial intelligence – machine processing
of natural language. The solution of this problem based on cluster analysis makes it possible
to identify, formalize and integrate large amounts of linguistic expert information under conditions
of information uncertainty and weak structure of the original text resources obtained from
various subject areas. Cluster analysis is a powerful tool for exploratory analysis of text data,
which allows for an objective classification of any objects that are characterized by a number of
features and have hidden patterns. A review and analysis of modern modified algorithms for agglomerative
clustering CURE, ROCK, CHAMELEON, non-hierarchical clustering PAM, CLARA
and the affine transformation algorithm used at various stages of text data clustering, the effectiveness
of which is verified by experimental studies, is carried out. The paper substantiates the
requirements for choosing the most efficient clustering method for solving the problem of increasing the efficiency of intellectual processing of linguistic expert information. Also, the paper considers
methods for visualizing clustering results for interpreting the cluster structure and dependencies
on a set of text data elements and graphical means of their presentation in the form of
dendograms, scatterplots, VOS similarity diagrams, and intensity maps. To compare the quality of
the algorithms, internal and external performance metrics were used: "V-measure", "Adjusted
Rand index", "Silhouette". Based on the experiments, it was found that it is necessary to use a
hybrid approach, in which, for the initial selection of the number of clusters and the distribution of
their centers, use a hierarchical approach based on sequential combining and averaging the characteristics
of the closest data of a limited sample, when it is not possible to put forward a hypothesis
about the initial number of clusters. Next, connect iterative clustering algorithms that provide
high stability with respect to noise features and the presence of outliers. Hybridization increases
the efficiency of clustering algorithms. The research results showed that in order to increase the
computational efficiency and overcome the sensitivity when initializing the parameters of clustering
algorithms, it is necessary to use metaheuristic approaches to optimize the parameters of the
learning model and search for a global optimal solution.


