METHODS AND ALGORITHMS FOR TEXT DATA CLUSTERING (REVIEW)

  • V.V. Bova Southern Federal University
  • Y.A. Kravchenko Southern Federal University
  • S.I. Rodzin Southern Federal University
Keywords: Text data cluster analysis, agglomerative clustering, quality metrics, non-hierarchical clustering, affine transformation method, dendograms, scatterplots

Abstract

The article deals with one of the important tasks of artificial intelligence – machine processing
of natural language. The solution of this problem based on cluster analysis makes it possible
to identify, formalize and integrate large amounts of linguistic expert information under conditions
of information uncertainty and weak structure of the original text resources obtained from
various subject areas. Cluster analysis is a powerful tool for exploratory analysis of text data,
which allows for an objective classification of any objects that are characterized by a number of
features and have hidden patterns. A review and analysis of modern modified algorithms for agglomerative
clustering CURE, ROCK, CHAMELEON, non-hierarchical clustering PAM, CLARA
and the affine transformation algorithm used at various stages of text data clustering, the effectiveness
of which is verified by experimental studies, is carried out. The paper substantiates the
requirements for choosing the most efficient clustering method for solving the problem of increasing the efficiency of intellectual processing of linguistic expert information. Also, the paper considers
methods for visualizing clustering results for interpreting the cluster structure and dependencies
on a set of text data elements and graphical means of their presentation in the form of
dendograms, scatterplots, VOS similarity diagrams, and intensity maps. To compare the quality of
the algorithms, internal and external performance metrics were used: "V-measure", "Adjusted
Rand index", "Silhouette". Based on the experiments, it was found that it is necessary to use a
hybrid approach, in which, for the initial selection of the number of clusters and the distribution of
their centers, use a hierarchical approach based on sequential combining and averaging the characteristics
of the closest data of a limited sample, when it is not possible to put forward a hypothesis
about the initial number of clusters. Next, connect iterative clustering algorithms that provide
high stability with respect to noise features and the presence of outliers. Hybridization increases
the efficiency of clustering algorithms. The research results showed that in order to increase the
computational efficiency and overcome the sensitivity when initializing the parameters of clustering
algorithms, it is necessary to use metaheuristic approaches to optimize the parameters of the
learning model and search for a global optimal solution.

References

1. Junkai Yi, Yacong Zhang, Xianghui Zhao, Jing Wan A Novel Text Clustering Approach Using
Deep-Learning Vocabulary Network, Mathematical Problems in Engineering, 2017,
Vol. 2017, 13 p.
2. Yujia Sun, Jan Platoš. High-Dimensional Text Clustering by Dimensionality Reduction and
Improved Density Peak, Wireless Communications and Mobile Computing, 2020, Vol. 2020.
Article ID 8881112, 16 p.
3. Bova V.V., Kuliev E.V., Shcheglov S.N. Metod semanticheskoy klasterizatsii raspredelennykh
resursov znaniy s dinamicheskimi komponentami na osnove kontentnoy fil'tratsii [The method
of semantic clustering of distributed knowledge resources with dynamic components based on
content filtering], Informatika, vychislitel'naya tekhnika i inzhenernoe obrazovanie [Informatics,
computer engineering and engineering education], 2019, No. 1 (34).
4. Bova V., Kureichik V., Leshchanov D. The model of semantic similarity estimation for the
problems of big data search and structuring, 11th IEEE International Conference AICT 2017,
pp. 27-32.
5. Zhang W., Tang X., Yoshida T. TESC: an approach to TExt classification using Semisupervised
Clustering, Knowledge-Based Systems, 2015, Vol. 75, pp. 152-160.
6. Wei T., Lu Y., Chang H., Zhou Q., Bao X. A semantic approach for text clustering using WordNet
and lexical chains, Expert Systems with Applications, 2015, Vol. 42, No. 4, pp. 2264-2275.
7. Xu D., Tian Y. A Comprehensive Survey of Clustering Algorithms, Ann. Data. Sci., 2015,
No. 2, pp. 165-193.
8. Sabhia Firdaus, Md. Ashraf Uddin A Survey on Clustering Algorithms and Complexity Analysis,
International Journal of Computer Science Issues, 2015, Vol. 12. Issue 2, pp. 62-85.
9. Sara Saad Soliman, Maged F. El-Sayed, Yasser F. Hassan Semantic Clustering of Search
Engine Results, The Scientific World Journal, 2015, Vol. 2015. Article ID 931258, 9 p.
10. Kravchenko Y.A., Rodzin S.I., Kuliev E.V., Bova V.V. Simulation of the semantic network of
knowledge representation in intelligent assistant systems based on ontological approach,
Communications in Computer and Information Science this link is disabled, 2021, 1396 CCIS,
pp. 241-252.
11. Kravchenko Y., Bova V. Assessment of ontological structures semantic similarity based on a
modified cuckoo search algorithm, IOP Conference Series: Materials Science and Engineering,
2020, No. 12018.
12. Otradnov K.K., Raev V.K. Eksperimental'noe issledovanie effektivnosti metodik vektorizatsii
tekstovykh dokumentov i algoritmov ikh klasterizatsii [Experimental study of the effectiveness
of methods of vectorization of text documents and algorithms for their clustering], Vestnik
RGRTU [Vestnik of RSREU], 2018, No. 64, pp. 73-84.
13. Zhou S., Xu X., Liu Y., Chang R., Xiao Y. Text similarity measurement of semantic cognition
based on word vector distance decentralization with clustering analysis, IEEE Access, 2019,
Vol. 7, pp. 107247-107258.
14. Krömer P., Platoš J. Cluster analysis of data with reduced dimensionality: an empirical study,
Intelligent Systems for Computer Modelling. Springer. Cham, 2016, pp. 121-132.
15. Bova V.V., Shcheglov S.N., Leshchanov D.V. Modifitsirovannyy algoritm EM-klasterizatsii
dlya zadach integrirovannoy obrabotki bol'shikh dannykh [Modified EM clustering algorithm
for integrated big data processing tasks], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya
SFedU. Engineering Sciences], 2018, No. 4 (165), pp. 197-211.
16. Jingdong Yan, Wuwei Liu. An Ensemble Clustering Approach (Consensus Clustering) for
High-Dimensional Data, Security and Communication Networks, 2022, Vol. 2022. Article ID
5629710, 9 p.
17. Olson C.F. Parallel algorithms for hierarchical clustering, Pattern Analysis & Machine Intelligence
IEEE Transactions on, 2016, Vol. 12, No. 11, pp. 1088-1092.
18. Xueli X.U., Xuejing Z. Application of sparse spectral clustering algorithm in high-dimensional
data, Journal of University of Science and Technology of China, 2017, Vol. 47, pp. 311-319.
19. Parkhomenko P.A., Grigor'ev A.A., Astrakhantsev N.A. Obzor i eksperimental'noe sravnenie
metodov klasterizatsii tekstov [Review and experimental comparison of text clustering methods],
Tr. ISP RAN [Proceedings of ISP RAS], 2017, Vol. 29, Issue 2, pp. 161-200.
20. Guha S., Rastogi R. Shim K. CURE: an efficient clustering algorithm for large databases, ACM
SIGMOD. Rec., 2017, Vol. 27, pp. 73-84.
21. Guha S., Rastogi R. Shim K. ROCK: a robust clustering algorithm for categorical attributes,
Proceedings of the 15th international conference on data engineering, 2016, pp. 512-521.
22. Karypis G. Han E. Kumar V. Chameleon: hierarchical clustering using dynamic modeling,
ACM SIGMOD Rec., Aug. 1999, Vol. 32, pp. 68-75.
23. Huu Hiep Nguyen. Clustering Categorical Data Using Community Detection Techniques,
Computational Intelligence and Neuroscience, 2017, Vol. 2017. Article ID 8986360, 11 p.
24. Yancheng He, Qingcai Chen, Xiaolong Wang, Ruifeng Xu. An adaptive affinity propagation
document clustering, The 7th International Conference on Informatics and Systems, 2010.
25. Laurens van der Maaten, Geoffrey Hinton Visualizing data using t-SNE, Journal of Machine
Learning Research, 2008, No. 9, pp. 2579-2605.
26. Van Eck N.J., Waltman L. Software survey: VOSviewer, a computer program for bibliometric
mapping, Scientometrics, 2015, Vol. 84, No. 2, pp. 523-538.
27. Bova V.V., Leshchanov D.V. Metod otsenki effektivnosti semanticheskoy klasterizatsii
gipertekstovykh dinamicheskikh struktur na osnove DOM-fil'tra [A method for evaluating the
effectiveness of semantic clustering of hypertext dynamic structures based on a DOM filter],
Tr. kongressa IS&IT [Proceedings of the Congress IS&IT], 2018, Vol. 2, pp. 59-70.
Published
2022-11-01
Section
SECTION II. INFORMATION PROCESSING ALGORITHMS