МЕТОДЫ И АЛГОРИТМЫ КЛАСТЕРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ (ОБЗОР)

V.V. Bova; Y.A. Kravchenko; S.I. Rodzin

Abstract

The article deals with one of the important tasks of artificial intelligence – machine processing of natural language. The solution of this problem based on cluster analysis makes it possible to identify, formalize and integrate large amounts of linguistic expert information under conditions of information uncertainty and weak structure of the original text resources obtained from various subject areas. Cluster analysis is a powerful tool for exploratory analysis of text data, which allows for an objective classification of any objects that are characterized by a number of features and have hidden patterns. A review and analysis of modern modified algorithms for agglomerative clustering CURE, ROCK, CHAMELEON, non-hierarchical clustering PAM, CLARA and the affine transformation algorithm used at various stages of text data clustering, the effectiveness of which is verified by experimental studies, is carried out. The paper substantiates the requirements for choosing the most efficient clustering method for solving the problem of increasing the efficiency of intellectual processing of linguistic expert information. Also, the paper considers methods for visualizing clustering results for interpreting the cluster structure and dependencies on a set of text data elements and graphical means of their presentation in the form of dendograms, scatterplots, VOS similarity diagrams, and intensity maps. To compare the quality of the algorithms, internal and external performance metrics were used: "V-measure", "Adjusted Rand index", "Silhouette". Based on the experiments, it was found that it is necessary to use a hybrid approach, in which, for the initial selection of the number of clusters and the distribution of their centers, use a hierarchical approach based on sequential combining and averaging the characteristics of the closest data of a limited sample, when it is not possible to put forward a hypothesis about the initial number of clusters. Next, connect iterative clustering algorithms that provide high stability with respect to noise features and the presence of outliers. Hybridization increases the efficiency of clustering algorithms. The research results showed that in order to increase the computational efficiency and overcome the sensitivity when initializing the parameters of clustering algorithms, it is necessary to use metaheuristic approaches to optimize the parameters of the learning model and search for a global optimal solution.

Authors

V.V. Bova Southern Federal University
Y.A. Kravchenko Southern Federal University
S.I. Rodzin Southern Federal University

References

1. Junkai Yi, Yacong Zhang, Xianghui Zhao, Jing Wan A Novel Text Clustering Approach Using
Deep-Learning Vocabulary Network, Mathematical Problems in Engineering, 2017,
Vol. 2017, 13 p.
2. Yujia Sun, Jan Platoš. High-Dimensional Text Clustering by Dimensionality Reduction and
Improved Density Peak, Wireless Communications and Mobile Computing, 2020, Vol. 2020.
Article ID 8881112, 16 p.
3. Bova V.V., Kuliev E.V., Shcheglov S.N. Metod semanticheskoy klasterizatsii raspredelennykh
resursov znaniy s dinamicheskimi komponentami na osnove kontentnoy fil'tratsii [The method
of semantic clustering of distributed knowledge resources with dynamic components based on
content filtering], Informatika, vychislitel'naya tekhnika i inzhenernoe obrazovanie [Informatics,
computer engineering and engineering education], 2019, No. 1 (34).
4. Bova V., Kureichik V., Leshchanov D. The model of semantic similarity estimation for the
problems of big data search and structuring, 11th IEEE International Conference AICT 2017,
pp. 27-32.
5. Zhang W., Tang X., Yoshida T. TESC: an approach to TExt classification using Semisupervised
Clustering, Knowledge-Based Systems, 2015, Vol. 75, pp. 152-160.
6. Wei T., Lu Y., Chang H., Zhou Q., Bao X. A semantic approach for text clustering using WordNet
and lexical chains, Expert Systems with Applications, 2015, Vol. 42, No. 4, pp. 2264-2275.
7. Xu D., Tian Y. A Comprehensive Survey of Clustering Algorithms, Ann. Data. Sci., 2015,
No. 2, pp. 165-193.
8. Sabhia Firdaus, Md. Ashraf Uddin A Survey on Clustering Algorithms and Complexity Analysis,
International Journal of Computer Science Issues, 2015, Vol. 12. Issue 2, pp. 62-85.
9. Sara Saad Soliman, Maged F. El-Sayed, Yasser F. Hassan Semantic Clustering of Search
Engine Results, The Scientific World Journal, 2015, Vol. 2015. Article ID 931258, 9 p.
10. Kravchenko Y.A., Rodzin S.I., Kuliev E.V., Bova V.V. Simulation of the semantic network of
knowledge representation in intelligent assistant systems based on ontological approach,
Communications in Computer and Information Science this link is disabled, 2021, 1396 CCIS,
pp. 241-252.
11. Kravchenko Y., Bova V. Assessment of ontological structures semantic similarity based on a
modified cuckoo search algorithm, IOP Conference Series: Materials Science and Engineering,
2020, No. 12018.
12. Otradnov K.K., Raev V.K. Eksperimental'noe issledovanie effektivnosti metodik vektorizatsii
tekstovykh dokumentov i algoritmov ikh klasterizatsii [Experimental study of the effectiveness
of methods of vectorization of text documents and algorithms for their clustering], Vestnik
RGRTU [Vestnik of RSREU], 2018, No. 64, pp. 73-84.
13. Zhou S., Xu X., Liu Y., Chang R., Xiao Y. Text similarity measurement of semantic cognition
based on word vector distance decentralization with clustering analysis, IEEE Access, 2019,
Vol. 7, pp. 107247-107258.
14. Krömer P., Platoš J. Cluster analysis of data with reduced dimensionality: an empirical study,
Intelligent Systems for Computer Modelling. Springer. Cham, 2016, pp. 121-132.
15. Bova V.V., Shcheglov S.N., Leshchanov D.V. Modifitsirovannyy algoritm EM-klasterizatsii
dlya zadach integrirovannoy obrabotki bol'shikh dannykh [Modified EM clustering algorithm
for integrated big data processing tasks], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya
SFedU. Engineering Sciences], 2018, No. 4 (165), pp. 197-211.
16. Jingdong Yan, Wuwei Liu. An Ensemble Clustering Approach (Consensus Clustering) for
High-Dimensional Data, Security and Communication Networks, 2022, Vol. 2022. Article ID
5629710, 9 p.
17. Olson C.F. Parallel algorithms for hierarchical clustering, Pattern Analysis & Machine Intelligence
IEEE Transactions on, 2016, Vol. 12, No. 11, pp. 1088-1092.
18. Xueli X.U., Xuejing Z. Application of sparse spectral clustering algorithm in high-dimensional
data, Journal of University of Science and Technology of China, 2017, Vol. 47, pp. 311-319.
19. Parkhomenko P.A., Grigor'ev A.A., Astrakhantsev N.A. Obzor i eksperimental'noe sravnenie
metodov klasterizatsii tekstov [Review and experimental comparison of text clustering methods],
Tr. ISP RAN [Proceedings of ISP RAS], 2017, Vol. 29, Issue 2, pp. 161-200.
20. Guha S., Rastogi R. Shim K. CURE: an efficient clustering algorithm for large databases, ACM
SIGMOD. Rec., 2017, Vol. 27, pp. 73-84.
21. Guha S., Rastogi R. Shim K. ROCK: a robust clustering algorithm for categorical attributes,
Proceedings of the 15th international conference on data engineering, 2016, pp. 512-521.
22. Karypis G. Han E. Kumar V. Chameleon: hierarchical clustering using dynamic modeling,
ACM SIGMOD Rec., Aug. 1999, Vol. 32, pp. 68-75.
23. Huu Hiep Nguyen. Clustering Categorical Data Using Community Detection Techniques,
Computational Intelligence and Neuroscience, 2017, Vol. 2017. Article ID 8986360, 11 p.
24. Yancheng He, Qingcai Chen, Xiaolong Wang, Ruifeng Xu. An adaptive affinity propagation
document clustering, The 7th International Conference on Informatics and Systems, 2010.
25. Laurens van der Maaten, Geoffrey Hinton Visualizing data using t-SNE, Journal of Machine
Learning Research, 2008, No. 9, pp. 2579-2605.
26. Van Eck N.J., Waltman L. Software survey: VOSviewer, a computer program for bibliometric
mapping, Scientometrics, 2015, Vol. 84, No. 2, pp. 523-538.
27. Bova V.V., Leshchanov D.V. Metod otsenki effektivnosti semanticheskoy klasterizatsii
gipertekstovykh dinamicheskikh struktur na osnove DOM-fil'tra [A method for evaluating the
effectiveness of semantic clustering of hypertext dynamic structures based on a DOM filter],
Tr. kongressa IS&IT [Proceedings of the Congress IS&IT], 2018, Vol. 2, pp. 59-70.

METHODS AND ALGORITHMS FOR TEXT DATA CLUSTERING (REVIEW)

Abstract

Authors

References

Скачивания

Published:

Issue:

Section:

Keywords:

links

journal

index