AGGLOMERATIVE CLUSTERIZATION ALGORITHMS FOR THE PROBLEMS OF ANALYSIS OF LINGUISTIC EXPERT INFORMATION
Abstract
This article discusses and presents the main problems and principles of the data clustering
process, in particular, the principles and tasks of clustering text arrays of linguistic expert information.
In the course of this work, the main difficulties arising in the design of such systems were
identified, for example: the need for preprocessing data, reducing the size of the initial sample,
etc. To effectively perform the presented tasks, the implemented solution must have an integrated
approach that takes into account the efficiency indicators of methods aimed at solving individual
subtasks, as well as the ability to provide high efficiency indicators for the implementation of each
stage of the clustering process. In the presented work, various groups of hierarchical clustering
algorithms are considered, in particular, a subgroup of agglomerative clustering algorithms was
considered in relation to the problems of clustering linguistic expert information. In the described
work, a formal statement of the text clustering problem is given, and the main group of implemented
solutions based on the principles of agglomerative clustering is determined: ROCK, CURE,
CHAMELEON. A detailed review of each of the presented algorithms is carried out, and the main
advantages and disadvantages of each of them are formulated. The advantage of this work can be
considered the totality of the presented data on the algorithms, as well as the results of a comparative analysis, which make it possible to further assess the feasibility and potential probability of
using these solutions from the presented group of agglomerative clustering algorithms. The novelty
of this work lies in the formation of an overview analysis of existing approaches in the field of
hierarchical clustering for solving the problems of cluster analysis of linguistic expert information,
as well as the formation of the results of the comparative analysis of the considered algorithms.
References
year. Available at: https://www.statista.com/statistics/871513/worldwide-data-created/ (accessed
22 December 2021).
2. Zargaryan Yu.A., Zatylkin V.V. Klassifikatsiya i nechetkaya klasterizatsiya v zadachakh
prinyatiya resheniy [Classification and fuzzy clustering in decision-making tasks], Izvestiya
YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2010, No. 1 (102),
pp. 140-144.
3. Staab S., Hotho A. Ontology-based text document clustering // Proc. International Intelligent
Information System, Intelligent Information Processing and Web Mining Conference (IIS:
IIPWM’03), 2003, pp. 451-452.
4. Hofmann T. Probabilistic latent semantic indexing, Proc. of the 22nd Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999),
1999, pp. 50-57.
5. Devlin J., Chang M., Lee K. BERT: Pretraining of deep bidirectional transformers for language
understanding, ArXiv, 2018, pp. 42-48.
6. Whissell J.S., Clarke C.L. Improving document clustering using Okapi BM25 feature
weighting, Information Retrieval, 2011, Vol. 14, No. 5, pp. 466-487.
7. Zhuravlev Yu.I. Ob algebraicheskom podkhode k resheniyu zadach raspoznavaniya ili
klassifikatsii [On an algebraic approach to solving problems of recognition or classification],
Problemy kibernetiki [Problems of Cybernetics], 1978, Vol. 33, pp. 5-68.
8. Ermochenko S.A. Kontseptsiya primeneniya Mapreduce v ierarkhicheskoy aglomerativnoy
klasterizatsii [The concept of using Mapreduce in hierarchical agglomerative clustering],
Vestnik Vitsebskaga dzyarzhaunaga universiteta [Vestnik Vitsebskaga dzyarzhaunaga
universiteta], 2019, No. 3 (104), pp. 28-37.
9. Makhruse N. Sovremennye tendentsii metodov intellektual'nogo analiza dannykh: metod
klasterizatsii [Modern trends in data mining methods: clustering method], Moskovskiy
ekonomicheskiy zhurnal [Moscow Economic Journal], 2019, No. 6, pp. 359-377.
10. Bil'gaeva L.P., Zaigraeva E.V. Otsenka kachestva aglomerativnoy klasterizatsii [Assessment
of the quality of agglomerative clustering], Prilozhenie matematiki v ekonomicheskikh i
tekhnicheskikh issledovaniyakh [Application of Mathematics in Economic and Technical Research],
2020, No. 1 (10), pp. 43-53.
11. Kirpichnikov A.P., Rizaev I.S. Takhavova E.G., and others. Razrabotka effektivnogo algoritma
ierarkhicheskoy klasterizatsii [Development of an effective hierarchical clustering algorithm],
Vestnik Tekhnologicheskogo universiteta [Bulletin of the Technological University], 2019,
Vol. 22, No. 10, pp. 117-122.
12. Uilliams U.T., Lans Dzh.N. Metody ierarkhicheskoy klassifikatsii [Methods for hierarchical
classification // Statistical methods for computers], Statisticheskie metody dlya EVM [Statistical
Methods for Computers], ed. by K. Ensleyna, E. Relstona, G.S. Uilfa. Moscow: Nauka,
1986, pp. 269-300.
13. Gladilin A.V., Gamazina V.S. Ierarkhicheskie metody klasterizatsii dannykh i ikh
kharakteristiki [Hierarchical methods of data clustering and their characteristics],
Informatsionnye tekhnologii v ekonomicheskikh i tekhnicheskikh zadachakh [Information
Technologies in Economic and Technical Problems]. Penza: Penzenskiy gosudarstvennyy
tekhnologicheskiy universitet, 2016, pp. 200-202.
14. Sudipto G., Rajeev R., Kyuseok S. CURE: an efficient clustering algorithm for large databases,
SIGMOD ’98 Pro. of the 1998 ACM SIGMOD international conference on Management of data,
1998, pp. 73-84.
15. Dubakov A.A., Vorob'ev A.M. Razrabotka algoritma ierarkhicheskoy aglomerativnoy
klasterizatsii dlya analiza tekstovykh dokumentov Vorob'ev [Development of an algorithm for
hierarchical agglomerative clustering for the analysis of text documents Vorobiev],
Matematicheskoe i informatsionnoe modelirovanie [Mathematical and Information Modeling].
Tyumen': Tyumenskiy gosudarstvennyy universitet, 2018, pp. 246-255.
16. Davydov O.A. Analiz sushchestvuyushchikh algoritmov klasterizatsii [Analysis of existing
clustering algorithms (Part 1)], Vestnik Tikhookeanskogo gosudarstvennogo universiteta [Bulletin
of the Pacific State University], 2020, No. 1 (56), pp. 27-36.
17. Mikhaylov A.S., SHabanov V.Yu. Razrabotka algoritm klasterizatsii nominal'nykh dannykh
[Development of an algorithm for clustering nominal data], Informatsionnye tekhnologii [Information
Technologies]. Novosibirsk: Novosibirskiy natsional'nyy issledovatel'skiy
gosudarstvennyy universitet, 2019, pp. 101-107.
18. Kholda O.S., Izvozchikova V.V. Razrabotka algoritma obrabotki bol'shikh massivov dannykh
[Development of an algorithm for processing large data arrays], Globalizatsiya nauki i tekhniki
v usloviyakh krizisa [Globalization of Science and Technology in a Crisis]. Rostov-on-Donu:
Izd-vo VVM», 2021, pp. 48-53.
19. Bezverkhiy O.A., Samokhvalova S.G. Klasterizatsiya bol'shogo ob"ema tekstovykh poiskovykh
zaprosov [Clustering of a large volume of text search queries], Uchenye zametki TOGU [Scientific
Notes of PNU], 2016, Vol. 7, No. 3-1, pp. 104-110.
20. Shatovskaya T.B., Zaremskaya A.A. Eksperimental'nye rezul'taty issledovaniya kachestva
klasterizatsii raznoobraznykh naborov dannykh s pomoshch'yu modifitsirovannogo algoritma
khameleona [Experimental results of studying the quality of clustering of various data sets using
a modified chameleon algorithm], ScienceRise, 2015, Vol. 3, No. 2 (8), pp. 11-16.