SUBSYSTEM FOR AUTOMATIC TEXT ANNOTATION BASED ON MACHINE LEARNING METHODS

  • L.А. Gladkov Southern Federal University
  • N. V. Gladkova Southern Federal University
  • V.М. Kureichik Southern Federal University
Keywords: Text summarization, text mining, abstractive summarization, extractive summarization methods, recurrent neural networks, tokenization, stemming, long short-term memory networks

Abstract

This paper considers the problem of automatic text annotation. The formulation of the problem
is considered. The relevance and importance of developing effective methods and software
systems for solving the problem of automatic text summarization in modern information systems is
substantiated. Definitions of the concepts “data” and knowledge are given.” A list of tasks related
to the Data Mining direction is described. The Text Mining problem and existing methods for solving
it are described in detail. The problem of summarizing texts is considered. The main stages of
solving the summation problem are highlighted. The main methods of automatic text processing
are described, their advantages and disadvantages are highlighted. Abstractive summarization
and extractive summarization methods are discussed in detail. A comparative analysis of the effectiveness
of various abstracting and quasi-abstracting methods has been carried out, their key advantages
and disadvantages have been highlighted. A brief description of the encoder-decoder
architecture is given from the point of view of using this architecture in the developed algorithm for automatic text summarization. A description of the model of recurrent neural networks is given,
the advantages and disadvantages of such models are noted. The architecture of a recurrent
neural network is considered in relation to solving the problem of automatic text summarization. A
description of the modified model of a recurrent neural network – a neural network with long
short-term memory – is given. A description of the proposed automatic abstracting algorithm and
the settings of its main parameters are given. A description of the developed automatic abstracting
software subsystem is given. Computer modeling is performed and the results obtained during
computational experiments are presented. The quality of the solutions obtained was assessed. The
optimal parameters of the developed software system are determined. Directions for continuing
research are formulated.

References

1. Mordvinov A.V. Razrabotka i issledovanie modeli teksta dlya ego kategorizatsii: avtoref. dis.
… kand. tekhn. nauk [Development and research of a text model for its categorization: abstract
of cand. of eng. sc. diss.]: 05.13.01. Nizhniy Novgorod, 2010, 25 p.
2. Trevgoda S.A. Metody i algoritmy avtomaticheskogo referirovaniya teksta na osnove analiza
funktsional'nykh otnosheniy: avtoref. dis. … kand. tekhn. nauk [Methods and algorithms for
automatic text summarization based on the analysis of functional relationships: abstract of
cand. of eng. sc. diss.]: 05.13.01. St. Petersburg, 2009, 19 p.
3. Lukashevich N.V. Modeli i metody avtomaticheskoy obrabotki nestrukturirovannoy informatsii
na osnove bazy znaniy ontologicheskogo tipa: avtoref. diss. … kand. tekhn. nauk [Models and
methods for automatic processing of unstructured information based on an ontological
knowledge base: abstract of cand. of eng. sc. diss.]: 05.25.05. Moscow, 2014, 32 p.
4. Van Lierde H., Chow T.W.S. Query-oriented text summarization based on hypergraph transversals,
Information Processing and Management, 2019, Vol. 56, No. 4, pp. 1317-1338.
5. Greengrass E. Information Retrieval: A Survey: University of Maryland. 2000, 225 p.
6. Manning D., Raghavan C., Schütze H. Introduction to Information Retrieval: Cambridge. England.
2008.
7. Alguliev R.M., Isazade N.R., Abdi A., Idris N. COSUM: Text summarization based on clustering
and optimization, Expert Systems, 2019, Vol. 36, No. 1.
8. Kharlamov A. Tekhnologiya avtomaticheskogo smyslovogo analiza tekstov TextAnalyst
[Technology for automatic semantic analysis of texts TextAnalyst], Vestnik Moskovskogo
gosudarstvennogo lingvisticheskogo universiteta [Bulletin of the Moscow State Linguistic
University], 2014, pp. 234-244.
9. Khoay L., Tuzovskiy A.F. Semanticheskoe annotirovanie dokumentov v elektronnykh
bibliotekakh [Semantic annotation of documents in electronic libraries], Izvestiya Tomskogo
politekhnicheskogo universiteta [News of Tomsk Polytechnic University], 2013, pp. 157-164.
10. Kharlamov A. Kognitivnyy podkhod k smyslovomu analizu tekstov [Cognitive approach to
semantic analysis of texts], Vestnik Moskovskogo gosudarstvennogo lingvisticheskogo
universiteta [Bulletin of the Moscow State Linguistic University], 2013, Vol. 13, No. 673,
pp. 196-205.
11. Gupta V.. Bansal N., Sharma A. Text summarization for big data: A comprehensive survey,
Lecture Notes in Networks and Systems. Delhi, 2019, Vol. 56, pp. 503-516.
12. Anam S.A., Muntasir Rahman A.M., Sleheen N.N., Arif H. Automatic text summarization using
fuzzy C-Means clustering, 2018 Joint 7th International Conference on Informatics, Electronics
and Vision and 2nd International Conference on Imaging, Vision and Pattern Recognition.
Kitakyushu, 2018, pp. 180-184.
13. Chua S., Kulathuramaiyer N., Ranaivo-Malancon B., Iboi H. A comparative Study of Sentiment-
Based Graphs of Text Summaries, 2018 IEEE 5th International Conference on Engineering
Technologies and Applied Sciences. Sarawak, 2018.
14. Siddiqui T. Generating abstractive summaries using sequence to sequence attention model,
2018 International Conference on Frontiers of Information Technology. Proceedings. Karachi,
2018, pp. 212-217.
15. Sonawane S., Ghotkar A., Hinge S. Context-based multi-document summarization, Advances
in Intelligent Systems and Computing, 2018, Vol. 812, pp. 153-165.
16. Alwis V. Intelligent E-news summarization, 18th International Conference on Advances in ICT
for Emerging Regions. Colombo, 2018, pp. 189-195.
17. Joshi A., Mehta K., Gupta N., Valloli V.K. Data generation using sequence-to-sequence, 2018
IEEE Recent Advances in Intelligent Computational Systems. Pune, 2018, pp. 108-112.
18. Gigioli P., Sagar N., Rao A., Voyles J. Domain-Aware Abstractive Text Summarization for
Medical Documents, Proceedings 2018 IEEE International Conference on Bioinformatics and
Biomedicine. New York. 2018, pp. 2338-2343.
19. Mahajani A., Pandya V., Maria I., Sharma D. Ranking-Based Sentence Retrieval for Text
Summarization, 2018 2nd International Conference on Smart Innovations in Communications
and Computational Sciences. Mumbai, 2018, pp. 465-474.
20. Kirmani M., Manzoor Hakak N., Mohd M., Mohd M. Hybrid text summarization, 2nd International
conference of the series Soft Computing: Theories and Applications, 2017. Kuruhshetra,
2017, pp. 63-73.
21. Hochreiter S.; Schmidhuber J. Long short-term memory, Neural Computation: journal, 1997,
Vol. 9, No. 8, pp. 1735-1780. DOI: 10.1162/neco.1997.9.8.1735. PMID 9377276.
22. Gladkov L.A., Gladkova N.V., Bova V.V. Metod avtomaticheskogo annotirovaniya tekstov na
osnove gibridnykh intellektual'nykh tekhnologiy [Method for automatic annotation of texts
based on hybrid intelligent technologies], Informatizatsiya i svyaz' [Informatization and communication],
2022, No. 2, pp. 54-60.
Published
2023-12-11
Section
SECTION II. DATA ANALYSIS AND MODELING