TEXT VECTORIZATION USING DATA MINING METHODS

  • Ali Mahmoud Mansour Southern Federal University
  • Juman Hussain Mohammad Southern Federal University
  • Y. A. Kravchenko Southern Federal University
Keywords: Text vectorization, data mining, classification, clustering, machine learning, concepts, semantic

Abstract

In the text mining tasks, textual representation should be not only efficient but also interpretable,
as this enables an understanding of the operational logic underlying the data mining
models. Traditional text vectorization methods such as TF-IDF and bag-of-words are effective and
characterized by intuitive interpretability, but suffer from the «curse of dimensionality», and they
are unable to capture the meanings of words. On the other hand, modern distributed methods effectively
capture the hidden semantics, but they are computationally intensive, time-consuming,
and uninterpretable. This article proposes a new text vectorization method called Bag of weighted
Concepts BoWC that presents a document according to the concepts’ information it contains. The
proposed method creates concepts by clustering word vectors (i.e. word embedding) then uses the
frequencies of these concept clusters to represent document vectors. To enrich the resulted document
representation, a new modified weighting function is proposed for weighting concepts based
on statistics extracted from word embedding information. The generated vectors are characterized
by interpretability, low dimensionality, high accuracy, and low computational costs when used in
data mining tasks. The proposed method has been tested on five different benchmark datasets in
two data mining tasks; document clustering and classification, and compared with several baselines,
including Bag-of-words, TF-IDF, Averaged GloVe, Bag-of-Concepts, and VLAC. The results
indicate that BoWC outperforms most baselines and gives 7 % better accuracy on average

References

1. Bengfort B., Bilbro R., Okheda T. Prikladnoy analiz tekstovykh dannykh na Python.
Mashinnoe obuchenie i sozdanie prilozheniy obrabotki estestvennogo yazyka [Applied analysis
of text data in Python. Machine learning and building natural language processing applications].
Saint Petersburg: Piter, 2019, 368 p.
2. Lapshin S.V., Lebedev I.S., Spivak A.I. Klassifikatsiya korotkikh soobshcheniy s
ispol'zovaniem vektorizatsii na osnove elmo [Classification of short messages using elmobased
vectorization], Izvestiya TulGU. Tekhnicheskie nauki [News of TulSU. Technical sciences],
2019, No. 10, pp. 410-418.
3. Kireev V.S., Fedorenko V.I. Ispol'zovanie metodov vektorizatsii tekstov na estestvennom
yazyke dlya povysheniya kachestva kontentnykh rekomendatsiy fil'mov [Using methods of
vectorization of texts in natural language to improve the quality of content recommendations
of films], Sovremennye naukoemkie tekhnologii [Modern science-intensive technologies],
2018, No. 3, pp. 102-106.
4. Lin Y., Liu Z., Sun M. Representation Learning for Natural Language Processing. Singapore:
Springer Nature, 2020, 334 p.
5. Baeza-Yates R., Ribeiro-Neto B. Modern Information Retrieval. New York: ACM Press, 1999, 501 p.
6. Jones K.S. A Statistical Interpretation of Term Specificity and its Application in Retrieval,
Journal of Documentation, 1972, Vol. 28, No. 1, pp. 11-21.
7. Hoi S., Wu L., Yu N. Semantics-Preserving Bag-of-Words Models and Applications, IEEE
Transactions on Image Processing, 2010, Vol. 19, No. 7, pp. 1908-1920.
8. Kim H.K., Kim H.-j. Bag-of-Concepts: Comprehending Document Representation through Clustering
Words in Distributed Representation, Neurocomputing, 2017, Vol. 266, pp. 336-352.
9. Grootendorst M., Vanschoren J. Beyond Bag-of-Concepts: Vectors of Locally Aggregated
Concepts, Joint European Conference on Machine Learning and Knowledge Discovery in Databases,
2019, pp. 681-696.
10. Bandar Z., Crockett K., Li Y. et al. Sentence Similarity Based on Semantic Nets and Corpus
Statistics, IEEE Transactions on Knowledge, 2006, Vol. 18, pp. 1138-1150.
11. Liu M., Yang J. An Improvement of TFIDF Weighting in Text Categorization, International
Proceedings of Computer Science Information Technology, 2012, Vol. 47, pp. 44-47.
12. Cardoso-Cachopo, A.L., Oliveira A. Semi-Supervised Single-Label Text Categorization Using
Centroid-Based Classifiers, Proceedings of the 2007 ACM Symposium on Applied Computing,
2007, pp. 844-851.
13. Lang, K., Rennie J. The 20 Newsgroups Data Set., 2008.
14. Manning C.D., Pennington J., Socher R. Glove: Global Vectors for Word Representation,
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2014, pp. 1532-1543.
15. Hirschberg J., Rosenberg A. V-measure: A Conditional Entropy-Based External Cluster Evaluation
Measure, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),
2007, pp. 410-420.
16. Van Rijsbergen C.J. Information Retrieval. Butterworth-Heinemann, 1979, 224 p.
17. Bova V., Zaporozhets D., Kureichik V. Integration and Processing of Problem-Oriented
Knowledge Based on Evolutionary Procedures, Advances in Intelligent Systems and Computing,
2016, Vol. 450, pp. 239-249.
18. Kureichik V.M., Semenova A.V. Ensemble of Classifiers for Ontology Enrichment, Journal of
Physics: Conference Series, 2018, Vol. 1015, Issue 3, Article id. 032123.
19. Bova V.V., Nuzhnov E.V., Kureichik V.V. The Combined Method of Semantic Similarity Estimation
of Problem Oriented Knowledge on the Basis of Evolutionary Procedures, Advances in
Intelligent Systems and Computing, 2017, Vol. 573, pp. 74-83.
20. Pulyavina N., Taratukhin V. The Future of Project-Based Learning for Engineering and Management
Students: Towards an Advanced Design Thinking Approach, ASEE Annual Conference
and Exposition, Conference Proceedings, 2018, No. 125.
21. Becker J., Pulyavina N., Taratukhin V. Next-Gen Design Thinking. Using Project-Based and
Game-Oriented Approaches to Support Creativity and Innovation, Proceedings of the 1st International
Conference of Information Systems and Design, 2020.
Published
2021-07-18
Section
SECTION IV. INFORMATION ANALYSIS AND PATTERN RECOGNITION