TEXT VECTORIZATION USING DATA MINING METHODS

Authors

Keywords:

Text vectorization, data mining, classification, clustering, machine learning, concepts, semantic

Abstract

In the text mining tasks, textual representation should be not only efficient but also interpretable,
as this enables an understanding of the operational logic underlying the data mining
models. Traditional text vectorization methods such as TF-IDF and bag-of-words are effective and
characterized by intuitive interpretability, but suffer from the «curse of dimensionality», and they
are unable to capture the meanings of words. On the other hand, modern distributed methods effectively
capture the hidden semantics, but they are computationally intensive, time-consuming,
and uninterpretable. This article proposes a new text vectorization method called Bag of weighted
Concepts BoWC that presents a document according to the concepts’ information it contains. The
proposed method creates concepts by clustering word vectors (i.e. word embedding) then uses the
frequencies of these concept clusters to represent document vectors. To enrich the resulted document
representation, a new modified weighting function is proposed for weighting concepts based
on statistics extracted from word embedding information. The generated vectors are characterized
by interpretability, low dimensionality, high accuracy, and low computational costs when used in
data mining tasks. The proposed method has been tested on five different benchmark datasets in
two data mining tasks; document clustering and classification, and compared with several baselines,
including Bag-of-words, TF-IDF, Averaged GloVe, Bag-of-Concepts, and VLAC. The results
indicate that BoWC outperforms most baselines and gives 7 % better accuracy on average

References

Downloads

Published

2021-07-18

Issue

Section

SECTION IV. INFORMATION ANALYSIS AND PATTERN RECOGNITION