A TRANSFORMER-BASED ALGORITHM FOR CLASSIFYING LONG TEXTS

Authors

Keywords:

Document classification, BERT, transformers, antennation mechanism Sentence BERT, TF-IDF, text mining

Abstract

The article is devoted to the urgent problem of representing and classifying long text documents using
transformers. Transformers-based text representation methods cannot effectively process long sequences
due to their self-attention process, which scales quadratically with the sequence length. This limitation
leads to high computational complexity and the inability to apply such models for processing long
documents. To eliminate this drawback, the article developed an algorithm based on the SBERT transformer,
which allows building a vector representation of long text documents. The key idea of the algorithm
is the application of two different procedures for creating a vector representation: the first is based
on text segmentation and averaging of segment vectors, and the second is based on concatenation of segment
vectors. This combination of procedures allows preserving important information from long documents.
To verify the effectiveness of the algorithm, a computational experiment was conducted on a group
of classifiers built on the basis of the proposed algorithm and a group of well-known text vectorization
methods, such as TF-IDF, LSA, and BoWC. The results of the computational experiment showed that
transformer-based classifiers generally achieve better classification accuracy results compared to classical
methods. However, this advantage is achieved at the cost of higher computational complexity and,
accordingly, longer training and application times for such models. On the other hand, classical text
vectorization methods, such as TF-IDF, LSA, and BoWC, demonstrated higher speed, making them more
preferable in cases where pre-encoding is not allowed and real-time operation is required. The proposed
algorithm has proven its high efficiency and led and led to an increase in the classification accuracy of the
BBC dataset by 0.5% according to the F1 criterion.

References

Downloads

Published

2024-08-12

Issue

Section

SECTION II. INFORMATION PROCESSING ALGORITHMS