BASIC APPROACHES TO EXTRACTING TEXTUAL INFORMATION (OVERVIEW)

  • V.V. Kureichik Southern Federal University
  • P.S. Gerasimenko Southern Federal University
Keywords: Full-text search, B-trees, vector space model, inverse index, n-gram indexing, two-phase text search, indexes, information extraction, ranking, neural networks, fuzzy logic, binned algorithms

Abstract

This article is devoted to the review of known and modern approaches, methods and algorithms of
full-text search. A brief history of the solution of the problem of search in unstructured text data, its development
and relevance are described. The main task of search in text data is formulated. The definition of
the database index is given. The target function of the search information system is defined in general
terms and possible compromise variations of its parameters when solving various applied problems are
described. A generalized architecture of a modern search information system is given with the division of
the search problem into two phases: the primary extraction of relevant records and their subsequent ranking
to form the final search results. The article provides basic descriptions of the main algorithms and
methods of full-text search, such as: search by terms (logical search), search using trees and their varieties
(B-trees, UB-trees, tries), search based on n-grams (including search based on frequency representation),
use of the vector space model (VSM), search based on an inverted (reverse) index, search using the apparatus of fuzzy logic and bioinspired methods. The main advantages and disadvantages of these methods
are given, their applicability in various conditions is described, and possible methods for optimizing
the search for text data to improve the accuracy, speed of search and efficiency of resource use are considered.
Possible promising directions in the field of solving the problem of primary information extraction
are presented. Some methods for determining the similarity of text records for solving the ranking
problem based on the apparatus of fuzzy logic are given. The article touches upon the issues of increasing
the relevance of primary extraction using artificial intelligence methods, neural networks, fuzzy logic and
bioinspired methods, in particular methods for expanding the search query and/or expanding the processed
text records. The influence of the boundary conditions of the search system construction on increasing
its efficiency is described. In conclusion, the article summarizes the review and discusses the prospects
for further development of various full-text search methods.

References

1. Guha R.V., McCool R., Miller E. Semantic search, Proceedings of the 12th International Conference
on World Wide Web, Budapest, 2003, pp. 700-709.
2. Weaver W. Machine Translation of Languages. MIT Press, Cambridge, MA, Reprinted from a memorandum
written by Weaver in 1949, 1955.
3. Salton G., Wong A., Yang C.S. A vector space model for automatic indexing, Communications of the
ACM, 1975, Vol. 18, No. 11, pp. 613-620.
4. Rygl J., Pomikálek J., Řehůřek R., Růžička M. Semantic Vector Encoding and Similarity Search Using
Fulltext Search Engines, Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017.
5. Baeza-Yates R., Ribeiro-Neto B. Modern Information Retrieval. ACM Press New York, 1999.
6. Manning C.D., Raghavan P., Sch"utze, H. Introduction to Information Retrieval. Cambridge
University Press, 2008.
7. Yan W., Zhang X. A Concise Concurrent B + -Tree for Persistent Memory, ACM Transactions on
Architecture and Code Optimization, 2023, Vol. 21 (2).
8. Leis V., Kemper A., Neumann T. The adaptive radix tree: ARTful indexing for main-memory
databases, 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD,
Australia, 2013, pp. 38-49.
9. Salton G. Information Retrieval: Data Structures and Algorithms. Reading, Massachussetts: Addison-
Wesley, 1989.
10. Giulio Ermanno P., Rossano V. Techniques for Inverted Index Compression, ACM Computing
Surveys, 2020, Vol. 53, pp. 1-36.
11. Miller E., Shen D., Liu J., Nicholas C. Performance and Scalability of a Large-Scale N-gram Based
Information Retrieval System, Journal of Digital Information, 2000, Vol 1, No 5.
12. Srinivasa K., Devi B. GPU Based N-Gram String Matching Algorithm with Score Table Approach for
String Searching in Many Documents, Journal of The Institution of Engineers (India) Series B, 2017,
Vol. 98, pp. 467-476.
13. Kudryavtsev K. Search in text documents based on N-grams and Fourier window transformation,
Computer Science, 2014, Vol. 65, pp. 871-880.
14. Jones K. A Statistical Interpretation of Term Specificity in Retrieval, Journal of Documentation, 2004,
Vol. 60, pp. 493-502.
15. Chebil W., Soualmia L. Improving semantic information retrieval by combining possibilistic networks,
vector space model and pseudo-relevance feedback, Journal of Information Science, 2023.
16. Eminagaoglu M. A new similarity measure for vector space models in text classification and
information retrieval, Journal of Information Science, 2020, Vol. 48.
17. Singh A., Yadav A., Rana A. K-means with Three different Distance Metrics, International Journal of
Computer Applications, 2013, Vol. 67, pp. 13-17.
18. Nogueira R., Yang W., Lin J., Cho K. Document expansion by query prediction, arXiv preprint
arXiv:1904.08375, 2019.
19. Mao Y., He P., Liu X., Shen Y., Gao J., Han J., Chen W. Generation-augmented retrieval for opendomain
question answering, arXiv preprint arXiv:2009.08553, 2020.
20. Yan M., Li C., Bi B., Wang W., Huang S. A unified pretraining framework for passage ranking and
expansion, Proceedings of the AAAI Conference on Artificial Intelligence, 2021, Vol. 35, pp. 4555-4563.
21. Efron M., Organisciak P., Fenlon K. Improving retrieval of short texts through document expansion,
Proceedings of the 35th international ACM SIGIR conference on Research and development in
information retrieval, 2012, pp. 911–920.
22. Agirre E., Arregi X., Otegi A. Document expansion based on WordNet for robust IR, Coling 2010:
Posters, 2010, pp. 9-17.
23. Shen Y., He X., Gao J., Deng L. Latent semantic models with deep neural networks for information
retrieval, Proceedings of the 37th international ACM SIGIR conference on Research & development in
information retrieval. ACM, 2014, pp. 269-278.
24. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I.
Attention Is All You Need, Advances in Neural Information Processing Systems, 2017, pp. 6000-6010.
25. Devlin J., Chang M.W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers), 2019, pp. 4171–4186.
26. Chimah J., Ude F. Current trends in information retrieval systems: review of fuzzy set theory and fuzzy
Boolean retrieval models, Journal of Library Services and Technologies, 2020, Vol. 2, pp. 48-56.
27. Bartos T., Eckhardt A., Skopal T. Fuzzy approach to non-metric similarity indexing, Proceedings - 4th
International Conference on SImilarity Search and APplications, SISAP 2011, 2011, pp. 115-116.
28. Mounira C., Jedidi A., Gargouri F. Semantic/Fuzzy Information Retrieval System, International
Journal of Information Technology and Web Engineering, 2017.
29. Abualigah L., Hanandeh E. Applying Genetic Algorithms to Information Retrieval Using Vector
Space Model, International Journal of Computer Science, Engineering and Applications, 2015, Vol. 5,
pp. 19-28.
30. Russell A. A Genetic Algorithm for Query Optimization, Department of Computer and Information
Sciences University of Strathclyde, Glasgow August 26, 2019.
Published
2024-10-08
Section
SECTION I. INFORMATION PROCESSING ALGORITHMS