A REVIEW OF METHODS FOR IMPROVING REASONING IN LARGE LANGUAGE MODELS

Authors

DOI:

https://doi.org/10.18522/2311-3103-2026-1-%25p

Keywords:

Large language model, reasoning methods, artificial intelligence

Abstract

The emergence of large language models has become an important milestone in the field of natural language processing, as such models demonstrate impressive results in text generation, transformation, and analysis, as well as in solving a wide range of applied tasks. However, despite significant practical success, large language models possess limited reasoning capabilities. These limitations manifest in difficulties with generalizing knowledge beyond the training distribution, challenges in transferring knowledge to new contexts, and reduced accuracy when performing multi-step logical and mathematical operations. The goal of this work is to examine methods for improving the reasoning abilities of large language models, where reasoning is understood as the process of forming and evaluating inferences based on existing information. The paper discusses the main types of reasoning relevant to large language models: mathematical, logical, and commonsense reasoning. It provides a list of the most commonly used benchmarks applied to assess the reasoning quality of language models. An overview is presented of the methods used to enhance reasoning in large language models at 2025. Depending on the stage of application (during training or during model usage), the work examines approaches to training data preparation, architectural modifications of language models, training and finetuning procedures (including those using specially constructed synthetic datasets), reinforcement learning, various chain-of-thought construction techniques, mechanisms for integrating external tools, and multi-agent approaches. The paper also discusses existing limitations of large language models, which include the lack of conceptual understanding, poor out-of-distribution generalization, and reduced effectiveness as task complexity increases. Finally, the most promising methods aimed at improving the quality and reliability of reasoning in large language models are highlighted.

References

1. Vaswani A. Attention is all you need, Advances in Neural Information Processing Systems, 2017.

2. Mirzadeh I., Alizadeh K., Shahrokhi H., Tuzel O., Bengio S., Farajtabar M. Gsm-symbolic: Under-standing the limitations of mathematical reasoning in large language models, arXiv preprint arXiv:2410.05229, 2024.

3. Minaee S., Mikolov T., Nikzad N., Chenaghlu M., Socher R., Amatriain X., Gao J. Large language mod-els: A survey, arXiv preprint arXiv:2402.06196, 2024.

4. Berglund L., Tong M., Kaufmann M., Balesni M., Stickland A. C., Korbak T., Evans O. The reversal curse: Llms trained on" a is b" fail to learn" b is a", arXiv preprint arXiv:2309.12288, 2023.

5. Wu Z., Qiu L., Ross A., Akyürek E., Chen B., Wang B., Kim N., Andreas J., Kim Y. Reasoning or recit-ing? exploring the capabilities and limitations of language models through counterfactual tasks, arXiv preprint arXiv:2307.02477, 2023.

6. Wang L., Ma C., Feng X., Zhang Z., Yang H., Zhang J., Chen Z., Tang J., Chen X., Lin Y., others.

A survey on large language model based autonomous agents, Frontiers of Computer Science, 2024, Vol. 18, No. 6, pp. 186345.

7. Arkoudas K. GPT-4 can’t reason, arXiv preprint arXiv:2308.03762, 2023.

8. Chang Y., Wang X., Wang J., Wu Y., Yang L., Zhu K., Chen H., Yi X., Wang C., Wang Y., Ye W., Zhang Y., Chang Y., Yu P. S., Yang Q., Xie X. A Survey on Evaluation of Large Language Models, ACM Trans. Intell. Syst. Technol, 2024, Vol. 15, No. 3. DOI: 10.1145/3641289.

9. Mahowald K., Ivanova A.A., Blank I.A., Kanwisher N., Tenenbaum J.B., Fedorenko E. Dissociating language and thought in large language models, Trends in Cognitive Sciences, 2024.

10. Plaat A., Wong A., Verberne S., Broekens J., Stein N. van, Back T. Reasoning with large language mod-els, a survey, arXiv preprint arXiv:2407.11511, 2024.

11. Penedo G., Kydlíček H., Lozhkov A., Mitchell M., Raffel C., Von Werra L., Wolf T., others. The fineweb datasets: Decanting the web for the finest text data at scale, arXiv preprint arXiv:2406.17557, 2024.

12. Shao Z., Wang P., Zhu Q., Xu R., Song J., Bi X., Zhang H., Zhang M., Li Y., Wu Y., others. Deepseek-math: Pushing the limits of mathematical reasoning in open language models, arXiv preprint arXiv:2402.03300, 2024.

13. Aryabumi V., Su Y., Ma R., Morisot A., Zhang I., Locatelli A., Fadaee M., Üstün A., Hooker S.

To Code, or Not To Code? Exploring Impact of Code in Pre-training, arXiv preprint arXiv:2408.10914, 2024.

14. Morishita T., Morio G., Yamaguchi A., Sogawa Y. Enhancing Reasoning Capabilities of LLMs via Prin-cipled Synthetic Logic Corpus, arXiv preprint arXiv:2411.12498, 2024.

15. Abdin M., Aneja J., Behl H., Bubeck S., Eldan R., Gunasekar S., Harrison M., Hewett R. J., Javaheripi M., Kauffmann P., others. Phi-4 Technical Report, arXiv preprint arXiv:2412.08905, 2024.

16. Chen A., Li A., Gong B., Jiang B., Fei B., Yang B., Shan B., Yu C., Wang C., Zhu C., others. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention, arXiv preprint arXiv:2506.13585, 2025.

17. Veličković P., Perivolaropoulos C., Barbero F., Pascanu R. Softmax is not enough (for sharp out-of-distribution), arXiv preprint arXiv:2410.01104, 2024.

18. Zhang Y., Backurs A., Bubeck S., Eldan R., Gunasekar S., Wagner T. Unveiling transformers with lego: a synthetic reasoning task, arXiv preprint arXiv:2206.04301, 2022.

19. Li C., Liu J., Chen Y., Jia Y., Li Z. KunlunBaize: LLM with Multi-Scale Convolution and Multi-Token Prediction Under TransformerX Framework, arXiv preprint arXiv:2503.04784, 2025.

20. Hao S., Sukhbaatar S., Su D., Li X., Hu Z., Weston J., Tian Y. Training large language models to reason in a continuous latent space, arXiv preprint arXiv:2412.06769, 2024.

21. Zaytsev D.V. Pochemu bol'shie yazykovye modeli ne (vsegda) rassuzhdayut kak lyudi? [Why don't large language models (always) reason like humans?], Vestnik Moskovskogo universiteta. Seriya 7. Filosofiya [Moscow University Bulletin. Series 7. Philosophy], 2024, No. 1, pp. 76-93.

22. Pagnoni A., Pasunuru R., Rodriguez P., Nguyen J., Muller B., Li M., Zhou C., Yu L., Weston J., Zettle-moyer L., others. Byte Latent Transformer: Patches Scale Better Than Tokens, arXiv preprint arXiv:2412.09871, 2024.

23. Xia B., Shen B., Zhu D., Zhang D., Wang G., Zhang H., Liu H., Xiao J., Dong J., Zhao L., others. MiMo: Unlocking the Reasoning Potential of Language Model–From Pretraining to Posttraining, arXiv preprint arXiv:2505.07608, 2025.

24. Zelikman E., Wu Y., Mu J., Goodman N. Star: Bootstrapping reasoning with reasoning, Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 15476-15488.

25. Wu T., Lan J., Yuan W., Jiao J., Weston J., Sukhbaatar S. Thinking LLMs: General Instruction Follow-ing with Thought Generation // arXiv preprint arXiv:2410.10630. – 2024.

26. Li W.-D., Hu K., Larsen C., Wu Y., Alford S., Woo C., Dunn S. M., Tang H., Naim M., Nguyen D., others. Combining induction and transduction for abstract reasoning, arXiv preprint arXiv:2411.02272, 2024.

27. Kumar A., Zhuang V., Agarwal R., Su Y., Co-Reyes J. D., Singh A., Baumli K., Iqbal S., Bishop C., Roelofs R., others. Training language models to self-correct via reinforcement learning, arXiv preprint arXiv:2409.12917, 2024.

28. Chen J. C.-Y., Wang Z., Palangi H., Han R., Ebrahimi S., Le L., Perot V., Mishra S., Bansal M., Lee C.-Y., others. Reverse Thinking Makes LLMs Stronger Reasoners, arXiv preprint arXiv:2411.19865, 2024.

29. Muennighoff N., Yang Z., Shi W., Li X. L., Fei-Fei L., Hajishirzi H., Zettlemoyer L., Liang P., Candès E., Hashimoto T. s1: Simple test-time scaling, arXiv preprint arXiv:2501.19393, 2025.

30. Ye Y., Huang Z., Xiao Y., Chern E., Xia S., Liu P. LIMO: Less is More for Reasoning, 2025.

31. Learning to Reason with LLMs. Available at: https://openai.com/index/learning-to-reason-with-llms/ (accessed 22 November 2024).

32. Zeng Z., Cheng Q., Yin Z., Wang B., Li S., Zhou Y., Guo Q., Huang X., Qiu X. Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective, 2024.

33. Guo D., Yang D., Zhang H., Song J., Zhang R., Xu R., Zhu Q., Ma S., Wang P., Bi X., others. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv preprint arXiv:2501.12948, 2025.

34. Huan M., Li Y., Zheng T., Xu X., Kim S., Du M., Poovendran R., Neubig G., Yue X. Does Math Reason-ing Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning, 2025.

35. Wei J., Wang X., Schuurmans D., Bosma M., Xia F., Chi E., Le Q. V., Zhou D., others. Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information pro-cessing systems, 2022, Vol. 35, pp. 24824-24837.

36. Kojima T., Gu S.S., Reid M., Matsuo Y., Iwasawa Y. Large language models are zero-shot reasoners, Advances in neural information processing systems, 2022, Vol. 35, pp. 22199-22213.

37. Long J. Large language model guided tree-of-thought, arXiv preprint arXiv:2305.08291, 2023.

38. Besta M., Blach N., Kubicek A., Gerstenberger R., Podstawski M., Gianinazzi L., Gajda J., Lehmann T., Niewiadomski H., Nyczyk P., others. Graph of thoughts: Solving elaborate problems with large lan-guage models, 2024, pp. 17682-17690.

39. Wang X., Wei J., Schuurmans D., Le Q., Chi E., Narang S., Chowdhery A., Zhou D. Self-consistency improves chain of thought reasoning in language models, arXiv preprint arXiv:2203.11171, 2022.

40. Madaan A., Tandon N., Gupta P., Hallinan S., Gao L., Wiegreffe S., Alon U., Dziri N., Prabhumoye S., Yang Y., others. Self-refine: Iterative refinement with self-feedback, Advances in Neural Information Processing Systems, 2024, Vol. 36.

41. Miao N., Teh Y.W., Rainforth T. Selfcheck: Using llms to zero-shot check their own step-by-step reason-ing, arXiv preprint arXiv:2308.00436, 2023.

42. Shinn N., Cassano F., Gopinath A., Narasimhan K., Yao S. Reflexion: Language agents with verbal reinforcement learning, Advances in Neural Information Processing Systems, 2024, Vol. 36.

43. Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., Goyal N., Küttler H., Lewis M., Yih W., Rock-täschel T., others. Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neu-ral Information Processing Systems, 2020, Vol. 33, pp. 9459-9474.

44. Jiang J., Chen J., Li J., Ren R., Wang S., Zhao W. X., Song Y., Zhang T. RAG-Star: Enhancing Delib-erative Reasoning with Retrieval Augmented Verification and Refinement, arXiv preprint arXiv:2412.12881, 2024.

45. Zhou P., Pujara J., Ren X., Chen X., Cheng H.-T., Le Q.V., Chi E.H., Zhou D., Mishra S., Zheng H.S. Self-discover: Large language models self-compose reasoning structures, arXiv preprint arXiv:2402.03620, 2024.

46. Akyürek E., Damani M., Qiu L., Guo H., Kim Y., Andreas J. The Surprising Effectiveness of Test-Time Training for Abstract Reasoning, arXiv preprint arXiv:2411.07279, 2024.

47. Snell C., Lee J., Xu K., Kumar A. Scaling llm test-time compute optimally can be more effective than scaling model parameters, arXiv preprint arXiv:2408.03314, 2024.

48. Zhong T., Liu Z., Pan Y., Zhang Y., Zhou Y., Liang S., Wu Z., Lyu Y., Shu P., Yu X., others. Evaluation of openai o1: Opportunities and challenges of agi, arXiv preprint arXiv:2409.18486, 2024.

49. Zhao Y., Yin H., Zeng B., Wang H., Shi T., Lyu C., Wang L., Luo W., Zhang K. Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions, arXiv preprint arXiv:2411.14405, 2024.

50. Scaling test-time compute - a Hugging Face Space by HuggingFaceH4. Available at: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute (accessed 17 Desen-ber 2024).

51. Zheng C., Zhang Z., Zhang B., Lin R., Lu K., Yu B., Liu D., Zhou J., Lin J. ProcessBench: Identifying Process Errors in Mathematical Reasoning, arXiv preprint arXiv:2412.06559, 2024.

52. Schick T., Dwivedi-Yu J., Dessì R., Raileanu R., Lomeli M., Hambro E., Zettlemoyer L., Cancedda N., Scialom T. Toolformer: Language models can teach themselves to use tools, Advances in Neural Infor-mation Processing Systems, 2023, Vol. 36, pp. 68539-68551.

53. Li C., Liang J., Zeng A., Chen X., Hausman K., Sadigh D., Levine S., Fei-Fei L., Xia F., Ichter B. Chain of code: Reasoning with a language model-augmented code emulator, arXiv preprint arXiv:2312.04474, 2023.

54. Chen W., Ma X., Wang X., Cohen W.W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, arXiv preprint arXiv:2211.12588, 2022.

55. Gao L., Madaan A., Zhou S., Alon U., Liu P., Yang Y., Callan J., Neubig G. Pal: Program-aided lan-guage models, 2023, pp. 10764-10799.

56. Yao S., Zhao J., Yu D., Du N., Shafran I., Narasimhan K., Cao Y. React: Synergizing reasoning and acting in language models, arXiv preprint arXiv:2210.03629, 2022.

57. Gou Z., Shao Z., Gong Y., Shen Y., Yang Y., Huang M., Duan N., Chen W. Tora: A tool-integrated rea-soning agent for mathematical problem solving, arXiv preprint arXiv:2309.17452, 2023.

58. Motwani S. R., Smith C., Das R. J., Rybchuk M., Torr P.H., Laptev I., Pizzati F., Clark R., Witt C.S. de MALT: Improving Reasoning with Multi-Agent LLM Training, arXiv preprint arXiv:2412.01928, 2024.

59. Wang Q., Wang Z., Su Y., Tong H., Song Y. Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?, arXiv preprint arXiv:2402.18272, 2024.

60. Chen X., Xu J., Liang T., He Z., Pang J., Yu D., Song L., Liu Q., Zhou M., Zhang Z., others. Do not think that much for 2+ 3=? on the overthinking of o1-like llms, arXiv preprint arXiv:2412.21187, 2024.

61. Pu X., Saxon M., Hua W., Wang W.Y. THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models, arXiv preprint arXiv:2504.13367, 2025.

62. Sui Y., Chuang Y.-N., Wang G., Zhang J., Zhang T., Yuan J., Liu H., Wen A., Chen H., Hu X., others. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models, arXiv preprint arXiv:2503.16419, 2025.

63. Fang G., Ma X., Wang X. Thinkless: LLM Learns When to Think, arXiv preprint arXiv:2505.13379, 2025.

64. Chen Y., Benton J., Radhakrishnan A., Uesato J., Denison C., Schulman J., Somani A., Hase P., Wag-ner M., Roger F., others. Reasoning Models Don’t Always Say What They Think, arXiv preprint arXiv:2505.05410, 2025.

65. Yue Y., Chen Z., Lu R., Zhao A., Wang Z., Song S., Huang G. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?, arXiv preprint arXiv:2504.13837, 2025.

66. Liu Z., Chen C., Li W., Pang T., Du C., Lin M. There May Not be Aha Moment in R1-Zero-like Train-ing — A Pilot Study, 2025.

67. Mancoridis M., Weeks B., Vafa K., Mullainathan S. Potemkin Understanding in Large Language Mod-els, arXiv preprint arXiv:2506.21521, 2025.

68. Shojaee P., Mirzadeh I., Alizadeh K., Horton M., Bengio S., Farajtabar M. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, arXiv preprint arXiv:2506.06941, 2025.

69. Opus C., Lawsen A. The Illusion of the Illusion of Thinking, arXiv preprint ArXiv:2506.09250, 2025.

70. Malek A., Ge J., Jin C., György A., Szepesvári C. Frontier LLMs Still Struggle with Simple Reasoning Tasks, arXiv preprint arXiv:2507.07313, 2025

Downloads

Published

2026-02-27

Issue

Section

SECTION IV. MACHINE LEARNING AND NEURAL NETWORKS