MULTI-AGENT ALGORITHM FOR CREATING A RESIDUAL PROBLEM-SOLVING SCHEME IN DISTRIBUTED APPLIED SOFTWARE PACKAGES

  • A.G. Feoktistov Matrosov Institute for System Dynamics and Control Theory of SB RAS
  • R.O. Kostromin Matrosov Institute for System Dynamics and Control Theory of SB RAS
  • I.A. Sidorov Matrosov Institute for System Dynamics and Control Theory of SB RAS
  • S.A. Gorsky Matrosov Institute for System Dynamics and Control Theory of SB RAS
Keywords: Distributed applied software package, problem-solving scheme, multi-agent management, fault-tolerance

Abstract

Nowadays, basic software tools that implement technologies for organizing computations in high-performance computing systems provide a potential basis for the mass creation and use of parallel and distributed applications. Tools for creating applied software packages and workflow support systems are being actively developed and applied in practice. However, an analysis of their practical application allows us to conclude that it is necessary to increase the fault-tolerance of problem-solving processes in distributed applied software packages for problems that includesets of interrelated subproblems. In particular, this problem becomes urgent when we solve problems in a heterogeneous distributed computing environment. Clusters, including hybrid clusters with het-erogeneous nodes, are the main components of such an environment. High-performance servers, storage systems, personal computers, and other computing elements complement the infrastructure of the environment. The paper presents an adaptive multi-agent algorithm, which is intended for the redistribution of jobs on the resources of such an environment. The algorithm is used when restarting the problem-solving process in distributed applied software packages after the failure of software and hardware. In contrast to the well-known algorithms for maintaining fault-tolerance of distributed computing that are used in workflow management systems, the work of this algorithm is based on the use of program specialization methods for creating and executing a residual problem-solving scheme. It also actively applies meta-monitoring of computational resources. Comparative analysis of the experimental results on the semi-natural modeling the support of the fault-tolerance of the scheme-executing process for solving problems of distributed applied software packages by various meta-schedulers demonstrated the advantage of the proposed approach to multi-agent management in the heterogeneous distributed computing environment.

References

1. Bondarenko A.A., Yakobovski M.V. Obespechenie otkazoustojchivosti vysokoproizvoditel'nyh vychislenij s pomoshch'yu lokal'nyh kontrol'nyh tochek [Fault tolerance for HPC by using lo-cal checkpoints], Vestnik Yuzhno-Ural'skogo gosudarstvennogo universiteta. Seriya: Vychislitel'naya matematika i informatika [Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering], 2014, Vol. 3, No. 3, pp. 20-36.
2. Feoktistov A.G., Sidorov I.A. Gorky S.A. Avtomatizatsiya razrabotki i primeneniya raspredelennykh paketov prikladnykh programm [Automation of development and application of distributed applied software packages], Problemy informatiki [Problems of Informatics], 2017, No. 4, pp. 61-78.
3. Banti A., Kacsuk P., Kozlovszky M. Classification of scientific workflows based on reproduci-bility analysis, Proceedings of the 39th International Convention on information and commu-nication technology, electronics and microelectronics (MIPRO-2016), Riejka: IEEE, 2016, pp. 327-331.
4. Mhashilkar P., Miller Z., Kettimuthu R., Garzoglio G., Holzman B., Weiss C., Duan X., Lacinski L. End-To-End Solution for Integrated Workload and Data Management using GlideinWMS and Globus Online, Journal of Physics: Conference Series, 2012, Vol. 396, No. 3, pp. 2076-2085.
5. Talia D. Workflow Systems for Science Concepts and Tools, ISRN Software Engineering, 2013, Vol. 2013, pp. 1-15.
6. Deelman E., Peterka T., Altintas I., Carothers C.D., van Dam K.K., Moreland K., Parashar M., Ramakrishnan L., Taufer M., Vetter J. The future of scientific workflows, The International Journal of High Performance Computing Applications, 2017, Vol. 32, No. 1.1, pp. 159-175.
7. Ostermann S., Plankensteiner K., Prodan R., Fahringer T., Iosup A. Workflow monitoring and analysis tool for ASKALON, Proceedings of 3rd CoreGRID Workshop on Grid Middleware, Spain: Springer, 2009, pp. 73-86.
8. Zhao Y., Raicu I., Foster I. Scientific Workflow Systems for 21st Century, New Bottle or New Wine?, IEEE Congress on Services - Part I, Honolulu, HI: IEEE, 2008, pp. 467-471.
9. Rodriguez M.A., Buyya R. Deadline Based Resource Provisioning and Scheduling Algorithm for Scientific Workflows on Clouds IEEE Transactions on Cloud Computing, 2014, Vol. 2, No. 2, pp. 222-235.
10. Anwar N., Deng H. Elastic Scheduling of Scientific Workflows under Deadline Constraints in Cloud Computing Environments Future Internet, 2018, Vol. 10, No. 1, pp. 1-23.
11. Feoktistov A., Sidorov I., Sergeev V., Kostromin R., Bogdanova V. Virtualization of Heteroge-neous HPC-clusters Based on OpenStack Platform, Vestnik Yuzhno-Ural'skogo gosudarstvennogo universiteta. Seriya: Vychislitel'naya matematika i informatika [Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineer-ing], 2017, Vol. 6, No. 2, pp. 37-48.
12. Ershov A.P. Nauchnye osnovy dokazatel'nogo programmirovaniya [Scientific basis of evi-dence-based programming], Vestnik AN SSSR [Herald of the Russian Academy of Sciences], 1984, No. 10, pp. 9-19.
13. Ershov A.P. On Mixed Computation: Informal Account of the Strict and Polyvariant Computa-tion Schemes, Control Flow and Data Flow: Concepts of Distributed Programming, Berlin A.O.: Springer-Verlag, 1985, pp. 107-120.
14. Sidorov I.A. Methods and Tools to Increase Fault Tolerance of High-Performance Computing Sys-tems, Proceedings of the 39th International Convention on information and communication tech-nology, electronics and microelectronics (MIPRO-2016), Riejka: IEEE, 2016, pp. 242-246.
15. Feoktistov A.G., Sidorov I.A. Logical-Probabilistic Analysis of Distributed Computing Reliabil-ity, Proceedings of the 39th International Convention on information and communication tech-nology, electronics and microelectronics (MIPRO-2016), Riejka: IEEE, 2016, pp. 247-252.
16. Feoktistov A.G, Kostromin R.O., Dyadkin Y.A. Upravlenie zadaniyami v geterogennoy raspredelennoy vychislitel'noy srede na osnove znaniy [Knowledge Based Management of Jobs in Heterogeneous Distributed Computing Environment], Vestnik komp'iuternykh i informatsionnykh tekhnologii [Herald of computer and information technologies], 2018, No. 2, pp. 10-17.
17. Bychkov I., Feoktistov A., Kostromin R., Sidorov I., Edelev A., Gorsky S. Machine Learning in a Multi-Agent System for Distributed Computing Management, Data Science. Information Technology and Nanotechnology 2018, CEUR-WS Proceedings, 2018, Vol. 2212. pp. 89-97.
18. Tel G. Introduction to Distributed Algorithms: Solutions and Suggestions, Cambridge Univer-sity Press, 2000, 596 p.
19. Balaji P., Buntinas D., Kimpe D. Fault Tolerance Techniques for Scalable Computing, Scala-ble Computing and Communications: Theory and Practice, Hoboken: Wiley-IEEE Press, 2013, pp. 212-245.
20. Irkutsk Supercomputer Centre of SB RAS. Available at: http://hpc.icc.ru/ (accessed 3 Novem-ber 2018).
Published
2019-04-03
Section
SECTION. II. DISTRIBUTED AND CLOUD COMPUTING