СУММИРОВАНИЕ МНОГОКРАТНОЙ ТОЧНОСТИ НА ЦЕНТРАЛЬНЫХ И ГРАФИЧЕСКИХ ПРОЦЕССОРАХ С ИСПОЛЬЗОВАНИЕМ БИБЛИОТЕКИ MPRES

K.S. Isupov; V.S. Knyazkov; A.S. Kuvaev

K.S. Isupov Vyatka State University
V.S. Knyazkov Vyatka State University
A.S. Kuvaev Vyatka State University

Keywords: Computer arithmetic, high precision, summation, residue number system, programming for graphics processors

Abstract

In many scientific applications, it is necessary to compute the sums of floating-point num-bers. Summation is a building block for many numerical algorithms, such as dot product, Taylor series, polynomial interpolation and numerical integration. However, the summation of large sets of numbers in finite-precision IEEE 754 arithmetic can be very inaccurate due to the accumulation of rounding errors. There are various ways to diminish rounding errors in the floating-point sum-mation. One of them is the use of multiple-precision arithmetic libraries. Such libraries provide data structures and subroutines for processing numbers whose precision exceeds the IEEE 754 floating-point formats. In this paper, we consider multiple-precision summation on hybrid CPU-GPU platforms using MPRES, a new software library for multiple-precision computations on CPUs and CUDA compatible GPUs. Unlike existing multiple-precision libraries based on the binary representation of numbers, MPRES uses a residue number system (RNS). In RNS, the num-ber is represented as a tuple of residues obtained by dividing this number by a given set of moduli, and multiple-precision operations such as addition, subtraction and multiplication are naturally divided into groups of reduced-precision operations on residues, performed in parallel and with-out carry propagation. We consider the algorithm for the addition of multiple-precision floatingpoint numbers in MPRES, as well as three summation algorithms: (1) recursive summation, (2) pairwise summation, and (3) block-parallel hybrid CPU-GPU summation. Experiments show that the hybrid algorithm allows the full utilization of the GPU’s resources, and therefore demonstrates better performance. On the other hand, the parallel computation of the digits (residues) of multi-ple-precision significands in RNS reduces computation time.

References

1. Kornerup P., Lefèvre V., Louvet N., Muller J. On the Computation of Correctly Rounded Sums, IEEE Transactions on Computers, 2012, Vol. 61, No. 3, pp. 289-298.
2. Higham N.J. Accuracy and stability of numerical algorithms. Philadelphia, USA: SIAM, 2002, 680 p.
3. Priest D.M. Algorithms for arbitrary precision floating point arithmetic, Proceedings of the 10th IEEE Symposium on Computer Arithmetic. Grenoble, France: IEEE, 1991, pp. 132-143.
4. Higham N.J. The Accuracy of Floating Point Summation, SIAM Journal on Scientific Compu-ting, 1993, Vol. 14, No. 4, pp. 783-799.
5. Anderson I.J. A Distillation Algorithm for Floating-Point Summation, SIAM Journal on Scien-tific Computing, 1999, Vol. 20, No. 5, pp. 1797-1806.
6. Zhu Y.-K., Yong J.-H., Zheng G.-Q. A New Distillation Algorithm for Floating-Point Summa-tion, SIAM Journal on Scientific Computing, 2005, Vol. 26, No. 6, pp. 2066-2078.
7. Ogita T., Rump S.M., Oishi S. Accurate Sum and Dot Product, SIAM Journal on Scientific Computing, 2005, Vol. 26, No. 6, pp. 1955-1988.
8. Rump S.M. Ultimately Fast Accurate Summation, SIAM Journal on Scientific Computing, 2009, Vol. 31, No. 5, pp. 3466-3502.
9. Goodrich M.T., Eldawy A. Parallel Algorithms for Summing Floating-Point Numbers, Pro-ceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures. Pacific Grove, California, USA: ACM, 2016, pp. 13-22.
10. Demmel J., Nguyen H.D. Fast reproducible floating-point summation, Proceedings of the 21st IEEE Symposium on Computer Arithmetic (ARITH). Austin, Texas: IEEE, 2013, pp. 163-172.
11. Demmel J., Nguyen H.D. Parallel reproducible summation, IEEE Transactions on Computers, 2015, Vol. 64, No. 7, pp. 2060-2070.
12. Collange S., Defour D., Graillat S., Iakymchuk R. Numerical reproducibility for the parallel re-duction on multi- and many-core architectures, Parallel Computing, 2015, Vol. 49, pp. 83-97.
13. Neal R.M. Fast Exact Summation Using Small and Large Superaccumulators, Technical re-port, University of Toronto, 2015. Available at: http://www.cs.toronto.edu/~ rad-ford/ftp/xsum.pdf (accessed 08 October 2018).
14. Kadric E., Gurniak P. and DeHon A. Accurate Parallel Floating-Point Accumulation, IEEE Transactions on Computers, 2016, Vol. 65, No. 11, pp. 3224-3238.
15. ReproBLAS: Reproducible BLAS. Available at: https://bebop.cs.berkeley.edu/reproblas/ in-dex.php (accessed 31 October 2018).
16. ExBLAS – Exact BLAS. Available at: https://exblas.lip6.fr/ (accessed 31 October 2018).
17. Bailey D.H., Barrio R., Borwein J.M. High-precision computation: Mathematical physics and dynamics, Applied Mathematics and Computation, 2012, Vol. 218, No. 20, pp. 10106-10121.
18. Joldes M., Muller J.M., Popescu V. Implementation and Performance Evaluation of an Ex-tended Precision Floating-Point Arithmetic Library for High-Accuracy Semidefinite Pro-gramming, Proceedings of the 24th IEEE Symposium on Computer Arithmetic (ARITH). Lon-don, UK: IEEE. 2017, pp. 27-34.
19. Zhao K., Chu X. GPUMP: a Multiple-Precision Integer Library for GPUs, Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010). Bradford, UK: IEEE, 2010, pp. 1164-1168.
20. Lu M., He B., Luo Q. Supporting extended precision on graphics processors, Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN’10). Indi-anapolis, Indiana, USA: ACM, 2010, pp. 19-26.
21. Isupov K., Kuvaev A., Popov M., Zaviyalov A. Multiple-Precision Residue-Based Arithmetic Library for Parallel CPU-GPU Architectures: Data Types and Features, Lecture Notes in Com-puter Science. Springer, 2017, Vol. 10421, pp. 196-204.
22. Hida Y., Li X.S., Bailey D.H. Algorithms for quad-double precision floating point arithmetic, Proceedings of the 15th IEEE Symposium on Computer Arithmetic. Vail, CO, USA: IEEE, 2001, pp. 155-162.
23. Bailey D.H., Hida Y., Li X.S., Thompson B. ARPREC: An arbitrary precision computation package, Technical report, Lawrence Berkeley National Lab. 2002. Available at: https://www.osti.gov/biblio/817634 (accessed 25 October 2018).
24. Akushskiy I.Ya., Yuditskiy D.I. Mashinnaya arifmetika v ostatochnykh klassakh [Machine arithmetic in residual classes]. Moscow: Sov. Radio, 1968, 440 p.
25. Omondi A., Premkumar B. Residue number systems: theory and implementation. London: Imperial College Press, 2007, 312 p.
26. Mohan P.A. Residue Number Systems. Theory and Applications. Cham: Birkhäuser, 2016, 351 p.
27. Dalton B., Wang A., Blainey B. SIMDizing pairwise sums: a summation algorithm balancing accuracy with throughput, Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing (WPMVP’14). Orlando, Florida, USA: IEEE, 2014, pp. 65-70.
28. Harris M. Optimizing parallel reduction in CUDA. Available at: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf (accessed 04 November 2018).
29. Isupov K., Knyazkov V. Interval Estimation of Relative Values in Residue Number System, Journal of Circuits, Systems and Computers, 2018, Vol. 27, No. 1, pp. 1850004.
30. Hauser J.R. Handling Floating-Point Exceptions in Numeric Programs, ACM Transactions on Programming Languages and Systems, 1996, Vol. 18, No. 2, pp. 139-174.
31. Isupov K.S., Knyaz'kov V.S., Kuvaev A.S. Effektivnoe masshtabirovanie v sisteme ostatochnykh klassov s ispol'zovaniem interval'nykh otsenok [Effective scaling in the system of residual classes using interval estimates], Vychislitel'nye tekhnologii [Computational tech-nologies], 2018, Vol. 23, No. 3, pp. 39-57.
32. McNamee J.M. A comparison of methods for accurate summation, ACM SIGSAM Bull, 2004, Vol. 38, No. 1, pp. 1-7.
33. Fousse L., Hanrot G., Lefèvre V., Pélissier P., Zimmermann P. MPFR: a multiple-precision binary floating-point library with correct rounding, ACM Transactions on Mathematical Soft-ware, 2007, Vol. 33, No. 2, Article No. 13.

SUMMATION OF THE MULTIPLE ACCURACY ON CENTRAL AND GRAPHIC PROCESSORS USING THE MPRES LIBRARY

Abstract

References