Stochastic subspace correction methods and fault tolerance

Griebel, Michael; Oswald, Peter

doi:10.1090/mcom/3459

Stochastic subspace correction methods and fault tolerance
HTML articles powered by AMS MathViewer

by Michael Griebel and Peter Oswald HTML | PDF

Math. Comp. 89 (2020), 279-312 Request permission

Abstract:

We present convergence results in expectation for stochastic subspace correction schemes and their accelerated versions to solve symmetric positive-definite variational problems, and discuss their potential for achieving fault tolerance in an unreliable compute network. We employ an overlapping domain decomposition algorithm for PDE discretizations to discuss the latter aspect.

References

E. Agullo, L. Giraud, and M. Zounon, On the resilience of a parallel hybrid solver, In IEEE 22nd International Conference on High Performance Computing (HiPC), 2015, 75–84; also RR-8744, INRIA Bordeaux 2015 (hal-01165186v2).
Mark Ainsworth and Christian Glusa, Is the multigrid method fault tolerant? The two-grid case, SIAM J. Sci. Comput. 39 (2017), no. 2, C116–C143. MR 3633785, DOI 10.1137/16M1100691
Mark Ainsworth and Christian Glusa, Is the multigrid method fault tolerant? The multilevel case, SIAM J. Sci. Comput. 39 (2017), no. 6, C393–C416. MR 3719032, DOI 10.1137/16M1097274
M. Altenbernd and D. Göddeke, Soft fault detection and correction for multigrid, Int. J. High Perf. Comput. Appl., vol. 32:6 (2017), 897–912. DOI: 10.1177/1094342016684006
Randolph E. Bank and Michael Holst, A new paradigm for parallel adaptive meshing algorithms, SIAM Rev. 45 (2003), no. 2, 291–323. Reprinted from SIAM J. Sci. Comput. 22 (2000), no. 4, 1411–1443 [MR1797889]. MR 2010380, DOI 10.1137/S003614450342061
Olivier Fercoq and Peter Richtárik, Optimization in high dimensions via accelerated, parallel, and proximal coordinate descent, SIAM Rev. 58 (2016), no. 4, 739–771. MR 3567606, DOI 10.1137/16M1085905
Z. Chen and J. Dongarra, Algorithm-based fault tolerance for fail-stop failures, IEEE Transactions on Parallel and Distributed Systems 19:12 (2008), 1628–1641.
Tao Cui, Jinchao Xu, and Chen-Song Zhang, An error-resilient redundant subspace correction method, Comput. Vis. Sci. 18 (2017), no. 2-3, 65–77. MR 3603807, DOI 10.1007/s00791-016-0270-6
Aymeric Dieuleveut and Francis Bach, Nonparametric stochastic approximation with large step-sizes, Ann. Statist. 44 (2016), no. 4, 1363–1399. MR 3519927, DOI 10.1214/15-AOS1391
Michael Griebel and Peter Oswald, Greedy and randomized versions of the multiplicative Schwarz method, Linear Algebra Appl. 437 (2012), no. 7, 1596–1610. MR 2946344, DOI 10.1016/j.laa.2012.04.052
Michael Griebel and Peter Oswald, Stochastic subspace correction in Hilbert space, Constr. Approx. 48 (2018), no. 3, 501–521. MR 3869450, DOI 10.1007/s00365-018-9447-1
M. Griebel and P. Oswald, Stochastic subspace correction methods and fault tolerance, INS Preprint No. 1809, Bonn Univ., 2018; arXiv:1807.11315 (July 30, 2018).
Michael Griebel, Michael Schneider, and Christoph Zenger, A combination technique for the solution of sparse grid problems, Iterative methods in linear algebra (Brussels, 1991) North-Holland, Amsterdam, 1992, pp. 263–281. MR 1159736
Michael Griebel and Gerhard Zumbusch, Parallel adaptive subspace correction schemes with applications to elasticity, Comput. Methods Appl. Mech. Engrg. 184 (2000), no. 2-4, 303–332. Vistas in domain decomposition and parallel processing in computational mechanics. MR 1764193, DOI 10.1016/S0045-7825(99)00233-9
Markus Huber, Björn Gmeiner, Ulrich Rüde, and Barbara Wohlmuth, Resilience for massively parallel multigrid solvers, SIAM J. Sci. Comput. 38 (2016), no. 5, S217–S239. MR 3565560, DOI 10.1137/15M1026122
S. Kavila, P. Raju, S. Satapathy, A. Machiraju, G. Kinnera, and K. Rasly, A survey on fault management techniques in distributed computing, In: Proc. of Int. Conf. on Front. of Intell. Comput., AISC 199 S. Satapathy et al. (Eds.), 2013, pp. 593–602, DOI: 10.1007/978364235314767
David E. Keyes and William D. Gropp, A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation, SIAM J. Sci. Statist. Comput. 8 (1987), no. 2, S166–S202. Parallel processing for scientific computing (Norfolk, Va., 1985). MR 879402, DOI 10.1137/0908020
Yin Tat Lee and Aaron Sidford, Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science—FOCS 2013, IEEE Computer Soc., Los Alamitos, CA, 2013, pp. 147–156. MR 3246216, DOI 10.1109/FOCS.2013.24
Ji Liu and Stephen J. Wright, An accelerated randomized Kaczmarz algorithm, Math. Comp. 85 (2016), no. 297, 153–178. MR 3404446, DOI 10.1090/mcom/2971
Yurii Nesterov, Introductory lectures on convex optimization, Applied Optimization, vol. 87, Kluwer Academic Publishers, Boston, MA, 2004. A basic course. MR 2142598, DOI 10.1007/978-1-4419-8853-9
Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim. 22 (2012), no. 2, 341–362. MR 2968857, DOI 10.1137/100802001
Peter Oswald, Multilevel finite element approximation, Teubner Skripten zur Numerik. [Teubner Scripts on Numerical Mathematics], B. G. Teubner, Stuttgart, 1994. Theory and applications. MR 1312165, DOI 10.1007/978-3-322-91215-2
Peter Oswald and Weiqi Zhou, Convergence analysis for Kaczmarz-type methods in a Hilbert space framework, Linear Algebra Appl. 478 (2015), 131–161. MR 3342418, DOI 10.1016/j.laa.2015.03.028
S. Pauli, P. Arbenz, and C. Schwab, Intrinsic fault tolerance of multi-level Monte Carlo methods, In Parallel Computing: Accelerating Computational Science and Engineering (CSE), M. Bader et al. (Eds.), IOS Press, 2014, pp. 471–480.
Boris T. Polyak, Introduction to optimization, Translations Series in Mathematics and Engineering, Optimization Software, Inc., Publications Division, New York, 1987. Translated from the Russian; With a foreword by Dimitri P. Bertsekas. MR 1099605
P. Richtárik and M. Takǎć, Stochastic reformulations of linear systems: Algorithms and convergence theory, arxiv:1706.01109v2 (6 Jun 2017).
F. Rizzi, K. Morris, K. Sargsyan, P. Mycek, C. Safta, O. Le Maître, O. Knio, and B. Debusschere, Partial differential equations preconditioner resilient to soft and hard faults, Int. J. High Perf. Comput. Appl. 32:5 (2017), 658–673, DOI: 10.1177/1094342016684975
Khachik Sargsyan, Francesco Rizzi, Paul Mycek, Cosmin Safta, Karla Morris, Habib Najm, Olivier Le Maître, Omar Knio, and Bert Debusschere, Fault resilient domain decomposition preconditioner for PDEs, SIAM J. Sci. Comput. 37 (2015), no. 5, A2317–A2345. MR 3395134, DOI 10.1137/15M1014474
Barry F. Smith, A parallel implementation of an iterative substructuring algorithm for problems in three dimensions, SIAM J. Sci. Comput. 14 (1993), no. 2, 406–423. MR 1204238, DOI 10.1137/0914025
M. Snir, R. Wisniewski et al., Addressing failures in exascale computing, Int. J. High Perf. Comput. Appl. 28:2 (2014), 129–173, DOI: 10.1177/1094342014522573
L. Stals, Algorithm-based fault recovery of adaptively refined parallel multilevel grids, Int. J. High Perf. Comput. Appl. 33:1 (2017), 189–211, DOI: 10.1177/1094342017720801
Miroslav Stoyanov and Clayton Webster, Numerical analysis of fixed point algorithms in the presence of hardware faults, SIAM J. Sci. Comput. 37 (2015), no. 5, C532–C553. MR 3394363, DOI 10.1137/140991406
Andrea Toselli and Olof Widlund, Domain decomposition methods—algorithms and theory, Springer Series in Computational Mathematics, vol. 34, Springer-Verlag, Berlin, 2005. MR 2104179, DOI 10.1007/b137868
M. Treaster, A survey of fault-tolerance and fault-recovery techniques in parallel systems, ACM Computing Research Repository (CoRR 501002, 1–11) (2005); arxiv:0501002v1.
Jinchao Xu, Iterative methods by space decomposition and subspace correction, SIAM Rev. 34 (1992), no. 4, 581–613. MR 1193013, DOI 10.1137/1034116