Improving the maximum attainable accuracy of communication-avoiding Krylov subspace methods Erin Carson and James Demmel Householder Symposium XIX June 8-13, 2014, Spa, Belgium Model Problem: 2D Poisson on grid, . Equilibration (diagonal scaling) used. RHS set s.t. elements of true solution . This affect Roundoff We Residual extend can be error replacement this worse can strategy for

cause communication-avoiding strategy toacommunication-avoiding decrease of van in der attainable Vorst and accuracy (orvariants. Ye -step) (1999) of Krylov Accuracy improves subspace can attainable be methods Krylov

improved accuracy subspace limits for minimal for practice methods. classical performance applicability! Krylov methods cost! 2 Communication bottleneck in KSMs A Krylov Subspace Method is a projection process onto the Krylov subspace orthogonal to some .

For linear systems, approx. solution to by imposing and , where . Projection process in each iteration: 1. Add a dimension to Sparse Matrix-Vector Multiplication (SpMV) Parallel: comm. vector entries w/ neighbors Sequential: read /vectors from slow memory SpMV orthogonalize Dependencies between 2. communication-bound Orthogonalize with respect to kernels in each Inner products iteration limit Parallel: global reduction performance! Sequential: multiple reads/writes to slow memory 1

Example: Classical conjugate gradient (CG) for solving Let for until convergence do SpMVs and inner products require communication in each iteration! end for 2 Communication-avoiding CA-KSMs Krylov methods can be reorganized to reduce communication cost by Many CA-KSMs (or -step KSMs) derived in the literature: CG, GMRES, Orthomin, MINRES, Lanczos, Arnoldi, CGS, Orthodir, BICG, CGS, BICGSTAB, QMR Related references: (Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010), (Philippe and Reichel, 2012), (C., Knight, Demmel 2013), (Feuerriegel and Bcker, 2013).

Reduction in communication can translate to speedups on practical problems Recent results: x speedup with for CA-BICGSTAB on GMG bottom-solve ( cores, on Cray XE6) (Williams, et al., 2014). 3 CA-CG overview Starting where ),at iteration for , it can be shown that to compute the next steps (iterations . 1. Compute matrix of dimension -by-, giving the recurrence relation , where is -by- and is 2x2 block diagonal with upper Hessenberg blocks, and is with columns and set to 0. Represent length- iterates by length- coordinates in : Communication cost: Same latency cost as one SpMV using matrix powers kernel (e.g., Hoemmen et al., 2007), assuming sufficient sparsity structure (low diameter) 2. Compute Gram matrix of dimension -by-. Communication cost: One reduction 4 No communication in inner loop!

CG for do end for CA-CG Compute such that Compute for do end for [ + , + , + ]= [ , , , , , ] 5 CA-KSMs in finite precision CA-KSMs are mathematically equivalent to classical KSMs But have different behavior in finite precision! Roundoff errors have two discernable effects: 1. Loss of attainable accuracy (due to deviation of true residual and updated residual ) 2. Delay of convergence (due to perturbation of Lanczos recurrence)

Generally worse with increasing Obstacle to solving practical problems: Decrease in attainable accuracy some problems that KSM can solve cant be solved with CA variant Delay of convergence if # iterations increases more than time per iteration decreases due to CA techniques, no speedup expected! 6 Maximum attainable accuracy of CG In classical CG, iterates are updated by and Formulas for and do not depend on each other - rounding errors cause the true residual, , and the updated residual, , to deviate The size of , determines the maximum attainable accuracy Write the true residual as Then the size of the true residual is bounded by When , and have similar magnitude When , depends on Many results on attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Bjrck, Elfving and Strako (1998) and Gutknecht and Strako (2000). 7

Example: Comparison of convergence of true and updated residuals for CG vs. CACG using a monomial basis, for various values Model problem (2D Poisson on grid) 8 Better conditioned polynomial bases can be used instead of monomial. Two common choices: Newton and Chebyshev - see, e.g., (Philippe and Reichel, 2012). Behavior closer to that of classical CG But can still see some loss of attainable accuracy compared to CG Clearer for higher values 9 Residual replacement strategy for CG van der Vorst and Ye (1999): Improve accuracy by replacing updated residual by the true residual in certain iterations, combined with group update. Choose when to replace with to meet two constraints: 1. Replace often enough so that at termination, is small relative to 2. Dont replace so often that original convergence mechanism of updated residuals is

destroyed (avoid large perturbations to finite precision CG recurrence) Use computable bound for to update error estimate in each iteration: , If , (group update), set , set If updated residual converges to , true residual reduced to 10 Sources of roundoff error in CA-CG Computing the -step Krylov basis: Error in computing -step basis Updating coordinate vectors in the inner loop: Error in updating coefficient vectors with Recovering CG vectors for use in next outer loop: Error in basis change

11 Maximum attainable accuracy of CA-CG We can write the deviation of the true and updated residuals in terms of these errors: Using standard rounding error results, this allows us to obtain an upper bound on . 12 A computable bound We extend van der Vorst and Yes residual replacement strategy to CA-CG Making use of the bound for in CA-CG, update error estimate by: All lower order terms in CA-CG Residual replacement does flops 1no reduction per iterations flopsper per iterations; iterations; communication

Estimated only once not asymptotically increase communication or computation! = otherwise where 13 Residual replacement for CA-CG Use the same replacement condition as van der Vorst and Ye (1999): where is a tolerance parameter, and we initially set . Pseudo-code for residual replacement with group update for CA-CG: if break from inner loop and begin new outer loop end 14

15 Residual Replacement Indices Total Number of Reductions s=4 s=4 s=8 s=8 s=12 s=12 CACG Mono. Mono. CACG 354 203 224, 334, 401, 517 157 135, 2119 557

CACG CACG Newt. Newt. 353 196 340 102 326 68 CACG CACG Cheb. Cheb. 365 197 353 99 346 71 CG CG

355 669 #Inreplacements small compared to total addition to attainable accuracy, iterations RR doesnt significantly affect convergence rate is incredibly communication savings!implementations important in practical Can have bothrate speed and accuracy! Convergence depends

on basis 18 15 Current work: CA-Lanczos convergence analysis , , for , Classic Lanczos rounding error result of Paige (1976): where , , and 0 = ( ) 1 = ( ) These results form the basis for Paiges influential results in (Paige, 1980). 16

Current work: CA-Lanczos convergence analysis Let for , For CA-Lanczos, we have: with (vs. for Lanczos) , (vs. for Lanczos) where , If is numerically full rank for and if , the results of (Paige, 1980) apply to CA-Lanczos (with these news values of ). Confirms the empirically observed effect of basis conditioning on convergence. 17 Thank you!

contact: [email protected] http://www.cs.berkeley.edu/~ecc2z/ 21 Extra Slides Replacement Iterations Newton: 240, 374 Chebyshev: 271, 384 23 Upper bound on maximum attainable accuracy in communicationavoiding Krylov subspace methods in finite precision Applies to CA-CG and CA-BICG Implicit residual replacement strategy for maintaining agreement between residuals to within Strategy can be implemented within method without asymptotically increasing communication or computation 24 25

26 Total Iterations CACG Mono. CACG Newt. CACG Cheb. s=4 813 782 785 s=8 1246

817 785 s=12 6675 813 850 CG 669 27 Total Number of Communication Steps CACG Mono. CACG Newt.

CACG Cheb. s=4 203 196 197 s=8 157 102 99 s=12 557

68 71 CG 669 28 Motivation: Communication-avoiding, KSM bottleneck, how CA-KSMs work (1) Speedups, but numerical properties bad (2) attainable accuracy, convergence Today focus on attainable accuracy but mention current work related to convergence Include plots In order to be practical, numerical properties cant negate CA benefits Related work CA-KSMs (1) Bounds on attainable accuracy, residual replacement strategies, VdV and Ye (2) CA-CG derivation (2) Maximum attainable accuracy bound (1) Residual replacement strategy of Vdv and Ye (1) Residual replacement strategy for CA-CG (1) How can be computed cheaply (1)

Plots (1) Current work (2): results of Paige and analogous results for CA-case 29 Improving Maximum Attainable Accuracy in CA-KSMs o Van der Vorst and Ye (1999) : Residual replacement used in combination with group-update of solution vector to improve the maximum attainable accuracy o Given computable upper bound for deviation, replacement steps chosen to satisfy two constraints: (1) Deviation must not grow so large that attainable accuracy is limited (2) Replacement must not perturb Lanczos recurrence relation for computed residuals such that convergence deteriorates o When the computed residual converges to level but true residual deviates, strategy reduces true residual, to level o We devise an analogous method for CA-KSMs o Requires determining a computable upper bound for deviation of residuals In CA-KSMs, we can write the deviation of the true and computed residual as: 30 A Computable Bound We

can bound the deviation of the true and computed residual in CACG with residual replacement in each iteration by updating the quantity by last inner itr. otherwise 31 Residual Replacement Strategy Based on the perturbed Lanczos recurrence (see, e.g. Van der Vorst and Ye, 1999), we should replace the residual when where is a tolerance parameter, chosen as , and we initially set . Pseudo-code for residual replacement with group update for CA-CG: if break from inner loop end 32

Communication-avoiding CA-KSMs Krylov methods can be reorganized to reduce communication cost by Main idea: Outer Loop k: 1 communication step Expand Krylov basis by O(s) dimensions up front, stored in matrix Same cost as a single SpMVs (for well-partitioned ) using matrix powers kernel (see, e.g., Hoemmen et al., 2007) Compute Gram matrix : one global reduction Inner Loop j: computation steps Update -vectors of coordinates for , in No communication! Quantities either local (parallel) or fit in fast memory (sequential) Many CA-KSMs (or -step KSMs) derived in the literature: (Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), (De Sturler, 1991), (De Sturler and Van der Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010). 3 CA-CG Derivation Starting at iteration for , it can be shown that for iterations where ,

Define the basis matrices , where , , where , where is a polynomial of degree satisfying , This allows us to write , 4 CA-CG Derivation Let , , and . Then we see that we can represent iterates by their coordinates in : , , and multiplication by can be written: Computing : Matrix powers kernel, same communication cost as 1 SpMV for wellpartitioned Then after computing and , for iterations , ,

Inner products can be written: Vector updates done implicitly by updating coordinates: Computing : one global reduction No communication for iterations! All small dimensional: local/fit in cache. 5 Example: CA-CG for solving Let for until convergence do Compute and Let for do end for Recover , , end for

via CA Matrix Powers Kernel Global reduction to compute Local computations within inner loop require no communication! 6 Related Work: -step methods Authors KSM Basis Van Rosendale, 1983 CG

Monomial Polynomial No No Leland, 1989 CG Monomial Polynomial No No Walker, 1988 GMRES

Monomial None No No Chronopoulos and Gear, 1989 CG Monomial None No No Monomial None

No No Chronopoulos and Orthomin, Kim, 1990 GMRES Precond? Mtx Pwrs? TSQR? Chronopoulos, 1991 MINRES Monomial None No

No Kim and Chronopoulos, 1991 Symm. Lanczos, Arnoldi Monomial None No No de Sturler, 1991 GMRES Chebyshev

None No No 37 Related Work: -step methods Authors KSM Basis Precond? Mtx Pwrs? TSQR? Joubert and Carey, 1992 Chronopoulos and Kim, 1992

Bai, Hu, and Reichel, 1991 Erhel, 1995 de Sturler and van der Vorst, 2005 Toledo, 1995 GMRES Chebyshev No Yes* No Nonsymm. Lanczos GMRES Monomial No

No No Newton No No No GMRES GMRES Newton Chebyshev No General No No

No No CG Monomial Polynomial Yes* No CGR, Orthomin Monomial No No No

Orthodir Monomial No No No Chronopoulos and Swanson, 1990 Chronopoulos and Kinkaid, 2001 38 Communication bottleneck in KSMs A Krylov Subspace Method is a projection process onto the Krylov subspace orthogonal to some .

For linear systems, approx. solution to by imposing and , where . Projection process in each iteration: 1. Add a dimension to Sparse Matrix-Vector Multiplication (SpMV) Parallel: comm. vector entries w/ neighbors Sequential: read /vectors from slow memory SpMV orthogonalize Dependencies between 2. communication-bound Orthogonalize with respect to kernels in each Inner products iteration limit Parallel: global reduction performance!

Sequential: multiple reads/writes to slow memory 1 Residual replacement strategy for CG Van der Vorst and Ye (1999): Improve accuracy by replacing updated residual by the true residual in certain iterations, combined with group update. Bound error in residual update with residual replacement scheme: gives an upper bound for Want to replace with whenever 1. has grown significantly since last replacement step, and 2. is not much larger than (avoid large perturbations to finite precision CG recurrence) Update , an estimate of (and bound for ), in each iteration: , where is the maximal # non-zeros per row of If , (group update), set , set If updated residual converges to , true residual reduced to 10 Communication and computation costs We can compute the terms we need in calculation of as follows:

Estimated only once at the start of the algorithm. Computed once per outer loop no communication. ^ + ^ + Can be estimated using Gram matrix: All lower order terms in and context of CA-CG algorithm Residual Replacement Strategy does not where is available from a previous iteration. No additional communication is increase or needed,asymptotically and additional computation cost iscommunication per outer loop. computation costs!

If we compute at the start of each outer loop, we can compute these quantities in each inner loop in the 2-norm for a factor of additional communication per outer loop (a lower order term). If we can compute in the same reduction step to compute , the latency cost is not affected (otherwise 1 extra message). 15