GPU-Accelerated Parallel Algorithms for Structured Sparse Matrices
Esmail Abdul Fattah
esmail.abdulfattah@kaust.edu.sa
King Abdullah University of Science and Technology
Hatem Ltaief, Håvard Rue, David E. Keyes
Introduction
Computing only the inverse elements that matter
Given a sparse matrix A, we need specific elements of A-1. However, the full inverse is typically dense, making direct computation prohibitive.
Compute only (A-1)ij where Aij ≠ 0, preserving the sparsity pattern.
Motivation
Different sparsity levels, same selected inverse structure
Matrix A (sparse)
Few spatial locations
Matrix B (denser)
More spatial locations
Selected Inverse
Same dense arrowhead!
Same dense arrowhead!
Regardless of initial sparsity, the selected inverse retains the arrowhead structure. This enables efficient tile-based computation.
Arrowhead matrices arise from Kronecker products modeling space-time dependencies:
Data Structure
From scalar indices to contiguous tile storage
Element position in matrix
Tile containing element
Contiguous memory layout
Tiles are stored contiguously in memory, enabling cache-efficient BLAS-3 operations and predictable memory access patterns.
Core Concept
full → full
diag → full!
diag+ → full!
off-diag → min
corner → min
full → full
arrow → arrow! ★
partial → partial
off-diag → min
corner → min
Algorithm
3×3 tiled matrix example: LTΣ = L-1
Σ (inverse)
Level 3: Σ₁₁ and Σ₂₀ on same anti-diagonal → no mutual deps → run in parallel!
Parallelization
3×3 tiled matrix example
Σ (inverse) - Parallel Execution
Phase 1: All diagonals parallel. Phase 2: Anti-diagonal parallelism (Σ₁₁ ∥ Σ₂₀). Larger matrices = more parallelism.
Applicability
Algorithm not restricted to arrowhead structure
Computing the inverse on sparsity pattern of A may require additional tiles due to fill in L, but does not imply computation becomes dense. Substantial savings are common in practice.
Not forming a tile ≠ tile of A-1 is zero. We simply avoid forming tiles not required for the requested output.
Related Work
Goal: compute A−1ij for all (i,j) in the sparsity pattern of A, not just a few entries
Non-zeros of A
(before fill-in)
→
compute A−1ij
at every (i,j)
where Aij ≠ 0
Selected entries of A−1
(same sparsity pattern)
MUMPS
Solves A·x = eᵢ per entry via elimination tree pruning. Efficient for few arbitrary entries, not for computing the full sparsity pattern.
Amestoy et al., SIAM J. Matrix Anal. Appl., 2001
PSelInv (PEXSI)
Supernodal left-looking selected inversion. Slower than PARDISO on large structured problems, with out-of-memory failures. PEXSI is not actively maintained.
Jacquelin et al., ACM TOMS, 2017 | Verbosio, PhD thesis, 2019
Serinv
Specialized for block-tridiagonal + arrowhead matrices only. sTiles targets general sparse SPD matrices beyond this fixed structure.
Maillou et al., arXiv:2503.17528, 2025
Panua-PARDISO ✓ chosen benchmark
State-of-the-art supernodal LLT solver, strongest competitor for full-pattern selected inversion.
Schenk et al., Future Gener. Comput. Syst., 2001 | Schenk & Gärtner, Parallel Comput., 2002
Experimental Setup
Arrowhead structured matrices from INLA-based models
| ID | Size | Bandwidth | Arrow | Density |
|---|---|---|---|---|
| Small (10K) | ||||
| 1-3 | 10,010 | 100-300 | 10 | 0.4-0.6% |
| 4-6 | 10,200 | 100-300 | 200 | 3.9-4.1% |
| Medium (100K) | ||||
| 7-9 | 100,010 | 1K-3K | 10 | 0.1-0.3% |
| 10-12 | 100,200 | 1K-3K | 200 | 0.5-0.6% |
| Large (500K) | ||||
| 13-15 | 500,010 | 1K-3K | 10 | 0.02-0.05% |
| 16-18 | 500,200 | 1K-3K | 200 | 0.1-0.13% |
Arrowhead: Block diagonal + dense last row/column
Bandwidth: Width of diagonal band
Arrow thickness: Size of arrowhead blocks
Size: 50,010
Bandwidth: 15,000
High compute intensity for A100
Matrices reflect INLA-based Bayesian models for spatio-temporal data
sTiles vs PARDISO for selected inversion on 500K matrices (computing (A⁻¹)ᵢⱼ on sparsity pattern)
sTiles vs PARDISO
SelInv / Chol time ratio
Reference: PARDISO Selected Inversion [Schenk et al., SIAM SISC 2007]
Scalability
AMD EPYC 7713 · 1-52 cores · 100K matrices
Matrix ID 12 (high bandwidth): 16× speedup at 52 cores. sTiles scales consistently while PARDISO saturates early.
100K matrices · AMD EPYC 7713 · 52 cores
Bandwidth 1500
Bandwidth 3000
Applicability
SuiteSparse matrices — sTiles tiling framework generalizes beyond arrowhead structure
ship_001
n = 34,920 | nnz = 3,896,496
| # | P-Chol | P-SelInv | S-Chol | S-SelInv |
|---|---|---|---|---|
| 1 | 0.393 | 0.906 | 0.591 | 0.897 |
| 8 | 0.080 | 0.278 | 0.104 | 0.270 |
| 16 | 0.077 | 0.289 | 0.077 | 0.295 |
| 32 | 0.064 | 0.358 | 0.089 | 0.326 |
| 50 | 0.079 | 0.284 | 0.088 | 0.357 |
nd6k
n = 18,000 | nnz = 6,897,316
| # | P-Chol | P-SelInv | S-Chol | S-SelInv |
|---|---|---|---|---|
| 1 | 5.335 | 10.500 | 3.479 | 6.360 |
| 8 | 0.849 | 6.354 | 0.540 | 1.273 |
| 16 | 0.550 | 5.642 | 0.351 | 0.934 |
| 32 | 0.483 | 4.972 | 0.409 | 1.271 |
| 50 | 0.552 | 4.124 | 0.456 | 1.320 |
Matrix 50K · BW=15000 · NVIDIA A100 vs AMD EPYC 7713
Large tiles maximize kernel throughput but reduce task parallelism.
Conclusion
Enabling faster Bayesian inference for large-scale spatial statistics.
Questions?