Dense Tiling Meets Structured Sparsity

Scalable Algorithms for High-Performance Computing

Esmail Abdul Fattah

esmail.abdulfattah@kaust.edu.sa

KAUST Logo

King Abdullah University of Science and Technology (KAUST)

Hatem Ltaief, HΓ₯vard Rue, David E. Keyes

SIAM Conference on Parallel Processing for Scientific Computing (PP26)

Why This Matters

Bayesian Inference at Scale

Major global institutions rely on Bayesian methods that generate structured sparse matrices requiring efficient factorization.

πŸ₯
WHO [1]

Excess mortality estimation

πŸ“Š
CDC [2]

Teen birth rates analysis

🌐
UN [3]

World Population Prospects

🌑️
Climate [4]

Interpolating climate variables

The Bottleneck

All these applications use INLA (Integrated Nested Laplace Approximations) which requires:

Cholesky factorization - factored 100s of times, same sparsity pattern
Selected inversion - posterior marginals
Structured matrices - e.g., arrowhead, banded

Key insight: As data resolution increases, these matrices grow larger and denser, becoming the computational bottleneck.

[1] WHO. Methods for estimating excess mortality associated with COVID-19 pandemic. World Health Data Platform, 2022.

[2] CDC. County-level teen birth rates in the United States. National Center for Health Statistics, 2024.

[3] UN DESA. World Population Prospects 2024: Methodology of population estimates. UN DESA/POP/2024, 2024.

[4] Fioravanti et al. Interpolating climate variables using INLA and SPDE. Environmetrics, 2023.

INLA: Global Impact

INLA enables fast Bayesian inference for latent Gaussian models, powering research in epidemiology, ecology, climate science, public health, and beyond.

INLA Global Applications Map

Source: r-inla.org

The Computational Challenge

INLA generates diverse structured sparse matrices from Kronecker products of spatial, temporal, and other model components.

Dense
Arrowhead
Banded Arrowhead
Sparse Arrowhead
Tridiagonal
Block Diagonal
Block-Tridiagonal
Block-Tridiagonal
Nested Block-Tridiagonal
Block-Diagonal Arrowhead

Sparse Solvers

  • Low arithmetic intensity
  • Poor cache utilization
  • Irregular memory access

Dense Solvers

  • O(nΒ³) complexity
  • Ignores sparsity pattern
  • Wastes memory on zeros

Neither approach is optimal! We need the efficiency of sparse + the performance of dense.

The sTiles Approach

sTiles = Hybrid framework combining sparse techniques (ordering, skip zeros) with dense performance (LAPACK/BLAS-3 on tiles)

Why Tiles?

Inspired by PLASMA dense linear algebra library:

  • Uniform grid of fixed-size dense blocks
  • Regular data structures for predictable dependencies
  • High arithmetic intensity via LAPACK/BLAS-3: POTRF, TRSM, SYRK, GEMM
  • Fine-grained parallelism via tile independence
  • Static scheduling maximizes data locality

Sparse Techniques

Preserve sparsity through preprocessing:

  • Permutation (RCM, AMD, ND) to minimize fill-in
  • Symbolic factorization to identify nonzero tiles
  • Skip zero tiles entirely during computation
  • Operate only on dense nonzero tile clusters

Sparse Points

Sparse Matrix
β†’

Dense Tiles

β†’

LAPACK/BLAS-3

Skip zero tiles

44 / 100

tiles computed

Buttari et al. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 2009.

The Fill-in Challenge

During Cholesky factorization, empty tiles can become non-empty due to GEMM updates

0

Aij

=
0

Aij

βˆ’
Lik

non-empty

Γ—
LjkT

non-empty

fill

FILL-IN!

=
fill

now filled

βˆ’
Lik

non-empty

Γ—
LjkT

non-empty

0

Aij

=
0

Aij

βˆ’

cols 1-2

Γ—

cols 3-4

0

No fill-in!

=
0

still empty

βˆ’

cols 1-2

Γ—

cols 3-4

Not accounting for this creates many unnecessary fill-in tiles! Ordering keeps empty tiles empty.

The Memory Trade-off

When we tile, we allocate full dense storage for every non-zero tile, even if it's sparse inside!

Sparse Tile (diagonal)

4 elements

β†’

16 allocated!

Dense Tile (full)

16 elements

β†’

16 allocated βœ“

Same memory cost!
Sparse tile (4 elements) uses same storage as dense tile (16 elements)

Worth it?
Yes, if tile is dense enough! For diagonal tiles, we use customized kernels or smaller tile sizes

CPU Tile Size

~120

L2/L3 cache efficiency

GPU Tile Size

~600

maximize occupancy

Ordering Strategies

Permutation is critical: it determines fill-in, parallelism, and whether we preserve structure

Available Orderings

  • RCM - Reduces bandwidth
  • AMD, COLAMD, CAMD - Minimize fill-in (SuiteSparse)
  • ND - Graph partitioning (METIS)

Flexible Combinations

  • Auto-select: run strategies in parallel, choose best for application
  • Tile-level ordering: order the tiles themselves
  • Hierarchical:
    global ND (METIS) + different local ordering per partition (e.g., RCM on P₁, AMD on Pβ‚‚)

METIS Limitation

ND (METIS) creates independent partitions, but not always optimal:

  • Often creates imbalanced partitions
  • Uneven task distribution between partitions
  • Load imbalance hurts parallel efficiency

Our Approach

  • Adapted ND: balance partition sizes
  • Custom orderings: structure-aware strategies
  • Hybrid: combine best of each method

Ordering Strategies

Nested Dissection creates independent partitions, but uniform tiling breaks this independence

P1 P2

Nested Dissection

β†’

With Uniform Tiling

Tiles crossing partition boundaries create dependencies between independent subproblems

Tree Reduction

Exposing parallelism in left-looking Cholesky for arrowhead matrices

The Problem: Sequential Bottleneck

In left-looking Cholesky, many GEMMs update the same target tile sequentially:

Sequential GEMMs

GEMM β†’ GEMM β†’ GEMM β†’ ... (serial dependency)

The Solution: Tree Reduction

Run GEMMs in parallel to temporary tiles, then combine with tree-based GEADD:

Tree Reduction

DAG Comparison: Dense vs Arrowhead

6Γ—6 Tiles

Arrowhead Matrix

Arrowhead structure

Dense Matrix

Dense DAG

Wide DAG = High parallelism

Arrowhead DAG

Arrowhead DAG

Thin DAG = Limited parallelism

Tree reduction recovers parallelism in the thin arrowhead DAG

POTRF
SYRK
GEMM
TRSM

sTiles Pipeline

1 Preprocessing

Multi Ordering

RCM, AMD, Adaptable ND

β†’

Symbolic Factorization

Fill-in analysis

β†’

Tiling (CTSF)

Compressed Tile Storage

β†’

Static Scheduling

Each core's tasks saved

{

Run in parallel: same pattern, different values

2 Cholesky Factorization

Tasks Saved

From preprocessing

β†’

Tile Kernels

POTRF, TRSM, SYRK, GEMM

β†’

Tree Reduction

Parallel GEMMs + GEADDs

β†’

GPU Acceleration

cuBLAS / cuSOLVER

3 Solve

Tasks Saved

From preprocessing

β†’

Forward

Ly = b

β†’

Backward

Lα΅€x = y

β†’

Solution x

Ax = b

4 Selected Inversion

Tasks Saved

From preprocessing

β†’

Selection

Sparsity + fill-in (auto)

β†’

Two-Phase Numeric

Row-wise + accumulation

β†’

(A⁻¹)ᡒⱼ

On sparsity pattern

Test Matrix Characteristics

18 arrowhead matrices from INLA framework covering diverse sizes and structures

Parameter Range Meaning
Matrix Size (n) 10K β†’ 1M Number of rows/columns
Bandwidth (b) 100 β†’ 15K Width of banded region
Arrow Thickness (t) 10 β†’ 200 Dense columns at end
Density 0.02% β†’ 4% Fraction of nonzeros

Matrix Groups

Group 1-2

10K-100K

Group 3

500K

GPU Tests

50K-1M

Arrowhead Structure

Bandwidth (b) Arrow (t)

Why this matters

Sparse enough to skip zeros, dense enough to use BLAS

Test Hardware

S1

Intel Xeon Gold 6230R (52 cores, 2.1 GHz)

S2

AMD EPYC 7713 (128 cores, 1.5 GHz)

GPU

NVIDIA A100 (80GB HBM2)

Performance: Cholesky Factorization

Comparison vs state-of-the-art sparse solvers across 18 test matrices (52 cores)

Group 1 Results
Group 3 Results

vs PARDISO

0.12Γ— – 11.08Γ—

vs SymPACK

0.72Γ— – 9.34Γ—

vs CHOLMOD

0.50Γ— – 8.41Γ—

vs MUMPS

0.43Γ— – 5.07Γ—

<1Γ— only for narrow BW (≀1000), thin arrowhead

Libraries: CHOLMOD [Chen et al., ACM TOMS 2008], SymPACK [Jacquelin et al., SC 2016], MUMPS [Amestoy et al., 2001], PARDISO [Schenk & GΓ€rtner, 2004]

Non-Arrowhead Cases: General Sparse Matrices

SuiteSparse matrices : sTiles tiling framework generalizes beyond arrowhead structure

msc23052

msc23052

n = 23,052  |  nnz = 588,933

Threads PARDISO (s) sTiles (s)
10.0950.658
80.0310.112
160.0280.078
320.0350.076
520.0420.113
ship_001

ship_001

n = 34,920  |  nnz = 3,896,496

Threads PARDISO (s) sTiles (s)
10.3930.591
80.0800.104
160.0770.077
320.0640.089
500.0790.088
nd6k

nd6k

n = 18,000  |  nnz = 6,897,316

Threads PARDISO (s) sTiles (s)
15.3353.479
80.8490.540
160.5500.351
320.4830.409
520.5520.456

Performance: Tree Reduction

Parallel GEMM accumulation: speedup from tree reduction vs sequential

Tree Reduction Speedup

The Problem

Left-looking Cholesky: many GEMMs update same target tile sequentially

Tree Reduction Solution

Parallel GEMMs to temporary tiles β†’ Tree-based GEADD accumulation β†’ Final update

Speedup up to

5Γ— at 16-32 cores

50K

GEMMs benefit most

logβ‚‚(n)

Reduction depth

Performance: Core Scaling

Strong scaling: Matrix 9 (n=100K, BW=3000, thick=10) β€” Intel Xeon 6230R, 1 to 52 cores

Core Scaling

sTiles parallel efficiency (this matrix)

1 β†’ 8 cores84%
1 β†’ 16 cores65%
1 β†’ 32 cores41%
1 β†’ 52 cores30%

Efficiency drops beyond 16 cores

The arrowhead structure has a long critical path β€” available parallelism is bounded regardless of core count

PLASMA included as reference

Illustrates the cost of ignoring sparsity β€” not a practical competitor for this problem class

Selected Inversion

Selected Inversion Illustration

More details on this in my talk at:

Friday, March 6 : CP12 : 11:15–11:35 AM : Room T9 SR 055

GPU-Accelerated Parallel Selected Inversion for Structured Matrices

Esmail Abdul Fattah, Hatem Ltaief, Haavard Rue, David E. Keyes : KAUST

Selected Inversion: Library Landscape

Goal: compute Aβˆ’1ij for all (i,j) in the sparsity pattern of A, not just a few entries

Non-zeros of A

Sparse matrix A

(before fill-in)

β†’

compute Aβˆ’1ij
at every (i,j)
where Aij β‰  0

Selected entries of Aβˆ’1

Selected inverse entries

(same sparsity pattern)

MUMPS

Solves AΒ·x = eα΅’ per entry via elimination tree pruning. Efficient for few arbitrary entries, not for computing the full sparsity pattern.

Amestoy et al., SIAM J. Matrix Anal. Appl., 2001

PSelInv (PEXSI)

Supernodal left-looking selected inversion. Slower than PARDISO on large structured problems, with out-of-memory failures. PEXSI is not actively maintained.

Jacquelin et al., ACM TOMS, 2017  |  Verbosio, PhD thesis, 2019

Serinv

Specialized for block-tridiagonal + arrowhead matrices only. sTiles targets general sparse SPD matrices beyond this fixed structure.

Maillou et al., arXiv:2503.17528, 2025

Panua-PARDISO βœ“ chosen benchmark

State-of-the-art supernodal LLT solver, strongest competitor for full-pattern selected inversion.

Schenk et al., Future Gener. Comput. Syst., 2001  |  Schenk & GΓ€rtner, Parallel Comput., 2002

Performance: Selected Inversion

sTiles vs PARDISO for selected inversion on 500K matrices (computing (A⁻¹)ᡒⱼ on sparsity pattern)

Selected Inversion Comparison

Speedup vs PARDISO

0.9Γ— – 9.2Γ—

Matrix 15 (n=500K, b=3000)

High Bandwidth Matrices

IDs 14, 15, 17, 18: 2.6Γ— to 9.2Γ— faster than PARDISO

Low Bandwidth Matrices

IDs 13, 16: Similar performance (sparse overhead dominates)

When sTiles Wins

High-bandwidth matrices: dense tiles enable BLAS efficiency : advantage grows with bandwidth

Reference: PARDISO Selected Inversion [Schenk et al., SIAM SISC 2007]

Selected Inversion: Density Effect

How inverse time scales with matrix density (100K matrices | Set 1: BW=1500, Set 2: BW=3000)

Sparsity vs Inverse Time

sTiles (lime)

Flat scaling - time stays constant as density increases

PARDISO (black)

Growing cost - time increases with density

Why sTiles stays flat:

Fixed tile size β†’ predictable BLAS workload

0.15Γ— – 12Γ—

Speedup range (Set 1)

0.09Γ— – 7.4Γ—

Speedup range (Set 2)

GPU Acceleration

NVIDIA A100 (80GB) vs 64 AMD EPYC 7713 cores | GPU tile size: 600, CPU tile size: 120

Cholesky Factorization

Matrix: n=50K, b=15K

CPU (64 cores)

3.68s

4.7Γ—

speedup

GPU (16 streams)

0.79s

Matrix: n=1M, b=3K

CPU (32 cores)

15.95s

2.8Γ—

speedup

GPU (32 streams)

5.75s

Selected Inversion

Matrix: n=50K, b=15K

CPU (64 cores)

20.38s

5.72Γ—

speedup

GPU (32 streams)

3.56s

Including data transfer overhead

Tile Size Trade-off

GPU (600): Maximize kernel throughput

CPU (120): Maximize task parallelism

Hardware: NVIDIA A100-SXM4 (80GB HBM2) | AMD EPYC 7713 (64 cores, 1.99 GHz)

Dense Matrix: No Significant Loss

One matrix (n=10,200, fully dense) β€” static vs dynamic scheduling β€” not a general claim

PLASMA vs sTiles

Observation (this matrix)

sTiles and PLASMA track closely across all core counts

PLASMA: Parallel Linear Algebra Software for Multicore Architectures [Agullo et al., 2009]

Summary

Dense Tiling for Structured Sparsity

Best of dense (parallelism) + sparse (efficiency)

Static Scheduling + Tree Reduction

Maximize locality, expose parallelism in thin DAGs

Hybrid CPU-GPU Implementation

Adaptive tile sizing for each architecture

Real-World Applications

INLA for Bayesian inference: COVID mortality, climate modeling, population forecasting

Core Message: Treating structured sparsity as dense tiles unlocks massive parallelism and order-of-magnitude speedups for large-scale Bayesian inference.

Thank You!

Questions & Discussion

Esmail Abdul Fattah

King Abdullah University of Science and Technology (KAUST)

/ 12