sparse smart structured scalable

A High-Performance Framework for Sparse Matrices

sTiles targets the full Cholesky → solve → selected inverse pipeline on a single shared-memory node, using tile-based parallelism and GPU acceleration. Today's release handles symmetric positive-definite matrices across the entire spectrum, from very sparse to fully dense, with one unified solver. Distributed-memory support is on the roadmap.

Sparse density pattern
Sparse
Semi-Sparse density pattern
Semi-Sparse
Semi-Dense density pattern
Semi-Dense
Dense density pattern
Dense
In Production

Deployed in R-INLA

sTiles powers the sparse matrix engine of R-INLA, a widely-used package for Bayesian inference with applications in spatial statistics, epidemiology, and ecology. The framework is designed for any application requiring high-performance sparse Cholesky factorization and selective inversion.

Research

Publications

Peer-reviewed research papers describing the sTiles framework and its applications.

📄

sTiles: An Accelerated Computational Framework for Sparse Factorizations of Structured Matrices

Esmail Abdul Fattah, Hatem Ltaief, Håvard Rue, David Keyes

Core paper focused on the sTiles solver architecture and tile-based algorithms.

Read on IEEE Xplore →
ISC 2025
📄

GPU-Accelerated Parallel Selected Inversion for Structured Matrices Using sTiles

Esmail Abdul Fattah, Hatem Ltaief, Håvard Rue, David Keyes

Focus on GPU acceleration for selected inversion computations.

Read on arXiv →
Under Review
📄

Elevating INLA: Next-Level Speed with sTiles

Application-focused paper demonstrating INLA acceleration with sTiles.

In Progress
Capabilities

Key Features

Designed for high-performance sparse matrix computations with modern hardware.

Tile-Based Factorization

Configurable tile sizes for cache-friendly Cholesky factorizations. Smart tiles adapt storage format based on fill-in density.

Intelligent Ordering

Multiple ordering algorithms including AMD, RCM, and SCOTCH nested dissection, with auto, parallel, and smart ordering strategies.

GPU Acceleration

Optional CUDA + cuSOLVER integration for GPU-accelerated dense kernels. Mix CPU tiling with accelerator compute.

Selected Inversion

Efficient computation of selected inverse elements matching the sparsity pattern. Much faster than full matrix inversion.

Multi-RHS Solvers

Solve multiple right-hand sides efficiently by reusing factorizations. Triangular solves: L, L^T, LL^T.

📊

Diagnostics

Built-in timing, log-determinant computation, memory tracking, and export capabilities for benchmarking.

Learn

Examples

Click any example to view the full code. Ordered from basic to advanced.

1

Minimal Cholesky Factorization

Factorize a sparse SPD matrix and compute log-determinant. Shows the core workflow.

Beginner
+
example_minimal.cpp C++
// Minimal sTiles example: Cholesky + log-determinant
#include "stiles.h"
#include <vector>
#include <cstdio>

int main() {
    int N = 4, NNZ = 7;
    std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
    std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
    std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};

    int calls[] = {1}, cores[] = {4}, variant[] = {0};
    bool need_inv[] = {false};

    void* stile = nullptr;
    sTiles_create(&stile, 1, calls, cores, variant, need_inv);
    sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
    sTiles_init_group(0, &stile);
    sTiles_assign_values(0, 0, &stile, vals.data());

    sTiles_bind(0, 0, &stile);
    sTiles_chol(0, 0, &stile);
    double logdet = sTiles_get_logdet(0, 0, &stile);
    sTiles_unbind(0, 0, &stile);

    printf("Log-determinant: %.10f\n", logdet);
    sTiles_quit();
    return 0;
}
2

Solving Linear Systems (Ax = b)

Solve one or more right-hand sides using the Cholesky factorization.

Intermediate
+
example_solve.cpp C++
// sTiles example: Solving Ax = b using Cholesky factorization
#include "stiles.h"
#include <vector>
#include <cstdio>

int main() {
    // 4x4 tridiagonal SPD matrix (same as before)
    int N = 4, NNZ = 7;
    std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
    std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
    std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};

    // Right-hand side vectors (2 RHS, column-major)
    int nrhs = 2;
    std::vector<double> b = {
        9.0, 8.0, 8.0, 9.0,   // First RHS
        1.0, 1.0, 1.0, 1.0    // Second RHS
    };

    // Setup
    int calls[] = {1}, cores[] = {4}, variant[] = {0};
    bool need_inv[] = {false};

    void* stile = nullptr;
    sTiles_create(&stile, 1, calls, cores, variant, need_inv);
    sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
    sTiles_init_group(0, &stile);
    sTiles_assign_values(0, 0, &stile, vals.data());

    // Factorize and solve
    sTiles_bind(0, 0, &stile);
    sTiles_chol(0, 0, &stile);
    sTiles_solve_LLT(0, 0, &stile, b.data(), nrhs);  // b is overwritten with x
    sTiles_unbind(0, 0, &stile);

    // Print solutions
    printf("Solution 1: [%.4f, %.4f, %.4f, %.4f]\n", b[0], b[1], b[2], b[3]);
    printf("Solution 2: [%.4f, %.4f, %.4f, %.4f]\n", b[4], b[5], b[6], b[7]);

    sTiles_quit();
    return 0;
}
3

Selected Inversion (Marginal Variances)

Compute diagonal of A⁻¹ efficiently. Much faster than full matrix inversion.

Intermediate
+
example_selinv.cpp C++
// sTiles example: Selected inversion for marginal variances
#include "stiles.h"
#include <vector>
#include <cstdio>

int main() {
    // 4x4 tridiagonal SPD matrix
    int N = 4, NNZ = 7;
    std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
    std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
    std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};

    // Enable selected inversion for this group
    int calls[] = {1}, cores[] = {4}, variant[] = {0};
    bool need_inv[] = {true};  // ← Enable selinv!

    void* stile = nullptr;
    sTiles_create(&stile, 1, calls, cores, variant, need_inv);
    sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
    sTiles_init_group(0, &stile);
    sTiles_assign_values(0, 0, &stile, vals.data());

    // Factorize then compute selected inverse
    sTiles_bind(0, 0, &stile);
    sTiles_chol(0, 0, &stile);
    sTiles_selinv(0, 0, &stile);  // Compute inverse elements
    sTiles_unbind(0, 0, &stile);

    // Query diagonal elements (marginal variances)
    printf("Marginal variances (diagonal of A^-1):\n");
    for (int i = 0; i < N; i++) {
        double var_i = sTiles_get_selinv_elm(0, 0, i, i, &stile);
        printf("  Var[%d] = %.6f\n", i, var_i);
    }

    // Query off-diagonal elements (covariances within sparsity pattern)
    printf("\nCovariances (off-diagonal of A^-1):\n");
    printf("  Cov[0,1] = %.6f\n", sTiles_get_selinv_elm(0, 0, 1, 0, &stile));
    printf("  Cov[1,2] = %.6f\n", sTiles_get_selinv_elm(0, 0, 2, 1, &stile));
    printf("  Cov[2,3] = %.6f\n", sTiles_get_selinv_elm(0, 0, 3, 2, &stile));

    sTiles_quit();
    return 0;
}
4

Multi-Call Parallel Processing

Process multiple matrices with same sparsity pattern in parallel using OpenMP.

Advanced
+
example_multicall.cpp C++
// sTiles example: Multiple matrices with same sparsity, processed in parallel
#include "stiles.h"
#include <vector>
#include <cstdio>
#include <omp.h>

int main() {
    // Shared sparsity pattern (4x4 tridiagonal)
    int N = 4, NNZ = 7;
    int num_calls = 4;  // 4 matrices to process

    // Each call needs its own copy of row/col indices
    std::vector<std::vector<int>> all_rows(num_calls), all_cols(num_calls);
    std::vector<std::vector<double>> all_vals(num_calls);

    for (int c = 0; c < num_calls; c++) {
        all_rows[c] = {0, 1, 1, 2, 2, 3, 3};
        all_cols[c] = {0, 0, 1, 1, 2, 2, 3};
        // Different diagonal values for each call
        double diag = 10.0 + c * 2.0;
        all_vals[c] = {diag, -1.0, diag, -1.0, diag, -1.0, diag};
    }

    // Setup: 1 group with 4 calls, 2 cores per call
    int calls[] = {num_calls};
    int cores[] = {2};  // 2 cores per call = 8 total threads
    int variant[] = {0};
    bool need_inv[] = {true};

    void* stile = nullptr;
    sTiles_create(&stile, 1, calls, cores, variant, need_inv);

    // Assign graph for each call (same pattern, different arrays)
    for (int c = 0; c < num_calls; c++) {
        sTiles_assign_graph_one_call(0, c, &stile, N, NNZ,
            all_rows[c].data(), all_cols[c].data());
    }
    sTiles_init_group(0, &stile);

    // Assign values for each call
    for (int c = 0; c < num_calls; c++) {
        sTiles_assign_values(0, c, &stile, all_vals[c].data());
    }

    // Process all calls in parallel using OpenMP
    #pragma omp parallel for num_threads(num_calls)
    for (int c = 0; c < num_calls; c++) {
        sTiles_bind(0, c, &stile);
        sTiles_chol(0, c, &stile);
        sTiles_selinv(0, c, &stile);
        sTiles_unbind(0, c, &stile);
    }

    // Collect results
    printf("Results from %d parallel factorizations:\n", num_calls);
    for (int c = 0; c < num_calls; c++) {
        double logdet = sTiles_get_logdet(0, c, &stile);
        double var0 = sTiles_get_selinv_elm(0, c, 0, 0, &stile);
        printf("  Call %d: logdet=%.4f, Var[0]=%.6f\n", c, logdet, var0);
    }

    sTiles_quit();
    return 0;
}
5

Iterative Value Updates (Reuse Pattern)

Update values and re-factorize without re-initialization. For iterative algorithms.

Advanced
+
example_reuse.cpp C++
// sTiles example: Update values and re-factorize without re-initialization
#include "stiles.h"
#include <vector>
#include <cstdio>

int main() {
    // 4x4 tridiagonal SPD matrix
    int N = 4, NNZ = 7;
    std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
    std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
    std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};

    int calls[] = {1}, cores[] = {4}, variant[] = {0};
    bool need_inv[] = {true};

    void* stile = nullptr;
    sTiles_create(&stile, 1, calls, cores, variant, need_inv);
    sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
    sTiles_init_group(0, &stile);  // Called once!

    // Simulate iterative algorithm with 5 iterations
    for (int iter = 0; iter < 5; iter++) {
        // Update diagonal values (simulate parameter changes)
        double diag = 10.0 + iter * 1.5;
        vals[0] = vals[2] = vals[4] = vals[6] = diag;

        // Assign new values (no re-init needed!)
        sTiles_assign_values(0, 0, &stile, vals.data());

        // Re-factorize with updated values
        sTiles_bind(0, 0, &stile);
        sTiles_chol(0, 0, &stile);
        sTiles_selinv(0, 0, &stile);
        sTiles_unbind(0, 0, &stile);

        // Query results for this iteration
        double logdet = sTiles_get_logdet(0, 0, &stile);
        double var0 = sTiles_get_selinv_elm(0, 0, 0, 0, &stile);

        printf("Iter %d: diag=%.1f, logdet=%.4f, Var[0]=%.6f\n",
               iter, diag, logdet, var0);
    }

    sTiles_quit();
    return 0;
}
Talks

Presentations

Conference talks and workshop presentations on sTiles.

From Single Core to Many Cores to GPUs

INLA Workshop, University of Glasgow, Scotland (2025)

Tailored for statisticians working with INLA methodology.

sTiles: Accelerated Sparse Factorizations

ISC High Performance 2025, Hamburg, Germany

Technical presentation for HPC audience.

What's Slowing Down Your Statistical Model, and How Tiling Fixes It?

CIRAD, Montpellier, France (Apr. 2026)

Seminar on computational bottlenecks in spatial statistical models and how tiling-based algorithms (sTiles) accelerate them.

Dense Tiling Meets Structured Sparsity: Scalable Algorithms with sTiles

SIAM PP26 – Minisymposium (Mar. 4, 2026)

Minisymposium talk on scalable tiling algorithms for structured sparse matrices.

GPU-Accelerated Parallel Selected Inversion for Structured Matrices

SIAM PP26 – Contributed Talk (2026)

GPU-accelerated selected inversion using sTiles for structured sparse matrices.

Documentation

API Reference

C API for integrating sTiles into your applications. Click to expand function details.

Object Creation & Initialization

Groups and Calls: the two-level parallelism model

Required reading before calling sTiles_create and sTiles_assign_graph_one_call
+ expand

Use sTiles_create to declare how many matrices you need to factorize and how many threads to use, then sTiles_assign_graph_one_call to hand off each matrix's sparsity pattern, and sTiles_init_group to run the symbolic phase. The two parameters that govern this setup are groups and calls per group: a group owns a sparsity pattern (symbolic factorization runs once per group), and each group holds one or more calls, independent matrices with different values that are factorized in parallel.

sTiles_create( num_groups=2, calls_per_group={2,3}, cores_per_group={8,6} )
Group 0
Sparsity pattern A
symbolic phase → once
Call 0vals₀ · 8 cores
Call 1vals₁ · 8 cores
Group 1
Sparsity pattern B
symbolic phase → once
Call 0vals₀ · 6 cores
Call 1vals₁ · 6 cores
Call 2vals₂ · 6 cores
Group: one sparsity pattern; symbolic phase runs once
Call: one numerical matrix; chol / selinv / solve run per call
Cores: threads per call for tile-parallel computation
sTiles_create Create sTiles solver object
+ expand
int sTiles_create(void** stile, int num_groups, const int* calls_per_group, const int* cores_per_group, const int* factor_type, const bool* get_inverse);

Parameters

stile Output pointer to the created sTiles object. Pass the address of a void* variable.
num_groups Number of groups. Each group owns one sparsity pattern; symbolic factorization runs once per group (ordering, fill analysis, tile layout) and is reused by every call in that group. Use multiple groups when you have matrices with structurally different sparsity patterns that you want to factorize together. Most single-matrix problems use num_groups = 1.
calls_per_group Array of length num_groups. calls_per_group[g] is the number of independent matrices in group g, all sharing group g's sparsity pattern but with different numerical values. Each call gets its own factor storage and is factorized independently via sTiles_chol / sTiles_selinv. Calls within a group run in parallel (launch one OpenMP thread per call). Use calls_per_group[g] = 1 for a single matrix. Use calls_per_group[g] = N_θ to factorize Nθ hyperparameter samples simultaneously (INLA pattern).
cores_per_group Array of length num_groups. cores_per_group[g] is the number of CPU threads allocated to each call in group g for tile-level parallelism inside sTiles_chol / sTiles_selinv. Total threads consumed at peak ≈ calls_per_group[g] × cores_per_group[g] per group. For a single-call setup this is simply the number of cores for the factorization.
factor_type Array of length num_groups. Factorization variant per group (0 = standard Cholesky).
get_inverse Array of length num_groups. Set get_inverse[g] = true to enable selected inversion (sTiles_selinv) for group g. Memory for the inverse is pre-allocated during sTiles_init_group; calling sTiles_selinv on a group where this is false will fail.

Returns: 0 on success, negative error code on failure.

sTiles_assign_graph_one_call Assign sparsity pattern for one call
+ expand
int sTiles_assign_graph_one_call(int group_id, int call_id, void** stile, int N, int NNZ, int* rows, int* cols);

Parameters

group_id, call_id Target group and call indices (0-based)
N Matrix dimension (N x N)
NNZ Number of non-zeros in upper triangle (including diagonal)
rows, cols COO format row/column indices (0-based). Important: sTiles takes ownership of these pointers directly (no copy). Arrays must remain valid until sTiles_quit() or the call is reassigned.

Returns: 0 on success.

sTiles_init_group Initialize group (symbolic factorization)
+ expand
int sTiles_init_group(int group_id, void** stile);

Performs symbolic factorization, ordering, and memory allocation. Call after assigning graphs.

sTiles_assign_values Assign numerical values to matrix
+ expand
int sTiles_assign_values(int group_id, int call_id, void** stile, double* values);

Values can be updated between solves without re-initializing.

Core Configuration

sTiles_set_tile_size Set tile dimension for blocked operations
+ expand
void sTiles_set_tile_size(int size);

Parameters

size Tile dimension in elements. Typical values: 32, 40, 64, 128.
sTiles_set_tile_type_mode Choose tile storage format
+ expand
void sTiles_set_tile_type_mode(int value);

Parameters

value 0 = dense tiles (standard column-major storage), 1 = semisparse tiles (LAPACK banded storage for sparse tiles, reducing memory for low fill-in tiles), 2-3 = additional experimental modes. Mode 1 is recommended for large sparse matrices with significant sparsity within tiles.
sTiles_set_control_param / sTiles_get_control_param Direct access to internal control parameters
+ expand
void sTiles_set_control_param(int index, int value);
int sTiles_get_control_param(int index);

Parameters

index Parameter index (0-19):
• 0: Semisparse pruning mode (prefer sTiles_set_correction_mode)
• 1: Tile size (prefer sTiles_set_tile_size)
• 2: Ordering strategy mode (prefer sTiles_set_ordering_mode)
• 3: Tile type mode (prefer sTiles_set_tile_type_mode)
• 4: Tile ordering mode (prefer sTiles_set_tile_ordering_mode)
• 5: Tile ordering size (prefer sTiles_set_tile_ordering_size)
• 6: Force nested dissection mode (prefer sTiles_force_ND)
• 7: Inverse storage mode (prefer sTiles_set_inverse_storage_mode)
• 8-19: Reserved for future use
value Value to set (for set function)

Returns (get): Parameter value, or -1 if index out of range.

Note: Prefer the dedicated setter functions (sTiles_set_tile_size, etc.) for type safety and validation. Use these functions for advanced scenarios or querying current state.

Execution

sTiles_bind / sTiles_unbind Activate/deactivate persistent thread teams
+ expand
int sTiles_bind(int group_id, int call_id, void** stile);
int sTiles_unbind(int group_id, int call_id, void** stile);

sTiles_bind: Activates a persistent thread team for the specified call. This does not create new threads but activates pre-allocated worker threads for parallel tile operations. Must be called before sTiles_chol, sTiles_selinv, or solve functions.

sTiles_unbind: Deactivates the thread team, releasing workers back to the pool. Always call after finishing computations on a call. Forgetting to unbind may cause resource leaks or deadlocks.

Returns: 0 on success.

sTiles_chol Perform Cholesky factorization (A = LL^T)
+ expand
int sTiles_chol(int group_id, int call_id, void** stile);

Returns 0 on success, non-zero if matrix is not positive definite.

sTiles_selinv Compute selected inverse elements
+ expand
int sTiles_selinv(int group_id, int call_id, void** stile);

Computes A^{-1} elements matching the sparsity pattern. Call sTiles_chol first.

sTiles_solve_LLT Solve Ax = b using Cholesky factorization
+ expand
int sTiles_solve_LLT(int group_id, int call_id, void** stile, double* b, int nrhs);

In-place solve. b is overwritten with solution x. Column-major for multiple RHS.

Query & Cleanup

sTiles_get_logdet Get log-determinant of factored matrix
+ expand
double sTiles_get_logdet(int group_id, int call_id, void** stile);

Returns log|A| = 2 * sum(log(L_ii)). Computed efficiently during factorization.

sTiles_get_selinv_elm Retrieve inverse element A^{-1}[i][j]
+ expand
double sTiles_get_selinv_elm(int group_id, int call_id, int row, int col, void** stile);

Returns A^{-1}[row][col] if within the sparsity pattern (0-based indices).

sTiles_quit Clean shutdown of sTiles
+ expand
void sTiles_quit(void);

Releases all allocated memory across all internal memory managers: MemoryManager, TileMemoryManager, TreeMemoryManager, AlgorithmsMemoryManager, OrderingMemoryManager, CpuSmartTileMemoryManager, TileIndexerMemoryManager, and GpuMemoryManager (if GPU enabled). Also destroys all thread contexts and CUDA handles. Call once at program termination.

Basic Usage Example C++
// Setup arrays
int num_groups = 1;
int calls_per_group[] = {1};
int cores_per_group[] = {4};      // 4 cores for parallel factorization
int factor_type[] = {0};          // Standard sparse factorization
bool get_inverse[] = {true};     // Compute selected inverse

// Create solver object
void* stile = nullptr;
sTiles_create(&stile, num_groups, calls_per_group,
              cores_per_group, factor_type, get_inverse);

// Assign sparsity pattern (sTiles takes ownership of rows/cols pointers!)
sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows, cols);

// Initialize: symbolic factorization, ordering, memory allocation
sTiles_init_group(0, &stile);

// Assign numerical values (can be updated without re-init)
sTiles_assign_values(0, 0, &stile, values);

// Compute: bind -> chol -> selinv -> unbind
sTiles_bind(0, 0, &stile);
sTiles_chol(0, 0, &stile);
sTiles_selinv(0, 0, &stile);
sTiles_unbind(0, 0, &stile);

// Query results
double logdet = sTiles_get_logdet(0, 0, &stile);
double var_i = sTiles_get_selinv_elm(0, 0, i, i, &stile);  // Diagonal = variance

// Cleanup all memory
sTiles_quit();
Setup

Getting Started

sTiles is distributed as pre-built binaries. Simply include the header and link against the library.

Quick Start

1

Download

Contact us to obtain the sTiles package for your platform

2

Include Header

#include "stiles.h"

3

Link Library

Link against libstiles.so (Linux) or libstiles.dylib (macOS)

4

Run

Call sTiles API functions from your C/C++ application

Pre-built Binaries

Platform Description
macOS ARM64 Apple Silicon (M1/M2/M3/M4) with Accelerate framework
Linux x86_64 Generic build - works on all x86_64 CPUs
Linux AVX2 Optimized for modern desktops/laptops (Intel 2013+, AMD 2017+)
Linux AVX-512 HPC clusters (Intel Skylake-X 2017+, AMD Zen4 2022+)
Compilation Example Shell
# Linux with OpenBLAS
g++ -O3 -fopenmp myapp.cpp \
  -I/path/to/stiles/include \
  -L/path/to/stiles/lib -Wl,-rpath,/path/to/stiles/lib -lstiles \
  -lopenblas -lpthread -lm -o myapp

# Linux with Intel MKL
g++ -O3 -fopenmp myapp.cpp \
  -I/path/to/stiles/include \
  -L/path/to/stiles/lib -Wl,-rpath,/path/to/stiles/lib -lstiles \
  -lmkl_gf_lp64 -lmkl_core -lmkl_sequential -lpthread -lm -o myapp

# macOS (Accelerate framework built-in, libomp via Homebrew)
clang++ -O3 -Xpreprocessor -fopenmp myapp.cpp \
  -I/path/to/stiles/include \
  -L/path/to/stiles/lib -Wl,-rpath,/path/to/stiles/lib -lstiles \
  -framework Accelerate -lomp -o myapp
Reference

How to Cite

If you use sTiles in your research, please cite the appropriate paper(s) below.

📖

General Use of sTiles

@inproceedings{fattah2025stiles,
  title     = {{sTiles}: An Accelerated Computational
               Framework for Sparse Factorizations
               of Structured Matrices},
  author    = {Fattah, Esmail Abdul and Ltaief, Hatem
               and Rue, H{\aa}vard and Keyes, David},
  booktitle = {ISC High Performance 2025 Research Paper
               Proceedings (40th International Conference)},
  pages     = {1--14},
  year      = {2025},
  organization = {Prometeus GmbH}
}
📖

Selected Inverse Functionality

@article{fattah2025gpu,
  title   = {{GPU}-Accelerated Parallel Selected
             Inversion for Structured Matrices
             Using {sTiles}},
  author  = {Fattah, Esmail Abdul and Ltaief, Hatem
             and Rue, H{\aa}vard and Keyes, David},
  journal = {arXiv preprint arXiv:2504.19171},
  year    = {2025}
}
Connect

Get In Touch

Esmail Abdul Fattah

Esmail Abdul Fattah

Developer & Maintainer

King Abdullah University of Science and Technology (KAUST)
Thuwal, Saudi Arabia

✉ Email List

Join the sTiles mailing list to receive updates about releases, new features, and research developments.

Subscribe →

💻 Contribute

Interested in contributing to sTiles? We're looking for help with Python wrappers and additional language bindings.

Get Involved →

🧪 Test Matrices

Have matrices that perform poorly with current solvers? Share them with us to help improve sTiles robustness.

Send Matrices →

Acknowledgments

sTiles builds upon excellent open-source libraries and research. We gratefully acknowledge the following projects and their contributors.

  • SCOTCH:Graph partitioning and sparse matrix ordering library. Used for nested dissection ordering.
    labri.fr/perso/pelegrin/scotch
    Pellegrini, F. & Roman, J. (1996). "SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs."
  • SuiteSparse:Suite of sparse matrix algorithms including AMD, COLAMD, and CHOLMOD ordering methods.
    people.engr.tamu.edu/davis/suitesparse.html
    Davis, T. A. (2006). "Direct Methods for Sparse Linear Systems." SIAM.
  • METIS:Graph partitioning library for fill-reducing orderings and nested dissection.
    github.com/KarypisLab/METIS
    Karypis, G. & Kumar, V. (1998). "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs." SIAM J. Sci. Comput.
  • BLAS/LAPACK:Foundational linear algebra libraries for dense matrix operations within tiles.
    netlib.org/lapack
    Anderson, E. et al. (1999). "LAPACK Users' Guide." SIAM.
  • OpenMP:Shared-memory parallel programming API for multi-threaded execution.
    openmp.org
    OpenMP Architecture Review Board. "OpenMP Application Programming Interface."
  • RCM Algorithm:Reverse Cuthill-McKee algorithm for bandwidth reduction in sparse matrices.
    Cuthill, E. & McKee, J. (1969). "Reducing the bandwidth of sparse symmetric matrices." ACM '69.