sTiles targets the full Cholesky → solve → selected inverse pipeline on a single shared-memory node, using tile-based parallelism and GPU acceleration. Today's release handles symmetric positive-definite matrices across the entire spectrum, from very sparse to fully dense, with one unified solver. Distributed-memory support is on the roadmap.
Peer-reviewed research papers describing the sTiles framework and its applications.
Core paper focused on the sTiles solver architecture and tile-based algorithms.
Read on IEEE Xplore →Focus on GPU acceleration for selected inversion computations.
Read on arXiv →Application-focused paper demonstrating INLA acceleration with sTiles.
Designed for high-performance sparse matrix computations with modern hardware.
Configurable tile sizes for cache-friendly Cholesky factorizations. Smart tiles adapt storage format based on fill-in density.
Multiple ordering algorithms including AMD, RCM, and SCOTCH nested dissection, with auto, parallel, and smart ordering strategies.
Optional CUDA + cuSOLVER integration for GPU-accelerated dense kernels. Mix CPU tiling with accelerator compute.
Efficient computation of selected inverse elements matching the sparsity pattern. Much faster than full matrix inversion.
Solve multiple right-hand sides efficiently by reusing factorizations. Triangular solves: L, L^T, LL^T.
Built-in timing, log-determinant computation, memory tracking, and export capabilities for benchmarking.
Click any example to view the full code. Ordered from basic to advanced.
Factorize a sparse SPD matrix and compute log-determinant. Shows the core workflow.
// Minimal sTiles example: Cholesky + log-determinant
#include "stiles.h"
#include <vector>
#include <cstdio>
int main() {
int N = 4, NNZ = 7;
std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};
int calls[] = {1}, cores[] = {4}, variant[] = {0};
bool need_inv[] = {false};
void* stile = nullptr;
sTiles_create(&stile, 1, calls, cores, variant, need_inv);
sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
sTiles_init_group(0, &stile);
sTiles_assign_values(0, 0, &stile, vals.data());
sTiles_bind(0, 0, &stile);
sTiles_chol(0, 0, &stile);
double logdet = sTiles_get_logdet(0, 0, &stile);
sTiles_unbind(0, 0, &stile);
printf("Log-determinant: %.10f\n", logdet);
sTiles_quit();
return 0;
}
Solve one or more right-hand sides using the Cholesky factorization.
// sTiles example: Solving Ax = b using Cholesky factorization
#include "stiles.h"
#include <vector>
#include <cstdio>
int main() {
// 4x4 tridiagonal SPD matrix (same as before)
int N = 4, NNZ = 7;
std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};
// Right-hand side vectors (2 RHS, column-major)
int nrhs = 2;
std::vector<double> b = {
9.0, 8.0, 8.0, 9.0, // First RHS
1.0, 1.0, 1.0, 1.0 // Second RHS
};
// Setup
int calls[] = {1}, cores[] = {4}, variant[] = {0};
bool need_inv[] = {false};
void* stile = nullptr;
sTiles_create(&stile, 1, calls, cores, variant, need_inv);
sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
sTiles_init_group(0, &stile);
sTiles_assign_values(0, 0, &stile, vals.data());
// Factorize and solve
sTiles_bind(0, 0, &stile);
sTiles_chol(0, 0, &stile);
sTiles_solve_LLT(0, 0, &stile, b.data(), nrhs); // b is overwritten with x
sTiles_unbind(0, 0, &stile);
// Print solutions
printf("Solution 1: [%.4f, %.4f, %.4f, %.4f]\n", b[0], b[1], b[2], b[3]);
printf("Solution 2: [%.4f, %.4f, %.4f, %.4f]\n", b[4], b[5], b[6], b[7]);
sTiles_quit();
return 0;
}
Compute diagonal of A⁻¹ efficiently. Much faster than full matrix inversion.
// sTiles example: Selected inversion for marginal variances
#include "stiles.h"
#include <vector>
#include <cstdio>
int main() {
// 4x4 tridiagonal SPD matrix
int N = 4, NNZ = 7;
std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};
// Enable selected inversion for this group
int calls[] = {1}, cores[] = {4}, variant[] = {0};
bool need_inv[] = {true}; // ← Enable selinv!
void* stile = nullptr;
sTiles_create(&stile, 1, calls, cores, variant, need_inv);
sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
sTiles_init_group(0, &stile);
sTiles_assign_values(0, 0, &stile, vals.data());
// Factorize then compute selected inverse
sTiles_bind(0, 0, &stile);
sTiles_chol(0, 0, &stile);
sTiles_selinv(0, 0, &stile); // Compute inverse elements
sTiles_unbind(0, 0, &stile);
// Query diagonal elements (marginal variances)
printf("Marginal variances (diagonal of A^-1):\n");
for (int i = 0; i < N; i++) {
double var_i = sTiles_get_selinv_elm(0, 0, i, i, &stile);
printf(" Var[%d] = %.6f\n", i, var_i);
}
// Query off-diagonal elements (covariances within sparsity pattern)
printf("\nCovariances (off-diagonal of A^-1):\n");
printf(" Cov[0,1] = %.6f\n", sTiles_get_selinv_elm(0, 0, 1, 0, &stile));
printf(" Cov[1,2] = %.6f\n", sTiles_get_selinv_elm(0, 0, 2, 1, &stile));
printf(" Cov[2,3] = %.6f\n", sTiles_get_selinv_elm(0, 0, 3, 2, &stile));
sTiles_quit();
return 0;
}
Process multiple matrices with same sparsity pattern in parallel using OpenMP.
// sTiles example: Multiple matrices with same sparsity, processed in parallel
#include "stiles.h"
#include <vector>
#include <cstdio>
#include <omp.h>
int main() {
// Shared sparsity pattern (4x4 tridiagonal)
int N = 4, NNZ = 7;
int num_calls = 4; // 4 matrices to process
// Each call needs its own copy of row/col indices
std::vector<std::vector<int>> all_rows(num_calls), all_cols(num_calls);
std::vector<std::vector<double>> all_vals(num_calls);
for (int c = 0; c < num_calls; c++) {
all_rows[c] = {0, 1, 1, 2, 2, 3, 3};
all_cols[c] = {0, 0, 1, 1, 2, 2, 3};
// Different diagonal values for each call
double diag = 10.0 + c * 2.0;
all_vals[c] = {diag, -1.0, diag, -1.0, diag, -1.0, diag};
}
// Setup: 1 group with 4 calls, 2 cores per call
int calls[] = {num_calls};
int cores[] = {2}; // 2 cores per call = 8 total threads
int variant[] = {0};
bool need_inv[] = {true};
void* stile = nullptr;
sTiles_create(&stile, 1, calls, cores, variant, need_inv);
// Assign graph for each call (same pattern, different arrays)
for (int c = 0; c < num_calls; c++) {
sTiles_assign_graph_one_call(0, c, &stile, N, NNZ,
all_rows[c].data(), all_cols[c].data());
}
sTiles_init_group(0, &stile);
// Assign values for each call
for (int c = 0; c < num_calls; c++) {
sTiles_assign_values(0, c, &stile, all_vals[c].data());
}
// Process all calls in parallel using OpenMP
#pragma omp parallel for num_threads(num_calls)
for (int c = 0; c < num_calls; c++) {
sTiles_bind(0, c, &stile);
sTiles_chol(0, c, &stile);
sTiles_selinv(0, c, &stile);
sTiles_unbind(0, c, &stile);
}
// Collect results
printf("Results from %d parallel factorizations:\n", num_calls);
for (int c = 0; c < num_calls; c++) {
double logdet = sTiles_get_logdet(0, c, &stile);
double var0 = sTiles_get_selinv_elm(0, c, 0, 0, &stile);
printf(" Call %d: logdet=%.4f, Var[0]=%.6f\n", c, logdet, var0);
}
sTiles_quit();
return 0;
}
Update values and re-factorize without re-initialization. For iterative algorithms.
// sTiles example: Update values and re-factorize without re-initialization
#include "stiles.h"
#include <vector>
#include <cstdio>
int main() {
// 4x4 tridiagonal SPD matrix
int N = 4, NNZ = 7;
std::vector<int> rows = {0, 1, 1, 2, 2, 3, 3};
std::vector<int> cols = {0, 0, 1, 1, 2, 2, 3};
std::vector<double> vals = {10.0, -1.0, 10.0, -1.0, 10.0, -1.0, 10.0};
int calls[] = {1}, cores[] = {4}, variant[] = {0};
bool need_inv[] = {true};
void* stile = nullptr;
sTiles_create(&stile, 1, calls, cores, variant, need_inv);
sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows.data(), cols.data());
sTiles_init_group(0, &stile); // Called once!
// Simulate iterative algorithm with 5 iterations
for (int iter = 0; iter < 5; iter++) {
// Update diagonal values (simulate parameter changes)
double diag = 10.0 + iter * 1.5;
vals[0] = vals[2] = vals[4] = vals[6] = diag;
// Assign new values (no re-init needed!)
sTiles_assign_values(0, 0, &stile, vals.data());
// Re-factorize with updated values
sTiles_bind(0, 0, &stile);
sTiles_chol(0, 0, &stile);
sTiles_selinv(0, 0, &stile);
sTiles_unbind(0, 0, &stile);
// Query results for this iteration
double logdet = sTiles_get_logdet(0, 0, &stile);
double var0 = sTiles_get_selinv_elm(0, 0, 0, 0, &stile);
printf("Iter %d: diag=%.1f, logdet=%.4f, Var[0]=%.6f\n",
iter, diag, logdet, var0);
}
sTiles_quit();
return 0;
}
Conference talks and workshop presentations on sTiles.
INLA Workshop, University of Glasgow, Scotland (2025)
Tailored for statisticians working with INLA methodology.
ISC High Performance 2025, Hamburg, Germany
Technical presentation for HPC audience.
CIRAD, Montpellier, France (Apr. 2026)
Seminar on computational bottlenecks in spatial statistical models and how tiling-based algorithms (sTiles) accelerate them.
SIAM PP26 – Minisymposium (Mar. 4, 2026)
Minisymposium talk on scalable tiling algorithms for structured sparse matrices.
SIAM PP26 – Contributed Talk (2026)
GPU-accelerated selected inversion using sTiles for structured sparse matrices.
C API for integrating sTiles into your applications. Click to expand function details.
sTiles_create and sTiles_assign_graph_one_call
Use sTiles_create to declare how many matrices you need to factorize and how many threads to use, then
sTiles_assign_graph_one_call to hand off each matrix's sparsity pattern, and
sTiles_init_group to run the symbolic phase.
The two parameters that govern this setup are groups and calls per group:
a group owns a sparsity pattern (symbolic factorization runs once per group), and each group holds one or more calls,
independent matrices with different values that are factorized in parallel.
int sTiles_create(void** stile, int num_groups, const int* calls_per_group, const int* cores_per_group, const int* factor_type, const bool* get_inverse);
void* variable.
num_groups = 1.
num_groups.
calls_per_group[g] is the number of independent matrices in group g, all sharing group g's sparsity pattern but with different numerical values.
Each call gets its own factor storage and is factorized independently via sTiles_chol / sTiles_selinv.
Calls within a group run in parallel (launch one OpenMP thread per call).
Use calls_per_group[g] = 1 for a single matrix.
Use calls_per_group[g] = N_θ to factorize Nθ hyperparameter samples simultaneously (INLA pattern).
num_groups.
cores_per_group[g] is the number of CPU threads allocated to each call in group g for tile-level parallelism inside sTiles_chol / sTiles_selinv.
Total threads consumed at peak ≈ calls_per_group[g] × cores_per_group[g] per group.
For a single-call setup this is simply the number of cores for the factorization.
num_groups. Factorization variant per group (0 = standard Cholesky).
num_groups. Set get_inverse[g] = true to enable selected inversion (sTiles_selinv) for group g. Memory for the inverse is pre-allocated during sTiles_init_group; calling sTiles_selinv on a group where this is false will fail.
Returns: 0 on success, negative error code on failure.
int sTiles_assign_graph_one_call(int group_id, int call_id, void** stile, int N, int NNZ, int* rows, int* cols);
Returns: 0 on success.
int sTiles_init_group(int group_id, void** stile);
Performs symbolic factorization, ordering, and memory allocation. Call after assigning graphs.
int sTiles_assign_values(int group_id, int call_id, void** stile, double* values);
Values can be updated between solves without re-initializing.
void sTiles_set_tile_size(int size);
void sTiles_set_tile_type_mode(int value);
void sTiles_set_control_param(int index, int value);
int sTiles_get_control_param(int index);
Returns (get): Parameter value, or -1 if index out of range.
Note: Prefer the dedicated setter functions (sTiles_set_tile_size, etc.) for type safety and validation. Use these functions for advanced scenarios or querying current state.
int sTiles_bind(int group_id, int call_id, void** stile);
int sTiles_unbind(int group_id, int call_id, void** stile);
sTiles_bind: Activates a persistent thread team for the specified call. This does not create new threads but activates pre-allocated worker threads for parallel tile operations. Must be called before sTiles_chol, sTiles_selinv, or solve functions.
sTiles_unbind: Deactivates the thread team, releasing workers back to the pool. Always call after finishing computations on a call. Forgetting to unbind may cause resource leaks or deadlocks.
Returns: 0 on success.
int sTiles_chol(int group_id, int call_id, void** stile);
Returns 0 on success, non-zero if matrix is not positive definite.
int sTiles_selinv(int group_id, int call_id, void** stile);
Computes A^{-1} elements matching the sparsity pattern. Call sTiles_chol first.
int sTiles_solve_LLT(int group_id, int call_id, void** stile, double* b, int nrhs);
In-place solve. b is overwritten with solution x. Column-major for multiple RHS.
double sTiles_get_logdet(int group_id, int call_id, void** stile);
Returns log|A| = 2 * sum(log(L_ii)). Computed efficiently during factorization.
double sTiles_get_selinv_elm(int group_id, int call_id, int row, int col, void** stile);
Returns A^{-1}[row][col] if within the sparsity pattern (0-based indices).
void sTiles_quit(void);
Releases all allocated memory across all internal memory managers: MemoryManager, TileMemoryManager, TreeMemoryManager, AlgorithmsMemoryManager, OrderingMemoryManager, CpuSmartTileMemoryManager, TileIndexerMemoryManager, and GpuMemoryManager (if GPU enabled). Also destroys all thread contexts and CUDA handles. Call once at program termination.
// Setup arrays
int num_groups = 1;
int calls_per_group[] = {1};
int cores_per_group[] = {4}; // 4 cores for parallel factorization
int factor_type[] = {0}; // Standard sparse factorization
bool get_inverse[] = {true}; // Compute selected inverse
// Create solver object
void* stile = nullptr;
sTiles_create(&stile, num_groups, calls_per_group,
cores_per_group, factor_type, get_inverse);
// Assign sparsity pattern (sTiles takes ownership of rows/cols pointers!)
sTiles_assign_graph_one_call(0, 0, &stile, N, NNZ, rows, cols);
// Initialize: symbolic factorization, ordering, memory allocation
sTiles_init_group(0, &stile);
// Assign numerical values (can be updated without re-init)
sTiles_assign_values(0, 0, &stile, values);
// Compute: bind -> chol -> selinv -> unbind
sTiles_bind(0, 0, &stile);
sTiles_chol(0, 0, &stile);
sTiles_selinv(0, 0, &stile);
sTiles_unbind(0, 0, &stile);
// Query results
double logdet = sTiles_get_logdet(0, 0, &stile);
double var_i = sTiles_get_selinv_elm(0, 0, i, i, &stile); // Diagonal = variance
// Cleanup all memory
sTiles_quit();
sTiles is distributed as pre-built binaries. Simply include the header and link against the library.
Contact us to obtain the sTiles package for your platform
#include "stiles.h"
Link against libstiles.so (Linux) or libstiles.dylib (macOS)
Call sTiles API functions from your C/C++ application
| Platform | Description |
|---|---|
| macOS ARM64 | Apple Silicon (M1/M2/M3/M4) with Accelerate framework |
| Linux x86_64 | Generic build - works on all x86_64 CPUs |
| Linux AVX2 | Optimized for modern desktops/laptops (Intel 2013+, AMD 2017+) |
| Linux AVX-512 | HPC clusters (Intel Skylake-X 2017+, AMD Zen4 2022+) |
# Linux with OpenBLAS
g++ -O3 -fopenmp myapp.cpp \
-I/path/to/stiles/include \
-L/path/to/stiles/lib -Wl,-rpath,/path/to/stiles/lib -lstiles \
-lopenblas -lpthread -lm -o myapp
# Linux with Intel MKL
g++ -O3 -fopenmp myapp.cpp \
-I/path/to/stiles/include \
-L/path/to/stiles/lib -Wl,-rpath,/path/to/stiles/lib -lstiles \
-lmkl_gf_lp64 -lmkl_core -lmkl_sequential -lpthread -lm -o myapp
# macOS (Accelerate framework built-in, libomp via Homebrew)
clang++ -O3 -Xpreprocessor -fopenmp myapp.cpp \
-I/path/to/stiles/include \
-L/path/to/stiles/lib -Wl,-rpath,/path/to/stiles/lib -lstiles \
-framework Accelerate -lomp -o myapp
If you use sTiles in your research, please cite the appropriate paper(s) below.
@inproceedings{fattah2025stiles,
title = {{sTiles}: An Accelerated Computational
Framework for Sparse Factorizations
of Structured Matrices},
author = {Fattah, Esmail Abdul and Ltaief, Hatem
and Rue, H{\aa}vard and Keyes, David},
booktitle = {ISC High Performance 2025 Research Paper
Proceedings (40th International Conference)},
pages = {1--14},
year = {2025},
organization = {Prometeus GmbH}
}
@article{fattah2025gpu,
title = {{GPU}-Accelerated Parallel Selected
Inversion for Structured Matrices
Using {sTiles}},
author = {Fattah, Esmail Abdul and Ltaief, Hatem
and Rue, H{\aa}vard and Keyes, David},
journal = {arXiv preprint arXiv:2504.19171},
year = {2025}
}
Developer & Maintainer
King Abdullah University of Science and Technology (KAUST)
Thuwal, Saudi Arabia
Join the sTiles mailing list to receive updates about releases, new features, and research developments.
Subscribe →Interested in contributing to sTiles? We're looking for help with Python wrappers and additional language bindings.
Get Involved →Have matrices that perform poorly with current solvers? Share them with us to help improve sTiles robustness.
Send Matrices →sTiles builds upon excellent open-source libraries and research. We gratefully acknowledge the following projects and their contributors.