Skip to content

SIMD and parallel execution

Li targets CPU HPC: vector lanes on one core, many cores on shared memory — without user-installed parallel frameworks.

Provability status

Disjointness for parallel for is enforced today partly via string heuristics in policy.cpp, not full Lean discharge. Decorators (@parallel, …) parse but do not yet elaborate. See Provability gaps (G-par, G-dec).

Two layers

Layer Syntax Hardware
SIMD simd[T, N] AVX / NEON vector units
Multi-core parallel for li_parallel_for (--cores) (linked by lic build)

Inner SIMD + outer parallel for is the standard Li pattern for hot loops.

SIMD type

var a: simd[f64, 4]
var b: simd[f32, 8]

Rules (v1):

  • T ∈ integer or float lane types (f32, f64, i32, …).
  • N{4, 8} in the current compiler (spec allows more).
  • No silent fallback to scalar if the CPU cannot do N lanes.

Intrinsics (today)

Intrinsic Role
__li_simd_splat_f64(x) Broadcast scalar
__li_simd_mul_f64(a, b) Lane-wise multiply
__li_simd_add_f64(a, b) Lane-wise add
__li_horiz_sum_f64(v) Sum lanes to scalar

Stdlib names like horizontal_sum and dot are planned to wrap these.

parallel for

parallel for i in 0..<N
  requires disjoint_row(i, grid)
  invariant row_ok(i, grid)
  decreases N - i
=
  compute_row(grid, i)

Proof names the checker knows

Include a disjointness clause in source, for example:

  • disjoint_row(i, grid)
  • disjoint_elem(i, buf)
  • disjoint_slice(...)

Without such a clause, lic build rejects the loop.

What is rejected (by design)

Pattern Result
All threads write buf[0] Compile error
Overlapping par_slice ranges Compile error
Mutable capture without Sync Compile error
borrow mut across iterations Compile error
Fake disjoint proof Lean / VC rejection

Exploit fixtures live in li-tests/race_shared_memory/.

Thread count

lic build app.li -o app --cores=8 --threads-per-core=1
./app

See execution surface. Native pool replaces OpenMP on user link path.

Send and Sync (spec)

Values captured into parallel regions must be safe to share. Laws are defined in the Lean semantics; the surface syntax enforces Sync when shared mutable state appears.

GPU (future)

gpu proc and device buffers are not v1 — CPU SIMD + OpenMP first.

Benchmarks: Benchmarks guide.