Skip to content

Fast math and parallelism

Status

Math surface (dot, a @ b, 2d C = A @ B, sum, element-wise ops) is implemented on fixed array tiles. @vectorized(lanes=4) on def is validated; @no_vectorize disables f64×4 array SIMD in codegen. @vectorized on for parses but does not elaborate loops yet. @parallel requires disjoint=. See Provability gaps.

Li is built for scientific and high-performance work. Prefer:

  1. Math on arraysdot(x, y), x @ y, C = A @ B, sum(a), scalar loops for AXPY.
  2. parallel for — many CPU cores with proof that iterations do not alias shared memory.

You do not install NumPy, OpenMP bindings, or a thread library yourself — Li’s compiler wires native vector instructions and (on Linux) OpenMP when your program passes the proof gate.

Start here: Math-first HPC examples · Linear algebra


Math surface (preferred)

You write Meaning
dot(x, y) or x @ y Inner product of matching array[N, float]
C = A @ B 2d matmul on array[M, array[K, float]] tiles
sum(a) Reduction over fixed array[N, int] or array[N, float]
y[i] = alpha * x[i] + y[i] AXPY via index loop (element-wise arrays planned)

Tier 1 benchmarks use this style — see benchmarks/tier1_micro/simd_dot/li/main.li and matmul_naive/li/main.li (no __li_simd_* in user sources).


Parallel loops

parallel for j in 0..<8
  requires disjoint_elem(j, buf)
  decreases 8 - j
=
  buf[j] = 0.0
Piece Meaning
parallel for Run iterations on multiple cores
0..<8 j runs 0, 1, …, 7
requires disjoint_elem(...) Proof obligation: each iteration touches different memory
Body What each iteration does

If Li cannot see that iterations are independent, lic build fails. That is how Li prevents data races by default.

Thread count

lic build sim.li -o sim --threads=8

Or set LI_OMP_THREADS=8 in the environment before running the binary (Linux with OpenMP linked).


Putting it together

A typical HPC workflow:

  1. Store fields in array[N, float] or 2d array[M, array[K, float]].
  2. Express kernels with dot / @ / sum or proved parallel for.
  3. Let the compiler lower to scalar/SIMD/OpenMP (see appendix for MIR names).

See benchmarks/tier2_physics/md_lennard_jones/li/main.li for a pure-Li MD driver.


Compiler intrinsics appendix

Low-level SIMD types remain for compiler tests and codegen debugging — not for new handbook or benchmark sources.

var v: simd[f64, 4] = __li_simd_splat_f64(1.5)
Call Effect
__li_simd_splat_f64(x) Broadcast scalar to all lanes
__li_simd_mul_f64(a, b) Lane-wise multiply
__li_simd_add_f64(a, b) Lane-wise add
__li_horiz_sum_f64(v) Add all lanes into one float

Lane counts 4 and 8 are supported; other sizes are rejected at compile time.

Migration: replace manual intrinsic loops with dot / @ on array tiles; see Examples gallery.

More detail: SIMD & parallel reference.


What is not allowed (on purpose)

You cannot… Why
Share one variable across parallel iterations without proof Data races
Mix int and float silently Catches science bugs
Skip requires / ensures / decreases No proof without promises
Use Any or unsafe Breaks the guarantee story