Skip to content

Mathematical linear-algebra surface (user syntax → SIMD / parallel)

Depends on: 2h (scalar Python-math), 2g (def), 7a–7b (SIMD + parallel for lowering), 7d (decorators)
Blocks: Tier 1 matmul_* / simd_dot benchmarks with pure math source (no user simd(...) / intrinsics)
Design spec addendum: docs/superpowers/specs/2026-05-16-li-math-linalg-surface.md (to land with phase)

Principle (binding)

Users write mathematics. The compiler and stdlib lower to simd, parallel for, OpenMP, and (later) GPU — same as today’s proved cores.

Layer Who sees it Examples
Surface Every scientist / agent C += A @ B, y = alpha * x + beta * y, norm(v), sum(a * b)
Execution control HPC authors @cpu, @parallel(disjoint=...), @vectorized(lanes=8) on def / loops
Internal / audit only Compiler tests, “expanded form” appendix simd[f64, 8], __li_simd_*, MIR dumps

Forbidden as primary style: asking users to call simd(...), __li_simd_mul_f64, or manual lane loops in handbook, Tetris-adjacent tutorials, or Tier 1–2 benchmark Li sources.


Surface syntax (v1 → Phase 3 tensor)

Scalars and element-wise (v1 on array[N, T])

def axpy(alpha: float64, x: array[N, float64], y: array[N, float64]) -> None:
  requires N > 0
  @cpu
  @parallel(disjoint=disjoint_index)
  @vectorized(lanes=8, T=float64)
  for i in 0..<N:
    y[i] = alpha * x[i] + y[i]   # infix *, + — not simd intrinsics

Inner products and norms (v1)

def dot(x: array[N, float64], y: array[N, float64]) -> float64:
  requires N > 0
  ensures result == sum_{i in 0..<N} x[i] * y[i]
  var acc: float64 = 0.0
  @vectorized(lanes=8)
  for i in 0..<N:
    acc += x[i] * y[i]   # lowers to simd horizontal sum + FMA chain
  return acc

Matrix multiply (Tier 1 benchmark shape)

# types: tensor[(M,K), f64] when Phase 3 lands; v1 may use array + refinements
@cpu
@parallel(disjoint=disjoint_tile)
def matmul_add(C: tensor[(M, N), float64],
               A: tensor[(M, K), float64],
               B: tensor[(K, N), float64]) -> None:
  requires shapes_compatible(A, B, C)
  C += A @ B    # mathematical matrix multiply — NOT simd(A) or @vectorized(...)

Blocked cache-friendly variant (Tier 1 matmul_blocked)

@cpu
@parallel(disjoint=disjoint_block)
def matmul_blocked(C, A, B, block: int) -> None:
  for bi in blocked_range(0, M, block):
    for bj in blocked_range(0, N, block):
      for bk in blocked_range(0, K, block):
        C[bi:bi+block, bj:bj+block] += A[bi:bi+block, bk:bk+block] @ B[bk:bk+block, bj:bj+block]

Slice + infix @ desugar to proved index loops; inner micro-kernel gets @vectorized from the loop or block def.


Lowering pipeline

flowchart LR
  math[Math surface expr]
  deco[Decorators placement]
  elab[Shape + broadcast check]
  core[Core loops + simd MIR]
  omp[OpenMP / LLVM vectorize]

  math --> elab
  deco --> core
  elab --> core --> omp
Surface form Lowers to (v1 CPU)
a * b on compatible arrays/tensors element-wise loop → @vectorizedsimd lanes
sum(expr) / dot(x,y) reduction with proved associativity where needed
A @ B ikj/ikj-blocked loops + SIMD inner + optional @parallel on outer
alpha * x + y AXPY → FMA vectorized loop

Proofs: shape/dimension errors at compile time; A @ B requires inner dim match; parallel outer loops still need disjoint=.


Sub-phases

Sub Deliverable Exit
2i-a Infix *, +, -, /, ** on numeric arrays; sum; parse + types li-tests/math_linalg/scalar_elementwise/
2i-b dot, norm, axpy, **, reductions partial — prelude dot/norm/axpy; math_linalg/reductions/; float Lean Props open
2i-c Binary @ for 2D matmul desugar (fixed small shapes + tensor when ready) donematmul_*.li on main
2f / P-linalg requires/ensures on fixed dot/sum/matmul entry partial (#151) — closed int specimens; loop dot open
7e-a Connect math expr lowering to existing 7a SIMD MIR simd_dot Li source has zero __li_simd_* in user file
7e-b Matmul math + decorators in Tier 1 benches bench.py --tier 1 CSV: li vs cpp/rust/julia
7e-c Docs + gallery See below

Documentation deliverables

Doc Content
docs/language/linear-algebra.md Math surface, shapes, @ vs element-wise *
docs/guide/math-hpc-examples.md Full samples: dot, axpy, matmul, MD force as math
Update docs/guide/fast-math-and-parallelism.md Lead with math + decorators; move intrinsics to “compiler expansion”
Update docs/guide/examples-gallery.md Side-by-side: math source vs expanded audit form
README.md hello One line sum(a * b) not simd

Tests

Suite Role
li-tests/math_syntax/ Scalar **, //, % (Phase 2h)
li-tests/math_linalg/ Shape errors, broadcast rules, A @ B dim mismatch fail
li-tests/math_linalg/golden/ Optional MIR/LLVM snapshot for dot / small matmul
Harness Tier 1 simd_dot, matmul_naive, matmul_blockedLi column uses math-only sources

Benchmarks vs reference HPC (mandatory)

Use existing benchmarks plan harness (bench.py, cross-lang CSV, plots).

Benchmark Li source style Compare
simd_dot sum(a * b) or dot(a,b) + @vectorized C++ Clang, Rust --release, Julia
matmul_naive C += A @ B + @parallel where legal Same + Python+NumPy column
matmul_blocked blocked slices + math @ Eigen C++ optional column

Regression policy (unchanged): Li within 1.2× C++ on same machine for Tier 1–2, or investigate before release.

Report columns: ns/element or GFLOPS; 1-thread and N-thread; label li_math vs legacy li_intrinsic during migration only.


Relation to decorators (7d)

  • Math = what you compute (A @ B, alpha * x + y).
  • Decorators = how / where it runs (@cpu, @parallel, @vectorized, @gpu).
  • Never require simd(...) in user code to get vectorization — @vectorized + math lowering does.

Exit gate

  • ./li-tests/run_all.sh math_linalg
  • Tier 1 Li benchmarks use math-only sources (li_pure=True on simd_dot, matmul_*)
  • Handbook pages published (linear-algebra.md, math-hpc-examples.md)
  • No user-facing doc recommends __li_simd_* as the default path
  • 2i-b norm, sum/dot, reductions/ suite; same-length ** / prelude axpy / scalar×array (no broadcast) — float Lean Props still open
  • 2i-broadcast length-1 element-wise broadcast (broadcast_len1_*.li); non-broadcast length mismatch (broadcast_invalid_len2_vs_len4.li, elementwise_len_mismatch.li) — full NumPy rank rules open
  • P-linalg loop implementation ≡ closed-form ensures in Lean (G-lean) (linalg_dot4_int_loop_open.li + discharge_linalg_int_lean.sh; float Props still G-math open)
  • Tier 1 perf ≤1.2× C++ (benchmarks dashboard)