RFC: lillm — native LLM inference library¶

Status: Draft (Wave 0 scaffold)
Date: 2026-05-29
Program: PH-LLM-program.md
Package: packages/li-llm (import llm)

Problem¶

World Studio agentic workflows need local LLM inference without shipping PyTorch or vendor Python runtimes. Agents must call llm.generate from pure Li code with honest correctness gates.

Proposal¶

Composable modules:

Tokenizer (llm_tokenize, llm_detokenize) — BPE/byte-level, vocab file load
Loader (llm_load_weights) — safetensors primary, GGUF secondary (f32/f16)
Forward (llm_forward) — transformer blocks with RoPE, built on li-ml matmul
Generate (llm_generate) — KV-cache + greedy decode loop

Weight format¶

Format	Role	Notes
safetensors	Primary	HF export; no pickle
GGUF	Secondary	llama.cpp-compatible subset
ONNX logits	Oracle only (Tier-4)	Correctness reference, not runtime

HF → safetensors import path (WP-LLM-07)¶

Export model from Hugging Face Hub to safetensors (no custom code in runtime)
Place under fixtures/ or user cache path documented in PH-LLM program
llm_load_weights(path) maps tensor names → Li layer structs
Future: lillm import CLI wrapper (Li script, not Python product helper)

No PyTorch runtime in v1.

CPU vs GPU¶

Wave	Path	Depends
Wave 1	CPU matmul via `li-ml` / `linalg`	PH-ML WP-ML-04
Wave 2	GPU via `@gpu` + LKIR emit	PH-HW WP-HW-06, PH-ML-GPU Wave 2

Studio integration¶

User → li-studio → studio.ai.complete → llm.generate → native tokens
                  ↘ Cursor SDK cloud (fallback when no weights)

Orchestration lives in li-studio-ai; inference in li-llm only.

Trusted backends (Wave 2 — WP-LLM-08)¶

Optional trait for Ollama / OpenAI-compatible HTTP backends:

Marked [trusted] in docs
Audit gate before product default
Local native path remains default when weights present

Li syntax¶

Use def for all APIs. extern proc only for documented FFI fences. Every exported def needs requires / ensures / decreases.

Proof / trust¶

Component	Proved	Trusted
Tokenizer roundtrip	WP-LLM-01 smoke	vocab file bytes
Forward logits	ULP vs ONNX oracle (Tier-4)	weight file integrity
Generate	greedy smoke	sampling policy (v1: greedy only)

Open questions¶

Fixed max seq len vs dynamic KV growth for v1
GGUF quantization subset (Q4_K_M?) for Wave 2
Model card / license gate in llm_load_weights