lm.c

lm.c is a lightweight CPU inference engine that brings powerful large language models to your CPU. Built for accessibility and efficiency, it enables AI capabilities even on standard hardware, with zero external dependencies.

Overall System Architecture

This section outlines the high-level data flow and processing stages within lm.c, illustrating how a model is loaded, executed, and generates text output.

GGUF File Loading↓

Header & Metadata Parsing↓

Tensor Info Loading↓

Quantization Handling↓

Transformer Execution↓

Token Generation↓

Text Output

Core Components

lm.c is built upon a set of robust and highly optimized core components, each designed for specific functionalities to ensure efficient and portable LLM inference.

GGUF Parser

Handles all GGUF metadata types and quantization formats with zero dependencies

Quantization Engine

Supports 30+ GGML quantization formats from F32 to IQ1_M

CPU Inference

Optimized transformer execution with minimal memory footprint

Portable Runtime

Single-file C99 implementation runs anywhere

GGUF File Structure

The GGUF file format is central to lm.c, allowing for efficient storage and loading of large language models. This section illustrates its key structural elements.

Magic Header
(GGUF)

→

Version
(uint32)

→

Tensor Count
(uint64)

→

Metadata
(Key-Value)

Tensor Names
(Strings)

→

Dimensions
(uint64[])

→

Quantization
(GGML_TYPE)

→

Tensor Data
(Aligned)

struct gguf_header_t {
    uint32_t magic;          // "GGUF"
    uint32_t version;         // Format version
    uint64_t tensor_count;    // Number of tensors
    uint64_t metadata_kv_count;
    gguf_metadata_kv_t metadata_kv[];
};

Transformer Layer Architecture

This diagram visualizes the internal structure of a single transformer layer within lm.c, highlighting the key sub-components involved in processing token embeddings.

Token Embeddings

RMS Normalization

Multi-Head Attention

Q/K/V Projections

RMS Normalization

Feed Forward Network

SwiGLU Activation

Output Projection

Sampling & Decoding

Memory Efficient Design

lm.c prioritizes minimal memory footprint, employing smart techniques for efficient resource utilization. This section highlights key aspects of its memory-optimized design.

GGUF Parser:
Minimal overhead

Quantization:
On-the-fly dequant

Tensor Mapping:
Zero-copy access

Activation Buffers:
Reusable memory

KV Cache:
Optimized storage

Token Buffers:
Efficient allocation

SIMD Registers:
Vectorized ops

Thread Pools:
Parallel execution

Development Roadmap

This roadmap outlines the ongoing and planned development efforts for lm.c, showcasing our commitment to continuous improvement and expansion of its capabilities.

lm.c Implementation Progress

✓GGUF File Loader: Complete with metadata extraction
✓Tensor Data Mapping: Memory-mapped tensor access
✓Quantization Kernels: All 30+ GGML formats
➤Transformer Layers: CPU-optimized implementation
➤Tokenization: Byte-pair encoding support
➤Sampling: Temperature-based token selection
➤SIMD Optimization: AVX2/NEON acceleration
➤Thread Parallelism: Multi-core support
➤Interactive Mode: Chat interface

Inference Workflow

The inference workflow in lm.c is meticulously designed for speed and accuracy. This section illustrates the step-by-step process from input text to generated output.

Input Text

→

Tokenization

→

Embedding Lookup

→

Transformer Layers

Layer Norm

→

Attention

→

FFN

→

Residual Add

Final Norm

→

Output Projection

→

Sampling

→

Generated Text

Performance Optimizations

lm.c incorporates advanced CPU-specific optimizations to achieve high performance even on resource-constrained hardware. This section details the key techniques employed.

CPU-Specific Enhancements

Quantization Aware Ops

Process quantized weights directly

Block Processing

Optimized cache utilization

Memory Mapping

Zero-copy weight access

Thread Parallelism

Layer-wise execution

Ready to explore lm.c?

Dive into the code, contribute to the project, or simply learn more about how lm.c is pushing the boundaries of accessible AI.

Visit GitHub Repository Read Our Launch Post