NileAGI Logo

lm.c

lm.c is a lightweight CPU inference engine that brings powerful large language models to your CPU. Built for accessibility and efficiency, it enables AI capabilities even on standard hardware, with zero external dependencies.

Overall System Architecture

This section outlines the high-level data flow and processing stages within lm.c, illustrating how a model is loaded, executed, and generates text output.

GGUF File Loading
Header & Metadata Parsing
Tensor Info Loading
Quantization Handling
Transformer Execution
Token Generation
Text Output

Core Components

lm.c is built upon a set of robust and highly optimized core components, each designed for specific functionalities to ensure efficient and portable LLM inference.

GGUF Parser

Handles all GGUF metadata types and quantization formats with zero dependencies

Quantization Engine

Supports 30+ GGML quantization formats from F32 to IQ1_M

CPU Inference

Optimized transformer execution with minimal memory footprint

Portable Runtime

Single-file C99 implementation runs anywhere

GGUF File Structure

The GGUF file format is central to lm.c, allowing for efficient storage and loading of large language models. This section illustrates its key structural elements.

Magic Header
(GGUF)
Version
(uint32)
Tensor Count
(uint64)
Metadata
(Key-Value)
Tensor Names
(Strings)
Dimensions
(uint64[])
Quantization
(GGML_TYPE)
Tensor Data
(Aligned)
struct gguf_header_t { uint32_t magic; // "GGUF" uint32_t version; // Format version uint64_t tensor_count; // Number of tensors uint64_t metadata_kv_count; gguf_metadata_kv_t metadata_kv[]; };

Transformer Layer Architecture

This diagram visualizes the internal structure of a single transformer layer within lm.c, highlighting the key sub-components involved in processing token embeddings.

Token Embeddings
RMS Normalization
Multi-Head Attention
Q/K/V Projections
RMS Normalization
Feed Forward Network
SwiGLU Activation
Output Projection
Sampling & Decoding

Memory Efficient Design

lm.c prioritizes minimal memory footprint, employing smart techniques for efficient resource utilization. This section highlights key aspects of its memory-optimized design.

GGUF Parser:
Minimal overhead
Quantization:
On-the-fly dequant
Tensor Mapping:
Zero-copy access
Activation Buffers:
Reusable memory
KV Cache:
Optimized storage
Token Buffers:
Efficient allocation
SIMD Registers:
Vectorized ops
Thread Pools:
Parallel execution

Development Roadmap

This roadmap outlines the ongoing and planned development efforts for lm.c, showcasing our commitment to continuous improvement and expansion of its capabilities.

lm.c Implementation Progress

  • GGUF File Loader: Complete with metadata extraction
  • Tensor Data Mapping: Memory-mapped tensor access
  • Quantization Kernels: All 30+ GGML formats
  • Transformer Layers: CPU-optimized implementation
  • Tokenization: Byte-pair encoding support
  • Sampling: Temperature-based token selection
  • SIMD Optimization: AVX2/NEON acceleration
  • Thread Parallelism: Multi-core support
  • Interactive Mode: Chat interface

Inference Workflow

The inference workflow in lm.c is meticulously designed for speed and accuracy. This section illustrates the step-by-step process from input text to generated output.

Input Text
Tokenization
Embedding Lookup
Transformer Layers
Layer Norm
Attention
FFN
Residual Add
Final Norm
Output Projection
Sampling
Generated Text

Performance Optimizations

lm.c incorporates advanced CPU-specific optimizations to achieve high performance even on resource-constrained hardware. This section details the key techniques employed.

CPU-Specific Enhancements

Quantization Aware Ops

Process quantized weights directly

Block Processing

Optimized cache utilization

Memory Mapping

Zero-copy weight access

Thread Parallelism

Layer-wise execution

Ready to explore lm.c?

Dive into the code, contribute to the project, or simply learn more about how lm.c is pushing the boundaries of accessible AI.