Microscale
0
Act VIIPacking for Travel
lesson gguf · 8 min · 40 xp

GGUF containers

One file, everything inside

One file, everything inside

safetensors solves weight storage, but a HuggingFace model is a directory: multiple safetensors shards, an index file, config.json, tokenizer.json, chat_template.json, generation_config.json. Distributing that cleanly is annoying. For consumer tools like Ollama and LM Studio, “here is a model” should mean here is a file.

GGUF (GGML Unified Format) is the answer. One file contains: architecture metadata (hidden size, layers, head count, RoPE base, etc.), the tokenizer, the chat template, generation config, and all the weight tensors — typically pre-quantized.

GGUF landed in August 2023 as Georgi Gerganov's replacement for the earlier GGML/GGJT formats, which had versioning chaos and no stable metadata schema. The design rule was forward-extensible: metadata is a typed key-value store (strings, ints, floats, arrays, nested KVs), so a new field like rope.scaling.type or tokenizer.ggml.pre can ship without breaking old loaders — they ignore unknown keys instead of refusing the file. That is why a 2024 llama.cpp build can still open a late-2023 GGUF, and why a 2023 build gracefully degrades on a 2025 one. Compare this to the HuggingFace world, where a new architecture requires a newmodeling_*.py file shipped with trust_remote_code=True — a GGUF just adds metadata keys.

GGUF file layout
magic 'GGUF'
4 bytes — file type discriminator
version + counts
file version, tensor count, metadata count
metadata (KV store)
model architecture · tokenizer · chat template · RoPE base · context length · quantization level — all in one KV table
tensor info list
name, shape, dtype, offset for every tensor
aligned tensor data
quantized tensors with whatever format the metadata says — Q4_K_M, Q5_K_M, Q8_0, etc.
The key win: the chat template lives in the file. Tools that read a GGUF know exactly how to prompt it without guessing. This kills an entire class of “my model outputs are garbage” bugs.

Why GGUF is Ollama's default format

Ollama, LM Studio, Jan, GPT4All, and llama.cpp all consume GGUFs directly. You don't need a Python environment, a tokenizer config, or HuggingFace credentials to run one — just the file. For local inference on consumer hardware, this is the right abstraction.

Every modern SLM is available as GGUFs for multiple quantization levels (Q4_K_M, Q5_K_M, Q6_K, Q8_0). The next lesson shows what the letters mean.

The tensor data region is aligned to a configurable boundary — general.alignmentin the metadata, default 32 bytes — so the same mmap-and-DMA path that safetensors unlocks works here too. For models larger than a single file can conveniently hold (GGUF has no hard size cap, but filesystems, HTTP range requests, and HuggingFace's LFS prefer ≤50 GB shards), GGUF supports split files named model-00001-of-00003.gguf, with the shard count and tensor-to-shard mapping in the metadata of shard one. The important property is that a GGUF is tool-lock-in-free: the format spec fits on a single page of GitHub markdown, the parser is a few hundred lines of C, and there are independent Rust, Go, Python, and Zig readers. No framework owns it — which is precisely why every local runtime converged on it.