One file, everything inside
safetensors solves weight storage, but a HuggingFace model is a directory: multiple safetensors shards, an index file, config.json, tokenizer.json, chat_template.json, generation_config.json. Distributing that cleanly is annoying. For consumer tools like Ollama and LM Studio, “here is a model” should mean here is a file.
GGUF (GGML Unified Format) is the answer. One file contains: architecture metadata (hidden size, layers, head count, RoPE base, etc.), the tokenizer, the chat template, generation config, and all the weight tensors — typically pre-quantized.
GGUF landed in August 2023 as Georgi Gerganov's replacement for the earlier GGML/GGJT formats, which had versioning chaos and no stable metadata schema. The design rule was forward-extensible: metadata is a typed key-value store (strings, ints, floats, arrays, nested KVs), so a new field like rope.scaling.type or tokenizer.ggml.pre can ship without breaking old loaders — they ignore unknown keys instead of refusing the file. That is why a 2024 llama.cpp build can still open a late-2023 GGUF, and why a 2023 build gracefully degrades on a 2025 one. Compare this to the HuggingFace world, where a new architecture requires a newmodeling_*.py file shipped with trust_remote_code=True — a GGUF just adds metadata keys.
Why GGUF is Ollama's default format
Ollama, LM Studio, Jan, GPT4All, and llama.cpp all consume GGUFs directly. You don't need a Python environment, a tokenizer config, or HuggingFace credentials to run one — just the file. For local inference on consumer hardware, this is the right abstraction.
Every modern SLM is available as GGUFs for multiple quantization levels (Q4_K_M, Q5_K_M, Q6_K, Q8_0). The next lesson shows what the letters mean.
The tensor data region is aligned to a configurable boundary — general.alignmentin the metadata, default 32 bytes — so the same mmap-and-DMA path that safetensors unlocks works here too. For models larger than a single file can conveniently hold (GGUF has no hard size cap, but filesystems, HTTP range requests, and HuggingFace's LFS prefer ≤50 GB shards), GGUF supports split files named model-00001-of-00003.gguf, with the shard count and tensor-to-shard mapping in the metadata of shard one. The important property is that a GGUF is tool-lock-in-free: the format spec fits on a single page of GitHub markdown, the parser is a few hundred lines of C, and there are independent Rust, Go, Python, and Zig readers. No framework owns it — which is precisely why every local runtime converged on it.