Microscale
0
Act VIIPacking for Travel
lesson safetensors · 8 min · 40 xp

safetensors anatomy

Header · offsets · mmap-friendly tensor region

Pickle is not safe — and it was the default

PyTorch's original .bin file format is a Python pickle. Loading a pickle runs arbitrary Python code embedded in the file — so every HuggingFace model download used to be a potential remote code execution. Real supply-chain attacks existed in the wild.

HuggingFace's safetensors format fixes this with a design that cannot execute code: the file contains only a JSON header and raw tensor bytes. Loading it cannot do anything except populate tensor buffers.

The attack surface was not theoretical. Pickle's __reduce__ protocol lets an object name any callable — os.system, subprocess.Popen, a bootstrapping exec() — to run at unpickle time. HuggingFace ran a malware scanner on the Hub after 2023 incidents where uploaded .bin files dropped reverse shells on researchers who ran AutoModel.from_pretrained. A JSON header parser, by contrast, has no callable to hijack: the loader reads a dtype enum, a shape array, and two integers, then hands a byte range to the tensor allocator. There is no code path that can do anything else.

safetensors file layout
offset 0..8
u64 header size
8 bytes little-endian — how big is the JSON header.
8..8+N
JSON header
Tensor name → dtype, shape, byte range.
{
  "model.layers.0.attn.q_proj.weight": {
    "dtype": "BF16",
    "shape": [3072, 3072],
    "data_offsets": [0, 18874368]
  },
  ...
}
rest
tensor data
Raw contiguous bytes. One tensor after another, aligned. Memory-mappable — you can mmap the whole file and the OS pages tensors in lazily.

Why mmap matters for loading speed

Because the data region is contiguous and aligned, you can memory-map the file with mmap(). Tensors become virtual memory regions — the OS pages them in on demand, shares them across processes, and evicts cold ones under pressure. Loading a 70B model from disk takes the time to parse a few hundred KB of JSON; the tensors themselves are never read until you use them.

This is why HuggingFace models “load fast” on repeat visits: the first load paged tensors into the page cache, and subsequent loads are cache hits.

The tensor data region is aligned to a page boundary (the header is padded so the first tensor starts at a multiple of the OS page size), which is the concrete property that makes zero-copy possible. A CUDA loader can cudaHostRegister the mmapped region and DMA straight from the page cache into VRAM — no intermediate torch.load deserialization, no temporary host-side FP32 buffer. vLLM and TGI exploited this to cut 7B model cold-start from ~45s (pickle round-trip) to under a second on warm cache; the bottleneck becomes PCIe bandwidth, not Python. This is why every serious inference runtime requires safetensors on disk — the security story was the entry argument, but the loading- speed story is what kept it.