Pickle is not safe — and it was the default
PyTorch's original .bin file format is a Python pickle. Loading a pickle runs arbitrary Python code embedded in the file — so every HuggingFace model download used to be a potential remote code execution. Real supply-chain attacks existed in the wild.
HuggingFace's safetensors format fixes this with a design that cannot execute code: the file contains only a JSON header and raw tensor bytes. Loading it cannot do anything except populate tensor buffers.
The attack surface was not theoretical. Pickle's __reduce__ protocol lets an object name any callable — os.system, subprocess.Popen, a bootstrapping exec() — to run at unpickle time. HuggingFace ran a malware scanner on the Hub after 2023 incidents where uploaded .bin files dropped reverse shells on researchers who ran AutoModel.from_pretrained. A JSON header parser, by contrast, has no callable to hijack: the loader reads a dtype enum, a shape array, and two integers, then hands a byte range to the tensor allocator. There is no code path that can do anything else.
{
"model.layers.0.attn.q_proj.weight": {
"dtype": "BF16",
"shape": [3072, 3072],
"data_offsets": [0, 18874368]
},
...
}Why mmap matters for loading speed
Because the data region is contiguous and aligned, you can memory-map the file with mmap(). Tensors become virtual memory regions — the OS pages them in on demand, shares them across processes, and evicts cold ones under pressure. Loading a 70B model from disk takes the time to parse a few hundred KB of JSON; the tensors themselves are never read until you use them.
This is why HuggingFace models “load fast” on repeat visits: the first load paged tensors into the page cache, and subsequent loads are cache hits.
The tensor data region is aligned to a page boundary (the header is padded so the first tensor starts at a multiple of the OS page size), which is the concrete property that makes zero-copy possible. A CUDA loader can cudaHostRegister the mmapped region and DMA straight from the page cache into VRAM — no intermediate torch.load deserialization, no temporary host-side FP32 buffer. vLLM and TGI exploited this to cut 7B model cold-start from ~45s (pickle round-trip) to under a second on warm cache; the bottleneck becomes PCIe bandwidth, not Python. This is why every serious inference runtime requires safetensors on disk — the security story was the entry argument, but the loading- speed story is what kept it.