GPT-OSS
GPT-OSS is OpenAI's first open-weights release in over six years: two MoE models, 20B and 120B total parameters. The architecture confirms what outside teams had reverse-engineered about the closed GPT-series.
GPT-OSS is the first time since GPT-2 in 2019 that OpenAI has released model weights openly. The architecture, as it turns out, is conventional: MoE for the MLP, GQA for attention, SwiGLU activations, RoPE positions. Nothing exotic. The contribution is the release, not the architecture — but that in itself is a signal: the frontier open-weights ecosystem (DeepSeek, Kimi, Qwen, Llama 4) pushed closed labs to ship something to stay relevant in the open conversation.
Both variants are MoE — 20B total with roughly 3.6B active per token, 120B total with roughly 5B active. The active-param ratios are smaller than DeepSeek-V3 or Kimi K2, which suggests a larger expert count with a tighter top-k, though the exact routing configuration wasn't detailed at release time. The MoE lessoncovers the routing and expert-count math; plug in GPT-OSS's published numbers to see where it lands on the activate-to-total ratio curve.
If you're choosing an open-weights model for inference today, GPT-OSS-20B sits in a useful spot: small enough to run on a single consumer GPU with AWQ quantisation, large enough to compete with Llama 3.1 70B-class instruction quality on many evals. The vLLM lessoncovers the production deployment path. For pure architecture study, DeepSeek-V3 is still the richer target — GPT-OSS confirms the recipe but doesn't extend it.
- Sizes
- 20B, 120B (MoE)
- Active per token
- ~3.6B (20B) / ~5B (120B)
- Architecture
- MoE, GQA, SwiGLU, RoPE
- Context
- 128K
- License
- Apache 2.0
- Act II · 22 min · 65 xpEight stations, two lanternsWhy DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat
- Act VII · 12 min · 50 xpActivation-aware quantization: AWQ vs GPTQHow AWQ keeps the 1% of salient weights at full precision, and how GPTQ walks column-by-column propagating error — calibration matters differently
- Act IX · 9 min · 45 xpvLLM in productionTensor parallelism, continuous batching, and PagedAttention in one config — deploy a production LLM endpoint with vLLM