The models, one page each.
Each hub is a one-page architectural summary of one SLM or LLM family, cross-linked to the Microscale lessons that explain its choices. Sorted newest first.
- Moonshot AI · released February 20261 lesson
Kimi K2
Kimi K2 scales the DeepSeek-V3 recipe from 256 routed experts to 384, with one always-on shared expert and a top-8 router. Same bones, bigger room.
read hub - OpenAI · released August 20253 lessons
GPT-OSS
GPT-OSS is OpenAI's first open-weights release in over six years: two MoE models, 20B and 120B total parameters. The architecture confirms what outside teams had reverse-engineered about the closed GPT-series.
read hub - Hugging Face · released July 20253 lessons
SmolLM3
SmolLM3 is a 3B dense model that uses NoPE layers every 4th block: three RoPE layers, one NoPE layer, repeat. The justification: not every attention layer needs positional bias to learn what it needs to learn.
read hub - Meta · released April 20252 lessons
Llama 4
Llama 4 is Meta's first big MoE. Scout (17B active / 109B total) and Maverick (17B active / 400B total). Natively multimodal from the start. Scout extends context to 10M tokens.
read hub - Google DeepMind · released March 20252 lessons
Gemma 3
Gemma 3's signature architectural choice is a 5:1 ratio of local-sliding-window attention to full attention. Most Gemma 3 layers only look at the last 4K tokens. Every sixth layer looks globally.
read hub - DeepSeek · released January 20252 lessons
DeepSeek-R1
DeepSeek-R1 is DeepSeek-V3's base model with a reasoning post-training stage on top. The architecture doesn't change. The training recipe does.
read hub - Alibaba · released 20253 lessons
Qwen3
Qwen3 ships at every scale: 0.6B, 1.7B, 4B, 8B, 14B, 32B dense, plus a 30B-A3B and a 235B-A22B MoE. The pick-your-size family for open-weights work in 2026.
read hub - DeepSeek · released December 20244 lessons
DeepSeek-V3
DeepSeek-V3 is the reference architecture this site keeps pointing at: MLA for attention, MoE for the MLP, and MTP bolted on the output for inference speed. 671B parameters total, 37B active per token.
read hub - Microsoft · released December 20242 lessons
Phi-4
Phi-4 is a 14B dense model trained to prove you can out-punch 70B-class competitors by spending compute on what the model learns, not just how much. Synthetic textbook data as the headline move.
read hub - Meta · released September 20241 lesson
Llama 3.2
Llama 3.2 is where Meta entered the SLM conversation seriously. 1B and 3B text models built for edge inference; 11B and 90B vision models bolted on top.
read hub - Meta · released July 20242 lessons
Llama 3.1
Llama 3.1 is the open-weights dense-transformer workhorse. 405B scales the frontier-class recipe; 70B and 8B are the teacher and student variants behind half of 2025's distillation pipelines.
read hub