model hubs · 11 families

The models, one page each.

Each hub is a one-page architectural summary of one SLM or LLM family, cross-linked to the Microscale lessons that explain its choices. Sorted newest first.

Moonshot AI · released February 2026
1 lesson
Kimi K2
Kimi K2 scales the DeepSeek-V3 recipe from 256 routed experts to 384, with one always-on shared expert and a top-8 router. Same bones, bigger room.
read hub
OpenAI · released August 2025
3 lessons
GPT-OSS
GPT-OSS is OpenAI's first open-weights release in over six years: two MoE models, 20B and 120B total parameters. The architecture confirms what outside teams had reverse-engineered about the closed GPT-series.
read hub
Hugging Face · released July 2025
3 lessons
SmolLM3
SmolLM3 is a 3B dense model that uses NoPE layers every 4th block: three RoPE layers, one NoPE layer, repeat. The justification: not every attention layer needs positional bias to learn what it needs to learn.
read hub
Meta · released April 2025
2 lessons
Llama 4
Llama 4 is Meta's first big MoE. Scout (17B active / 109B total) and Maverick (17B active / 400B total). Natively multimodal from the start. Scout extends context to 10M tokens.
read hub
Google DeepMind · released March 2025
2 lessons
Gemma 3
Gemma 3's signature architectural choice is a 5:1 ratio of local-sliding-window attention to full attention. Most Gemma 3 layers only look at the last 4K tokens. Every sixth layer looks globally.
read hub
DeepSeek · released January 2025
2 lessons
DeepSeek-R1
DeepSeek-R1 is DeepSeek-V3's base model with a reasoning post-training stage on top. The architecture doesn't change. The training recipe does.
read hub
Alibaba · released 2025
3 lessons
Qwen3
Qwen3 ships at every scale: 0.6B, 1.7B, 4B, 8B, 14B, 32B dense, plus a 30B-A3B and a 235B-A22B MoE. The pick-your-size family for open-weights work in 2026.
read hub
DeepSeek · released December 2024
4 lessons
DeepSeek-V3
DeepSeek-V3 is the reference architecture this site keeps pointing at: MLA for attention, MoE for the MLP, and MTP bolted on the output for inference speed. 671B parameters total, 37B active per token.
read hub
Microsoft · released December 2024
2 lessons
Phi-4
Phi-4 is a 14B dense model trained to prove you can out-punch 70B-class competitors by spending compute on what the model learns, not just how much. Synthetic textbook data as the headline move.
read hub
Meta · released September 2024
1 lesson
Llama 3.2
Llama 3.2 is where Meta entered the SLM conversation seriously. 1B and 3B text models built for edge inference; 11B and 90B vision models bolted on top.
read hub
Meta · released July 2024
2 lessons
Llama 3.1
Llama 3.1 is the open-weights dense-transformer workhorse. 405B scales the frontier-class recipe; 70B and 8B are the teacher and student variants behind half of 2025's distillation pipelines.
read hub

Kimi K2

GPT-OSS

SmolLM3

Llama 4

Gemma 3

DeepSeek-R1

Qwen3

DeepSeek-V3

Phi-4

Llama 3.2

Llama 3.1

Kimi K2

GPT-OSS

SmolLM3

Llama 4

Gemma 3

DeepSeek-R1

Qwen3

DeepSeek-V3

Phi-4

Llama 3.2

Llama 3.1