The Inference Showdown

the aha moment

Take one model (Qwen3-0.6B-Q4_K_M) and serve it through Ollama, llama.cpp-server, and (depending on your hardware) vLLM or MLX-LM. Measure cold-start time, TTFT, tok/s at batch=1, tok/s at batch=8, and peak memory for each. Find the crossover point where vLLM's batching advantage overtakes Ollama's simplicity — on YOUR hardware, not someone's benchmark blog.

Open in ColabView on GitHub

the facts

Time: 45–60 min
Hardware: CPU · GPU · Mac
Act: IX · Ship It
Status: Coming soon
Artifact: A runtime-comparison table + a recommendation for your specific hardware.

run it locally

Clone the labs repo and run this lab as a script or open it as a notebook:

git clone https://github.com/iqbal-sk/Microscale-labs.git
cd Microscale
just setup-auto      # auto-detects CPU / CUDA / Mac
just run 12
# or:  jupyter lab labs/12-inference-showdown/lab.py

Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.

read alongside

Lesson · 8 min · 40 xp

Ollama in 60 seconds

Install Ollama, pull a model, build a Modelfile with system prompts and parameters — the fastest path from zero to local LLM inference

Lesson · 7 min · 35 xp

MLX-LM on Mac

Apple's MLX framework for local LLM inference on M-series chips — unified memory, Metal acceleration, and the mlx-lm CLI

Lesson · 9 min · 45 xp

vLLM in production

Tensor parallelism, continuous batching, and PagedAttention in one config — deploy a production LLM endpoint with vLLM

Open in ColabView on GitHub ← all labs