the aha moment
Take one model (Qwen3-0.6B-Q4_K_M) and serve it through Ollama, llama.cpp-server, and (depending on your hardware) vLLM or MLX-LM. Measure cold-start time, TTFT, tok/s at batch=1, tok/s at batch=8, and peak memory for each. Find the crossover point where vLLM's batching advantage overtakes Ollama's simplicity — on YOUR hardware, not someone's benchmark blog.
the facts
- Time
- 45–60 min
- Hardware
- CPU · GPU · Mac
- Act
- IX · Ship It
- Status
- Coming soon
- Artifact
- A runtime-comparison table + a recommendation for your specific hardware.
run it locally
Clone the labs repo and run this lab as a script or open it as a notebook:
git clone https://github.com/iqbal-sk/Microscale-labs.git cd Microscale just setup-auto # auto-detects CPU / CUDA / Mac just run 12 # or: jupyter lab labs/12-inference-showdown/lab.py
Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.
read alongside