Lutz RoederDec 30, 2024 · Updated Dec 28, 2025

Tiny Models

How much code does it take to load pre-trained weights and run inference with a transformer language model? GPT-2 fits in 175 lines of PyTorch, and Gemma 3 in 200 lines of JAX. Both run on a CPU, no GPU required, a good starting point for understanding what's actually inside these models. The code is available in the lutzroeder/models repository on GitHub.

GPT-2

The GPT-2 architecture (2019) is straightforward. Learned position embeddings and stacked transformer blocks of multi-head causal attention and MLP, each wrapped in layer norm and a residual connection. The tokenizer is a hand-coded BPE implementation rather than an imported library. The model downloads OpenAI's pre-trained weights from Hugging Face and generates text with top-k sampling.

$ python gpt2.py "The Eiffel tower is"
The Eiffel tower is a landmark in the heart of the city...

Gemma 3

Six years of architectural iteration. Same transformer skeleton, but nearly every component has been replaced. Learned positions become rotary embeddings (RoPE), standard multi-head attention becomes grouped query attention with fewer KV heads, and the simple MLP becomes a gated MLP with two parallel projections. Some layers use sliding window attention instead of attending to the full context. The weights are loaded by parsing the safetensors directly.

$ python gemma3.py "What is the capital of France?"
The capital of France is Paris.

PyTorch and JAX

The two implementations illustrate a difference in programming model. PyTorch is object-oriented. You subclass a module, store state in it, and call it to run a forward pass. JAX is functional. The forward pass is a JIT function that compiles the entire computation graph. No mutable state inside the model. This forces a different way of thinking. Compilation makes execution noticeably faster, even on CPU.

Try It

Like Tiny Agents, these are hands-on playgrounds. You can step through each layer, see how attention patterns differ between GPT-2 and Gemma 3, and understand what six years of transformer research actually changed in practice.