From Tokens to Chips
You ask an AI coding assistant to fix a failing test, and it does. But what actually happens between your prompt and the patch?
This is a guided technical primer for the compute stack underneath AI coding tools: tokens, vectors, model layers, GPUs, memory bandwidth, batching, KV cache, and chip design.
It is not a summary of the lectures. It is the missing map I wanted before watching them.
Start hereWho this is for
A bridge for software engineers using AI tools.
This guide is for you if:
- you use ChatGPT, Claude Code, Codex, Cursor, Gemini, or similar tools;
- you understand software systems, but not machine-learning infrastructure deeply;
- you keep hearing terms like KV cache, batching, memory bandwidth, HBM, MoE, and scaling laws;
- you want enough intuition to watch the Reiner Pope lectures without getting lost.
The frame
The collaboration has a compute stack underneath.
You ask an AI coding assistant to fix a failing test. It receives the issue, inspects files, looks at logs, proposes a patch, and explains the diff.
On the surface, this feels like collaboration. Underneath, your codebase becomes tokens. Tokens become numbers. Numbers move through model layers. Layers mostly do matrix operations. Matrix operations run on GPUs and AI accelerators. Memory movement becomes a bottleneck. Batching shapes cost. KV cache carries context forward. Chip design determines what is practical.
Running example
"The payment retry test is failing. Check the logs, inspect the billing worker, make the smallest safe fix, and explain the diff."
This one request is the thread we follow down the stack: from words, to tokens, to memory, to hardware, to product economics.
The map
The chatbot is the last mile of a much deeper machine.
Product surface
- AI coding assistant
Context becomes representation
- Prompt, code, logs, tests
- Tokens
- Vectors / embeddings
Model work
- Model layers
- Matrix multiplication
Serving hardware
- GPU / AI accelerator
- Memory bandwidth
- Batching
- KV cache
Silicon design
- Chip design
The useful product surface is only the top of the stack. Underneath it, text becomes numerical work, and numerical work has to move through real hardware.
First read path
Read these first.
Treat the rest as second-pass concepts.
- Token
- Vector / embedding
- Forward pass
- Matrix multiplication
- GPU
- Memory bandwidth
- Training vs inference
- Batching
- Prefill vs decode
- Attention
- KV cache
- Long context
- Why chips have a shape
Treat MoE, pipeline parallelism, scaling laws, Chinchilla, systolic arrays, cache/scratchpad, logic gates as second-pass concepts. You do not need these on the first read. They help when you return to the lectures.
How trust is tracked
Claims, lectures, background, and metaphors stay separate.
Throughout the guide, I separate sourced technical claims from my own teaching metaphors. Lines like "long context is a memory bill", "logs are cheap for humans to paste and expensive for models to carry", and "the chatbot is the last mile of a much deeper machine" are analogies or interpretations, not source quotes.
- Source-backed fact Directly grounded in a primary source.
- Lecture-derived claim From Reiner Pope's Dwarkesh lectures.
- Standard technical background Common infrastructure or ML framing.
- Teaching analogy / interpretation My explanatory bridge, not a source quote.
How to use this
Read it in three passes, not as a glossary.
The goal is to make the lectures easier to enter. You are not trying to memorise chip terminology on the first read; you are building hooks.
First pass
Get the shape of the stack.
Follow the descent and the payment-retry example without opening every detail. Notice the route: text, vectors, layers, matrix work, memory, chips.
Second pass
Connect the concepts to the coding-assistant example.
Open the entries that explain why your prompt, logs, files, and tests become useful evidence or expensive context.
Third pass
Watch the lectures with handles.
Use the lecture maps as a checklist for batching, rooflines, KV cache, MoE, systolic arrays, and scratchpads.
If you only keep one idea on the first read, keep this one: an AI coding assistant is not just a clever text box. It is a chain of translations. Modern AI is not only about models getting smarter. It is about moving numerical information through hardware fast enough, cheaply enough, and reliably enough to become useful.
Which translation surprised you most: the moment text becomes mathematics, or the moment mathematics becomes silicon? If this guide changed where your mental model cracked, I'd love to hear where.
Concept ladder
The descent, told once.
One walk from your request down to silicon. Each stage opens with the intuition and the running example, then gives you reference entries: a plain definition, a technical nudge, why it matters, and the grounding.
Layer 1 of 5
Text becomes math
The model does not see code like a compiler. It sees token IDs, then vectors, then transformed vectors.
The assistant receives characters: prompt, code, file names, logs, errors, and prior conversation.
Tokenisation turns that material into token IDs. Prompt length is therefore real serving work, not just a UX detail.
Then token IDs become vectors. From there, the model is transforming numerical representations, not reading English or TypeScript directly.
Takeaway: Language becomes a shape that math can move around.
Stage 1
Token
A token is the unit of text the model actually processes. It can be a word, part of a word, punctuation, whitespace, or code syntax.
Appears in Lecture 1
Go deeper
Before your prompt reaches the model's numerical computation, text is split into token IDs. Those IDs are the interface between human text and numerical computation.
File paths, stack traces, logs, and code snippets all become tokens before the model can use them as context.
Token count shapes context length, latency, and cost.
Do not think of tokens as exactly words. Code and logs can tokenise in surprisingly dense ways.
OpenAI token docs.
Stage 1
Parameter
A parameter is a learned number inside the model.
Appears in Lecture 1
Go deeper
Parameters are model weights learned during training and reused during inference to transform token representations.
When the assistant matches a common retry-loop bug pattern, that behaviour comes from patterns distributed across many learned weights.
Model size is partly a question of how many learned numbers must be stored, moved, and used.
A parameter is not a fact in a database. It is one number in a learned computation.
Standard neural-network background; Karpathy's Neural Networks: Zero to Hero is the intuition source. Reiner and Chinchilla become relevant later for model size, active parameters, and training trade-offs.
Stage 1
Vector / embedding
An embedding turns a token into a list of numbers that the model can operate on.
Go deeper
The token ID is mapped to a dense vector, usually floating-point values, placing text into a mathematical space.
The tokens for billing, retry, timeout, and failed test become vectors the model can compare and transform.
Embeddings are where language first becomes geometry.
An embedding is not a sentence summary by itself. It is a numerical representation used by later computation.
OpenAI embeddings guide.
Stage 1
Forward pass
A forward pass is one run of inputs through the model to produce outputs.
Appears in Lecture 1
Go deeper
During inference, the model applies its layers to token representations and produces probabilities for next tokens.
The assistant processes the prompt and produces the next suggested word, code token, or explanation token through repeated forward passes.
Serving an AI coding tool is mostly about doing forward passes quickly and economically.
Inference is not training. The weights are used, not broadly rewritten.
Standard LLM serving background and Lecture 1 framing.
Layer 2 of 5
Math becomes hardware work
The point is not to become a linear algebra expert. It is to notice that the workload has a shape hardware can exploit.
Once the work is numerical, the hardware question is simple: how fast can we do lots of similar arithmetic?
Matrix multiplication is built from tiny multiply-accumulate operations. The hard part is scale and data movement.
Takeaway: The assistant's fluency depends on industrial-scale arithmetic.
Stage 2
Matrix
A matrix is a grid of numbers.
Go deeper
Model layers store and transform vectors using large matrices of learned weights and activations.
The assistant's internal representation of the failing test is repeatedly reshaped by matrices.
Once language becomes numbers, much of the work becomes matrix work.
The matrix is not the model's memory in a human sense. It is the shape of a computation.
Standard linear algebra background.
Stage 2
Matrix multiplication
Matrix multiplication combines grids of numbers to produce new grids or vectors.
Appears in Lecture 2
Go deeper
Transformer computation involves many large matrix and tensor operations, including projections, feed-forward layers, and attention-related operations.
Your prompt becomes batches of numerical operations that accelerators can perform in parallel.
This is why AI hardware is optimised around high-throughput numerical operations.
The model is not doing symbolic code review like a compiler. It is doing learned numerical transformations.
NVIDIA matrix multiplication guide.
Stage 2
Multiply-accumulate
Multiply two numbers, add the result to a running total, repeat at enormous scale.
Appears in Lecture 2
Go deeper
MAC operations are the tiny arithmetic core behind dot products and matrix multiplication.
A code explanation that feels conversational is built from vast numbers of these tiny arithmetic steps.
Chip design often starts with making this repeated operation fast and efficient.
The operation is simple; the scale, data movement, and scheduling are the hard parts.
Lecture 2 chip-design path from small operations upward.
Stage 2
GPU
A GPU is a processor built for doing many similar numerical operations in parallel.
Appears in Lecture 1 Lecture 2
Go deeper
Compared with CPUs, GPUs trade broad flexibility for massive parallel throughput and specialised matrix units.
Serving many coding-assistant requests means packing similar work onto accelerators efficiently.
LLM economics depends on how well model work maps to accelerator hardware.
A GPU is not magically fast for everything. It shines when the work can be parallelised and fed with data.
NVIDIA performance sources and Lecture 2 hardware discussion.
Layer 3 of 5
Hardware becomes product cost
A coding assistant has to serve real users under latency and cost constraints. That is where batching, memory bandwidth, and serving phases matter.
Batching can improve throughput, but it changes latency. Bigger models can help, but size, active parameters, memory placement, and communication all matter.
At frontier scale, chips and distributed systems decide which product features are practical.
Takeaway: The user experience is downstream of hardware economics.
Stage 3
Memory bandwidth
Memory bandwidth is how quickly data can be moved to where computation happens.
Appears in Lecture 1 Lecture 2
Go deeper
Even when arithmetic is fast, model serving can be limited by moving weights, activations, and cache data.
A long log pasted into the assistant is not just text. It is data that must be carried through memory systems.
Speed and cost are often bottlenecked by movement, not only arithmetic.
Do not assume more compute always fixes latency. The model may be waiting on memory.
NVIDIA inference optimisation and Lecture 1 roofline framing.
Stage 3
Training vs inference
Training creates or updates the model; inference uses the model to answer.
Appears in Lecture 1
Go deeper
Training optimises weights using large datasets. Inference runs forward passes with fixed weights for user requests.
Your coding assistant is usually doing inference when it proposes a patch.
Training and serving have different cost profiles and hardware pressures.
A model responding to you is not usually learning from that one interaction by changing its core weights.
Standard technical background and Lecture 1 serving focus.
Stage 3
Batching
Batching groups work together so hardware can be used more efficiently.
Appears in Lecture 1
Go deeper
Serving systems may combine multiple requests or token operations to improve throughput, with latency tradeoffs.
Your small code-fix request may share accelerator time with other users' requests.
Batching shapes cost, speed, and how products feel under load.
Bigger batches can improve throughput without always improving your individual wait time.
Grounded in Reiner Pope's Lecture 1 and NVIDIA inference material.
Stage 3
Prefill vs decode
Prefill processes the prompt; decode generates the answer one token at a time.
Appears in Lecture 1
Go deeper
Prefill processes the input context in parallel. Decode repeatedly uses prior context to produce the next token.
Reading the issue, files, and logs is prefill-like. Writing the patch explanation is decode-like.
The two phases stress hardware differently and explain why long prompts and long answers feel different.
A model does not generate the whole response in one instant. Decode is sequential at the token level.
NVIDIA inference optimisation and Lecture 1.
Layer 4 of 5
Context becomes memory
Engineers want the assistant to read the repo, remember the logs, and keep the diff in mind. That wish quickly becomes a memory-system question.
Attention helps tokens use information from other positions. That is how a failing assertion can connect to a billing worker, retry policy, and candidate patch.
Prefill processes the prompt. Decode writes the answer token by token. The KV cache helps, but it must still be stored and moved.
Stage 4
Attention
Attention lets tokens use information from other tokens in the context.
Appears in Lecture 1
Go deeper
Transformers compute query, key, and value representations so token positions can weight and combine information from other positions.
The assistant can connect the failing assertion to a billing-worker function shown many lines earlier.
Attention is central to why long prompts can influence later generated tokens.
Attention is not human attention. It is a learned numerical mechanism.
Attention Is All You Need, 3Blue1Brown, and Jay Alammar transformer explainers.
Stage 4
KV cache
The KV cache stores key/value information from previous tokens so generation can continue without recomputing everything.
Appears in Lecture 1
Go deeper
During decode, cached keys and values from earlier context are reused when producing later tokens.
If your prompt includes a stack trace, the model carries useful key/value state from that context while drafting the fix.
KV cache makes generation practical but turns context into a memory bill.
Cache helps avoid recomputation, but storing and reading it still costs memory.
Grounded in Reiner Pope's Lecture 1 and NVIDIA inference material.
Stage 4
Long context
Long context means the model can consider more tokens in one request.
Appears in Lecture 1
Go deeper
More context increases the amount of token state that must be processed, stored, and attended over.
Pasting an entire repository gives the assistant more evidence, but also a larger computation and memory problem.
Long context is useful, but it is not free.
The product problem is not "include everything"; it is "include the right context at the right time." Irrelevant logs can dilute focus and add cost.
Standard background; see the token, KV cache, and inference sources below.
Stage 4
Memory hierarchy / HBM
Modern accelerators use layers of memory, with fast memory close to compute and larger memory farther away.
Appears in Lecture 1 Lecture 2
Go deeper
HBM, caches, scratchpads, registers, and off-chip memory differ in capacity, bandwidth, latency, and control.
The assistant's useful context has to live somewhere while the model generates; where it lives affects speed.
AI chips are shaped as much by memory movement as by arithmetic.
Do not think of memory as one uniform pool.
Lecture 2 cache/scratchpad path and NVIDIA performance material.
Logs are cheap for a human to paste and expensive for a model to carry.
Layer 5 of 5
Scale shapes chips
These concepts bridge into Reiner Pope's frontier-scale discussion: racks, data movement, power, specialised chips, and training economics.
Second-pass concepts. You do not need these on the first read. They help when you return to the lectures.
Stage 5
Mixture of Experts
MoE models route tokens through selected expert subnetworks rather than using every parameter for every token.
Appears in Lecture 1
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
An MoE model may route a token through only some expert subnetworks. These experts are learned components, not necessarily human-readable specialists like code, SQL, or testing.
The routing choice is part of the learned computation, so only some internal pathways may be active for a given token.
MoE changes the relationship between model size, memory placement, communication, and work per token.
Total parameters and active parameters are not the same thing.
Lecture 1 model layout and active-parameter discussion.
Stage 5
Pipeline parallelism
Pipeline parallelism splits model work across multiple devices or stages.
Appears in Lecture 1
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
Different parts of a model can run on different accelerators, with data passed between stages.
A frontier model may be too large or too busy for one chip, so serving becomes a distributed systems problem.
Scaling AI is not just a single-chip problem. It is scheduling, networking, and utilisation.
Parallelism adds communication costs and coordination complexity.
Lecture 1 discussion of splitting work across racks.
Stage 5
Scaling laws
Scaling laws describe patterns between model size, data, compute, and performance.
Appears in Lecture 1
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
They are empirical relationships that guide training strategy and resource allocation.
They help answer whether to train a bigger model, use more data, or spend compute differently.
They turn frontier AI into a planning and economics question, not only an algorithm question.
A scaling law is not a guarantee for every architecture, dataset, or product use case.
Kaplan et al. for broader neural language model scaling laws; Chinchilla and Lecture 1 for compute-optimal model-size-versus-training-token trade-offs.
Stage 5
Chinchilla
Chinchilla is shorthand for compute-optimal training tradeoffs between model size and training tokens.
Appears in Lecture 1
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
The Chinchilla work argued many large models were undertrained relative to their parameter count.
It reframes the question from 'how big is the model?' to 'how was compute allocated?'
Training strategy affects the models that later become practical to serve.
Bigger is not automatically the best use of compute.
Training Compute-Optimal Large Language Models.
Stage 5
Logic gates
Logic gates are tiny physical circuits that implement basic operations on bits.
Appears in Lecture 2
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
Chip design builds upward from transistors and gates to arithmetic units, memory, and dataflow.
The cloud service answering your prompt ultimately depends on physical circuits switching.
Lecture 2 grounds AI in physics and hardware, not abstraction alone.
The cloud is not weightless. It is machines, power, heat, and silicon.
Lecture 2 chip design from the bottom up.
Stage 5
Systolic arrays
A systolic array is a grid of compute units that rhythmically passes data through for matrix work.
Appears in Lecture 2
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
It choreographs data movement so multiply-accumulate work can happen with high reuse and throughput.
Instead of sending every number back and forth to distant memory, the chip moves data through a local pattern.
It is a concrete example of hardware taking the shape of matrix multiplication.
It is not just more cores. It is a dataflow design.
Lecture 2 and matrix multiplication sources.
Stage 5
Cache vs scratchpad
Cache is usually hardware-managed nearby memory; scratchpad is explicitly managed nearby memory, often by software, compiler, or runtime.
Appears in Lecture 2
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
Both put data close to compute, but they differ in who controls placement and movement.
Accelerator efficiency depends on keeping the right model data close at the right time.
This distinction explains why chip design and compiler/runtime design are linked.
Nearby memory is scarce, so placement choices matter.
Lecture 2 memory hierarchy discussion.
Stage 5
Why chips have a shape
Chips reflect the workloads they are built to run.
Appears in Lecture 2
You do not need this on the first read. It helps when you return to the lectures.
Go deeper
Area, power, memory bandwidth, interconnect, arithmetic units, and programmability trade off against each other.
AI accelerators look the way they do because matrix work and data movement dominate the economics.
Understanding chips helps explain what future AI coding tools can practically do.
No chip is best at everything. Specialization buys performance by giving something up.
Lecture 2 overall frame.
Lecture 1 · The destination
How to watch Lecture 1: The math behind how LLMs are trained and served
Use this lecture to connect model behaviour to serving economics. The core move is to stop asking only "what can the model do?" and start asking "what work happens per token, where is the bottleneck, and how does the serving system keep the accelerator useful?"
Open Lecture 1 sourceBefore watching
Keep the payment retry prompt in mind. The prompt-read phase is the system absorbing evidence. The answer-write phase is the system generating new tokens while carrying earlier context forward.
What should click
Latency, throughput, and cost are not afterthoughts. They are shaped by model weights, active parameters, batching, memory bandwidth, KV cache, and whether the workload is compute-limited or memory-limited.
| Lecture area | What to listen for | Concepts from this guide |
|---|---|---|
| Batch size | How grouping work improves accelerator use while changing latency. | batching, GPU |
| Token cost | Why prompt length and generated length become serving work. | token, prefill/decode |
| Roofline analysis | Whether the bottleneck is arithmetic throughput or memory movement. | GPU, memory bandwidth |
| Memory bandwidth | Why moving weights, activations, and cache data can dominate latency. | memory bandwidth, HBM |
| Prefill/decode | The difference between reading context and writing the answer. | prefill vs decode, forward pass |
| KV cache | How stored key/value state makes decode practical and memory-hungry. | attention, KV cache |
| Long-context cost | Why larger context windows create more state to process and carry. | long context, memory hierarchy |
| MoE | How sparse expert routing changes active parameters and placement. | MoE, active parameters |
| Pipeline parallelism | How frontier models get split across devices or racks. | pipeline parallelism |
| Chinchilla | Why compute allocation matters, not only model size. | scaling laws, Chinchilla |
Lecture 2 · The destination
How to watch Lecture 2: Chip design from the bottom up
Use this lecture to bring the abstraction all the way down to silicon. The surprising lesson is that the shape of the chip follows the shape of the workload: repeated arithmetic, scarce nearby memory, expensive data movement, and the need to keep many operations flowing.
Open Lecture 2 sourceBefore watching
Do not start with the whole GPU. Start with one tiny operation: multiply, add, repeat. Then ask what must be built around that operation so it can happen billions or trillions of times usefully.
What should click
Systolic arrays, scratchpads, caches, and specialised accelerators are not random hardware trivia. They are ways of arranging compute and memory so matrix-heavy AI workloads waste less time moving data.
| Lecture area | What to listen for | Concepts from this guide |
|---|---|---|
| Logic gates | How physical circuits become the base of computation. | logic gates |
| Matrix multiplication | Why model work reduces to repeated structured numerical operations. | matrix multiplication |
| Multiply-accumulate | The tiny arithmetic step repeated at enormous scale. | multiply-accumulate |
| Systolic arrays | Matrix multiplication as choreographed local dataflow. | systolic arrays |
| Cache/scratchpad | Who controls nearby memory and why placement matters. | cache vs scratchpad, memory hierarchy |
| CPU vs GPU | Flexible control flow versus high-volume parallel arithmetic. | GPU |
| GPU as matrix machine | Why modern accelerators are shaped around matrix-heavy work. | GPU, matrix multiplication |
| AI accelerator design | How area, power, memory, interconnect, and programmability trade off. | why chips have a shape |
Source spine
The trail this guide leans on.
- Reiner Pope / Dwarkesh - The math behind how LLMs are trained and served
- Reiner Pope / Dwarkesh - Chip design from the bottom up
- OpenAI token docs
- OpenAI embeddings guide
- Attention Is All You Need
- Kaplan et al. / OpenAI - Scaling Laws for Neural Language Models
- Chinchilla / Training Compute-Optimal Large Language Models
- NVIDIA matrix multiplication guide
- NVIDIA inference optimisation
- Karpathy - Neural Networks: Zero to Hero
- 3Blue1Brown - Transformers, the tech behind LLMs
- 3Blue1Brown - Attention in transformers
- Jay Alammar - The Illustrated Transformer
- Jay Alammar - The Illustrated GPT-2
Glossary
Small definitions for repeated terms.
- token
- A chunk of text or code represented as an ID for model processing.
- parameter
- A learned number in the model's weights.
- vector
- An ordered list of numbers.
- embedding
- A vector representation of a token or piece of content.
- forward pass
- A run through the model that produces output probabilities.
- inference
- Using a trained model to answer a request.
- training
- Updating model weights from data and objective functions.
- matrix
- A rectangular grid of numbers.
- matrix multiplication
- The operation that combines matrices and vectors across model layers.
- multiply-accumulate
- Multiply values and add the result into a running sum.
- CPU
- A general-purpose processor optimised for flexible control flow.
- GPU
- A parallel processor optimised for high-volume numerical work.
- TPU
- A tensor-processing accelerator designed for machine-learning workloads.
- AI accelerator
- Hardware specialised for AI math and data movement.
- memory bandwidth
- The rate at which data can move through memory systems.
- HBM
- High Bandwidth Memory, used near accelerators for high-throughput data access.
- batching
- Grouping work together to improve hardware utilisation.
- prefill
- The phase that processes the input prompt/context.
- decode
- The phase that generates output tokens step by step.
- attention
- A transformer mechanism for relating tokens to other tokens in context.
- query
- A learned representation used to ask what other token information matters.
- key
- A learned representation used for matching against queries.
- value
- A learned representation carrying information to combine after attention weights are computed.
- KV cache
- Stored key/value state from prior tokens used during generation.
- long context
- A larger token window available to the model in one request.
- MoE
- Mixture of Experts, a model layout that activates selected expert pathways.
- pipeline parallelism
- Splitting model work across stages or devices.
- scaling laws
- Empirical relationships among compute, data, model size, and performance.
- Chinchilla
- A compute-optimal training result about balancing model size and training tokens.
- logic gate
- A basic circuit element that computes with bits.
- systolic array
- A grid-like dataflow design for efficient matrix operations.
- cache
- Hardware-managed nearby memory.
- scratchpad
- Explicitly managed nearby memory, often controlled by software, compiler, or runtime.
Appendix How this guide is grounded.
These are the analogies in the guide. They rest on real grounding, but the framing is mine.
| Teaching analogy | What it rests on | Source trail |
|---|---|---|
| Long context is a memory bill | KV cache plus memory bandwidth: more context means more key/value state to store and move, not just more text. | |
| Logs are cheap for humans to paste and expensive for models to carry | Token and context cost: pasted text becomes tokens that must be processed, cached, and attended over. | |
| The chatbot is the last mile of a much deeper machine | The full descent: the product surface sits on top of tokens, matrix work, memory movement, and silicon. | |
| An AI coding assistant is a chain of translations | The text-to-math-to-hardware path: each stage hands a representation to the next. |