From Tokens to Chips

You ask an AI coding assistant to fix a failing test, and it does. But what actually happens between your prompt and the patch?

This is a guided technical primer for the compute stack underneath AI coding tools: tokens, vectors, model layers, GPUs, memory bandwidth, batching, KV cache, and chip design.

It is not a summary of the lectures. It is the missing map I wanted before watching them.

Artificial Intelligence Software Engineering Engineering Technology

Start here

Who this is for

A bridge for software engineers using AI tools.

This guide is for you if:

you use ChatGPT, Claude Code, Codex, Cursor, Gemini, or similar tools;
you understand software systems, but not machine-learning infrastructure deeply;
you keep hearing terms like KV cache, batching, memory bandwidth, HBM, MoE, and scaling laws;
you want enough intuition to watch the Reiner Pope lectures without getting lost.

The frame

The collaboration has a compute stack underneath.

You ask an AI coding assistant to fix a failing test. It receives the issue, inspects files, looks at logs, proposes a patch, and explains the diff.

On the surface, this feels like collaboration. Underneath, your codebase becomes tokens. Tokens become numbers. Numbers move through model layers. Layers mostly do matrix operations. Matrix operations run on GPUs and AI accelerators. Memory movement becomes a bottleneck. Batching shapes cost. KV cache carries context forward. Chip design determines what is practical.

Running example

"The payment retry test is failing. Check the logs, inspect the billing worker, make the smallest safe fix, and explain the diff."

This one request is the thread we follow down the stack: from words, to tokens, to memory, to hardware, to product economics.

The map

The chatbot is the last mile of a much deeper machine.

Product surface

AI coding assistant

Context becomes representation

Prompt, code, logs, tests
Tokens
Vectors / embeddings

Model work

Model layers
Matrix multiplication

Serving hardware

GPU / AI accelerator
Memory bandwidth
Batching
KV cache

Silicon design

Chip design

The useful product surface is only the top of the stack. Underneath it, text becomes numerical work, and numerical work has to move through real hardware.

First read path

Read these first.

Treat the rest as second-pass concepts.

Token
Vector / embedding
Forward pass
Matrix multiplication
GPU
Memory bandwidth
Training vs inference
Batching
Prefill vs decode
Attention
KV cache
Long context
Why chips have a shape

Treat MoE, pipeline parallelism, scaling laws, Chinchilla, systolic arrays, cache/scratchpad, logic gates as second-pass concepts. You do not need these on the first read. They help when you return to the lectures.

How trust is tracked

Claims, lectures, background, and metaphors stay separate.

Throughout the guide, I separate sourced technical claims from my own teaching metaphors. Lines like "long context is a memory bill", "logs are cheap for humans to paste and expensive for models to carry", and "the chatbot is the last mile of a much deeper machine" are analogies or interpretations, not source quotes.

Source-backed fact Directly grounded in a primary source.
Lecture-derived claim From Reiner Pope's Dwarkesh lectures.
Standard technical background Common infrastructure or ML framing.
Teaching analogy / interpretation My explanatory bridge, not a source quote.

How to use this

Read it in three passes, not as a glossary.

The goal is to make the lectures easier to enter. You are not trying to memorise chip terminology on the first read; you are building hooks.

First pass

Get the shape of the stack.

Follow the descent and the payment-retry example without opening every detail. Notice the route: text, vectors, layers, matrix work, memory, chips.

Second pass

Connect the concepts to the coding-assistant example.

Open the entries that explain why your prompt, logs, files, and tests become useful evidence or expensive context.

Third pass

Watch the lectures with handles.

Use the lecture maps as a checklist for batching, rooflines, KV cache, MoE, systolic arrays, and scratchpads.

If you only keep one idea on the first read, keep this one: an AI coding assistant is not just a clever text box. It is a chain of translations. Modern AI is not only about models getting smarter. It is about moving numerical information through hardware fast enough, cheaply enough, and reliably enough to become useful.

Which translation surprised you most: the moment text becomes mathematics, or the moment mathematics becomes silicon? If this guide changed where your mental model cracked, I'd love to hear where.

Concept ladder

The descent, told once.

One walk from your request down to silicon. Each stage opens with the intuition and the running example, then gives you reference entries: a plain definition, a technical nudge, why it matters, and the grounding.

Layer 1 of 5

Text becomes math

The model does not see code like a compiler. It sees token IDs, then vectors, then transformed vectors.

The assistant receives characters: prompt, code, file names, logs, errors, and prior conversation.

Tokenisation turns that material into token IDs. Prompt length is therefore real serving work, not just a UX detail.

Then token IDs become vectors. From there, the model is transforming numerical representations, not reading English or TypeScript directly.

Takeaway: Language becomes a shape that math can move around.

Stage 1

Token

Source-backed fact

A token is the unit of text the model actually processes. It can be a word, part of a word, punctuation, whitespace, or code syntax.

Appears in Lecture 1

Go deeper

Slightly technical

Before your prompt reaches the model's numerical computation, text is split into token IDs. Those IDs are the interface between human text and numerical computation.

Software-engineering example

File paths, stack traces, logs, and code snippets all become tokens before the model can use them as context.

Why it matters

Token count shapes context length, latency, and cost.

Common trap

Do not think of tokens as exactly words. Code and logs can tokenise in surprisingly dense ways.

Grounding

OpenAI token docs.

Source trail

OpenAI token docs Dwarkesh / Reiner: LLM math and serving

Stage 1

Parameter

Standard background

A parameter is a learned number inside the model.

Appears in Lecture 1

Go deeper

Slightly technical

Parameters are model weights learned during training and reused during inference to transform token representations.

Software-engineering example

When the assistant matches a common retry-loop bug pattern, that behaviour comes from patterns distributed across many learned weights.

Why it matters

Model size is partly a question of how many learned numbers must be stored, moved, and used.

Common trap

A parameter is not a fact in a database. It is one number in a learned computation.

Grounding

Standard neural-network background; Karpathy's Neural Networks: Zero to Hero is the intuition source. Reiner and Chinchilla become relevant later for model size, active parameters, and training trade-offs.

Source trail

Karpathy: Zero to Hero Dwarkesh / Reiner: LLM math and serving Chinchilla paper

Stage 1

Vector / embedding

Source-backed fact

An embedding turns a token into a list of numbers that the model can operate on.

Go deeper

Slightly technical

The token ID is mapped to a dense vector, usually floating-point values, placing text into a mathematical space.

Software-engineering example

The tokens for billing, retry, timeout, and failed test become vectors the model can compare and transform.

Why it matters

Embeddings are where language first becomes geometry.

Common trap

An embedding is not a sentence summary by itself. It is a numerical representation used by later computation.

Grounding

OpenAI embeddings guide.

Source trail

OpenAI embeddings guide Jay Alammar: Illustrated Transformer

Stage 1

Forward pass

Standard background

A forward pass is one run of inputs through the model to produce outputs.

Appears in Lecture 1

Go deeper

Slightly technical

During inference, the model applies its layers to token representations and produces probabilities for next tokens.

Software-engineering example

The assistant processes the prompt and produces the next suggested word, code token, or explanation token through repeated forward passes.

Why it matters

Serving an AI coding tool is mostly about doing forward passes quickly and economically.

Common trap

Inference is not training. The weights are used, not broadly rewritten.

Grounding

Standard LLM serving background and Lecture 1 framing.

Source trail

Dwarkesh / Reiner: LLM math and serving NVIDIA inference optimisation

Layer 2 of 5

Math becomes hardware work

The point is not to become a linear algebra expert. It is to notice that the workload has a shape hardware can exploit.

Once the work is numerical, the hardware question is simple: how fast can we do lots of similar arithmetic?

Matrix multiplication is built from tiny multiply-accumulate operations. The hard part is scale and data movement.

Takeaway: The assistant's fluency depends on industrial-scale arithmetic.

Stage 2

Matrix

Standard background

A matrix is a grid of numbers.

Go deeper

Slightly technical

Model layers store and transform vectors using large matrices of learned weights and activations.

Software-engineering example

The assistant's internal representation of the failing test is repeatedly reshaped by matrices.

Why it matters

Once language becomes numbers, much of the work becomes matrix work.

Common trap

The matrix is not the model's memory in a human sense. It is the shape of a computation.

Grounding

Standard linear algebra background.

Source trail

NVIDIA matrix multiplication guide Karpathy: Zero to Hero

Stage 2

Matrix multiplication

Source-backed fact

Matrix multiplication combines grids of numbers to produce new grids or vectors.

Appears in Lecture 2

Go deeper

Slightly technical

Transformer computation involves many large matrix and tensor operations, including projections, feed-forward layers, and attention-related operations.

Software-engineering example

Your prompt becomes batches of numerical operations that accelerators can perform in parallel.

Why it matters

This is why AI hardware is optimised around high-throughput numerical operations.

Common trap

The model is not doing symbolic code review like a compiler. It is doing learned numerical transformations.

Grounding

NVIDIA matrix multiplication guide.

Source trail

NVIDIA matrix multiplication guide Dwarkesh / Reiner: chip design

Stage 2

Multiply-accumulate

Lecture-derived claim

Multiply two numbers, add the result to a running total, repeat at enormous scale.

Appears in Lecture 2

Go deeper

Slightly technical

MAC operations are the tiny arithmetic core behind dot products and matrix multiplication.

Software-engineering example

A code explanation that feels conversational is built from vast numbers of these tiny arithmetic steps.

Why it matters

Chip design often starts with making this repeated operation fast and efficient.

Common trap

The operation is simple; the scale, data movement, and scheduling are the hard parts.

Grounding

Lecture 2 chip-design path from small operations upward.

Source trail

Dwarkesh / Reiner: chip design NVIDIA matrix multiplication guide

Stage 2

GPU

Standard background

A GPU is a processor built for doing many similar numerical operations in parallel.

Appears in Lecture 1 Lecture 2

Go deeper

Slightly technical

Compared with CPUs, GPUs trade broad flexibility for massive parallel throughput and specialised matrix units.

Software-engineering example

Serving many coding-assistant requests means packing similar work onto accelerators efficiently.

Why it matters

LLM economics depends on how well model work maps to accelerator hardware.

Common trap

A GPU is not magically fast for everything. It shines when the work can be parallelised and fed with data.

Grounding

NVIDIA performance sources and Lecture 2 hardware discussion.

Source trail

Dwarkesh / Reiner: LLM math and serving Dwarkesh / Reiner: chip design NVIDIA matrix multiplication guide

Layer 3 of 5

Hardware becomes product cost

A coding assistant has to serve real users under latency and cost constraints. That is where batching, memory bandwidth, and serving phases matter.

Batching can improve throughput, but it changes latency. Bigger models can help, but size, active parameters, memory placement, and communication all matter.

At frontier scale, chips and distributed systems decide which product features are practical.

Takeaway: The user experience is downstream of hardware economics.

Stage 3

Memory bandwidth

Lecture-derived claim

Memory bandwidth is how quickly data can be moved to where computation happens.

Appears in Lecture 1 Lecture 2

Go deeper

Slightly technical

Even when arithmetic is fast, model serving can be limited by moving weights, activations, and cache data.

Software-engineering example

A long log pasted into the assistant is not just text. It is data that must be carried through memory systems.

Why it matters

Speed and cost are often bottlenecked by movement, not only arithmetic.

Common trap

Do not assume more compute always fixes latency. The model may be waiting on memory.

Grounding

NVIDIA inference optimisation and Lecture 1 roofline framing.

Source trail

Dwarkesh / Reiner: LLM math and serving NVIDIA inference optimisation

Stage 3

Training vs inference

Standard background

Training creates or updates the model; inference uses the model to answer.

Appears in Lecture 1

Go deeper

Slightly technical

Training optimises weights using large datasets. Inference runs forward passes with fixed weights for user requests.

Software-engineering example

Your coding assistant is usually doing inference when it proposes a patch.

Why it matters

Training and serving have different cost profiles and hardware pressures.

Common trap

A model responding to you is not usually learning from that one interaction by changing its core weights.

Grounding

Standard technical background and Lecture 1 serving focus.

Source trail

Dwarkesh / Reiner: LLM math and serving Chinchilla paper

Stage 3

Batching

Lecture-derived claim

Batching groups work together so hardware can be used more efficiently.

Appears in Lecture 1

Go deeper

Slightly technical

Serving systems may combine multiple requests or token operations to improve throughput, with latency tradeoffs.

Software-engineering example

Your small code-fix request may share accelerator time with other users' requests.

Why it matters

Batching shapes cost, speed, and how products feel under load.

Common trap

Bigger batches can improve throughput without always improving your individual wait time.

Grounding

Grounded in Reiner Pope's Lecture 1 and NVIDIA inference material.

Source trail

Dwarkesh / Reiner: LLM math and serving NVIDIA inference optimisation

Stage 3

Prefill vs decode

Standard background

Prefill processes the prompt; decode generates the answer one token at a time.

Appears in Lecture 1

Go deeper

Slightly technical

Prefill processes the input context in parallel. Decode repeatedly uses prior context to produce the next token.

Software-engineering example

Reading the issue, files, and logs is prefill-like. Writing the patch explanation is decode-like.

Why it matters

The two phases stress hardware differently and explain why long prompts and long answers feel different.

Common trap

A model does not generate the whole response in one instant. Decode is sequential at the token level.

Grounding

NVIDIA inference optimisation and Lecture 1.

Source trail

Dwarkesh / Reiner: LLM math and serving NVIDIA inference optimisation

Layer 4 of 5

Context becomes memory

Engineers want the assistant to read the repo, remember the logs, and keep the diff in mind. That wish quickly becomes a memory-system question.

Attention helps tokens use information from other positions. That is how a failing assertion can connect to a billing worker, retry policy, and candidate patch.

Prefill processes the prompt. Decode writes the answer token by token. The KV cache helps, but it must still be stored and moved.

Stage 4

Attention

Source-backed fact

Attention lets tokens use information from other tokens in the context.

Appears in Lecture 1

Go deeper

Slightly technical

Transformers compute query, key, and value representations so token positions can weight and combine information from other positions.

Software-engineering example

The assistant can connect the failing assertion to a billing-worker function shown many lines earlier.

Why it matters

Attention is central to why long prompts can influence later generated tokens.

Common trap

Attention is not human attention. It is a learned numerical mechanism.

Grounding

Attention Is All You Need, 3Blue1Brown, and Jay Alammar transformer explainers.

Source trail

Attention Is All You Need 3Blue1Brown: attention Jay Alammar: Illustrated Transformer

Stage 4

KV cache

Lecture-derived claim

The KV cache stores key/value information from previous tokens so generation can continue without recomputing everything.

Appears in Lecture 1

Go deeper

Slightly technical

During decode, cached keys and values from earlier context are reused when producing later tokens.

Software-engineering example

If your prompt includes a stack trace, the model carries useful key/value state from that context while drafting the fix.

Why it matters

KV cache makes generation practical but turns context into a memory bill.

Common trap

Cache helps avoid recomputation, but storing and reading it still costs memory.

Grounding

Grounded in Reiner Pope's Lecture 1 and NVIDIA inference material.

Source trail

Dwarkesh / Reiner: LLM math and serving NVIDIA inference optimisation

Stage 4

Long context

Standard background

Long context means the model can consider more tokens in one request.

Appears in Lecture 1

Go deeper

Slightly technical

More context increases the amount of token state that must be processed, stored, and attended over.

Software-engineering example

Pasting an entire repository gives the assistant more evidence, but also a larger computation and memory problem.

Why it matters

Long context is useful, but it is not free.

Common trap

The product problem is not "include everything"; it is "include the right context at the right time." Irrelevant logs can dilute focus and add cost.

Grounding

Standard background; see the token, KV cache, and inference sources below.

Source trail

Dwarkesh / Reiner: LLM math and serving NVIDIA inference optimisation OpenAI token docs

Stage 4

Memory hierarchy / HBM

Lecture-derived claim

Modern accelerators use layers of memory, with fast memory close to compute and larger memory farther away.

Appears in Lecture 1 Lecture 2

Go deeper

Slightly technical

HBM, caches, scratchpads, registers, and off-chip memory differ in capacity, bandwidth, latency, and control.

Software-engineering example

The assistant's useful context has to live somewhere while the model generates; where it lives affects speed.

Why it matters

AI chips are shaped as much by memory movement as by arithmetic.

Common trap

Do not think of memory as one uniform pool.

Grounding

Lecture 2 cache/scratchpad path and NVIDIA performance material.

Source trail

Dwarkesh / Reiner: chip design NVIDIA inference optimisation

Logs are cheap for a human to paste and expensive for a model to carry.

Layer 5 of 5

Scale shapes chips

These concepts bridge into Reiner Pope's frontier-scale discussion: racks, data movement, power, specialised chips, and training economics.

Second-pass concepts. You do not need these on the first read. They help when you return to the lectures.

Stage 5

Mixture of Experts

Lecture-derived claim

MoE models route tokens through selected expert subnetworks rather than using every parameter for every token.

Appears in Lecture 1

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

An MoE model may route a token through only some expert subnetworks. These experts are learned components, not necessarily human-readable specialists like code, SQL, or testing.

Software-engineering example

The routing choice is part of the learned computation, so only some internal pathways may be active for a given token.

Why it matters

MoE changes the relationship between model size, memory placement, communication, and work per token.

Common trap

Total parameters and active parameters are not the same thing.

Grounding

Lecture 1 model layout and active-parameter discussion.

Source trail

Dwarkesh / Reiner: LLM math and serving

Stage 5

Pipeline parallelism

Lecture-derived claim

Pipeline parallelism splits model work across multiple devices or stages.

Appears in Lecture 1

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

Different parts of a model can run on different accelerators, with data passed between stages.

Software-engineering example

A frontier model may be too large or too busy for one chip, so serving becomes a distributed systems problem.

Why it matters

Scaling AI is not just a single-chip problem. It is scheduling, networking, and utilisation.

Common trap

Parallelism adds communication costs and coordination complexity.

Grounding

Lecture 1 discussion of splitting work across racks.

Source trail

Dwarkesh / Reiner: LLM math and serving

Stage 5

Scaling laws

Source-backed fact

Scaling laws describe patterns between model size, data, compute, and performance.

Appears in Lecture 1

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

They are empirical relationships that guide training strategy and resource allocation.

Software-engineering example

They help answer whether to train a bigger model, use more data, or spend compute differently.

Why it matters

They turn frontier AI into a planning and economics question, not only an algorithm question.

Common trap

A scaling law is not a guarantee for every architecture, dataset, or product use case.

Grounding

Kaplan et al. for broader neural language model scaling laws; Chinchilla and Lecture 1 for compute-optimal model-size-versus-training-token trade-offs.

Source trail

Kaplan et al.: scaling laws Chinchilla paper Dwarkesh / Reiner: LLM math and serving

Stage 5

Chinchilla

Source-backed fact

Chinchilla is shorthand for compute-optimal training tradeoffs between model size and training tokens.

Appears in Lecture 1

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

The Chinchilla work argued many large models were undertrained relative to their parameter count.

Software-engineering example

It reframes the question from 'how big is the model?' to 'how was compute allocated?'

Why it matters

Training strategy affects the models that later become practical to serve.

Common trap

Bigger is not automatically the best use of compute.

Grounding

Training Compute-Optimal Large Language Models.

Source trail

Chinchilla paper Dwarkesh / Reiner: LLM math and serving

Stage 5

Logic gates

Lecture-derived claim

Logic gates are tiny physical circuits that implement basic operations on bits.

Appears in Lecture 2

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

Chip design builds upward from transistors and gates to arithmetic units, memory, and dataflow.

Software-engineering example

The cloud service answering your prompt ultimately depends on physical circuits switching.

Why it matters

Lecture 2 grounds AI in physics and hardware, not abstraction alone.

Common trap

The cloud is not weightless. It is machines, power, heat, and silicon.

Grounding

Lecture 2 chip design from the bottom up.

Source trail

Dwarkesh / Reiner: chip design

Stage 5

Systolic arrays

Lecture-derived claim

A systolic array is a grid of compute units that rhythmically passes data through for matrix work.

Appears in Lecture 2

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

It choreographs data movement so multiply-accumulate work can happen with high reuse and throughput.

Software-engineering example

Instead of sending every number back and forth to distant memory, the chip moves data through a local pattern.

Why it matters

It is a concrete example of hardware taking the shape of matrix multiplication.

Common trap

It is not just more cores. It is a dataflow design.

Grounding

Lecture 2 and matrix multiplication sources.

Source trail

Dwarkesh / Reiner: chip design NVIDIA matrix multiplication guide

Stage 5

Cache vs scratchpad

Lecture-derived claim

Cache is usually hardware-managed nearby memory; scratchpad is explicitly managed nearby memory, often by software, compiler, or runtime.

Appears in Lecture 2

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

Both put data close to compute, but they differ in who controls placement and movement.

Software-engineering example

Accelerator efficiency depends on keeping the right model data close at the right time.

Why it matters

This distinction explains why chip design and compiler/runtime design are linked.

Common trap

Nearby memory is scarce, so placement choices matter.

Grounding

Lecture 2 memory hierarchy discussion.

Source trail

Dwarkesh / Reiner: chip design

Stage 5

Why chips have a shape

Lecture-derived claim

Chips reflect the workloads they are built to run.

Appears in Lecture 2

You do not need this on the first read. It helps when you return to the lectures.

Go deeper

Slightly technical

Area, power, memory bandwidth, interconnect, arithmetic units, and programmability trade off against each other.

Software-engineering example

AI accelerators look the way they do because matrix work and data movement dominate the economics.

Why it matters

Understanding chips helps explain what future AI coding tools can practically do.

Common trap

No chip is best at everything. Specialization buys performance by giving something up.

Grounding

Lecture 2 overall frame.

Source trail

Dwarkesh / Reiner: chip design NVIDIA matrix multiplication guide

Lecture 1 · The destination

How to watch Lecture 1: The math behind how LLMs are trained and served

Use this lecture to connect model behaviour to serving economics. The core move is to stop asking only "what can the model do?" and start asking "what work happens per token, where is the bottleneck, and how does the serving system keep the accelerator useful?"

Open Lecture 1 source

Before watching

Keep the payment retry prompt in mind. The prompt-read phase is the system absorbing evidence. The answer-write phase is the system generating new tokens while carrying earlier context forward.

What should click

Latency, throughput, and cost are not afterthoughts. They are shaped by model weights, active parameters, batching, memory bandwidth, KV cache, and whether the workload is compute-limited or memory-limited.

Lecture area	What to listen for	Concepts from this guide
Batch size	How grouping work improves accelerator use while changing latency.	batching, GPU
Token cost	Why prompt length and generated length become serving work.	token, prefill/decode
Roofline analysis	Whether the bottleneck is arithmetic throughput or memory movement.	GPU, memory bandwidth
Memory bandwidth	Why moving weights, activations, and cache data can dominate latency.	memory bandwidth, HBM
Prefill/decode	The difference between reading context and writing the answer.	prefill vs decode, forward pass
KV cache	How stored key/value state makes decode practical and memory-hungry.	attention, KV cache
Long-context cost	Why larger context windows create more state to process and carry.	long context, memory hierarchy
MoE	How sparse expert routing changes active parameters and placement.	MoE, active parameters
Pipeline parallelism	How frontier models get split across devices or racks.	pipeline parallelism
Chinchilla	Why compute allocation matters, not only model size.	scaling laws, Chinchilla

Lecture 2 · The destination

How to watch Lecture 2: Chip design from the bottom up

Use this lecture to bring the abstraction all the way down to silicon. The surprising lesson is that the shape of the chip follows the shape of the workload: repeated arithmetic, scarce nearby memory, expensive data movement, and the need to keep many operations flowing.

Open Lecture 2 source

Before watching

Do not start with the whole GPU. Start with one tiny operation: multiply, add, repeat. Then ask what must be built around that operation so it can happen billions or trillions of times usefully.

What should click

Systolic arrays, scratchpads, caches, and specialised accelerators are not random hardware trivia. They are ways of arranging compute and memory so matrix-heavy AI workloads waste less time moving data.

Lecture area	What to listen for	Concepts from this guide
Logic gates	How physical circuits become the base of computation.	logic gates
Matrix multiplication	Why model work reduces to repeated structured numerical operations.	matrix multiplication
Multiply-accumulate	The tiny arithmetic step repeated at enormous scale.	multiply-accumulate
Systolic arrays	Matrix multiplication as choreographed local dataflow.	systolic arrays
Cache/scratchpad	Who controls nearby memory and why placement matters.	cache vs scratchpad, memory hierarchy
CPU vs GPU	Flexible control flow versus high-volume parallel arithmetic.	GPU
GPU as matrix machine	Why modern accelerators are shaped around matrix-heavy work.	GPU, matrix multiplication
AI accelerator design	How area, power, memory, interconnect, and programmability trade off.	why chips have a shape

Source spine

The trail this guide leans on.

Glossary

Small definitions for repeated terms.

token: A chunk of text or code represented as an ID for model processing.
parameter: A learned number in the model's weights.
vector: An ordered list of numbers.
embedding: A vector representation of a token or piece of content.
forward pass: A run through the model that produces output probabilities.
inference: Using a trained model to answer a request.
training: Updating model weights from data and objective functions.
matrix: A rectangular grid of numbers.
matrix multiplication: The operation that combines matrices and vectors across model layers.
multiply-accumulate: Multiply values and add the result into a running sum.
CPU: A general-purpose processor optimised for flexible control flow.
GPU: A parallel processor optimised for high-volume numerical work.
TPU: A tensor-processing accelerator designed for machine-learning workloads.
AI accelerator: Hardware specialised for AI math and data movement.
memory bandwidth: The rate at which data can move through memory systems.
HBM: High Bandwidth Memory, used near accelerators for high-throughput data access.
batching: Grouping work together to improve hardware utilisation.
prefill: The phase that processes the input prompt/context.
decode: The phase that generates output tokens step by step.
attention: A transformer mechanism for relating tokens to other tokens in context.
query: A learned representation used to ask what other token information matters.
key: A learned representation used for matching against queries.
value: A learned representation carrying information to combine after attention weights are computed.
KV cache: Stored key/value state from prior tokens used during generation.
long context: A larger token window available to the model in one request.
MoE: Mixture of Experts, a model layout that activates selected expert pathways.
pipeline parallelism: Splitting model work across stages or devices.
scaling laws: Empirical relationships among compute, data, model size, and performance.
Chinchilla: A compute-optimal training result about balancing model size and training tokens.
logic gate: A basic circuit element that computes with bits.
systolic array: A grid-like dataflow design for efficient matrix operations.
cache: Hardware-managed nearby memory.
scratchpad: Explicitly managed nearby memory, often controlled by software, compiler, or runtime.

Appendix How this guide is grounded.

These are the analogies in the guide. They rest on real grounding, but the framing is mine.

Teaching analogy	What it rests on	Source trail
Long context is a memory bill	KV cache plus memory bandwidth: more context means more key/value state to store and move, not just more text.	NVIDIA inference optimisation Dwarkesh / Reiner: LLM math and serving OpenAI token docs
Logs are cheap for humans to paste and expensive for models to carry	Token and context cost: pasted text becomes tokens that must be processed, cached, and attended over.	OpenAI token docs Dwarkesh / Reiner: LLM math and serving
The chatbot is the last mile of a much deeper machine	The full descent: the product surface sits on top of tokens, matrix work, memory movement, and silicon.	Dwarkesh / Reiner: LLM math and serving Dwarkesh / Reiner: chip design
An AI coding assistant is a chain of translations	The text-to-math-to-hardware path: each stage hands a representation to the next.	Dwarkesh / Reiner: LLM math and serving