SY

From Agents to Tokens: My AI Engineering Learning Plan

7 min
ai-engineeringlearning-in-publicllm

I use LLMs every day — they've become part of how I build and ship. But it's gone a step past using them: at work I've been building agentic systems for other people to use. A chat interface that knows our datasets, decides which API calls to make, pulls the data, runs the calculations, and answers the user's question. I build these systems — and I still couldn't tell you, precisely, how the model at the center of them works.

That's the gap this plan closes: not "user" to "power user," but "builds on top of the black box" to "has opened the box" — working down from agents to tokens.

This post is the plan, and publishing it is the commitment device: I learn a stage, I write it up, and the site keeps me honest. (Full disclosure: I drafted the syllabus with Claude's help — which feels like the appropriate way to plan a course of study on the thing itself.)

Why "AI engineering" and not just "LLMs"

The model is only half the story. The other half is the discipline that has grown around using models: prompt engineering, agentic loops and harnesses, tool use, evals, serving. The industry has mostly settled on AI engineering as the name for that layer, and the name matters because it sets the syllabus — studying attention math without studying harnesses and evals would be like learning how an engine works but never driving.

So the path runs the whole stack. I live at the top of it, in the agents layer, and every question I have points down — from agents to tokens. The learning has to build back up the other way, though, which is why the map below starts at the bottom.

Questions I can't answer yet

A few questions have been sitting in my head for a while, and they're the real reason this plan exists. For most of them I can point at the likely step — "something in the tokenizer," "probably post-training" — but pointing at a step and knowing what happens inside it are different things.

How do these models see? Most people carry a rough mental model of LLMs reading text: break it into tokens, predict the next one. Fine. But then how does the same architecture become multimodal? What does it even mean to "tokenize" an image? How does a next-word predictor also generate a picture, watch a video, or hear audio — is everything secretly a token, or is something else going on?

How is one model multilingual? If "understanding a word" means learning an embedding for its token, that story sounds suspiciously English-shaped. Yet the same frozen weights answer in Hindi, debug Python, and translate between language pairs nobody explicitly trained them on. Where does meaning actually live, if not in the words of any one language?

Why don't typos break anything? Misspell half the words in a prompt and the model doesn't stumble — it answers as if nothing happened, no correction needed. But a typo completely changes how the text tokenizes. Shouldn't that scramble everything downstream? Why is understanding so robust to mangled input?

Where does a model's personality come from? Ask ChatGPT and Claude the same question and you'll get different tone, different instincts, different taste — yet they're all trained on more or less the same internet. So where does the difference come from? Is it post-training — the fine-tuning and reinforcement learning the labs layer on after pretraining, where a raw next-word predictor somehow becomes an assistant with opinions? And the question has another layer: products like Codex and Claude Code aren't bare models — they're models inside a harness: a system prompt, a set of tools, a loop deciding what happens next. How much of what we read as personality is the model itself, and how much is the harness it's wearing?

I don't want hand-wavy answers to any of these. Most of them sit at the bottom of the stack — tokenization and embeddings — which is exactly why the path starts there. The personality question is split between the middle and the top: post-training and the harness layer.

The map

agents · tool use · harnesses · evalsI build herepromptstokensthe base model (the box)tokenizationembeddingsattentiontrainingfine-tuninginferencethis plan:open the box
the plan, roughly

1. Tokenization. Why do models see subwords instead of words? How does BPE actually work? Is the tokenizer the reason models are famously bad at spelling — and if so, why are they so unbothered by my typos? And what does "tokenizing" an image or a second of audio even mean?

2. Embeddings. What makes a vector "mean" something? Why is cosine similarity the default measure? How do fifty languages end up sharing one embedding space? And what do positional encodings add?

3. Attention and the transformer. The one I most want to earn rather than nod along to: what do queries, keys, and values actually do? What does the causal mask hide, and from whom? Where, physically, do the billions of parameters live?

4. Training. What is next-token prediction really optimizing? How do I read a loss curve? And what did the scaling-law papers claim that reshaped how everyone budgets a training run?

5. Fine-tuning and alignment. What separates a base model from an instruct model when the weights are nearly the same? Is this where a model's personality actually gets made? How does LoRA get away with training so few parameters? What's the actual difference between RLHF and DPO?

6. Inference. What do temperature, top-k, and top-p do, mathematically, to a distribution? Why does long context eat memory (the KV cache)? How does quantization shrink a model without ruining it?

7. Agents, tool use, and harnesses. The layer I build in every day — but shipping harnesses and understanding them systematically are different things. What makes a harness reliable? Where do agentic loops fail, and how do you evaluate them? How much of a product's "personality" is the harness rather than the model? Expect field notes rather than textbook summaries.

8. Evals and interpretability. How do you honestly measure a model? Why do benchmarks saturate and leak? And what can interpretability actually claim today versus what gets claimed for it?

Stages 1 through 3 build on each other, so they come first and in order. After that I'll follow interest. Stage 7 runs in parallel throughout, because it's my day job.

How I'll write it up

Mostly as TILs — short posts, one idea each, published the week I learn it. I promised TILs in my very first post here and never delivered one; this is where that changes.

Occasionally, when running the code is the point, a stage will get an interactive post instead. This site can execute Python in the browser, and some of these topics are perfect for it: a BPE tokenizer you can type into, an attention heatmap you can poke at, a sampling playground with temperature and top-k sliders. One constraint keeps those honest — the browser runtime speaks numpy, not PyTorch, so anything interactive has to be implemented from scratch. That's not a limitation so much as the whole pedagogical point.

Holding myself to it

The last stretch of this blog went quiet for a long time while the work went elsewhere. The fix isn't ambition, it's cadence: small pieces, shipped while they're fresh. If you want to follow along, the RSS feed is live — the next post will be a TIL about tokenization.