Skip to content

Pillar 0: LLM Foundations

You don’t need to train models, but you need to understand the machinery well enough to reason about behavior.

Every pillar in this document assumes you can reason about why LLMs behave the way they do. Not at a research level, but at the level where your mental model of the system matches reality closely enough to make good decisions. When you understand that transformers process tokens (not words), that attention is how the model decides what matters, and that embeddings are how machines represent meaning as math, you stop treating AI as a magic box and start treating it as an engineered system with predictable characteristics.

This foundation pays off across everything else: context engineering makes more sense when you understand context windows mechanically, prompt engineering clicks when you know how tokenization and attention interact, cost awareness becomes intuitive when you understand what a token actually is.

You understand that LLMs are next-token prediction machines. An LLM generates one token at a time, left to right, each time asking “given everything so far, what is the most probable next token?” This is the single most important mental model for working with AI. It explains why prompt order matters (the model builds on what came before), why “think step by step” works (intermediate reasoning tokens influence subsequent ones), why few-shot examples work (you shape the probability distribution by showing the pattern you want continued), and why temperature is not a magic dial (it reshapes the probability distribution at each individual token choice). Every prompting technique, every agent pattern, and every reliability strategy in this document is a consequence of this mechanism.

You understand the training pipeline that shapes model behavior. A model goes through three stages that determine how it behaves: pre-training (raw text completion on internet-scale data - the base model is just an autocomplete engine), instruction tuning (fine-tuning on instruction/response pairs - this is what makes it follow prompts instead of just completing text), and RLHF or Constitutional AI (alignment training - this is what makes it helpful, refuse harmful requests, and problematically, sycophantic). Understanding this pipeline explains why models over-help (RLHF rewards confidence), why system prompts and role-setting work (they tap into instruction tuning), and why the same base model can behave completely differently depending on how it was fine-tuned.

You understand context windows and how attention distributes across them. The context window is the total token budget shared between your input and the model’s output. Attention is not uniform across this window - models attend more strongly to the beginning and end, with weaker attention in the middle. Longer context means more computation, higher latency, and higher cost. You should understand why this matters mechanically so the practical techniques in Pillar 1: Context Engineering make sense rather than feeling like arbitrary rules.

You understand the transformer architecture at a conceptual level. You know that modern LLMs are built on the transformer architecture introduced in Attention Is All You Need (2017). You understand the core idea: self-attention lets the model weigh the relevance of every part of the input when producing each part of the output. You don’t need to implement one, but you should be able to explain why transformers handle long-range dependencies better than prior architectures, why they parallelize well during training, and why attention is the mechanism that makes context engineering work.

You understand tokenization and why it matters practically. LLMs don’t see characters or words. They see tokens, sub-word units produced by algorithms like BPE (Byte Pair Encoding). This has real consequences: code often tokenizes less efficiently than prose (meaning higher costs), unusual variable names consume more tokens than common ones, and token boundaries can affect how the model processes your input. When you understand tokenization, you understand why prompt length affects cost, why some languages are more expensive to process than others, and why “just add more context” has a real price tag.

You understand embeddings and vector similarity. Embeddings are how LLMs convert text into numerical vectors that capture semantic meaning. Similar concepts end up near each other in vector space. This is the foundation for how RAG works: documents get embedded, queries get embedded, and the system retrieves documents whose vectors are closest to the query vector. Understanding this helps you reason about why retrieval sometimes fails (the embedding didn’t capture the right semantic relationship) and why chunk size and overlap matter in RAG pipelines.

You understand inference parameters and how they shape output. Temperature is the parameter every developer touches first. It controls output randomness by reshaping the probability distribution over next tokens. The practical guidance: 0.0-0.2 for code generation and data extraction, 0.5-0.7 for general tasks, 0.7-1.0 for creative work. A critical nuance: the same temperature value produces different behavior across providers (Anthropic defaults to 1.0, OpenAI defaults vary by model), and some reasoning models lock temperature entirely. Top-p (nucleus sampling) limits the token pool to the smallest set whose cumulative probability exceeds a threshold. The key rule: don’t combine temperature and top-p adjustments simultaneously, as this introduces unpredictable interactions. Top-k is even more specialized and most providers don’t expose it.

You understand structured output and constrained decoding. The progression from “please return JSON” (unreliable) to API-native constrained decoding (guaranteed schema compliance) represents one of the most important developments for production LLM use. All major providers now support strict JSON schema enforcement via constrained decoding, a technique that masks invalid tokens during generation so the model literally cannot produce schema-violating output. The practical developer workflow: define your output schema (Pydantic in Python, Zod in TypeScript), pass it to the API’s structured output parameter, receive typed and validated responses. No regex parsing, no retry loops, no fragile extraction logic.

You understand adaptive thinking and test-time compute. Standard LLMs apply the same computation to every question. Modern reasoning models (Claude with extended thinking, OpenAI o1/o3) dynamically allocate more compute to harder problems by generating intermediate reasoning tokens before answering. This means cost and latency are now variable - a simple question might cost 100 tokens, a hard one might use 10,000 thinking tokens. Reasoning models lock temperature because randomness derails multi-step logic chains. Understanding this mechanism explains why the same model gives wildly different response times and costs for different queries. For practical guidance on when to use deeper vs lighter thinking modes, see Pillar 3: Prompt Engineering.

You understand why LLMs hallucinate and what that means for system design. LLMs hallucinate because training and evaluation procedures reward guessing over acknowledging uncertainty. Like a multiple-choice test where guessing gives a 1-in-4 chance of being right while “I don’t know” scores zero, models learn to produce confident-sounding text even when they lack knowledge. RLHF (reinforcement learning from human feedback) amplifies this because human judges prefer detailed, confident answers. The practical implication: you must design every system assuming hallucinations will occur. Mitigation is layered: RAG for grounding in source material, chain-of-thought prompting for reasoning transparency, explicit “refuse rather than guess” instructions, and human review for high-stakes outputs. Understanding the cause informs every reliability decision. For verification practices, see Pillar 6: Verification and Security.

You understand sycophancy and why models tend to agree with you. RLHF trains models by rewarding answers that human evaluators preferred, and humans preferred answers that agreed with them. The model learned: agreement = good score. This means if you ask “Isn’t it true that X?” the model will tend to confirm, even when X is wrong. If you frame a question with emotional investment (“I spent three weeks on this architecture”), the model will find reasons to praise it. This is not a bug you can fix with one prompt - it’s a trained behavior. Treating model agreement as validation is one of the most common mistakes engineers make.

You know how to reduce sycophancy in practice. The counters are straightforward but require discipline:

  • Don’t lead the witness. “Compare React and Vue performance, include cases where each wins” not “Isn’t React faster than Vue?”
  • Remove emotional framing. “Review this architecture for flaws. Be direct.” not “I spent three weeks on this, what do you think?”
  • Ask for disagreement explicitly. “Challenge my assumptions. If my premise is wrong, say so.” This gives the model permission to override its agreement training.
  • Ask for counterarguments. “Give me 3 reasons this approach might fail” or “Steel-man the opposing view.” Forces the model into critical mode.
  • Use persona prompting. “Act as a skeptical senior engineer reviewing a PR. Find problems.” or “You are a meticulous code reviewer with very high standards for code quality. Find every potential issue.” A critical persona overrides the default helpful-agreeable persona.
  • Use two-pass review. First ask the model to generate, then: “Now critique your own answer. What might be wrong?” Splitting generation from evaluation produces more honest assessment because the model is no longer defending its own output.
  • Ask for confidence levels. “Rate your confidence 1-10 for each claim. For anything below 7, explain your uncertainty.” This forces the model to distinguish between what it knows and what it’s guessing.
  • Never use confirmation questions. “This is correct, right?”, “Does this make sense?”, and “Am I right that…?” are all leading questions that trigger agreement. Use “Is this correct? If not, explain why.” or “Evaluate this for correctness.” instead.

You understand RAG well enough to reason about quality. You are already using RAG daily - Claude Code reads files before answering (retrieval), Cursor’s @codebase embeds your query and matches it against your project (retrieval), Copilot indexes your repo to inform completions (retrieval). RAG is not something you will encounter someday; it is how your AI tools work right now. You should understand the pipeline stages: loading/ingestion (getting documents in), indexing/embedding (converting to vectors), storing (vector databases), querying/retrieval (finding relevant chunks), and generation (producing answers from retrieved context). Each stage has knobs that affect output quality. When a RAG system gives bad answers, you should be able to reason about whether the problem is in retrieval (wrong chunks found), generation (model hallucinating despite good context), or ingestion (documents chunked poorly).

You understand agent patterns and recognize them in your daily tools. AI agents are LLMs with tools and loops. The core patterns are: prompt chaining (sequential steps), routing (directing to specialized handlers), parallelization (running multiple tasks concurrently), orchestrator-workers (a coordinator dispatching to specialized agents), and evaluator-optimizer (using an LLM to grade LLM responses, sometimes called “LLM-as-a-Judge”). You are already using these patterns: when Claude Code plans a multi-file refactor it is orchestrating workers, when you give feedback and it revises you are the evaluator in an evaluator-optimizer loop, when you break a task into steps you are prompt chaining. The most important principle from Anthropic’s own agent research: use the simplest pattern that works. Single prompt before chain, chain before workflow, workflow before autonomous agent. Most tasks don’t need agents. The named prompting strategies that power these patterns (zero-shot, few-shot, chain-of-thought, ReAct, Reflexion) are covered in Pillar 3: Prompt Engineering.

You understand function calling, tool design, and how MCP extends it. Function calling is the mechanism that lets LLMs interact with external systems: the model decides when to call a function, structures the arguments, and incorporates the result. MCP (Model Context Protocol) standardizes this into a protocol so tools are portable across AI clients. A critical insight from Anthropic’s agent research: tool design matters more than prompts. When building tools for LLMs, make the right action easy and the wrong action impossible - reject invalid inputs at the tool level rather than asking the model politely in the prompt. Narrow, specific tools outperform broad, generic ones. Claude Code’s own tools demonstrate this: Edit uses search-and-replace (hard to misuse) rather than unified diffs (error-prone), and dedicated Grep/Glob/Read tools outperform a single generic shell tool. Understanding this pipeline helps you debug tool failures, design better integrations, and reason about the security implications of giving models access to external systems.

You understand that multimodal input is now a standard capability. Image understanding is a default feature in every frontier model. Common developer use cases include document processing, chart analysis, OCR, UI screenshot understanding, and visual QA. The API pattern is straightforward: pass images as base64 or URLs alongside text messages. Audio and video capabilities are more specialized and vary by provider, but vision is table-stakes knowledge for any developer working with LLM APIs.

  • Treating LLMs as deterministic systems that always produce the same output for the same input
  • Not understanding why the same prompt costs different amounts across models and providers
  • Building on top of RAG without understanding why retrieval quality varies
  • Not understanding the security implications of function calling (the model can be manipulated into calling tools with malicious arguments)
  • Using temperature 1.0 for code generation or 0.0 for creative writing without understanding why those are bad defaults
  • Assuming embeddings capture all semantic relationships equally (they don’t; domain-specific concepts often need fine-tuned embeddings)
  • Treating “please return JSON” as equivalent to constrained decoding (it isn’t; one is a suggestion, the other is a guarantee)
  • Building production systems without accounting for hallucination as a structural property of the technology
  • Asking the model to validate your approach and treating its agreement as evidence of correctness (sycophancy means the model is biased toward agreeing with you)
  • Guessing file paths, function names, or API signatures instead of using tools to verify them (tool-based ground truth is the most effective anti-hallucination mechanism)
  • Using an autonomous agent when a simple prompt chain would work (complexity should be earned, not the default)
  • Dumping entire files as context when only one function is relevant (every irrelevant token competes for attention with the tokens that matter)