Pillar 0: LLM Foundations
You understand the machinery well enough to reason about LLM behavior.
You understand the machinery well enough to reason about LLM behavior.
Every other pillar assumes you can reason about why LLMs behave the way they do. Not at a research level, but at the level where your mental model matches reality closely enough to make good decisions. When you understand that transformers process tokens, that attention is how the model decides what matters, and that embeddings are how meaning becomes math, you stop treating AI as a magic box and start treating it as an engineered system you can reason about.
What We Expect
Section titled “What We Expect”You understand that LLMs are next-token prediction machines
Section titled “You understand that LLMs are next-token prediction machines”An LLM generates one token at a time, left to right, each time asking "given everything so far, what is the most probable next token?" This single mechanism explains why prompt order matters, why "think step by step" works, why few-shot examples work, and why temperature is not a magic dial. Every prompting technique, agent pattern, and reliability strategy in this document is a consequence of it.
You understand the training pipeline that shapes behavior
Section titled “You understand the training pipeline that shapes behavior”Models go through three stages: pre-training (raw text completion on internet-scale data; the base model is just an autocomplete engine), instruction tuning (what makes it follow prompts), and alignment via RLHF or Constitutional AI (what makes it helpful, refuse harmful requests, and problematically, sycophantic). This pipeline explains why models over-help, why system prompts work, and why the same base model behaves completely differently after different fine-tuning.
You understand context windows and how attention distributes across them
Section titled “You understand context windows and how attention distributes across them”The context window is the total token budget shared between your input and the model's output. Attention is not uniform across this window: models attend more strongly to the beginning and end, with weaker attention in the middle. Longer context means more computation, higher latency, and higher cost. See Pillar 1: Context Engineering for the practical techniques that follow from this.
You understand the transformer architecture at a conceptual level
Section titled “You understand the transformer architecture at a conceptual level”Modern LLMs are built on the transformer (Attention Is All You Need, 2017). You don't need to implement one; you should be able to explain that self-attention lets the model weigh the relevance of every part of the input when producing each part of the output. The deeper mechanics are in Learning Paths.
You understand tokenization and why it matters practically
Section titled “You understand tokenization and why it matters practically”LLMs see tokens (sub-word units produced by algorithms like BPE), not characters or words. Code tokenizes less efficiently than prose, unusual variable names consume more tokens than common ones, and token boundaries affect how the model processes input. This is why prompt length drives cost and why "just add more context" has a price tag.
You understand embeddings and vector similarity
Section titled “You understand embeddings and vector similarity”Embeddings convert text into numerical vectors that capture semantic meaning; similar concepts cluster in vector space. This is the foundation of RAG and the reason retrieval can fail when the embedding doesn't capture the right semantic relationship.
You understand inference parameters and how they shape output
Section titled “You understand inference parameters and how they shape output”Temperature, top-p, and top-k all control output randomness in different ways, and their defaults vary by provider. Many reasoning models constrain temperature entirely. Picking the right values for code generation, data extraction, and creative work is a learned skill; the practical bands and provider quirks live in Learning Paths.
You understand structured output and constrained decoding
Section titled “You understand structured output and constrained decoding”"Please return JSON" is unreliable; API-native constrained decoding is the strongest available syntactic guarantee. All major providers (OpenAI, Anthropic, Google) support strict JSON schema enforcement that masks invalid tokens during generation so the model cannot emit schema-violating tokens. Define your schema (Pydantic, Zod), pass it to the structured-output parameter, receive typed responses without regex parsing.
Constrained decoding eliminates malformed output for the schema features the provider supports; it does not eliminate every failure mode. Per Anthropic's own documentation, safety refusals can return non-schema content with stop_reason: "refusal", truncation against max_tokens returns incomplete JSON that requires a retry with a higher limit, sufficiently complex schemas can fail compilation, and the model can still produce schema-valid but semantically wrong output (hallucinated string fields, wrong enum value, fabricated IDs). Treat structured output as the syntactic floor, and keep application-level validation, refusal handling, truncation retries, and semantic checks at the boundary.
You understand adaptive thinking and test-time compute
Section titled “You understand adaptive thinking and test-time compute”Modern reasoning models (Claude with extended thinking, OpenAI's o-series, similar features elsewhere) dynamically allocate more compute to harder problems by generating intermediate reasoning tokens before answering. Cost and latency are now variable. For practical guidance on when to use deeper thinking modes, see Pillar 3: Prompt Engineering.
You understand why LLMs hallucinate and what that means for system design
Section titled “You understand why LLMs hallucinate and what that means for system design”Training and evaluation procedures reward guessing over acknowledging uncertainty, and RLHF amplifies this because human judges prefer detailed, confident answers. Design every system assuming hallucinations will occur.
Mitigation is layered: RAG for grounding, chain-of-thought style rationales paired with external verification (tests, retrieval, tool call results, citations, human review), explicit "refuse rather than guess" instructions, and human review for high-stakes outputs. Chain-of-thought is a generation pattern that often improves performance on multi-step problems; it is not a faithful trace of the model's internal computation, and treating it as "reasoning transparency" overstates what it provides (Anthropic: Reasoning models don't always say what they think). For verification practices, see Pillar 6: Verification and Security.
You understand sycophancy and how to counter it
Section titled “You understand sycophancy and how to counter it”RLHF trained models to agree with humans because evaluators preferred answers that agreed with them. The model learned: agreement = good score. If you ask "Isn't it true that X?" the model tends to confirm even when X is wrong; if you frame a question with emotional investment, the model finds reasons to praise.
Counters include not leading the witness, asking for disagreement, using critical personas, two-pass review, and asking for confidence levels. The full checklist with worked examples lives in Learning Paths: Reducing Sycophancy. Treating model agreement as validation is one of the most common mistakes engineers make.
You understand RAG well enough to reason about quality
Section titled “You understand RAG well enough to reason about quality”You already use RAG daily: Claude Code reads files before answering, Cursor's @codebase matches your query against your project, Copilot indexes your repo to inform completions.
The pipeline (loading, indexing, embedding, retrieval, generation) has knobs at each stage. When a RAG system gives bad answers, you should be able to reason about whether the problem is in retrieval (wrong chunks), generation (hallucinating despite good context), or ingestion (chunked poorly).
You understand agent patterns and recognize them in your daily tools
Section titled “You understand agent patterns and recognize them in your daily tools”Agents are LLMs with tools and loops. Core patterns: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. When Claude Code plans a multi-file refactor it's orchestrating workers; when you give feedback and it revises you're the evaluator. The principle that matters: use the simplest pattern that works. Single prompt before chain, chain before workflow, workflow before autonomous agent. Most tasks don't need agents. The named prompting strategies behind these patterns are covered in Pillar 3: Prompt Engineering.
You understand function calling, tool design, and how MCP extends it
Section titled “You understand function calling, tool design, and how MCP extends it”Function calling lets LLMs interact with external systems: the model decides when to call, structures arguments, and incorporates the result. MCP standardizes this into a portable protocol.
The critical insight from agent research: tool design matters more than prompts. Make the right action easy and the wrong action impossible at the tool level rather than asking the model politely. Narrow, specific tools outperform broad generic ones; dedicated Grep/Glob/Read tools outperform a single shell tool.
You understand that multimodal input is standard
Section titled “You understand that multimodal input is standard”Image understanding is a default feature across frontier models (OpenAI, Anthropic, Google all ship vision). Common use cases: document processing, chart analysis, OCR, UI screenshot understanding, visual QA. The API pattern is straightforward: pass images as base64 or URLs alongside text. Audio and video are more specialized and vary by provider, but vision is table-stakes.
Anti-patterns
Section titled “Anti-patterns”- Treating LLMs as deterministic systems that always produce the same output for the same input
- Not understanding why the same prompt costs different amounts across models and providers
- Building on top of RAG without understanding why retrieval quality varies
- Not understanding the security implications of function calling (the model can be manipulated into calling tools with malicious arguments)
- Using temperature 1.0 for code generation or 0.0 for creative writing without understanding why those are bad defaults
- Assuming embeddings capture all semantic relationships equally (they don't; domain-specific concepts often need fine-tuned embeddings)
- Treating "please return JSON" as equivalent to constrained decoding (one is a suggestion, the other is a syntactic guarantee with documented exceptions for refusals, truncation, and semantic correctness)
- Building production systems without accounting for hallucination as a structural property of the technology
- Asking the model to validate your approach and treating its agreement as evidence of correctness (sycophancy means the model is biased toward agreeing with you)
- Guessing file paths, function names, or API signatures instead of using tools to verify them (tool-based ground truth is the most effective anti-hallucination mechanism)
- Using an autonomous agent when a simple prompt chain would work (complexity should be earned, not the default)
- Dumping entire files as context when only one function is relevant (every irrelevant token competes for attention with the tokens that matter)
Resources
Section titled “Resources”- Intro to Large Language Models - Andrej Karpathy's accessible overview, the canonical primer on next-token prediction
- HuggingFace LLM Course: Training Pipeline - Pre-training, instruction tuning, and RLHF explained
- The Illustrated Transformer - Visual explanation of the transformer architecture, used in courses at Stanford, Harvard, and MIT
- Attention Is All You Need - The original 2017 paper introducing the transformer
- HuggingFace LLM Course: Tokenizers (Chapter 6) - BPE, WordPiece, and SentencePiece reference
- The Illustrated Word2Vec - Visual guide to embeddings and vector similarity
- Prompting Guide: LLM Settings - Vendor-agnostic explanation of temperature, top-p, and other inference parameters
- OpenAI: Reasoning Models - Adaptive thinking and test-time compute
- OpenAI: Why Language Models Hallucinate - Why training incentives produce hallucination
- Anthropic: Reasoning Models Don't Always Say What They Think - Empirical study of chain-of-thought faithfulness; Claude 3.7 Sonnet acknowledged provided hints only ~25% of the time, supporting the "rationale, not transparency" framing
- LlamaIndex: Understanding RAG - Full RAG pipeline walkthrough from ingestion through retrieval
- Anthropic: Building Effective Agents - Agent architecture patterns and the case for simplicity
- MCP Specification - The Model Context Protocol for portable tool integration
- See Learning Paths for deeper dives on each topic