Skip to content

Pillar 0: LLM Foundations

You understand the machinery well enough to reason about LLM behavior.

You understand the machinery well enough to reason about LLM behavior.

Every other pillar assumes you can reason about why LLMs behave the way they do. Not at a research level, but at the level where your mental model matches reality closely enough to make good decisions. When you understand that transformers process tokens, that attention is how the model decides what matters, and that embeddings are how meaning becomes math, you stop treating AI as a magic box and start treating it as an engineered system you can reason about.

An LLM generates one token at a time, left to right, each time asking "given everything so far, what is the most probable next token?" This single mechanism explains why prompt order matters, why "think step by step" works, why few-shot examples work, and why temperature is not a magic dial. Every prompting technique, agent pattern, and reliability strategy in this document is a consequence of it.

Models go through three stages: pre-training (raw text completion on internet-scale data; the base model is just an autocomplete engine), instruction tuning (what makes it follow prompts), and alignment via RLHF or Constitutional AI (what makes it helpful, refuse harmful requests, and problematically, sycophantic). This pipeline explains why models over-help, why system prompts work, and why the same base model behaves completely differently after different fine-tuning.

The context window is the total token budget shared between your input and the model's output. Attention is not uniform across this window: models attend more strongly to the beginning and end, with weaker attention in the middle. Longer context means more computation, higher latency, and higher cost. See Pillar 1: Context Engineering for the practical techniques that follow from this.

Modern LLMs are built on the transformer (Attention Is All You Need, 2017). You don't need to implement one; you should be able to explain that self-attention lets the model weigh the relevance of every part of the input when producing each part of the output. The deeper mechanics are in Learning Paths.

LLMs see tokens (sub-word units produced by algorithms like BPE), not characters or words. Code tokenizes less efficiently than prose, unusual variable names consume more tokens than common ones, and token boundaries affect how the model processes input. This is why prompt length drives cost and why "just add more context" has a price tag.

Embeddings convert text into numerical vectors that capture semantic meaning; similar concepts cluster in vector space. This is the foundation of RAG and the reason retrieval can fail when the embedding doesn't capture the right semantic relationship.

Temperature, top-p, and top-k all control output randomness in different ways, and their defaults vary by provider. Many reasoning models constrain temperature entirely. Picking the right values for code generation, data extraction, and creative work is a learned skill; the practical bands and provider quirks live in Learning Paths.

"Please return JSON" is unreliable; API-native constrained decoding is the strongest available syntactic guarantee. All major providers (OpenAI, Anthropic, Google) support strict JSON schema enforcement that masks invalid tokens during generation so the model cannot emit schema-violating tokens. Define your schema (Pydantic, Zod), pass it to the structured-output parameter, receive typed responses without regex parsing.

Constrained decoding eliminates malformed output for the schema features the provider supports; it does not eliminate every failure mode. Per Anthropic's own documentation, safety refusals can return non-schema content with stop_reason: "refusal", truncation against max_tokens returns incomplete JSON that requires a retry with a higher limit, sufficiently complex schemas can fail compilation, and the model can still produce schema-valid but semantically wrong output (hallucinated string fields, wrong enum value, fabricated IDs). Treat structured output as the syntactic floor, and keep application-level validation, refusal handling, truncation retries, and semantic checks at the boundary.

Modern reasoning models (Claude with extended thinking, OpenAI's o-series, similar features elsewhere) dynamically allocate more compute to harder problems by generating intermediate reasoning tokens before answering. Cost and latency are now variable. For practical guidance on when to use deeper thinking modes, see Pillar 3: Prompt Engineering.

Training and evaluation procedures reward guessing over acknowledging uncertainty, and RLHF amplifies this because human judges prefer detailed, confident answers. Design every system assuming hallucinations will occur.

Mitigation is layered: RAG for grounding, chain-of-thought style rationales paired with external verification (tests, retrieval, tool call results, citations, human review), explicit "refuse rather than guess" instructions, and human review for high-stakes outputs. Chain-of-thought is a generation pattern that often improves performance on multi-step problems; it is not a faithful trace of the model's internal computation, and treating it as "reasoning transparency" overstates what it provides (Anthropic: Reasoning models don't always say what they think). For verification practices, see Pillar 6: Verification and Security.

You understand sycophancy and how to counter it

Section titled “You understand sycophancy and how to counter it”

RLHF trained models to agree with humans because evaluators preferred answers that agreed with them. The model learned: agreement = good score. If you ask "Isn't it true that X?" the model tends to confirm even when X is wrong; if you frame a question with emotional investment, the model finds reasons to praise.

Counters include not leading the witness, asking for disagreement, using critical personas, two-pass review, and asking for confidence levels. The full checklist with worked examples lives in Learning Paths: Reducing Sycophancy. Treating model agreement as validation is one of the most common mistakes engineers make.

You already use RAG daily: Claude Code reads files before answering, Cursor's @codebase matches your query against your project, Copilot indexes your repo to inform completions.

The pipeline (loading, indexing, embedding, retrieval, generation) has knobs at each stage. When a RAG system gives bad answers, you should be able to reason about whether the problem is in retrieval (wrong chunks), generation (hallucinating despite good context), or ingestion (chunked poorly).

Agents are LLMs with tools and loops. Core patterns: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. When Claude Code plans a multi-file refactor it's orchestrating workers; when you give feedback and it revises you're the evaluator. The principle that matters: use the simplest pattern that works. Single prompt before chain, chain before workflow, workflow before autonomous agent. Most tasks don't need agents. The named prompting strategies behind these patterns are covered in Pillar 3: Prompt Engineering.

Function calling lets LLMs interact with external systems: the model decides when to call, structures arguments, and incorporates the result. MCP standardizes this into a portable protocol.

The critical insight from agent research: tool design matters more than prompts. Make the right action easy and the wrong action impossible at the tool level rather than asking the model politely. Narrow, specific tools outperform broad generic ones; dedicated Grep/Glob/Read tools outperform a single shell tool.

Image understanding is a default feature across frontier models (OpenAI, Anthropic, Google all ship vision). Common use cases: document processing, chart analysis, OCR, UI screenshot understanding, visual QA. The API pattern is straightforward: pass images as base64 or URLs alongside text. Audio and video are more specialized and vary by provider, but vision is table-stakes.

  • Treating LLMs as deterministic systems that always produce the same output for the same input
  • Not understanding why the same prompt costs different amounts across models and providers
  • Building on top of RAG without understanding why retrieval quality varies
  • Not understanding the security implications of function calling (the model can be manipulated into calling tools with malicious arguments)
  • Using temperature 1.0 for code generation or 0.0 for creative writing without understanding why those are bad defaults
  • Assuming embeddings capture all semantic relationships equally (they don't; domain-specific concepts often need fine-tuned embeddings)
  • Treating "please return JSON" as equivalent to constrained decoding (one is a suggestion, the other is a syntactic guarantee with documented exceptions for refusals, truncation, and semantic correctness)
  • Building production systems without accounting for hallucination as a structural property of the technology
  • Asking the model to validate your approach and treating its agreement as evidence of correctness (sycophancy means the model is biased toward agreeing with you)
  • Guessing file paths, function names, or API signatures instead of using tools to verify them (tool-based ground truth is the most effective anti-hallucination mechanism)
  • Using an autonomous agent when a simple prompt chain would work (complexity should be earned, not the default)
  • Dumping entire files as context when only one function is relevant (every irrelevant token competes for attention with the tokens that matter)