Pillar 7: Workflow and Tooling

Be the orchestrator, not the bottleneck. Leverage agents, sessions, and tools.

Be the orchestrator, not the bottleneck. Leverage agents, sessions, and tools.

The math of productivity changed when AI agents became capable of autonomous work. Every moment you spend on delegatable work blocks not just you, but all the parallel processes you could have spawned. Think like a CPU scheduler: your attention is the scarcest resource in the system. Before touching any task, ask yourself whether it could run in parallel while you work on something else.

That said, autonomy is a dial, not a switch. Start with tight human-in-the-loop steering (max 3 turns between check-ins) until you trust the prompt and verification criteria, then open it up for longer autonomous runs.

What We Expect

You understand session management and use it deliberately

Start new sessions for new tasks. Most AI coding tools support resuming previous sessions and labeling them for retrieval. Use whatever your tool exposes; the principle (clean session per task) survives the specific commands. Your session strategy directly affects context quality: a clean session for a focused task produces better results than a sprawling conversation covering multiple concerns.

You give your AI tools, not just instructions

Why describe your database schema when the AI could query it directly? Why explain API contracts when it could read the OpenAPI spec through MCP? AI with well-scoped tools usually outperforms AI without tools, especially for tasks where the answer depends on facts the model cannot infer. Invest in configuring MCP servers, browser automation, and test runners that the AI can use in its agent loop.

You scope tool access to least privilege

Tools beat instructions only when scoped correctly. The same MCP server that lets an agent read your OpenAPI spec can, misconfigured, let it write to production. Default to read-only access; mutating tools require an approval gate or explicit human-in-the-loop confirmation. Default to non-production environments; production data and production-mutating tools require explicit policy approval, not implicit availability.

Audit and log tool calls with enough fidelity to reconstruct what happened during an autonomous run. Limit credential scope (per-environment tokens, short-lived federated identity, no long-lived static keys) per Pillar 10. The right action should be easy; the wrong action should be impossible at the tool layer, not the prompt.

You leverage Skills for repeatable expertise

Skills are folder-based packages (a SKILL.md plus optional scripts and references) that teach an agent how to perform a specific task: a code-review checklist, a deployment runbook, a doc-generation pattern, a frontend component scaffold. Originally built by Anthropic, Agent Skills is now an open standard supported across Claude Code, Cursor, Codex, GitHub Copilot, and roughly thirty other agents. Where MCP tools give the agent new actions, skills give it new procedural knowledge.

Use community skills for general workflows, write your own when a task is recurring and team-specific. Install team-shared skills per-project (checked into source control) and personal-utility skills globally. Vet anything you install per Pillar 10 - skills run with the agent's permissions and unvetted marketplaces are a real supply-chain surface.

You choose models deliberately based on the landscape

The gap between top models has compressed significantly; the difference between the top model and the 10th-ranked is roughly 5% (as of Q1 2026). Different models lead different capability domains: some lead coding benchmarks (SWE-bench), others lead multimodal (MMMU-Pro), others offer the largest context windows. Open-source models provide 10-100x cost savings for simpler tasks.

Read benchmarks critically. MMLU and HumanEval are saturated for frontier models and no longer differentiate them. More meaningful signals come from SWE-bench (real code on actual GitHub issues), LM Arena Vision (human preference for multimodal), and independent reproductions like Artificial Analysis and Vals.ai. Task-specific evaluation on your representative workloads matters more than any leaderboard score.

Translate that landscape into selection. A practical starting point: reserve your most capable model for roughly 30% of your work (complex reasoning, architectural planning, tricky debugging) and use a faster, cheaper model for the remaining 70% (routine implementation, boilerplate, test writing, documentation). Model routing (directing simple queries to cheap/fast models and complex queries to expensive/capable ones) is becoming a production standard; UC Berkeley's RouteLLM (ICLR 2025) demonstrated 85%+ cost reduction while maintaining 95% of top-model quality.

You apply the prompting vs. RAG vs. fine-tuning hierarchy to LLM-powered features

Before building any LLM-powered feature, choose the right approach. There is near-universal consensus on the hierarchy: start with prompt engineering (hours to implement, near-zero cost), escalate to RAG when you need current or proprietary data, and fine-tune only when persistent behavioral changes are required (weeks to implement, significant cost).

The key diagnostic question: "Do we need new facts, or new behavior?" New facts point to RAG; if your team is building an internal tool that answers questions about company policies or product docs that change regularly, you need retrieval because the model simply does not have those facts. New behavior (tone, style, complex classification patterns) points to fine-tuning; if you need every response across thousands of requests to match a specific house style and format, and prompt instructions alone are not producing the consistency you need, that is a fine-tuning case.

A critical misconception to avoid: fine-tuning does not reliably inject new knowledge; it changes behavior and style, not factual recall. Growing context windows (now 1M+ tokens in some models) are also shifting some RAG use cases back to prompt engineering, since you can fit entire document sets in-context. If a customer support bot's full response guidelines and FAQ content fit in a single prompt, start there before building a retrieval pipeline.

You treat cost awareness as a professional skill

All major providers charge separately for input and output tokens, with output tokens costing 3-5x more than input. The cost difference between model tiers within a single provider can be 10-15x or more. Check your provider's current pricing page; these numbers shift regularly as competition drives costs down.

Three cost levers matter most: prompt caching (saves 60-90% on repeated prefixes), batch APIs (significant discounts for latency-insensitive workloads), and model routing (directing simple tasks to cheaper models). Developers who ignore cost optimization either burn through budgets that get their AI access revoked or avoid using AI where it would help because they assume everything is expensive. Track your usage so you can make data-informed decisions about where premium models earn their price and where lighter models do the job.

You leverage structured workflows for complex tasks

For work that exceeds a single session's capacity, use spec files and plan documents as state persistence. Your progress lives in markdown files with completion status. Combined with session resume, this gives you workflow resilience: if you are interrupted or the session ends, you pick up exactly where you left off. See Pillar 2: Planning Before Code for how to structure these artifacts.

You use version control as your safety net for autonomous runs

Before letting an AI run autonomously, ensure you have a clean commit point. Check the AI's work at intervals. The longer the leash, the more important the rollback strategy. See Pillar 6: Verification and Security for the full verification framework.

Anti-patterns

Doing work manually that agents can do in parallel
Using the same session for hours across multiple unrelated tasks
Not configuring tools (MCP, hooks, test runners) that would make the AI self-sufficient
Wiring an agent into a production database or mutating API without read-only scoping, approval gates, or audit logging
Re-explaining the same procedural knowledge to the AI in every session instead of capturing it as a skill
Using the cheapest model for complex reasoning tasks where a more capable model would save time on rework
Running everything on the most expensive model without considering whether a lighter model would produce equivalent results
Not tracking token usage or understanding the cost implications of large context windows
Running long autonomous sessions without commit checkpoints or verification criteria
Not learning the slash commands and capabilities of your tools; new features ship regularly
Assuming fine-tuning will fix factual accuracy problems (it won't; that's RAG's job)
Reaching for RAG when the data would fit in a single prompt with a large context window

Resources

MCP Protocol - The Model Context Protocol specification for extending AI capabilities with external tools
MCP Apps - Interactive UI components served by MCP servers, rendering dashboards, forms, and visualizations directly in the conversation
Chatbot Arena (LMSYS) - Live model rankings based on human preference voting
LM Arena Vision Leaderboard - Multimodal model rankings
SWE-bench - Real-world coding benchmark for evaluating AI on actual GitHub issues
MMMU-Pro - Multimodal understanding benchmark
Artificial Analysis: MMMU-Pro - Independent benchmark evaluations
Vals.ai: MMMU - Independent benchmark reproduction
RouteLLM: Model Routing (ICLR 2025) - 85% cost reduction while maintaining 95% quality through intelligent model routing
Anthropic: Prompt Caching - How prompt caching works for cost optimization
Anthropic Model Overview - Model capabilities, context windows, and pricing tiers (representative of how providers structure offerings)
See Learning Paths for deeper dives

Pillar 7: Workflow and Tooling

What We Expect

You understand session management and use it deliberately

You give your AI tools, not just instructions

You scope tool access to least privilege

You leverage Skills for repeatable expertise

You choose models deliberately based on the landscape

You apply the prompting vs. RAG vs. fine-tuning hierarchy to LLM-powered features

You treat cost awareness as a professional skill

You leverage structured workflows for complex tasks

You use version control as your safety net for autonomous runs

Anti-patterns

Resources

Pillars

Toolchain

Resources