Skip to content

Pillar 7: Workflow and Tooling

Be the orchestrator, not the bottleneck. Leverage agents, sessions, and tools.

Be the orchestrator, not the bottleneck. Leverage agents, sessions, and tools.

The math of productivity changed when AI agents became capable of autonomous work. Every moment you spend on delegatable work blocks not just you, but all the parallel processes you could have spawned. Think like a CPU scheduler: your attention is the scarcest resource in the system. Before touching any task, ask yourself whether it could run in parallel while you work on something else.

That said, autonomy is a dial, not a switch. Start with tight human-in-the-loop steering (max 3 turns between check-ins) until you trust the prompt and verification criteria, then open it up for longer autonomous runs.

You understand session management and use it deliberately

Section titled “You understand session management and use it deliberately”

Start new sessions for new tasks. Most AI coding tools support resuming previous sessions and labeling them for retrieval. Use whatever your tool exposes; the principle (clean session per task) survives the specific commands. Your session strategy directly affects context quality: a clean session for a focused task produces better results than a sprawling conversation covering multiple concerns.

Why describe your database schema when the AI could query it directly? Why explain API contracts when it could read the OpenAPI spec through MCP? AI with well-scoped tools usually outperforms AI without tools, especially for tasks where the answer depends on facts the model cannot infer. Invest in configuring MCP servers, browser automation, and test runners that the AI can use in its agent loop.

Tools beat instructions only when scoped correctly. The same MCP server that lets an agent read your OpenAPI spec can, misconfigured, let it write to production. Default to read-only access; mutating tools require an approval gate or explicit human-in-the-loop confirmation. Default to non-production environments; production data and production-mutating tools require explicit policy approval, not implicit availability.

Audit and log tool calls with enough fidelity to reconstruct what happened during an autonomous run. Limit credential scope (per-environment tokens, short-lived federated identity, no long-lived static keys) per Pillar 10. The right action should be easy; the wrong action should be impossible at the tool layer, not the prompt.

Skills are folder-based packages (a SKILL.md plus optional scripts and references) that teach an agent how to perform a specific task: a code-review checklist, a deployment runbook, a doc-generation pattern, a frontend component scaffold. Originally built by Anthropic, Agent Skills is now an open standard supported across Claude Code, Cursor, Codex, GitHub Copilot, and roughly thirty other agents. Where MCP tools give the agent new actions, skills give it new procedural knowledge.

Use community skills for general workflows, write your own when a task is recurring and team-specific. Install team-shared skills per-project (checked into source control) and personal-utility skills globally. Vet anything you install per Pillar 10 - skills run with the agent's permissions and unvetted marketplaces are a real supply-chain surface.

You choose models deliberately based on the landscape

Section titled “You choose models deliberately based on the landscape”

The gap between top models has compressed significantly; the difference between the top model and the 10th-ranked is roughly 5% (as of Q1 2026). Different models lead different capability domains: some lead coding benchmarks (SWE-bench), others lead multimodal (MMMU-Pro), others offer the largest context windows. Open-source models provide 10-100x cost savings for simpler tasks.

Read benchmarks critically. MMLU and HumanEval are saturated for frontier models and no longer differentiate them. More meaningful signals come from SWE-bench (real code on actual GitHub issues), LM Arena Vision (human preference for multimodal), and independent reproductions like Artificial Analysis and Vals.ai. Task-specific evaluation on your representative workloads matters more than any leaderboard score.

Translate that landscape into selection. A practical starting point: reserve your most capable model for roughly 30% of your work (complex reasoning, architectural planning, tricky debugging) and use a faster, cheaper model for the remaining 70% (routine implementation, boilerplate, test writing, documentation). Model routing (directing simple queries to cheap/fast models and complex queries to expensive/capable ones) is becoming a production standard; UC Berkeley's RouteLLM (ICLR 2025) demonstrated 85%+ cost reduction while maintaining 95% of top-model quality.

You apply the prompting vs. RAG vs. fine-tuning hierarchy to LLM-powered features

Section titled “You apply the prompting vs. RAG vs. fine-tuning hierarchy to LLM-powered features”

Before building any LLM-powered feature, choose the right approach. There is near-universal consensus on the hierarchy: start with prompt engineering (hours to implement, near-zero cost), escalate to RAG when you need current or proprietary data, and fine-tune only when persistent behavioral changes are required (weeks to implement, significant cost).

The key diagnostic question: "Do we need new facts, or new behavior?" New facts point to RAG; if your team is building an internal tool that answers questions about company policies or product docs that change regularly, you need retrieval because the model simply does not have those facts. New behavior (tone, style, complex classification patterns) points to fine-tuning; if you need every response across thousands of requests to match a specific house style and format, and prompt instructions alone are not producing the consistency you need, that is a fine-tuning case.

A critical misconception to avoid: fine-tuning does not reliably inject new knowledge; it changes behavior and style, not factual recall. Growing context windows (now 1M+ tokens in some models) are also shifting some RAG use cases back to prompt engineering, since you can fit entire document sets in-context. If a customer support bot's full response guidelines and FAQ content fit in a single prompt, start there before building a retrieval pipeline.

You treat cost awareness as a professional skill

Section titled “You treat cost awareness as a professional skill”

All major providers charge separately for input and output tokens, with output tokens costing 3-5x more than input. The cost difference between model tiers within a single provider can be 10-15x or more. Check your provider's current pricing page; these numbers shift regularly as competition drives costs down.

Three cost levers matter most: prompt caching (saves 60-90% on repeated prefixes), batch APIs (significant discounts for latency-insensitive workloads), and model routing (directing simple tasks to cheaper models). Developers who ignore cost optimization either burn through budgets that get their AI access revoked or avoid using AI where it would help because they assume everything is expensive. Track your usage so you can make data-informed decisions about where premium models earn their price and where lighter models do the job.

For work that exceeds a single session's capacity, use spec files and plan documents as state persistence. Your progress lives in markdown files with completion status. Combined with session resume, this gives you workflow resilience: if you are interrupted or the session ends, you pick up exactly where you left off. See Pillar 2: Planning Before Code for how to structure these artifacts.

Before letting an AI run autonomously, ensure you have a clean commit point. Check the AI's work at intervals. The longer the leash, the more important the rollback strategy. See Pillar 6: Verification and Security for the full verification framework.

  • Doing work manually that agents can do in parallel
  • Using the same session for hours across multiple unrelated tasks
  • Not configuring tools (MCP, hooks, test runners) that would make the AI self-sufficient
  • Wiring an agent into a production database or mutating API without read-only scoping, approval gates, or audit logging
  • Re-explaining the same procedural knowledge to the AI in every session instead of capturing it as a skill
  • Using the cheapest model for complex reasoning tasks where a more capable model would save time on rework
  • Running everything on the most expensive model without considering whether a lighter model would produce equivalent results
  • Not tracking token usage or understanding the cost implications of large context windows
  • Running long autonomous sessions without commit checkpoints or verification criteria
  • Not learning the slash commands and capabilities of your tools; new features ship regularly
  • Assuming fine-tuning will fix factual accuracy problems (it won't; that's RAG's job)
  • Reaching for RAG when the data would fit in a single prompt with a large context window