Pillar 11: Knowing When NOT to Use AI
The best AI engineers know when to put it down.
The best AI engineers know when to put it down.
Every other pillar in this repository teaches you how to use AI well. This one teaches you when to stop. AI coding assistants create a gravitational pull toward using them for everything, and the beginner mindset treats AI like a hammer that makes everything look like a nail.
The real risk is not the hallucinations that crash at compile time. As Simon Willison argues, the dangerous mistakes are the ones that compile cleanly: subtle logic errors, security oversights, and architectural decisions that look reasonable but compound. IEEE Spectrum calls these "silent failures" the dominant risk category. The code runs. The tests pass. And the bug ships.
The 2025 DORA State of AI-Assisted Software Development frames the structural risk in one line: AI is "an amplifier, magnifying an organization's existing strengths and weaknesses." The downstream cost shows up in the data. A large-scale 2025 study found human-written code remains superior across every quality metric measured, despite being structurally more complex. Faros AI telemetry across 22,000 developers shows median PR review time up 441% YoY, 31% more PRs merged unreviewed, and incidents per PR up 242.7%. Whether AI helps or hurts is decided by your engineering culture, not the model.
What We Expect
Section titled “What We Expect”You recognize the task categories where AI still struggles
Section titled “You recognize the task categories where AI still struggles”Frontier model coding capability has improved substantially through 2026: top models now score in the 50-60% range on real-world software engineering tasks per the SWE-bench Pro public leaderboard (as of Q2 2026), with substantial year-over-year gains. The absolutism of "AI can't code" is no longer accurate. But specific categories remain consistently hard:
- Novel algorithms that require deep mathematical reasoning
- Security-critical code paths where subtle errors have outsized consequences
- Complex multi-system integrations where the AI cannot see the full picture
- Performance-sensitive code where naive implementations carry hidden costs
- Domain logic that crosses system boundaries
The 2024 framing of that last category was "AI lacks training data for your domain." In 2026, with MCP context, RAG over codebases, and skill loading, the AI usually has fragments of domain knowledge but stitches them wrong at integration points. Locally-correct code, globally-wrong assembly.
MIT research mapped the specific roadblocks: AI fails at large codebases (millions of lines), struggles with global architectural coherence while generating locally correct code, and hallucinates code that looks plausible but violates internal conventions. The common failure types are well-categorized: code that does not compile, code that is overly convoluted, functions that contradict themselves, and hallucinations that make up nonexistent functions. When you encounter these patterns, slow down, write more of the code yourself, and use AI for specific sub-problems where you can verify the output.
You match AI's allowed role to the risk of the work
Section titled “You match AI's allowed role to the risk of the work”Recognizing the categories above is the awareness layer. Translating them into a decision rule is the operational layer. The matrix below is a default starting point; teams should harden the gradient based on their risk profile and regulatory exposure.
| Risk level | Examples | AI role | Required process |
|---|---|---|---|
| Prohibited | Crypto primitive implementation, auth core logic, irreversible destructive operations on production, regulated PHI/PCI handling without contractual coverage | Not used | Human-only; AI may not draft or review |
| Review-only (AI assists, human writes) | Security boundary code, payment flows, identity-and-access logic, key management integration, cryptographic protocol use | AI critiques, human writes | Human writes; AI may review and suggest fixes; named human owner signs off |
| Draft + mandatory expert review | Complex domain logic, multi-system integration, performance-sensitive paths, novel algorithms | AI drafts under tight scope | Human writes spec and acceptance tests first; AI implements; named expert reviews; tests gate merge |
| Standard | Routine features, refactors, tests, docs, boilerplate | AI drafts or assists at engineer's discretion | Standard guardrails per Pillar 5 and Pillar 6 |
The matrix is a default, not a ceiling. When a task sits between rows, treat the higher-risk row as the binding one.
You do not vibe code into production
Section titled “You do not vibe code into production”"Vibe coding" (letting AI generate code you accept without fully understanding) has a place for throwaway prototypes and weekend experiments. Even Andrej Karpathy, who coined the term, framed it as "not too bad for throwaway weekend projects" rather than a production approach. It does not belong in production codebases.
Research from Kaspersky found that 45% of AI-generated code contains classic OWASP Top-10 vulnerabilities, and security deteriorates with iteration: after five modification rounds, code has 37% more critical vulnerabilities than it started with. Qodo's 2025 research found 71% of developers say they won't merge AI code without manual review, yet many junior developers still deploy AI-generated code they don't fully understand. If you cannot explain what the code does, why it does it, and how it fails, it is not ready to ship.
You stop and reframe when the AI is struggling
Section titled “You stop and reframe when the AI is struggling”If the AI takes too many iterations, produces contradictory outputs, or keeps regressing to the same incorrect pattern, that is a signal. The Axur engineering team's recommendation holds: if the AI assistant takes too long or struggles with a complex prompt, stop it and reframe the problem. Break it into smaller pieces, provide more context, or switch to a different approach. Stubbornly iterating on a failing prompt is a time sink.
You account for the verification tax
Section titled “You account for the verification tax”Generating code and verifying code are different cognitive tasks. The DORA 2025 report names this directly: time saved during generation gets reallocated to auditing, and auditing AI output requires reverse-engineering intent from text the engineer did not write. DORA's March 2026 follow-up sharpens the framing: because AI tools cannot reliably signal uncertainty, engineers are forced to treat every interaction as potentially deceptive, and verification becomes a fundamentally different cognitive task than creation. The decision rule that follows: before delegating, ask whether you can verify the output faster than you could write it. If not, do not delegate. Treat verification time as a real budget, not a free byproduct.
You are alert to the "looks right, is wrong" failure mode
Section titled “You are alert to the "looks right, is wrong" failure mode”AI-generated code that compiles and passes basic testing can still contain: functions that contradict themselves, overly convoluted implementations of simple problems, references to non-existent packages (slopsquatting), deprecated API patterns that work today but will break, and security vulnerabilities disguised in plausible-looking code.
A USENIX 2025 study testing 16 LLMs found that roughly 20% of AI-generated code references non-existent packages, with 43% of those hallucinated names repeating consistently across runs. GitClear's 211M-line longitudinal analysis found duplicated code blocks rose 8x in 2024 versus prior years, refactoring-associated changes dropped from 25% (2021) to under 10% (2024), and copy/pasted lines exceeded moved lines for the first time in the dataset's history. Each pattern is locally plausible. Together they describe a codebase becoming harder to maintain.
You match autonomy to your ability to verify
Section titled “You match autonomy to your ability to verify”The more autonomy you grant the AI, the more correctness you must independently verify, and the harder verification gets. A suggestion you accept after reading is cheap to verify; an autonomous agent that ran 47 tool calls, edited 12 files, and pushed a PR forces you to reconstruct everything it did. Match the leash to your verification budget.
Reasoning-mode models do not fix this. The November 2025 AA-Omniscience benchmark explicitly penalizes wrong answers and rewards admitting uncertainty - and most production benchmarks do the opposite, training models to guess rather than refuse. The downstream effect shows in OpenAI's own o3 and o4-mini system card: o3 hallucinated 33% of the time on PersonQA, double the rate of its predecessor o1, and o4-mini reached 48%. More compute does not equal more honesty.
You guard against automation bias and over-reliance
Section titled “You guard against automation bias and over-reliance”Automation bias is the documented tendency to favor AI recommendations even when contradictory evidence is present. Thoughtworks placed "complacency with AI-generated code" on their Technology Radar as a recognized risk, noting that AI-driven confidence often comes at the expense of critical thinking, with automation bias, anchoring bias, and review fatigue all contributing.
In coding, this manifests as accepting AI output without tracing through the logic, deferring architectural decisions to the model, and losing the habit of critical evaluation. The METR study captured the perception gap: 16 experienced open-source developers working on familiar codebases in early 2025 believed AI made them 20% faster, while measured outcomes showed they were 19% slower. The scope matters - the slowdown is conditioned on high prior codebase familiarity and tasks the developers had already partially scoped, not on AI being categorically slower. A 2026 follow-up from METR suggests returning developers are now seeing roughly an 18% speedup, with substantial selection-bias caveats that make the new estimate weaker evidence than the original. The durable lesson survives both estimates: you cannot trust your intuition about whether AI is helping on your work. You need to measure it.
You maintain the ability to work without AI
Section titled “You maintain the ability to work without AI”If your productivity drops to near zero when your AI tool has an outage, that is a warning sign. You should be able to read code, debug, reason about architecture, and write implementations without AI assistance. AI is an accelerant, not a crutch.
Anthropic's 2026 research found developers who delegated code generation to AI scored 17% lower on comprehension tests; MIT Media Lab measured similar declines in memory and neural connectivity from prolonged AI use, and ICIS 2025 found developer expertise is the primary factor mitigating hallucination impact. Maintaining your fundamentals is the safety net that makes AI collaboration viable.
Anti-patterns
Section titled “Anti-patterns”- Using AI for every task regardless of whether it is a good fit
- Vibe coding into production: accepting AI output you cannot explain or debug
- Continuing to iterate on a failing AI interaction instead of stepping back and reframing
- Trusting AI-generated security code without dedicated expert review
- Treating verification as free, including by granting autonomy beyond what you can verify ("the agent ran for an hour, the diff is 2,000 lines, I will skim it")
- Assuming reasoning-mode models hallucinate less; on grounded recall tasks they often hallucinate more
- Not having a plan B when your AI tool is down or producing poor results
- Letting AI choose frameworks, libraries, or architectures for domains you do not understand well enough to evaluate the choice
- Assuming AI's productivity impact is constant rather than measuring actual outcomes for your team and tasks
- Deploying AI-generated code that passes tests but has never been read by a human who understands the business logic
Resources
Section titled “Resources”AI Coding Limitations and Failure Modes
Section titled “AI Coding Limitations and Failure Modes”- Simon Willison: Hallucinations in Code (2025) - Why subtle AI mistakes are more dangerous than obvious hallucinations
- IEEE Spectrum: AI Coding Degrades (2026) - Silent failures and quality plateau in AI coding tools
- MIT: Roadblocks to Autonomous Software Engineering (2025) - Where AI fails at scale: large codebases, architectural coherence, internal conventions
- InfoWorld: AI-Assisted Coding Creates More Problems (2025) - Secondary coverage of the GitClear longitudinal findings
- Axur: Best Practices for AI-Assisted Coding (2025) - Recognizing when the AI is struggling and reframing the problem
Code Quality Comparisons
Section titled “Code Quality Comparisons”- Human vs. AI Code: Defects, Vulnerabilities, Complexity (arXiv 2508.21634, 2025) - Large-scale study confirming human code superior across all quality metrics. Primary academic citation for the AI-vs-human quality gap.
- GitClear: AI Copilot Code Quality (2025) - 211M-line longitudinal analysis. Refactoring dropped, copy/paste rose past moved lines for the first time, churn nearly doubled. Strongest empirical case for AI degrading long-term maintainability.
- Qodo: State of AI Code Quality (2025) - 71% of developers won't merge AI code without manual review
- CodeRabbit: AI vs. Human Code Quality (2025) - Vendor-published analysis of 470 PRs reporting 1.7x more issues and 2.7x more security vulnerabilities. Corroborates the arXiv finding; treat as a data point, not the headline source.
Industry Telemetry and Surveys
Section titled “Industry Telemetry and Surveys”- DORA 2025 State of AI-Assisted Software Development - Google Cloud's annual research program. Frames AI as an "amplifier" with documented correlation between higher AI adoption and increased delivery instability.
- DORA 2026 follow-up: Balancing AI Tensions - March 2026 update on the verification tax and how the amplifier effect has played out.
- Faros AI: Acceleration Whiplash (2026) - 22,000-developer telemetry: PR review +441% YoY, +31% unreviewed merges, incidents/PR +242.7%.
- Lightrun: 43% of AI Code Changes Need Production Debugging (Q1 2026) - 200 senior SRE/DevOps leaders; tied to the March 2026 Amazon outages.
Vibe Coding Risks
Section titled “Vibe Coding Risks”- Kaspersky: Security Risks of Vibe Coding (2025) - 45% of AI code contains OWASP Top-10 vulnerabilities; security worsens with iteration
- TheServerSide: The Case Against Vibe Coding (2025) - Even the term's creator says it's not for production
- Qodo: State of AI Code Quality (2025) - 71% of developers won't merge AI code without manual review; 46% distrust AI accuracy
Hallucinations, Benchmarks, and Supply Chain
Section titled “Hallucinations, Benchmarks, and Supply Chain”- Artificial Analysis: AA-Omniscience benchmark (Nov 2025) - Knowledge and hallucination benchmark that penalizes wrong answers and rewards admitting uncertainty. Primary source for the reasoning-mode honesty critique.
- OpenAI: o3 and o4-mini System Card (April 2025) - Load-bearing primary source for the o3 33% / o4-mini 48% PersonQA hallucination figures.
- USENIX 2025: Package Hallucinations by Code-Generating LLMs - 20% hallucination rate across 16 LLMs, peer-reviewed
- ITBrew: Slopsquatting Explained (2025) - How attackers weaponize hallucinated package names
Automation Bias and Cognitive Effects
Section titled “Automation Bias and Cognitive Effects”- Thoughtworks: Complacency with AI-Generated Code - Technology Radar entry on automation bias, anchoring bias, and review fatigue
- METR: AI Impact on Developer Productivity (2025) - Developers believe they're 20% faster but are 19% slower
- METR: 2026 Update on the Productivity Experiment - The 2026 follow-up that revises the original estimate toward a speedup for returning developers, with substantial selection-bias caveats. Required reading alongside the original study.
- MIT Media Lab: Your Brain on ChatGPT (2025) - Measurable cognitive effects of prolonged AI use
- Anthropic: AI Assistance and Coding Skills (2026) - 17% comprehension drop when delegating to AI
Related Pillars
Section titled “Related Pillars”- Pillar 6: Verification and Security - The review discipline that catches what AI gets wrong
- Pillar 4: The AI as Collaborator - Working iteratively to catch problems early
- Pillar 8: Continuous Evolution - Maintaining skills alongside AI adoption
- See Learning Paths for deeper dives