Pillar 11: Knowing When NOT to Use AI

The best AI engineers know when to put it down.

The best AI engineers know when to put it down.

Every other pillar in this repository teaches you how to use AI well. This one teaches you when to stop. AI coding assistants create a gravitational pull toward using them for everything, and the beginner mindset treats AI like a hammer that makes everything look like a nail.

The real risk is not the hallucinations that crash at compile time. As Simon Willison argues, the dangerous mistakes are the ones that compile cleanly: subtle logic errors, security oversights, and architectural decisions that look reasonable but compound. IEEE Spectrum calls these "silent failures" the dominant risk category. The code runs. The tests pass. And the bug ships.

The 2025 DORA State of AI-Assisted Software Development frames the structural risk in one line: AI is "an amplifier, magnifying an organization's existing strengths and weaknesses." The downstream cost shows up in the data. A large-scale 2025 study found human-written code remains superior across every quality metric measured, despite being structurally more complex. Faros AI telemetry across 22,000 developers shows median PR review time up 441% YoY, 31% more PRs merged unreviewed, and incidents per PR up 242.7%. Whether AI helps or hurts is decided by your engineering culture, not the model.

What We Expect

You recognize the task categories where AI still struggles

Frontier model coding capability has improved substantially through 2026: top models now score in the 50-60% range on real-world software engineering tasks per the SWE-bench Pro public leaderboard (as of Q2 2026), with substantial year-over-year gains. The absolutism of "AI can't code" is no longer accurate. But specific categories remain consistently hard:

Novel algorithms that require deep mathematical reasoning
Security-critical code paths where subtle errors have outsized consequences
Complex multi-system integrations where the AI cannot see the full picture
Performance-sensitive code where naive implementations carry hidden costs
Domain logic that crosses system boundaries

The 2024 framing of that last category was "AI lacks training data for your domain." In 2026, with MCP context, RAG over codebases, and skill loading, the AI usually has fragments of domain knowledge but stitches them wrong at integration points. Locally-correct code, globally-wrong assembly.

MIT research mapped the specific roadblocks: AI fails at large codebases (millions of lines), struggles with global architectural coherence while generating locally correct code, and hallucinates code that looks plausible but violates internal conventions. The common failure types are well-categorized: code that does not compile, code that is overly convoluted, functions that contradict themselves, and hallucinations that make up nonexistent functions. When you encounter these patterns, slow down, write more of the code yourself, and use AI for specific sub-problems where you can verify the output.

You match AI's allowed role to the risk of the work

Recognizing the categories above is the awareness layer. Translating them into a decision rule is the operational layer. The matrix below is a default starting point; teams should harden the gradient based on their risk profile and regulatory exposure.

Risk level	Examples	AI role	Required process
Prohibited	Crypto primitive implementation, auth core logic, irreversible destructive operations on production, regulated PHI/PCI handling without contractual coverage	Not used	Human-only; AI may not draft or review
Review-only (AI assists, human writes)	Security boundary code, payment flows, identity-and-access logic, key management integration, cryptographic protocol use	AI critiques, human writes	Human writes; AI may review and suggest fixes; named human owner signs off
Draft + mandatory expert review	Complex domain logic, multi-system integration, performance-sensitive paths, novel algorithms	AI drafts under tight scope	Human writes spec and acceptance tests first; AI implements; named expert reviews; tests gate merge
Standard	Routine features, refactors, tests, docs, boilerplate	AI drafts or assists at engineer's discretion	Standard guardrails per Pillar 5 and Pillar 6

The matrix is a default, not a ceiling. When a task sits between rows, treat the higher-risk row as the binding one.

You do not vibe code into production

"Vibe coding" (letting AI generate code you accept without fully understanding) has a place for throwaway prototypes and weekend experiments. Even Andrej Karpathy, who coined the term, framed it as "not too bad for throwaway weekend projects" rather than a production approach. It does not belong in production codebases.

Research from Kaspersky found that 45% of AI-generated code contains classic OWASP Top-10 vulnerabilities, and security deteriorates with iteration: after five modification rounds, code has 37% more critical vulnerabilities than it started with. Qodo's 2025 research found 71% of developers say they won't merge AI code without manual review, yet many junior developers still deploy AI-generated code they don't fully understand. If you cannot explain what the code does, why it does it, and how it fails, it is not ready to ship.

You stop and reframe when the AI is struggling

If the AI takes too many iterations, produces contradictory outputs, or keeps regressing to the same incorrect pattern, that is a signal. The Axur engineering team's recommendation holds: if the AI assistant takes too long or struggles with a complex prompt, stop it and reframe the problem. Break it into smaller pieces, provide more context, or switch to a different approach. Stubbornly iterating on a failing prompt is a time sink.

You account for the verification tax

Generating code and verifying code are different cognitive tasks. The DORA 2025 report names this directly: time saved during generation gets reallocated to auditing, and auditing AI output requires reverse-engineering intent from text the engineer did not write. DORA's March 2026 follow-up sharpens the framing: because AI tools cannot reliably signal uncertainty, engineers are forced to treat every interaction as potentially deceptive, and verification becomes a fundamentally different cognitive task than creation. The decision rule that follows: before delegating, ask whether you can verify the output faster than you could write it. If not, do not delegate. Treat verification time as a real budget, not a free byproduct.

You are alert to the "looks right, is wrong" failure mode

AI-generated code that compiles and passes basic testing can still contain: functions that contradict themselves, overly convoluted implementations of simple problems, references to non-existent packages (slopsquatting), deprecated API patterns that work today but will break, and security vulnerabilities disguised in plausible-looking code.

A USENIX 2025 study testing 16 LLMs found that roughly 20% of AI-generated code references non-existent packages, with 43% of those hallucinated names repeating consistently across runs. GitClear's 211M-line longitudinal analysis found duplicated code blocks rose 8x in 2024 versus prior years, refactoring-associated changes dropped from 25% (2021) to under 10% (2024), and copy/pasted lines exceeded moved lines for the first time in the dataset's history. Each pattern is locally plausible. Together they describe a codebase becoming harder to maintain.

You match autonomy to your ability to verify

The more autonomy you grant the AI, the more correctness you must independently verify, and the harder verification gets. A suggestion you accept after reading is cheap to verify; an autonomous agent that ran 47 tool calls, edited 12 files, and pushed a PR forces you to reconstruct everything it did. Match the leash to your verification budget.

Reasoning-mode models do not fix this. The November 2025 AA-Omniscience benchmark explicitly penalizes wrong answers and rewards admitting uncertainty - and most production benchmarks do the opposite, training models to guess rather than refuse. The downstream effect shows in OpenAI's own o3 and o4-mini system card: o3 hallucinated 33% of the time on PersonQA, double the rate of its predecessor o1, and o4-mini reached 48%. More compute does not equal more honesty.

You guard against automation bias and over-reliance

Automation bias is the documented tendency to favor AI recommendations even when contradictory evidence is present. Thoughtworks placed "complacency with AI-generated code" on their Technology Radar as a recognized risk, noting that AI-driven confidence often comes at the expense of critical thinking, with automation bias, anchoring bias, and review fatigue all contributing.

In coding, this manifests as accepting AI output without tracing through the logic, deferring architectural decisions to the model, and losing the habit of critical evaluation. The METR study captured the perception gap: 16 experienced open-source developers working on familiar codebases in early 2025 believed AI made them 20% faster, while measured outcomes showed they were 19% slower. The scope matters - the slowdown is conditioned on high prior codebase familiarity and tasks the developers had already partially scoped, not on AI being categorically slower. A 2026 follow-up from METR suggests returning developers are now seeing roughly an 18% speedup, with substantial selection-bias caveats that make the new estimate weaker evidence than the original. The durable lesson survives both estimates: you cannot trust your intuition about whether AI is helping on your work. You need to measure it.

You maintain the ability to work without AI

If your productivity drops to near zero when your AI tool has an outage, that is a warning sign. You should be able to read code, debug, reason about architecture, and write implementations without AI assistance. AI is an accelerant, not a crutch.

Anthropic's 2026 research found developers who delegated code generation to AI scored 17% lower on comprehension tests; MIT Media Lab measured similar declines in memory and neural connectivity from prolonged AI use, and ICIS 2025 found developer expertise is the primary factor mitigating hallucination impact. Maintaining your fundamentals is the safety net that makes AI collaboration viable.

Anti-patterns

Using AI for every task regardless of whether it is a good fit
Vibe coding into production: accepting AI output you cannot explain or debug
Continuing to iterate on a failing AI interaction instead of stepping back and reframing
Trusting AI-generated security code without dedicated expert review
Treating verification as free, including by granting autonomy beyond what you can verify ("the agent ran for an hour, the diff is 2,000 lines, I will skim it")
Assuming reasoning-mode models hallucinate less; on grounded recall tasks they often hallucinate more
Not having a plan B when your AI tool is down or producing poor results
Letting AI choose frameworks, libraries, or architectures for domains you do not understand well enough to evaluate the choice
Assuming AI's productivity impact is constant rather than measuring actual outcomes for your team and tasks
Deploying AI-generated code that passes tests but has never been read by a human who understands the business logic

Resources

AI Coding Limitations and Failure Modes

Simon Willison: Hallucinations in Code (2025) - Why subtle AI mistakes are more dangerous than obvious hallucinations
IEEE Spectrum: AI Coding Degrades (2026) - Silent failures and quality plateau in AI coding tools
MIT: Roadblocks to Autonomous Software Engineering (2025) - Where AI fails at scale: large codebases, architectural coherence, internal conventions
InfoWorld: AI-Assisted Coding Creates More Problems (2025) - Secondary coverage of the GitClear longitudinal findings
Axur: Best Practices for AI-Assisted Coding (2025) - Recognizing when the AI is struggling and reframing the problem

Code Quality Comparisons

Human vs. AI Code: Defects, Vulnerabilities, Complexity (arXiv 2508.21634, 2025) - Large-scale study confirming human code superior across all quality metrics. Primary academic citation for the AI-vs-human quality gap.
GitClear: AI Copilot Code Quality (2025) - 211M-line longitudinal analysis. Refactoring dropped, copy/paste rose past moved lines for the first time, churn nearly doubled. Strongest empirical case for AI degrading long-term maintainability.
Qodo: State of AI Code Quality (2025) - 71% of developers won't merge AI code without manual review
CodeRabbit: AI vs. Human Code Quality (2025) - Vendor-published analysis of 470 PRs reporting 1.7x more issues and 2.7x more security vulnerabilities. Corroborates the arXiv finding; treat as a data point, not the headline source.

Industry Telemetry and Surveys

DORA 2025 State of AI-Assisted Software Development - Google Cloud's annual research program. Frames AI as an "amplifier" with documented correlation between higher AI adoption and increased delivery instability.
DORA 2026 follow-up: Balancing AI Tensions - March 2026 update on the verification tax and how the amplifier effect has played out.
Faros AI: Acceleration Whiplash (2026) - 22,000-developer telemetry: PR review +441% YoY, +31% unreviewed merges, incidents/PR +242.7%.
Lightrun: 43% of AI Code Changes Need Production Debugging (Q1 2026) - 200 senior SRE/DevOps leaders; tied to the March 2026 Amazon outages.

Vibe Coding Risks

Kaspersky: Security Risks of Vibe Coding (2025) - 45% of AI code contains OWASP Top-10 vulnerabilities; security worsens with iteration
TheServerSide: The Case Against Vibe Coding (2025) - Even the term's creator says it's not for production
Qodo: State of AI Code Quality (2025) - 71% of developers won't merge AI code without manual review; 46% distrust AI accuracy

Hallucinations, Benchmarks, and Supply Chain

Artificial Analysis: AA-Omniscience benchmark (Nov 2025) - Knowledge and hallucination benchmark that penalizes wrong answers and rewards admitting uncertainty. Primary source for the reasoning-mode honesty critique.
OpenAI: o3 and o4-mini System Card (April 2025) - Load-bearing primary source for the o3 33% / o4-mini 48% PersonQA hallucination figures.
USENIX 2025: Package Hallucinations by Code-Generating LLMs - 20% hallucination rate across 16 LLMs, peer-reviewed
ITBrew: Slopsquatting Explained (2025) - How attackers weaponize hallucinated package names

Automation Bias and Cognitive Effects

Thoughtworks: Complacency with AI-Generated Code - Technology Radar entry on automation bias, anchoring bias, and review fatigue
METR: AI Impact on Developer Productivity (2025) - Developers believe they're 20% faster but are 19% slower
METR: 2026 Update on the Productivity Experiment - The 2026 follow-up that revises the original estimate toward a speedup for returning developers, with substantial selection-bias caveats. Required reading alongside the original study.
MIT Media Lab: Your Brain on ChatGPT (2025) - Measurable cognitive effects of prolonged AI use
Anthropic: AI Assistance and Coding Skills (2026) - 17% comprehension drop when delegating to AI

Pillar 6: Verification and Security - The review discipline that catches what AI gets wrong
Pillar 4: The AI as Collaborator - Working iteratively to catch problems early
Pillar 8: Continuous Evolution - Maintaining skills alongside AI adoption
See Learning Paths for deeper dives

Pillar 11: Knowing When NOT to Use AI

What We Expect

You recognize the task categories where AI still struggles

You match AI's allowed role to the risk of the work

You do not vibe code into production

You stop and reframe when the AI is struggling

You account for the verification tax

You are alert to the "looks right, is wrong" failure mode

You match autonomy to your ability to verify

You guard against automation bias and over-reliance

You maintain the ability to work without AI

Anti-patterns

Resources

AI Coding Limitations and Failure Modes

Code Quality Comparisons

Industry Telemetry and Surveys

Vibe Coding Risks

Hallucinations, Benchmarks, and Supply Chain

Automation Bias and Cognitive Effects

Pillars

Toolchain

Resources

Pillar 11: Knowing When NOT to Use AI

What We Expect

You recognize the task categories where AI still struggles

You match AI's allowed role to the risk of the work

You do not vibe code into production

You stop and reframe when the AI is struggling

You account for the verification tax

You are alert to the "looks right, is wrong" failure mode

You match autonomy to your ability to verify

You guard against automation bias and over-reliance

You maintain the ability to work without AI

Anti-patterns

Resources

AI Coding Limitations and Failure Modes

Code Quality Comparisons

Industry Telemetry and Surveys

Vibe Coding Risks

Hallucinations, Benchmarks, and Supply Chain

Automation Bias and Cognitive Effects

Related Pillars

Pillars

Toolchain

Resources