Skip to content

Pillar 11: Knowing When NOT to Use AI

The best AI engineers know when to put it down.

The best AI engineers know when to put it down.

Every other pillar in this repository teaches you how to use AI well. This one teaches you when to stop. AI coding assistants create a gravitational pull toward using them for everything, and the beginner mindset treats AI like a hammer that makes everything look like a nail.

The real risk is not the hallucinations that crash at compile time. As Simon Willison argues, the dangerous mistakes are the ones that compile cleanly: subtle logic errors, security oversights, and architectural decisions that look reasonable but compound. IEEE Spectrum calls these "silent failures" the dominant risk category. The code runs. The tests pass. And the bug ships.

The 2025 DORA State of AI-Assisted Software Development frames the structural risk in one line: AI is "an amplifier, magnifying an organization's existing strengths and weaknesses." The downstream cost shows up in the data. A large-scale 2025 study found human-written code remains superior across every quality metric measured, despite being structurally more complex. Faros AI telemetry across 22,000 developers shows median PR review time up 441% YoY, 31% more PRs merged unreviewed, and incidents per PR up 242.7%. Whether AI helps or hurts is decided by your engineering culture, not the model.

You recognize the task categories where AI still struggles

Section titled “You recognize the task categories where AI still struggles”

Frontier model coding capability has improved substantially through 2026: top models now score in the 50-60% range on real-world software engineering tasks per the SWE-bench Pro public leaderboard (as of Q2 2026), with substantial year-over-year gains. The absolutism of "AI can't code" is no longer accurate. But specific categories remain consistently hard:

  • Novel algorithms that require deep mathematical reasoning
  • Security-critical code paths where subtle errors have outsized consequences
  • Complex multi-system integrations where the AI cannot see the full picture
  • Performance-sensitive code where naive implementations carry hidden costs
  • Domain logic that crosses system boundaries

The 2024 framing of that last category was "AI lacks training data for your domain." In 2026, with MCP context, RAG over codebases, and skill loading, the AI usually has fragments of domain knowledge but stitches them wrong at integration points. Locally-correct code, globally-wrong assembly.

MIT research mapped the specific roadblocks: AI fails at large codebases (millions of lines), struggles with global architectural coherence while generating locally correct code, and hallucinates code that looks plausible but violates internal conventions. The common failure types are well-categorized: code that does not compile, code that is overly convoluted, functions that contradict themselves, and hallucinations that make up nonexistent functions. When you encounter these patterns, slow down, write more of the code yourself, and use AI for specific sub-problems where you can verify the output.

You match AI's allowed role to the risk of the work

Section titled “You match AI's allowed role to the risk of the work”

Recognizing the categories above is the awareness layer. Translating them into a decision rule is the operational layer. The matrix below is a default starting point; teams should harden the gradient based on their risk profile and regulatory exposure.

Risk levelExamplesAI roleRequired process
ProhibitedCrypto primitive implementation, auth core logic, irreversible destructive operations on production, regulated PHI/PCI handling without contractual coverageNot usedHuman-only; AI may not draft or review
Review-only (AI assists, human writes)Security boundary code, payment flows, identity-and-access logic, key management integration, cryptographic protocol useAI critiques, human writesHuman writes; AI may review and suggest fixes; named human owner signs off
Draft + mandatory expert reviewComplex domain logic, multi-system integration, performance-sensitive paths, novel algorithmsAI drafts under tight scopeHuman writes spec and acceptance tests first; AI implements; named expert reviews; tests gate merge
StandardRoutine features, refactors, tests, docs, boilerplateAI drafts or assists at engineer's discretionStandard guardrails per Pillar 5 and Pillar 6

The matrix is a default, not a ceiling. When a task sits between rows, treat the higher-risk row as the binding one.

"Vibe coding" (letting AI generate code you accept without fully understanding) has a place for throwaway prototypes and weekend experiments. Even Andrej Karpathy, who coined the term, framed it as "not too bad for throwaway weekend projects" rather than a production approach. It does not belong in production codebases.

Research from Kaspersky found that 45% of AI-generated code contains classic OWASP Top-10 vulnerabilities, and security deteriorates with iteration: after five modification rounds, code has 37% more critical vulnerabilities than it started with. Qodo's 2025 research found 71% of developers say they won't merge AI code without manual review, yet many junior developers still deploy AI-generated code they don't fully understand. If you cannot explain what the code does, why it does it, and how it fails, it is not ready to ship.

If the AI takes too many iterations, produces contradictory outputs, or keeps regressing to the same incorrect pattern, that is a signal. The Axur engineering team's recommendation holds: if the AI assistant takes too long or struggles with a complex prompt, stop it and reframe the problem. Break it into smaller pieces, provide more context, or switch to a different approach. Stubbornly iterating on a failing prompt is a time sink.

Generating code and verifying code are different cognitive tasks. The DORA 2025 report names this directly: time saved during generation gets reallocated to auditing, and auditing AI output requires reverse-engineering intent from text the engineer did not write. DORA's March 2026 follow-up sharpens the framing: because AI tools cannot reliably signal uncertainty, engineers are forced to treat every interaction as potentially deceptive, and verification becomes a fundamentally different cognitive task than creation. The decision rule that follows: before delegating, ask whether you can verify the output faster than you could write it. If not, do not delegate. Treat verification time as a real budget, not a free byproduct.

AI-generated code that compiles and passes basic testing can still contain: functions that contradict themselves, overly convoluted implementations of simple problems, references to non-existent packages (slopsquatting), deprecated API patterns that work today but will break, and security vulnerabilities disguised in plausible-looking code.

A USENIX 2025 study testing 16 LLMs found that roughly 20% of AI-generated code references non-existent packages, with 43% of those hallucinated names repeating consistently across runs. GitClear's 211M-line longitudinal analysis found duplicated code blocks rose 8x in 2024 versus prior years, refactoring-associated changes dropped from 25% (2021) to under 10% (2024), and copy/pasted lines exceeded moved lines for the first time in the dataset's history. Each pattern is locally plausible. Together they describe a codebase becoming harder to maintain.

You match autonomy to your ability to verify

Section titled “You match autonomy to your ability to verify”

The more autonomy you grant the AI, the more correctness you must independently verify, and the harder verification gets. A suggestion you accept after reading is cheap to verify; an autonomous agent that ran 47 tool calls, edited 12 files, and pushed a PR forces you to reconstruct everything it did. Match the leash to your verification budget.

Reasoning-mode models do not fix this. The November 2025 AA-Omniscience benchmark explicitly penalizes wrong answers and rewards admitting uncertainty - and most production benchmarks do the opposite, training models to guess rather than refuse. The downstream effect shows in OpenAI's own o3 and o4-mini system card: o3 hallucinated 33% of the time on PersonQA, double the rate of its predecessor o1, and o4-mini reached 48%. More compute does not equal more honesty.

Automation bias is the documented tendency to favor AI recommendations even when contradictory evidence is present. Thoughtworks placed "complacency with AI-generated code" on their Technology Radar as a recognized risk, noting that AI-driven confidence often comes at the expense of critical thinking, with automation bias, anchoring bias, and review fatigue all contributing.

In coding, this manifests as accepting AI output without tracing through the logic, deferring architectural decisions to the model, and losing the habit of critical evaluation. The METR study captured the perception gap: 16 experienced open-source developers working on familiar codebases in early 2025 believed AI made them 20% faster, while measured outcomes showed they were 19% slower. The scope matters - the slowdown is conditioned on high prior codebase familiarity and tasks the developers had already partially scoped, not on AI being categorically slower. A 2026 follow-up from METR suggests returning developers are now seeing roughly an 18% speedup, with substantial selection-bias caveats that make the new estimate weaker evidence than the original. The durable lesson survives both estimates: you cannot trust your intuition about whether AI is helping on your work. You need to measure it.

If your productivity drops to near zero when your AI tool has an outage, that is a warning sign. You should be able to read code, debug, reason about architecture, and write implementations without AI assistance. AI is an accelerant, not a crutch.

Anthropic's 2026 research found developers who delegated code generation to AI scored 17% lower on comprehension tests; MIT Media Lab measured similar declines in memory and neural connectivity from prolonged AI use, and ICIS 2025 found developer expertise is the primary factor mitigating hallucination impact. Maintaining your fundamentals is the safety net that makes AI collaboration viable.

  • Using AI for every task regardless of whether it is a good fit
  • Vibe coding into production: accepting AI output you cannot explain or debug
  • Continuing to iterate on a failing AI interaction instead of stepping back and reframing
  • Trusting AI-generated security code without dedicated expert review
  • Treating verification as free, including by granting autonomy beyond what you can verify ("the agent ran for an hour, the diff is 2,000 lines, I will skim it")
  • Assuming reasoning-mode models hallucinate less; on grounded recall tasks they often hallucinate more
  • Not having a plan B when your AI tool is down or producing poor results
  • Letting AI choose frameworks, libraries, or architectures for domains you do not understand well enough to evaluate the choice
  • Assuming AI's productivity impact is constant rather than measuring actual outcomes for your team and tasks
  • Deploying AI-generated code that passes tests but has never been read by a human who understands the business logic

Hallucinations, Benchmarks, and Supply Chain

Section titled “Hallucinations, Benchmarks, and Supply Chain”