Pillar 10: Data Hygiene and IP

What you put into AI tools matters as much as what comes out.

What you put into AI tools matters as much as what comes out.

AI-assisted development creates two categories of data risk that traditional development does not. On the input side, every prompt, every file you share, every MCP connection, and every adjacent communication channel (Slack, email, screen shares, screenshots) is a potential leak vector for secrets, proprietary code, and sensitive client data. On the output side, AI-generated code can introduce hardcoded credentials, reference non-existent packages that attackers can squat on, and produce code with unclear intellectual property status.

GitGuardian's 2026 State of Secrets Sprawl report found that secret leak rates in AI-assisted code were roughly double the GitHub-wide baseline, with AI-assisted commits leaking secrets at approximately 3.2%. MCP server configurations alone exposed over 24,000 unique secrets, with 2,117 confirmed as valid credentials. Research from Harmonic Security found that 8.5% of prompts submitted to AI tools included sensitive information such as PII, credentials, and internal file references.

These are not theoretical risks. In 2023, Samsung banned ChatGPT after engineers pasted proprietary source code into the tool. More recently, the OpenClaw security crisis demonstrated the risks of AI agent tooling at scale: a Meta employee working in AI safety and alignment posted on X that she could not prevent ClawBot from deleting a major portion of her email inbox, and researchers found 1,184 malicious skills in ClawHub's marketplace, including infostealers disguised as legitimate tools.

For embedded engineers working on client codebases, the risk profile is elevated. You are handling code and data that belongs to someone else.

What We Expect

You never paste secrets, credentials, or API keys into AI prompts, and you handle secrets through managed infrastructure

Treat every AI interaction as potentially logged; if you would not post it on a public Slack channel, do not put it in a prompt. That rule covers hardcoded tokens in code files you share as context, environment variables you copy into a prompt for debugging, and screenshots that contain sensitive information.

The behavioral rule only holds when secrets have a legitimate place to live: a managed vault as the source of truth, runtime retrieval via managed identity or short-lived federated tokens (not long-lived static keys in CI), and rotation treated as routine rather than exceptional.

You understand what data leaves your machine

Know the difference between tools that process locally (CLI tools with local file access) and tools that send data to external APIs. Understand your AI tool's data retention and training policies. When working on client projects with confidentiality requirements, verify that your tool configuration complies.

Default rule for sensitive data: do not share client code, regulated data (PHI, PII, financial records), proprietary IP, production logs, or secret-bearing material with AI tools unless all three of the following are true: (a) your contract with the data owner permits it, (b) your organization's policy permits it for the specific tool, and (c) the tool configuration is verified to comply (training disabled if required, retention configured, region-locked if required). Default to local-only or vetted enterprise-tier tools for sensitive workloads. When in doubt, escalate before pasting.

You audit MCP configurations for credential exposure

MCP server configs are a common vector for secrets leakage. Never store credentials directly in MCP configuration files. Use environment variables, secrets managers, or client-side authentication patterns instead. Review any MCP configuration before committing it to source control.

You treat team communication as a leak surface, not just AI prompts

Credentials leak at least as often through Slack DMs, email threads, tickets, screen-share recordings, screenshots, and pasted error logs as they do through AI tools. Assume anything pasted is archived.

If a teammate needs access, grant them the role rather than pasting the key; when a credential must change hands, use a short-lived single-view share link rather than chat or email. Scrub logs and screenshots for tokens, session IDs, and PII before posting anywhere.

You run security scanning on AI-generated code

AI-generated code should go through the same security tooling as human-written code, with extra attention to the failure modes AI introduces. The core scanning categories:

SCA (Software Composition Analysis): Scan AI-generated dependencies for known vulnerabilities and verify packages actually exist. Tools like Snyk, Sonatype, and Mend.io flag vulnerable or phantom dependencies before they ship.
SAST (Static Application Security Testing): Scan generated code for security flaws, injection vulnerabilities, and insecure patterns. Tools like Semgrep, SonarQube, and Checkmarx catch issues that compile cleanly but fail security review.
DAST/IAST (Dynamic/Interactive Application Software Testing): Test running applications for security vulnerabilities by simulating real-world attacks against live endpoints. DAST tools identify issues like authentication flaws, injection vulnerabilities, and misconfigurations that only appear at runtime. Tools like OWASP ZAP, Burp Suite, and Acunetix probe applications externally to uncover exploitable weaknesses that static analysis may miss.
Secret scanning: Scan for hardcoded credentials, API keys, and tokens that AI may embed in generated code. GitGuardian, TruffleHog, and Gitleaks detect secrets before they reach source control.
SBOM (Software Bill of Materials): Maintain an inventory of all components in your codebase, including AI-generated code. SBOMs are becoming a compliance requirement and are critical for auditing what AI contributed and what dependencies it introduced.
IaC (Infrastructure as Code) Scanning: Scan infrastructure definitions (Terraform, CloudFormation, Kubernetes manifests, etc.) for misconfigurations, insecure defaults, and excessive permissions before deployment. This prevents issues like public storage exposure, open network access, or overprivileged roles. Tools like Checkov and tfsec detect risks early in the delivery pipeline.
Container and Image Scanning: Scan container images and underlying OS packages for vulnerabilities, misconfigurations, and outdated components. AI-generated Dockerfiles introduce specific failure modes: secrets baked into build context or layers, floating base image tags, and missing build-stage separation. Exclude secret-bearing paths from the build context, keep secrets out of image layers, pin base images for reproducibility, use multi-stage builds, and sign images destined for production. Tools like Trivy, Grype, and Docker Scout flag image and OS-package vulnerabilities; Cosign handles signing and verification.

None of these are new to AI-assisted development, but AI changes the volume and the risk profile. You're generating more code faster, with less manual review per line. The tooling must scale to match.

You review AI-generated code for license compliance

AI models can regurgitate training-set examples, including copyleft-licensed open-source code, without flagging the source. The risk concentrates in two places: third-party dependencies the AI suggests, and substantial inline code blocks that may match existing public repositories.

Standard SCA license scanning catches the first: vulnerable or copyleft dependencies in your manifest. Catching the second requires snippet-similarity scanning, which compares inline source code against public OSS corpora and is not part of typical CI. For client deliverables, core product IP, or anything where ownership matters, run snippet scanning before merge; the dependency layer alone will not flag inline regurgitation.

You understand the IP implications of AI-generated code

In the United States, the Copyright Office has stated that works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection. AI tool providers prominently display warranty disclaimers that push the due diligence burden back onto the businesses integrating AI-generated code.

This means: you cannot assume AI-generated code is free of licensing encumbrances, and copyright protection over purely AI-generated output is jurisdiction-dependent and unsettled. Meaningful human authorship is necessary but not sufficient. For code where ownership matters (client deliverables, core product IP, anything carrying compliance or warranty obligations), apply operational controls in addition to author review: provenance tracking that records what was AI-generated and what was human-written, snippet-similarity scanning against public OSS corpora to catch inline regurgitation, SCA license scanning on dependencies, review of your AI provider's terms (training rights, indemnification, attribution requirements), and contract review for client deliverables. For high-stakes deliverables, route through legal and compliance early. This guidance is operational, not legal advice.

Anti-patterns

Defaulting to "share unless I think of a reason not to" rather than "do not share unless contract, policy, and configuration all explicitly permit it"
Pasting error logs containing credentials or API keys into AI prompts for debugging
Committing MCP configuration files with hardcoded secrets to source control
Leaving a leaked credential in place instead of rotating immediately once it has appeared in any shared surface
Not running SCA or SAST on AI-generated code before merge
Installing AI-suggested dependencies without verifying they exist and are maintained (slopsquatting risk)
Assuming AI-generated code is automatically free of licensing issues
Sharing client proprietary code with AI tools without understanding the data handling policies
Using AI tools on air-gapped or compliance-sensitive projects without verifying data flow
No SBOM tracking for AI-generated components in your codebase
Installing AI agent plugins or skills from unvetted marketplaces (the OpenClaw lesson)

Resources

Data Leakage and Secrets

GitGuardian: 2026 State of Secrets Sprawl - Definitive data on secret leak rates in AI-assisted development, including MCP configuration exposure
Harmonic Security: GenAI Prompt Data Leakage (2025) - 8.5% of AI prompts contain sensitive information
OWASP Secrets Management Cheat Sheet - Best practices for secrets lifecycle management
Secret Scanning Tools (2026) - Landscape of tools for detecting leaked credentials

CI and Supply Chain Patterns

OpenID Connect (OIDC) for federated CI credentials - OIDC is the open standard for replacing long-lived static cloud keys in CI with short-lived federated tokens. The pattern is supported across major cloud providers; the linked walkthrough is the most-cited practitioner reference.
SLSA Provenance - Open Linux Foundation specification for supply-chain attestation: what a build artifact contains and how it was built. The standard SLSA references in any modern supply-chain conversation.
Sigstore - Open Linux Foundation standard for image signing and verification; the emerging answer to "how do you prove this artifact is the one your CI built."

Security Scanning

OWASP Top 10 for LLM Applications - Industry-standard framework for LLM security risks
CISA SBOM Resources - Federal guidance on Software Bill of Materials practices
NIST Secure Software Development Framework (SSDF) - Federal framework covering SCA, SAST, secret management, and supply-chain hygiene as a unified discipline. The closest single document to a "what every team should be doing" reference.

IP and Licensing

US Copyright Office: AI and Copyright - Official guidance on copyright eligibility for AI-generated works

Cautionary Examples

Samsung Bans ChatGPT After Data Leak (2023) - Engineers pasted proprietary source code into AI tools
The OpenClaw Security Crisis (2026) - Malicious skills, deleted data, and the risks of unvetted AI agent marketplaces
ClawHavoc: 1,184 Malicious Skills in ClawHub - Supply chain attack on AI agent skill marketplace

Pillar 6: Verification and Security - The verification framework for AI-generated code
See Learning Paths for deeper dives

Pillar 10: Data Hygiene and IP

What We Expect

You never paste secrets, credentials, or API keys into AI prompts, and you handle secrets through managed infrastructure

You understand what data leaves your machine

You audit MCP configurations for credential exposure

You treat team communication as a leak surface, not just AI prompts

You run security scanning on AI-generated code

You review AI-generated code for license compliance

You understand the IP implications of AI-generated code

Anti-patterns

Resources

Data Leakage and Secrets

CI and Supply Chain Patterns

Security Scanning

IP and Licensing

Cautionary Examples

Pillars

Toolchain

Resources

Pillar 10: Data Hygiene and IP

What We Expect

You never paste secrets, credentials, or API keys into AI prompts, and you handle secrets through managed infrastructure

You understand what data leaves your machine

You audit MCP configurations for credential exposure

You treat team communication as a leak surface, not just AI prompts

You run security scanning on AI-generated code

You review AI-generated code for license compliance

You understand the IP implications of AI-generated code

Anti-patterns

Resources

Data Leakage and Secrets

CI and Supply Chain Patterns

Security Scanning

IP and Licensing

Cautionary Examples

Related Pillars

Pillars

Toolchain

Resources