Pillar 10: Data Hygiene and IP
What you put into AI tools matters as much as what comes out.
AI-assisted development creates two categories of data risk that traditional development does not. On the input side, every prompt, every file you share, and every MCP connection is a potential leak vector for secrets, proprietary code, and sensitive client data. On the output side, AI-generated code can introduce hardcoded credentials, reference non-existent packages that attackers can squat on, and produce code with unclear intellectual property status.
GitGuardian’s 2026 State of Secrets Sprawl report found that secret leak rates in AI-assisted code were roughly double the GitHub-wide baseline, with AI-assisted commits leaking secrets at approximately 3.2%. MCP server configurations alone exposed over 24,000 unique secrets, with 2,117 confirmed as valid credentials. Research from Harmonic Security found that 8.5% of prompts submitted to AI tools included sensitive information such as PII, credentials, and internal file references.
These are not theoretical risks. In 2023, Samsung banned ChatGPT after engineers pasted proprietary source code into the tool. More recently, the OpenClaw security crisis demonstrated the risks of AI agent tooling at scale: a Meta AI director’s entire email inbox was deleted by an OpenClaw skill that bypassed its safety instructions, and researchers found 1,184 malicious skills in ClawHub’s marketplace, including infostealers disguised as legitimate tools.
For embedded engineers working on client codebases, the risk profile is elevated. You are handling code and data that belongs to someone else.
What We Expect
Section titled “What We Expect”You never paste secrets, credentials, or API keys into AI prompts, and you handle secrets through managed infrastructure. Treat every AI interaction as potentially logged; if you would not post it on a public Slack channel, do not put it in a prompt. That rule covers hardcoded tokens in code files you share as context, environment variables you copy into a prompt for debugging, and screenshots that contain sensitive information. The behavioral rule only holds when secrets have a legitimate place to live: a managed vault as the source of truth, runtime retrieval via managed identity or short-lived federated tokens (not long-lived static keys in CI), and rotation treated as routine rather than exceptional.
You understand what data leaves your machine. Know the difference between tools that process locally (CLI tools with local file access) and tools that send data to external APIs. Understand your AI tool’s data retention and training policies. When working on client projects with confidentiality requirements, verify that your tool configuration complies.
You audit MCP configurations for credential exposure. MCP server configs are a common vector for secrets leakage. Never store credentials directly in MCP configuration files. Use environment variables, secrets managers, or client-side authentication patterns instead. Review any MCP configuration before committing it to source control.
You treat team communication as a leak surface, not just AI prompts. Credentials leak at least as often through Slack DMs, email threads, tickets, screen-share recordings, screenshots, and pasted error logs as they do through AI tools. If a teammate needs access, grant them the role rather than pasting the key; when a credential must change hands, use a short-lived single-view share link rather than chat or email. Scrub logs and screenshots for tokens, session IDs, and PII before posting anywhere. Assume anything pasted is archived.
You run security scanning on AI-generated code. AI-generated code should go through the same security tooling as human-written code, with extra attention to the failure modes AI introduces. The core scanning categories:
- SCA (Software Composition Analysis): Scan AI-generated dependencies for known vulnerabilities and verify packages actually exist. Tools like Snyk, Sonatype, and Mend.io flag vulnerable or phantom dependencies before they ship.
- SAST (Static Application Security Testing): Scan generated code for security flaws, injection vulnerabilities, and insecure patterns. Tools like Semgrep, SonarQube, and Checkmarx catch issues that compile cleanly but fail security review.
- DAST/IAST (Dynamic/Interactive Application Software Testing): Test running applications for security vulnerabilities by simulating real-world attacks against live endpoints. DAST tools identify issues like authentication flaws, injection vulnerabilities, and misconfigurations that only appear at runtime. Tools like OWASP ZAP, Burp Suite, and Acunetix probe applications externally to uncover exploitable weaknesses that static analysis may miss.
- Secret scanning: Scan for hardcoded credentials, API keys, and tokens that AI may embed in generated code. GitGuardian, TruffleHog, and Gitleaks detect secrets before they reach source control.
- SBOM (Software Bill of Materials): Maintain an inventory of all components in your codebase, including AI-generated code. SBOMs are becoming a compliance requirement and are critical for auditing what AI contributed and what dependencies it introduced.
- IaC (Infrastructure as Code) Scanning: Scan infrastructure definitions (Terraform, CloudFormation, Kubernetes manifests, etc.) for misconfigurations, insecure defaults, and excessive permissions before deployment. This prevents issues like public storage exposure, open network access, or overprivileged roles. Tools like Checkov and tfsec detect risks early in the delivery pipeline.
- Container & Image Scanning: Scan container images and underlying OS packages for vulnerabilities, misconfigurations, and outdated components. AI-generated Dockerfiles introduce specific failure modes: secrets baked into build context or layers, floating base image tags, and missing build-stage separation. Exclude secret-bearing paths from the build context, keep secrets out of image layers, pin base images for reproducibility, use multi-stage builds, and sign images destined for production.
None of these are new to AI-assisted development, but AI changes the volume and the risk profile. You’re generating more code faster, with less manual review per line. The tooling must scale to match.
You review AI-generated code for license compliance. AI can produce code that closely mirrors existing open-source code without proper attribution. Most SCA tools include license scanning that flags copyleft obligations, attribution requirements, and license conflicts. When code ownership matters (client deliverables, core product IP), run license checks on AI-generated output the same way you would on any third-party dependency.
You understand the IP implications of AI-generated code. In the United States, the Copyright Office has stated that works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection. AI tool providers prominently display warranty disclaimers that push the due diligence burden back onto the businesses integrating AI-generated code.
This means: you cannot assume AI-generated code is free of licensing encumbrances, and your organization may have limited IP protection over purely AI-generated output. When code ownership matters (client deliverables, core product IP), ensure meaningful human authorship and review.
Anti-patterns
Section titled “Anti-patterns”- Pasting error logs containing credentials or API keys into AI prompts for debugging
- Committing MCP configuration files with hardcoded secrets to source control
- Leaving a leaked credential in place instead of rotating immediately once it has appeared in any shared surface
- Not running SCA or SAST on AI-generated code before merge
- Installing AI-suggested dependencies without verifying they exist and are maintained (slopsquatting risk)
- Assuming AI-generated code is automatically free of licensing issues
- Sharing client proprietary code with AI tools without understanding the data handling policies
- Using AI tools on air-gapped or compliance-sensitive projects without verifying data flow
- No SBOM tracking for AI-generated components in your codebase
- Installing AI agent plugins or skills from unvetted marketplaces (the OpenClaw lesson)
Resources
Section titled “Resources”Data Leakage and Secrets
Section titled “Data Leakage and Secrets”- GitGuardian: 2026 State of Secrets Sprawl - Definitive data on secret leak rates in AI-assisted development, including MCP configuration exposure
- Harmonic Security: GenAI Prompt Data Leakage (2025) - 8.5% of AI prompts contain sensitive information
- OWASP Secrets Management Cheat Sheet - Best practices for secrets lifecycle management
- Secret Scanning Tools (2026) - Landscape of tools for detecting leaked credentials
CI and Supply Chain Patterns
Section titled “CI and Supply Chain Patterns”- GitHub Actions OIDC - Federated short-lived credentials for CI, the primary pattern for eliminating long-lived cloud keys
- SLSA Provenance - Supply-chain attestation specification for what an image contains and how it was built
- Sigstore - Image signing and verification; the emerging supply-chain standard
Security Scanning
Section titled “Security Scanning”- Snyk - SCA and vulnerability scanning for AI-generated dependencies
- Semgrep - SAST platform for static code analysis
- OWASP Top 10 for LLM Applications - Industry-standard framework for LLM security risks
- CISA SBOM Resources - Federal guidance on Software Bill of Materials practices
IP and Licensing
Section titled “IP and Licensing”- US Copyright Office: AI and Copyright - Official guidance on copyright eligibility for AI-generated works
Cautionary Examples
Section titled “Cautionary Examples”- Samsung Bans ChatGPT After Data Leak (2023) - Engineers pasted proprietary source code into AI tools
- The OpenClaw Security Crisis (2026) - Malicious skills, deleted data, and the risks of unvetted AI agent marketplaces
- ClawHavoc: 1,184 Malicious Skills in ClawHub - Supply chain attack on AI agent skill marketplace
Related Pillars
Section titled “Related Pillars”- Pillar 6: Verification and Security - The verification framework for AI-generated code
- See Learning Paths for deeper dives