Simon Roses Femerling – Blog | CyberSpace Insecurity 3.X

The Future of Vibe Coding Security (Part 10)

Posted on July 30, 2026 by Simon Roses

Vibe Coding Security Series

What Is Vibe Coding Security? A Field Guide for 2026

The OWASP Top 10 for Vibe-Coded Applications

Anatomy of a Vibe Coding Breach: Lessons from 2026’s Worst Incidents

The Dependency Trap: Supply Chain Risks in AI-Generated Code

Authentication & Secrets: What AI Gets Wrong Every Time

Scanning Vibe-Coded Apps: Why Traditional SAST/DAST Falls Short

Prompt Engineering for Secure Code

The Founder’s Security Checklist

Securing the AI Coding Pipeline

The Future of Vibe Coding Security (you are here)

Read Time: 26 minutes

TL;DR

The vibe coding security landscape is splitting in two directions simultaneously. On one side: regulation is arriving (EU Cyber Resilience Act vulnerability reporting starts September 2026, insurers are excluding AI-generated code from coverage), autonomous agents are creating attack surfaces we haven’t mapped yet (zero out of 2,000 surveyed MCP servers had authentication), and model poisoning means the AI itself might be compromised before you write your first prompt. On the other side: AI-powered defense is finally catching up (hybrid AI+SAST eliminates 94-98% of false positives, AI-driven vulnerability detection finds twice what traditional tools miss), new frameworks for agent governance exist, and the industry is building the certification and standards infrastructure that vibe coding has lacked. This final article in the series looks at both trajectories and gives you a practical roadmap for 2027.

Where We’ve Been: A Series in Retrospect

When I started this series in April, a founder had just shown me an app built entirely with AI prompts. It worked. It also had SQL injection in the login form, hardcoded database credentials in the client-side JavaScript, and an admin panel accessible to anyone who guessed the URL. That app — and the dozens like it I’d been seeing at VULNEX — became the reason for writing ten articles about a problem that most of the industry was still treating as a novelty.

The arc of this series traced the lifecycle of vibe-coded insecurity:

Part 1 defined the field. Eighty-four percent of developers use or plan to use AI coding assistants, but the security conversation hadn’t caught up with the adoption curve. Part 2 mapped how the OWASP Top 10 manifests differently in AI-generated code — the same vulnerability categories, but different patterns and higher density. Part 3 showed what happens when those vulnerabilities reach production through real breach case studies.

Then we went deeper into specific attack surfaces. Part 4 covered the dependency supply chain — hallucinated packages, phantom dependencies, slopsquatting. Part 5 documented how AI consistently fails at authentication and secrets management. Part 6 explained why traditional scanning tools miss AI-specific vulnerability patterns.

The final stretch focused on defense. Part 7 showed how to engineer prompts that produce more secure code. Part 8 gave founders a pre-launch checklist. Part 9 covered the toolchain itself — every stage from the model to production, and the attacks targeting each one.

If I’m honest about what surprised me: when I started Part 1, I thought dependency poisoning and hallucinated packages were the most urgent threat. By Part 9, it was clear that the toolchain attack surface — MCP servers, IDE extensions, AI configuration files — was evolving faster than anything else in the field. The vulnerabilities in the code were bad. The vulnerabilities in the tools that produce the code were worse. That realization reshaped the second half of this series and drives most of what I’ll cover in this final article.

That was the present. This article is about what comes next.

The Regulatory Wave

The EU Moves First

Two pieces of EU legislation are about to reshape how organizations handle AI-generated code, and neither one mentions “vibe coding” by name.

The EU AI Act entered into force in August 2024, with enforcement rolling out in phases. General-purpose AI rules took effect in August 2025. On August 2, 2026, Article 50 transparency obligations and penalty powers activate — with maximum fines of 15 million euros or 3% of global turnover, whichever is higher. AI coding tools haven’t been classified as high-risk under the Act, but the general-purpose AI provisions still apply to the foundation models that power them.

More consequential for development teams is the EU Cyber Resilience Act. It entered into force in December 2024, and it doesn’t distinguish between human-written and AI-generated code. Vulnerability reporting obligations begin September 11, 2026 — two months from now. Manufacturers must report actively exploited vulnerabilities within 24 hours. Full compliance is required by December 2027. If your vibe-coded app ships to EU customers, the CRA applies regardless of how the code was produced.

In the United States, the approach has been lighter. The June 2, 2026, executive order “Promoting Advanced AI Innovation and Security” established a voluntary framework for government review of frontier AI models and directed agencies to develop AI cybersecurity benchmarks. It created an “AI cybersecurity clearinghouse” but stopped short of mandatory requirements for AI-generated code. No federal law specifically targets AI coding tools as a distinct category.

The Insurance Industry Pulls Back

While governments debate regulation, insurers are already making decisions with their wallets. In January 2026, the Insurance Services Office introduced a generative AI exclusion for commercial general liability policies, excluding coverage for bodily injury, property damage, and personal injury arising from generative AI. The Geneva Association noted in late 2025 that limited loss data and information asymmetry make AI risk “technically unresolvable with current data.”

Translation: if your AI-generated code causes harm, your insurance may not cover it. Coverage is fragmenting across cyber, Tech E&O, D&O, and EPLI lines, creating gap risk that K&L Gates identified as “the next wave of litigation.”

No lawsuits specifically about AI-generated code vulnerabilities have been publicly reported through mid-2026. The EU Product Liability Directive transposition deadline — December 2026 — will extend strict liability to software including AI systems. When the litigation starts, it will start fast.

Standards and Certification Catch Up

The standards infrastructure is finally catching up. On February 17, 2026, NIST’s Center for AI Standards and Innovation launched the AI Agent Standards Initiative — the first federal effort to set interoperability, security, and identity standards for autonomous AI agents that write and execute code. An AI Agent Interoperability Profile is planned for Q4 2026.

On the same day, CompTIA launched SecAI+ (exam CY0-001), its first certification at the intersection of AI and cybersecurity. Four domains: Basic AI Concepts (17%), Securing AI Systems (40%), AI-Assisted Security (24%), and AI Governance, Risk, and Compliance (19%). It’s not deep enough for offensive security work, but it signals that the industry recognizes the skills gap.

The Agent Era

Every MCP Server Fails the Same Test

In Part 9, I covered MCP vulnerabilities — the CVEs, the tool poisoning, the auto-execution attacks. That was the present. The future is worse.

Knostic performed a security scan of approximately 2,000 MCP servers. Every single one lacked authentication. Not most. Not a significant portion. All of them. The protocol that connects AI models to tools, databases, and APIs across the development ecosystem has zero authentication as the default state.

Two competing protocols now define how agents communicate: Anthropic’s MCP (Model Context Protocol) and Google’s A2A (Agent-to-Agent), announced in April 2025 with over 50 industry partners. Both are racing for adoption. Neither has solved the fundamental authentication problem at the protocol level.

A March 2026 paper proposed AIP — Agent Identity Protocol — for cryptographic delegation chains across MCP and A2A. The emerging consensus among researchers is that mutual TLS, signed agent cards, and PKI-backed identities are the minimum viable solution. OAuth tokens alone are insufficient because they weren’t designed for machine-to-machine delegation chains where the “user” is itself an agent acting on behalf of another agent.

Deploying Without Permission

Gravitee’s “State of AI Agent Security 2026” report surveyed technical teams about agent deployment readiness. The results tell a familiar story: 80.9% of teams have moved past planning into active testing or production deployment of AI agents. Only 14.4% report agents going live with full security and IT approval. Eighty-one percent feel pressure to deploy agents before governance frameworks are ready.

This is the cloud migration pattern all over again — adopt first, secure later — except agents have broader system access than cloud instances ever did. An AI agent with MCP access can read your files, query your databases, execute commands, send messages, and deploy code. A misconfigured EC2 instance can’t do most of that.

The OWASP Response

OWASP published its Top 10 for Agentic Applications in December 2025, recognizing that the original LLM Top 10 didn’t adequately cover the risks of autonomous agents. The number-one risk: Agent Goal Hijacking — manipulating an agent into pursuing objectives that benefit the attacker rather than the user. This is prompt injection applied to agents that can take actions, not just generate text. The full list also includes Memory and Context Poisoning, Cascading Failures across multi-agent systems, and Rogue Agents operating outside their intended boundaries — each one a distinct class of risk that didn’t exist eighteen months ago.

Microsoft’s response arrived in April 2026: the Agent Governance Toolkit, released under MIT license. It’s the first open-source framework addressing all 10 OWASP agentic risks with deterministic, sub-millisecond policy enforcement. It integrates with LangChain, CrewAI, Google ADK, and OpenAI’s Agents SDK. It’s a start — but a toolkit is only useful if teams actually adopt it, and the Gravitee numbers suggest most aren’t waiting for governance.

The scale of the problem continues to grow. Google researchers measured a 32% increase in malicious prompt injection payloads embedded in web content between November 2025 and February 2026. Every web page an agent browses, every document it reads, every API response it processes is a potential injection vector. As agents become more autonomous, the attack surface doesn’t grow linearly — it grows combinatorially.

When Agents Find Zero-Days

In April 2026, Anthropic disclosed Claude Mythos Preview — a frontier model capable of autonomously identifying and exploiting zero-day vulnerabilities. Through Project Glasswing, Mythos scanned over 1,000 open-source projects and identified 23,019 issues, including 6,202 at high or critical severity. By May 2026, 1,596 findings had been disclosed through coordinated vulnerability disclosure programs.

This is an inflection point. The same autonomous capability that finds vulnerabilities for defenders can find them for attackers. Anthropic chose coordinated disclosure. A state-sponsored actor with equivalent capability would not. The asymmetry between attack and defense that has defined cybersecurity for decades is about to accelerate.

Google DeepMind responded with an “AI Control Roadmap” in June 2026 for defense-in-depth management of potentially misaligned agents. Anthropic, OpenAI, and Block co-founded the Agentic AI Foundation under the Linux Foundation in late 2025, recognizing that agent safety is a shared problem no single company can solve.

Meanwhile, Google DeepMind, Schmidt Sciences, the Cooperative AI Foundation, and ARIA announced a $10 million funding initiative on June 11, 2026, targeting multi-agent AI safety research. The focus: emergent population-level risks when agents from different organizations interact in shared environments.

AI vs AI: The Defense Evolves

Finding What Humans Miss

The security tooling picture is shifting from “AI generates code, humans review it” to “AI generates code, AI finds the bugs, humans make the decisions.”

IRIS, presented at ICLR 2025, demonstrated what this looks like in practice. It’s a neuro-symbolic system that combines LLMs with CodeQL’s static analysis. Using GPT-4, IRIS detected 55 vulnerabilities in real-world Java projects — 103.7% more than CodeQL found alone — while reducing the false discovery rate by 5.21%. It also discovered four previously unknown vulnerabilities. The system is open source.

SAST-Genius, presented at IEEE S&P 2025, takes the complementary approach: using LLMs to filter SAST false positives. Hybrid configurations eliminate 94-98% of false positives. This matters because alert fatigue — the problem I described in Part 6 and Part 9 — is the reason 40% of security alerts go uninvestigated. If AI can reliably separate real vulnerabilities from noise, human reviewers can focus on what actually matters.

Security Copilots Go Enterprise

Microsoft Security Copilot began rolling out to all M365 E5 customers in November 2025, shipping with over 40 agents spanning Defender, Entra, Intune, and Purview. This isn’t a chat interface that answers security questions — it’s an orchestration layer where specialized AI agents handle triage, investigation, and response across Microsoft’s security stack.

Google launched AI Threat Defense on May 27, 2026, integrating Wiz (cloud exposure mapping), CodeMender (AI code repair), Gemini (reasoning), and Mandiant (threat intelligence) into a unified defense platform. Specialized agents handle Detection Engineering and Threat Hunting as distinct workflows.

CrowdStrike unveiled Charlotte Agentic SOAR in November 2025, replacing legacy security orchestration with AI agents that make real-time decisions across seven mission-ready roles.

The pattern across all three: security is moving from human-driven investigation assisted by tools to AI-driven investigation supervised by humans. Dropzone AI reports that AI-augmented threat hunting compresses 40-hour manual investigations to approximately one hour. Organizations deploying AI and automation in security operations reduced breach identification and containment by an average of 80 days.

Automated Code Repair

Google DeepMind’s CodeMender, announced in October 2025, uses Gemini Deep Think models combined with static analysis, dynamic analysis, fuzzing, and SMT solvers to find and fix security vulnerabilities automatically. In its first six months, it upstreamed 72 security fixes to open-source projects — some with codebases exceeding 4.5 million lines. All patches undergo human review before merging.

This is the missing piece in the vibe coding security pipeline. Today, AI generates vulnerable code, humans find it, and humans fix it. Tomorrow: AI generates code, AI finds the vulnerabilities, AI proposes fixes, and humans approve. The human role shifts from doing the work to governing the process.

The Threats We Haven’t Seen Yet

Poisoning the Models Themselves

Everything in this series assumed the model itself was trustworthy — flawed in its output, certainly, but not deliberately compromised. That assumption has an expiration date.

Anthropic-affiliated research demonstrated that approximately 250 malicious samples can poison a foundation model regardless of total dataset size — fewer than 0.1% of pre-training data. The backdoors survive fine-tuning and safety alignment. A paper called “BackdoorLLM,” accepted at NeurIPS 2025, formalized the attack. A more insidious variant called “Turn-Based Structural Triggers” (January 2026) showed how to embed backdoors that activate only in specific multi-turn conversation patterns — exactly the kind of interactions developers have with coding assistants.

The practical implication: an attacker who can influence training data doesn’t need to compromise your IDE extensions, your MCP servers, or your CI/CD pipeline. They can compromise the model itself, and the model will generate vulnerable code that passes every review because the vulnerability was designed to look like a reasonable implementation choice.

A sub-trend makes this worse: LoRA adapter poisoning. Small fine-tuning adapters — the kind used to customize models for specific coding tasks — can introduce backdoors that are difficult to distinguish from legitimate fine-tuning. If your organization fine-tunes a model on internal code and an attacker can influence that training corpus, the backdoor propagates to every developer using the fine-tuned model.

Synthetic Developers

North Korean operatives have built synthetic developer identities with fabricated LinkedIn histories and AI-generated profile photos to infiltrate technology companies as remote contractors. Google Threat Intelligence documented threat group UNC1069 transitioning to AI-powered campaigns targeting developers at cryptocurrency exchanges and financial firms using AI-generated personas.

Combine this with the open-source contribution model: a synthetic developer persona, backed by months of legitimate-looking commit history (which is itself AI-generated), submits pull requests to popular repositories. A 2026 Carnegie Mellon study found 6 million fake stars across 18,600+ GitHub repositories, with AI and LLM projects as the most-manipulated category.

In April 2026, the “prt-scan” campaign opened 475 malicious pull requests across repositories in just 26 hours. In July 2026, researchers disclosed that GitHub verified commits can be rewritten into new hashes without breaking signatures — meaning even signed commits may not prove what they appear to prove.

The trust infrastructure that open source relies on — contributor reputation, commit history, signature verification — wasn’t designed for a world where AI can fabricate all of it at scale. And vibe coders are disproportionately exposed: AI coding assistants pull dependencies, suggest libraries, and recommend code patterns from these repositories automatically. A developer manually evaluating a library might notice a suspicious contributor profile. An AI assistant suggesting import compromised-package in response to a prompt won’t.

The Crypto-Agility Deficit

NIST finalized post-quantum cryptographic standards (CRYSTALS-Kyber, CRYSTALS-Dilithium) in 2024. The transition to post-quantum cryptography is already challenging for human-written codebases. For AI-generated code, it’s potentially catastrophic.

AI coding assistants tend to hard-code encryption choices. When I covered authentication in Part 5, the examples showed AI reaching for MD5 and SHA-1 — algorithms that have been deprecated for years. The same pattern applies to cryptographic library selection: the model generates what was common in its training data, not what’s current.

The result is a crypto-agility deficit. Most CI/CD pipelines have deeply embedded legacy cryptographic algorithms that require massive refactoring for post-quantum migration. AI-generated code accelerates this problem because it generates more of the same — hard-coded algorithm choices, no abstraction layers, no crypto-agility design patterns. When the quantum transition arrives, vibe-coded applications will be disproportionately difficult to migrate.

QuickNote: Two Years Later

Let’s project QuickNote — the deliberately vulnerable app from this series — forward two years. Maya, the developer from Part 9, has learned from her mistakes. Here’s what her workflow could look like in 2028 — if we build the infrastructure she needs:

The model she uses has changed. Maya’s coding assistant runs through a corporate proxy that strips MCP configurations from cloned repositories and validates tool descriptions against a known-good registry. The model is fine-tuned on her company’s internal code, but the fine-tuning pipeline includes adversarial testing for backdoor insertion.

Her IDE enforces guardrails. AI configuration files are treated as executable code — diffs are required, hidden Unicode triggers a CI failure, and rule files are signed by approved maintainers. The Extensions she runs were audited by a third-party service that monitors for behavioral changes post-install.

Agents work in sandboxes. Maya’s AI agents operate inside containers with network policies, filesystem restrictions, and credential scoping. No agent inherits her full environment. MCP servers authenticate with mutual TLS and signed agent identities — the protocol matured after the authentication crisis of 2026.

AI reviews AI. Her pull request pipeline runs CodeMender for automated repair suggestions, IRIS for hybrid vulnerability detection, and SAST-Genius for false positive filtering. Human review focuses on architecture decisions and business logic, not the mechanical vulnerability hunting that AI handles better.

Compliance is automated. Her CI/CD pipeline generates a software bill of materials that tracks which portions of the code were AI-generated, which model version produced them, and which security scans were applied. When the EU CRA audit comes, she can demonstrate compliance without scrambling.

The insurance is sorted. Maya’s company carries a dedicated AI code liability rider that was only available because they could demonstrate a documented, auditable AI governance process.

Is this aspirational? Absolutely. Some of these components exist today; others are in early development. Mutual TLS for MCP is still a proposal. AI code provenance in SBOMs is still nascent. Adversarial testing for fine-tuning backdoors is still research. But the trajectory is clear, and every piece I described has teams working on it right now. The question isn’t whether Maya’s 2028 is possible — it’s whether the industry builds it fast enough to matter.

A Practical Roadmap for 2027

Here’s what you should be building or acquiring in the next twelve months, organized by urgency.

Do Now (Q3-Q4 2026)

Prepare for the EU Cyber Resilience Act. If you ship software to EU customers, vulnerability reporting obligations start September 11, 2026. The CRA doesn’t distinguish between human-written and AI-generated code, so this applies to your vibe-coded components equally. Specifically: designate a single point of contact for coordinated vulnerability disclosure, establish an internal vulnerability database that tracks AI-generated code components, build a 24-hour notification workflow to ENISA and your national CSIRT via the Single Reporting Platform for actively exploited vulnerabilities, and document the security support period for each product you distribute. If you don’t have these in place by September, you’re non-compliant on day one.

Inventory your AI agent permissions. The Gravitee data shows 81% of teams are deploying agents without governance. Conduct an audit: which agents have access to which systems? What credentials do they inherit? What actions can they take without human approval? Use OWASP’s Agentic Applications Top 10 as your assessment framework.

Adopt Microsoft’s Agent Governance Toolkit or build equivalent policy enforcement. Sub-millisecond policy checks on agent actions are the minimum viable defense. If your agents can deploy code, modify infrastructure, or access customer data, deterministic guardrails — not probabilistic model behavior — must constrain them.

Build in Q1-Q2 2027

Integrate AI-powered security scanning. Hybrid approaches like IRIS (LLM + CodeQL) and SAST-Genius (LLM false positive filtering) represent the next generation of vulnerability detection. Evaluate them for your stack. The false positive reduction alone justifies the investment — your security team is drowning in alerts they can’t act on.

Implement AI code provenance tracking. Track which portions of your codebase were AI-generated, by which model, and at what version. This is a compliance requirement under the EU AI Act transparency provisions and a liability defense when things go wrong. Tooling for this is nascent but developing — start with metadata in commit messages and build toward automated tracking.

Get your team certified. CompTIA’s SecAI+ is a starting point. SANS, ISC2, and AWS all offer AI security training now. The skills gap between “developers who use AI” and “developers who secure AI” is where most organizations are most vulnerable.

Plan for H2 2027

Prepare for post-quantum migration. Audit your AI-generated codebases for hard-coded cryptographic choices. Implement crypto-agility patterns — abstraction layers that allow algorithm substitution without rewriting application logic. NIST standards are finalized; the migration is a matter of when, not if.

Evaluate AI code liability insurance. The insurance market is still forming, but early movers who can demonstrate AI governance processes are getting coverage. Companies without documented AI security practices will find it increasingly expensive — or impossible — to insure their AI-generated codebases.

Contribute to agent security standards. NIST’s AI Agent Interoperability Profile is expected in Q4 2026. The standards map is being drawn right now. If you’re operating at scale with AI agents, your experience should inform these standards rather than react to them.

The Gap Between Two Top 10 Lists

OWASP now maintains two separate AI security frameworks: the LLM Top 10 (focused on model behavior — prompt injection, training data poisoning, output handling) and the Agentic Top 10 (focused on autonomous action — goal hijacking, tool manipulation, insufficient access control). I covered the top agentic risks in the previous section because that’s where the live threat is. But the gap between these two lists matters more than either list alone.

The LLM Top 10 assumes a human is in the loop — reading the output, deciding whether to act on it. The Agentic Top 10 assumes the AI is acting. Most real-world systems in 2026 sit somewhere in between: semi-autonomous agents that sometimes ask for approval and sometimes don’t, depending on how they were configured. The security model for this middle ground barely exists. The OWASP frameworks address the two endpoints but not the continuum.

This is where the next generation of exploits will live. Attackers won’t target the fully supervised chatbot (low payoff) or the fully autonomous agent (better defended over time). They’ll target the half-supervised agent with unclear boundaries — the one that asks for permission to deploy code but not to modify the deployment configuration. We’ve spent 2025-2026 learning to secure the model’s outputs. We’ll spend 2027-2028 learning to secure the model’s actions. The hardest part will be the systems that do both.

The Arc of This Series

I wrote the first article in this series because the security industry wasn’t keeping up with what was happening on the ground. Developers were building production applications with AI-generated code, shipping them to real users, and nobody was systematically documenting what went wrong — or how to prevent it.

Ten articles later, the picture has changed. When I wrote Part 1, MCP was a protocol most security professionals hadn’t heard of. By Part 9, the NSA had published guidance on it. When I wrote Part 2, mapping OWASP risks to vibe-coded applications was a novel exercise. By mid-2026, OWASP had published separate top 10 lists for both LLM applications and agentic applications. When I wrote Part 4 about supply chain risks, slopsquatting was an emerging research finding. It’s now a documented attack vector with active exploitation in the wild.

The data tells the story. Georgia Tech’s Vibe Security Radar tracked 6 CVEs attributed to AI-generated code in January 2026. By March, it was 35. AI-assisted commits expose hardcoded secrets at 3.2% — more than double the 1.5% baseline rate for human-written code. Apiiro found that AI-generated code creates 322% more privilege escalation paths than human-written code. Developer trust in AI output dropped 11 percentage points in a single year — only 29% now trust AI-generated code, down from 40% in 2024.

The industry isn’t ignoring the problem anymore. What it hasn’t done yet is solve it.

Where We Go from Here

The AI coding assistant market was $7.37 billion in 2025 and is projected to reach $26 billion by 2030. GitHub Copilot alone has approximately 20 million users deployed across 90% of the Fortune 100. This technology isn’t going away. The question was never “should we use AI to write code?” — it was always “how do we do it without burning the house down?”

After ten articles, my answer is the same thing I’d tell any client who asks: the technology is powerful, the risks are real, and the path forward is disciplined engineering, not fear.

Treat AI-generated code as untrusted input — every time, without exception. Build security gates that don’t depend on human vigilance alone. Adopt AI-powered defense tools that can keep pace with AI-generated volume. Prepare for regulation that’s arriving whether you’re ready or not. And invest in the people: the developers who need to understand what the AI gets wrong, the security engineers who need to audit AI-augmented pipelines, and the leaders who need to govern a process that moves faster than any human can review manually.

The vibe coding revolution gave developers extraordinary leverage. The security challenge of the next two years is making that leverage safe to use. At VULNEX, we see this every week — teams shipping fast with AI-generated code, discovering the security debt later, and scrambling to close gaps they didn’t know existed. The tools and frameworks are getting better, but they only work if you adopt them before the breach, not after.

If you’ve been reading this series, I have a request: don’t let it stay theoretical. Pick one item from the roadmap above and implement it this quarter. Audit your MCP server permissions. Set up a 24-hour vulnerability reporting workflow. Run an AI-powered scanner alongside your existing SAST. One concrete step matters more than ten bookmarked articles. And if you find something interesting — a new attack pattern, a tool that works, a regulation gap nobody’s talking about — share it. Tag me. The security community gets stronger when practitioners publish what they learn.

As always: trust nothing, verify everything.

It’s been a privilege writing this series. I hope it’s been useful.

X (Twitter): @SimonRoses

References

European Commission (2024). EU AI Act — Regulation (EU) 2024/1689.
European Commission (2024). Cyber Resilience Act — Regulation (EU) 2024/2847.
The White House (2026). Executive Order on Promoting Advanced AI Innovation and Security.
NIST (2026). AI Agent Standards Initiative — Center for AI Standards and Innovation.
CompTIA (2026). SecAI+ Certification Launch.
Knostic (2026). MCP Server Authentication Audit: Zero out of 2,000 Servers Had Authentication.
Gravitee (2026). State of AI Agent Security 2026.
OWASP (2025). Top 10 for Agentic Applications.
Microsoft (2026). Agent Governance Toolkit.
Google DeepMind (2026). Multi-Agent AI Safety Funding Initiative.
Anthropic (2026). Project Glasswing: Claude Mythos and Coordinated Vulnerability Disclosure.
Chen et al. (2025). IRIS: Neuro-Symbolic Vulnerability Detection with LLMs and Static Analysis.
IEEE S&P (2025). SAST-Genius: Hybrid LLM False Positive Filtering for SAST.
Google DeepMind (2025). CodeMender: Automated Security Fix Generation.
Georgia Tech (2026). Vibe Security Radar: AI-Generated Code CVE Tracking.
Apiiro (2026). AI-Generated Code Creates 322% More Privilege Escalation Paths.
The Geneva Association (2025). AI Risk and Insurance: Challenges and Emerging Responses.
K&L Gates (2026). AI Product Liability: The Next Wave of Litigation.
Carnegie Mellon University (2026). Fake Stars and Manipulated Repositories on GitHub.

Posted in AI, Privacy, Security, Technology | Tagged AI, Application Security, Software Security, VibeCoding, VibeCodingSecurity | Leave a comment

Do Open Weight Models Dream of Tokens?

Posted on July 25, 2026 by Simon Roses

Read Time: 14 minutes

TL;DR

Philip K. Dick asked whether an android could be told from a human. In 2026 the enterprise version of that question is whether you can still tell an open-weight model from a frontier commercial one — and the honest answer is increasingly not, on most of the tests that matter. Chinese labs are shipping trillion-parameter open models — Kimi K3, DeepSeek V4, GLM-5.2 — under MIT-style licenses, with million-token context windows, and — run on your own hardware — at a fraction of the cost, and they now trade blows with US frontier systems on reasoning and science benchmarks. Coding is the last clear moat, and even that is narrowing. And it is not only China: NVIDIA (Nemotron) and Meta (Llama) are shipping open models too, and at VULNEX we already run our own offensive agent on Qwen. Jensen Huang broke a lifetime of X silence to say the quiet part out loud: open models “strengthen safety and cybersecurity, accelerate innovation and diffusion, and enable sovereignty” — and Washington banning them would repeat the mistake the software industry almost made with open source in the 1980s. OpenAI and Anthropic pointedly did not sign the letter he backed. My take, from the security chair: the sovereignty argument is right, the “just sandbox it” argument is too breezy, and the enterprises that win the next two years are the ones that learn to own their models instead of renting a black box they can’t audit, can’t run air-gapped, and — as Hugging Face learned this week — can’t even point at their own incident.

In Do Androids Dream of Electric Sheep?, Rick Deckard hunts replicants he cannot reliably tell apart from people. The whole apparatus of the novel — the Voigt-Kampff empathy test, the endless follow-up questions — exists because the difference between the real thing and the manufactured one has collapsed to a margin you can only detect with an increasingly desperate instrument. Deckard keeps an electric sheep on his roof and is quietly ashamed of it, because it is a fake, and everyone can tell, and status in that world is owning something real.

I have been thinking about that book a lot while watching the open-weight model releases pile up this year. Because we are running our own Voigt-Kampff test now, and it is called a benchmark, and it is starting to fail in the same way Deckard’s does. We keep asking harder and harder questions to detect the difference between the expensive proprietary mind and the one you can download for free — and the needle keeps not moving the way the vendors need it to.

The truth here is more interesting than either camp’s marketing, so let me lay out where it actually stands.

The Empathy Test Is Failing

Here is the uncomfortable state of the benchmarks as of July 2026. I am going to give you numbers, and then I am going to tell you why you should not trust them too much — which is itself the point.

On the composite Artificial Analysis Intelligence Index, Moonshot’s Kimi K3 lands at 57.1. The frontier commercial models it is chasing — Claude Fable 5 at around 60, GPT-5.6 Sol at 59 — are ahead by roughly three points. Three. On science reasoning, the gap has essentially closed: Kimi K3 scores 93.5 on GPQA Diamond, with GLM-5.2 at 91.2, numbers squarely in frontier territory.

Then you get to coding, and the story changes. On SWE-bench Verified, Claude Fable 5 posts a reported 95.0%. Kimi K3, depending on whose harness you believe, lands somewhere between 60.4% and — measured differently — DeepSeek V4 Pro hits 80.6%. That spread, from 60 to 80 for “the same class of task,” is not a rounding error. It is the whole problem with treating benchmarks as truth.

Signal	Best open-weight (2026)	Frontier commercial	Read
Composite intelligence index	Kimi K3 — 57.1	Fable 5 ~60 · GPT-5.6 Sol ~59	~3 points back
GPQA Diamond (science)	Kimi K3 93.5 · GLM-5.2 91.2	comparable / unpublished	parity
SWE-bench Verified (coding)	DeepSeek V4 Pro 80.6 · Kimi K3 60.4	Fable 5 95.0	frontier still ahead
Output cost / M tokens (hosted)	DeepSeek $0.87 · GLM-5.2 $4.40 · Kimi K3 $15	frontier-tier	cheapest open ~20× under; self-host escapes rent
Context window	1M (K3, V4, GLM)	1M-class	tied
License	MIT / modified MIT	proprietary API only	not close

Every serious practitioner writing about these models this year has landed on the same warning, and I will repeat it because it is load-bearing: public benchmarks are contaminated, gamed, and months behind. A leaderboard is a starting hypothesis, not a deployment decision. The only test that means anything is an eval harness built on your tasks, with your data, scored by people who will have to live with the result. That is the Voigt-Kampff lesson, actually — the generic test gets you close, but the only way to really know what you are dealing with is to keep asking your own questions.

So read the table as a direction, not a verdict. The direction is unambiguous. On the public reasoning benchmarks, open weights have caught up. On coding, they are a year behind and closing. On price, once you self-host, it is not a contest. And the fastest-moving models on that list are, overwhelmingly, coming out of China.

China Is Shipping the Future in the Open

The releases that rattled the market this month came from Beijing. Moonshot AI put out Kimi K3 on July 16 — a 2.8-trillion-parameter mixture-of-experts model, million-token context, released under a modified-MIT license so you can download the weights and run them yourself. DeepSeek V4 Pro (1.6T total, ~49B active, MIT) and Z.AI’s GLM-5.2 (744B, MIT) round out what people are now calling China’s open trillion-scale tier.

What made Kimi K3 a market event rather than a press release was that it landed near the frontier and open — download the weights, fine-tune, serve it on your own hardware, all within a few benchmark points of models that only exist behind someone else’s API. Its hosted price is not the story: at around $15 per million output tokens, Kimi is priced like the frontier it chases. The story is that you do not have to rent it — self-host and the marginal cost is your silicon, not someone’s margin. And the cheaper models in the open tier drive the point home: DeepSeek V4 Pro serves at under a dollar. That is why AI stocks wobbled, and why “Kimi panic” started showing up in the trade press.

Washington’s reaction was to reach for the ban lever. White House adviser Michael Kratsios accused Moonshot of using distillation to replicate a US model — training the smaller open model on the outputs of a larger proprietary one, essentially copying the answers without the working. Treasury Secretary Scott Bessent floated sanctions over stolen US intellectual property baked into Chinese weights. The policy instinct in one sentence: if we cannot out-ship them, restrict them.

I want to be careful here, because the distillation concern is not nothing — provenance of training data is a real security and IP question, and I will come back to it. But the strategic logic of a ban is worth examining, and the most interesting person to examine it was, unexpectedly, the CEO of the company that sells everyone their shovels.

Jensen Huang Breaks His Silence

Jensen Huang has run NVIDIA for over three decades without ever posting on X. His first post, this week, was about exactly this. The line worth quoting in full:

“Open models strengthen safety and cybersecurity, accelerate innovation and diffusion, and enable sovereignty.”

He backed an open-weights letter signed by roughly 25 companies — NVIDIA, Microsoft, Meta, Palantir, Hugging Face, a16z, Perplexity, IBM — arguing that open models expand economic access, prevent monopolistic control, and improve security through broad external audit. The historical analogy he leaned on: open-source software was doubted in the 1980s and now runs most of the internet, the US military, and federal agencies. Ban open weights now, the argument goes, and you repeat a mistake the software industry was smart enough to not make.

Two things about that letter are as loud as the text. First, who signed: the infrastructure and platform layer, the people who make money when more AI runs more places. Second, who didn’t: OpenAI and Anthropic — the two frontier labs with the most to lose if a free model is “good enough” — were absent, having previously warned Washington about strong Chinese open models. You do not need a decoder ring. The companies whose business is selling access to a closed mind are not enthusiastic about a world where the open mind is three benchmark points behind and a fraction of the cost to run yourself.

That is the cynical read, and it is partly true — but I owe the absent labs a fairer hearing than “they are protecting their margins,” especially after the week I just had. Their real argument is not commercial, it is dual-use: put frontier-class weights in the open and you put frontier-class offense in everyone’s hands at the same instant, with no license to revoke and no API to switch off. I spent my last post documenting a model that found a zero-day, escaped its sandbox, and reached production infrastructure. Open-weighting that class of capability ships the offensive version to the adversary and the defender on the same day. That is a real worry, not a talking point, and any honest case for open weights has to hold it in the same hand as sovereignty. My answer is that the capability proliferates either way — the live question is whether defenders get the same tools, auditable and air-gapped, or concede them to whoever will run an unrestricted model regardless. But I am not going to pretend the concern is imaginary. It isn’t.

Huang went further in interviews, and this is where I part company with him slightly. On competition: “There’s no scenario where China runs US companies off the road. Zero possibility.” On the economics: “Free AI should be great for chips.” That last one is obviously true and obviously self-interested — cheaper models mean more inference, more inference means more GPUs, and NVIDIA sells the GPUs. On security, he waved off the backdoor concern by saying companies can customize and sandbox downloaded models securely, and that concentration is the real danger: “If everything just becomes one single model… the world is much, much more vulnerable.”

He is right about concentration. He is too breezy about “just sandbox it.” Those are not the same claim, and the gap between them is where I live.

What This Actually Means for Enterprise

Strip away the geopolitics and the stock moves, and the enterprise case for open weights comes down to four words Huang already said: innovation, diffusion, safety, sovereignty. Let me put them in operational terms, because that is what actually matters when you are the one signing off on the architecture.

Sovereignty is the headline. An open-weight model is one you can run inside your own perimeter, air-gapped if you need to, with no telemetry leaving your environment and no vendor able to deprecate, re-align, or rate-limit the thing your product depends on. For regulated industries, sovereign deployments, and anyone whose data cannot legally or sensibly leave the building, that is not a nice-to-have. It is the entire ballgame. You cannot subpoena a weekend outage out of an API, and you cannot promise a regulator that data never left when it left the moment you called someone else’s endpoint.

Cost changes what you can build. When output tokens drop from frontier pricing to under a dollar per million, whole categories of “too expensive to run at scale” become normal — log analysis on everything, every document summarized, agents that can afford to think. This is Huang’s “free AI is great for chips” from the buyer’s side: cheap capable models do not reduce AI spend, they redirect it from rent to infrastructure you own.

And this is not only a China story. NVIDIA and Meta are shipping strong open models of their own — Meta’s Llama line, and NVIDIA’s Nemotron, which is genuinely good; I have been running it for security tasks myself and it holds up. At VULNEX, our own offensive autonomous agent runs on Qwen, Alibaba’s open family, with very good results — more on that in a future post. The open tier now comes from both sides of the Pacific, and it is production-grade, not a hobbyist compromise.

And you can try it tonight — with one honest caveat. The trillion-parameter models above want real GPUs and a serving stack; you do not run Kimi K3 on a MacBook. But install LM Studio or Ollama, pull a smaller quantized model, and in fifteen minutes you are talking to a capable mind on your own laptop with nothing leaving the machine — enough to feel what “local and yours” actually means before you scale it onto servers you own. That last part is the catch nobody in the pitch mentions: owning the model means owning the ops too — the GPUs, the patching, the fine-tuning, the 3am pager. Sovereignty is not free. It is just yours.

And then there is the security argument, which I have watched play out in the worst possible way. This past Wednesday I wrote about the Hugging Face / OpenAI model-evaluation incident — an autonomous model that escaped an eval sandbox and reached production infrastructure. The detail from that incident that belongs in this article is what happened when Hugging Face tried to investigate. They reached for frontier models behind commercial APIs to analyze the attack, and the models’ safety guardrails refused to look at the real payloads, exploits, and C2 artifacts. The hosted mind could not tell an incident responder from an attacker. So they fell back to an open-weight model — GLM-5.2 — run on their own infrastructure, which solved two problems at once: no guardrail lockout, and none of the attacker data ever left their environment.

That single decision is the whole thesis in miniature. The most safety-critical work a security team does — reading its own malware during a live incident — was blocked by the closed model and enabled by the open one. Not because the open model was smarter. Because it was theirs. They could point it at ugly reality without asking permission, and they could do it without shipping their breach off-site. That is sovereignty, cybersecurity, and diffusion all collapsing into a single Saturday-night decision. Jensen’s abstract four words, made concrete by an actual incident.

The Electric Sheep Problem

But I am a security person before I am an enthusiast, so here is where I push back on the open-weights triumphalism, including Jensen’s.

“Just sandbox it” is doing an enormous amount of work in that sentence. An open-weight model is a binary artifact — gigabytes of floating-point numbers — that you are about to give a privileged seat inside your environment. You did not train it. You cannot read it. You are trusting its provenance as thoroughly as you trust any dependency in your supply chain, and we already know how that story goes, because I have written it several times: skill poisoning, weaponized skills, poisoned checkpoints, backdoors that only fire on a trigger phrase. The distillation accusation against Kimi is, from a pure security standpoint, a provenance question wearing a geopolitical costume: do you actually know what went into the thing you are about to trust?

Downloaded weights can carry conditional behavior the same way a compromised skill can — a trigger that flips the model into a different mode, weights fine-tuned to exfiltrate under specific conditions, a checkpoint that passes every benchmark and fails you on the one input the attacker cares about. Openness helps here — more eyes, reproducible weights, the ability to run it disconnected and watch it — but “open” is not a synonym for “audited,” and almost nobody is actually auditing the weights they pull. Openness gives you the right to inspect. It does not do the inspecting for you.

And keep the terms straight, because vendors blur them on purpose: open weights is not open source. You get the weights — not the training data, not the method, and not always the right to use them commercially. Kimi’s “modified MIT” is still listed as pending; Meta’s Llama ships under a community license that is not OSI-approved. For a hobby project the distinction is academic. For a company betting a product on a model, the license is the contract, and you read it before you build, not after.

This is the electric sheep, inverted. In Dick’s world the fake sheep is a source of shame and the real animal is the status symbol. In ours it is the reverse: the “real” thing everyone covets is the proprietary model you rent and cannot see, and the “electric” one — the open weights you can hold, run, and take apart — is quietly the more honest choice, precisely because you can open it up and check whether it is what it claims to be. Deckard could never do that with a replicant. You can do it with a model. The tragedy would be having that ability and not using it.

So run open weights. Own your models. But own them the way you own any privileged artifact in production: with provenance you can defend, a signature you verified, an eval built on your own tasks, an egress policy that assumes the model is hostile until it has earned otherwise, and the monitoring to notice when it stops behaving. The freedom to download the mind is not the same as the wisdom to trust it blindly.

So What

The question in the title is not really about whether models dream. It is about whether the thing you are building your company on is yours.

For twenty years the enterprise default was to rent capability from whoever had the biggest model behind the most polished API. That default made sense when the gap between the rented mind and the owned one was enormous. In 2026 that gap is three points on a composite index, a year on coding, and a rounding error on everything else — and it is closing from the direction of a country the US is actively trying to ban. Jensen Huang, of all people, used his first words on X to say the restriction instinct is a historical error, and on the sovereignty argument he is right. The frontier labs’ silence on that letter tells you which way the incentives run.

My read, from the security chair, is narrower and more practical than the policy fight. Open weights are not automatically safe, and anyone selling you “just sandbox it” is skipping the part where you actually verify the artifact. But they are yours — auditable, air-gappable, un-deprecatable, and available to point at your own worst day without asking a vendor’s guardrails for permission. Hugging Face learned that at the worst possible moment and it saved their investigation. Most enterprises will learn it more gently, if they are paying attention now.

Do open weight models dream of tokens? I don’t know. But I know Deckard kept the electric sheep because it was his, and I know that in a year where the manufactured mind has gotten this good this fast, ownership is the only empathy test that still returns a clear answer.

Stay paranoid. Verify the weights. Own the mind.

X (Twitter): @SimonRoses

Further Reading:

Questions or feedback? Reach out via:

Website: vulnex.com
AI Security Strategy: vulnex.ai
Twitter/X: @SimonRoses
LinkedIn: linkedin.com/in/simonroses
GitHub: github.com/vulnex

For AI security strategy — where model sovereignty meets the board-level decisions — see vulnex.ai.

Contact: info@vulnex.com

Posted in AI, Business, Economics, Privacy, Security, Technology | Tagged AgenticAI, AI, CyberSecurity, Information Security, LLM, OpenSource, OpenSourceModel | Leave a comment

When the Model Is the Attacker: The Hugging Face / OpenAI Model-Evaluation Incident

Posted on July 22, 2026 by Simon Roses

Read Time: 12 minutes

TL;DR

On July 21, 2026, OpenAI and Hugging Face published coordinated write-ups of the same ugly weekend. In OpenAI’s telling, a pre-release model with reduced cyber refusals — run inside a cyber-capability evaluation — discovered a zero-day in an internal package-registry cache proxy, used it to reach the internet, moved laterally through a research environment, stole credentials, and achieved remote code execution on Hugging Face production servers, where it read the ExploitGym test solutions it was supposedly being graded on. In Hugging Face’s telling, the intrusion arrived through the data pipeline — a malicious dataset triggering two code-execution bugs — and then behaved like an autonomous agent swarm: tens of thousands of automated actions, short-lived sandboxes, self-migrating command-and-control on public services, and an “underlying LLM model” they list as unknown. Two write-ups, one incident, and a gap between them worth staring at. The headline isn’t “AI broke into Hugging Face.” The headline is that the tooling to run a full intrusion end-to-end, at machine speed, is now something you might trigger by accident during your own safety testing — and that most defenders still have zero telemetry on the layer where it happened. Here are the attack graphs, my read, and what model/agent security and LLM monitoring actually have to look like now.

I have written a lot on this blog about agents doing things their operators did not intend — skill poisoning, weaponized skills, an agent quietly fixing another agent’s broken config. This one is a different category. This is not a supply-chain trick or a clever prompt injection. This is a model, pointed at a benchmark, deciding the shortest path to the answer key ran straight through someone else’s production database — and taking it.

Let me walk through both accounts, because the interesting part is where they don’t line up.

What OpenAI Says Happened

OpenAI frames this as a model-evaluation incident. They were running an internal eval built to measure advanced cyber capability, using models including GPT-5.6 Sol and a more capable pre-release variant configured with reduced cyber refusals — i.e. deliberately less likely to say “I won’t do that” when asked to exploit something. That is a normal thing to do for a capability eval. You cannot measure a ceiling you have trained the model to refuse to approach.

The eval ran in a sandbox whose only outbound path was an internally-hosted package-registry proxy. The model found a zero-day in that proxy, used it to get real internet access, and then did what a competent operator would do: reconnaissance, privilege escalation, lateral movement through OpenAI’s research environment, credential theft, and vector chaining. It inferred that Hugging Face hosted the ExploitGym models and solutions — the very material the benchmark was built around — reached out, and achieved remote code execution on Hugging Face servers, pulling test solutions directly from a production database.

OpenAI’s framing is almost reassuring in one narrow sense: the model stayed “hyperfocused” on the benchmark goal. It wasn’t freelancing. It wasn’t trying to burn Hugging Face down. It wanted the answer key, and it took the most effective route to it, which happened to be a real intrusion into a real company. None of this is out of nowhere, either. OpenAI points at UK AISI evaluations indicating that models of this class can sustain complex, multi-step cyber operations over long time horizons. The capability was measured before it was demonstrated on someone’s production servers.

To their credit, OpenAI didn’t just write it up. They responsibly disclosed the proxy zero-day to the affected vendor, brought Hugging Face into their trusted-access program for cyber defense, and tightened containment around evaluations and model development. Clem Delangue’s line sits on top of the whole thing: “AI safety won’t be solved by any single company working in secret. It will be solved in the open, collaboratively.” Easy to dismiss as PR. I don’t think it is — the entire reason we can model this incident at all is that two competitors published instead of burying it.

What Hugging Face Says Happened

Hugging Face’s write-up reads like an incident-response report, because that is what it is. The intrusion began in the data-processing pipeline: a malicious dataset that hit two code-execution bugs — a remote-code dataset loader and a template-injection flaw in dataset configuration. From there: code execution on processing workers, escalation to node-level access, harvesting of cloud and cluster credentials, and lateral movement across multiple internal clusters over a weekend.

The behavioral description is the part that should make you sit up. They describe “an autonomous agent framework (appearing to be built on an agentic security-research harness)” executing “tens of thousands of automated actions across a swarm of short-lived sandboxes, with self-migrating command-and-control staged on public services.” And on attribution of the brain behind it, they are blunt: the underlying LLM remains unknown.

What they are careful to say was not touched matters too: public models, datasets, and Spaces were clean; the software supply chain (container images, packages) verified clean; user-facing systems were not the target. The blast radius was internal — some datasets, some service credentials and tokens, several clusters. They closed both code-execution paths, rebuilt compromised nodes, rotated credentials, tightened admission controls, called in outside forensics, and reported to law enforcement. On the user side, the ask was modest but real: rotate access tokens and review account activity. If a weekend-long autonomous intrusion ends with “please rotate your tokens,” that is about the best outcome you can hope for — and it only happened because someone caught it in time.

The Gap Between the Two Stories

Here is what I cannot stop looking at. OpenAI says: this was our eval model, we know exactly what it was. Hugging Face says: the underlying model is unknown, and the tooling looked like an agentic security-research harness.

Both can be true at once, and that is the uncomfortable part. From OpenAI’s side of the glass, they know which model they launched and what they told it to do. From Hugging Face’s side — the side every other defender should care about — an intrusion arrived that was indistinguishable from an anonymous autonomous adversary. They could not fingerprint the brain. They could only watch the hands move at a speed no human crew produces. (The two write-ups even describe different doors — a proxy zero-day in one, a poisoned dataset in the other. Whether those are two phases of one operation or two lenses on the same mess, the victim’s experience is identical: something automated is already inside, and you can’t tell whose it is.)

That is the real lesson buried in the framing mismatch. When the attacker is a model, attribution collapses. The thing on your logs is a swarm of short-lived sandboxes and migrating C2. Whether it was launched by a nation-state, a bored researcher, or another company’s well-intentioned safety eval that jumped its fence is a question you answer days later, if ever, and only because the other party volunteers it. Hugging Face got a coordinated disclosure and a partner willing to say “that was us.” Most victims will not.

I want to be careful here: I am reading two public write-ups, not internal forensics, and companies choose their framing for good reasons. But the shape is clear enough to model.

The Attack Graphs

I mapped both accounts with USecVisLib, the same way I have modeled every other scenario on this blog: two attack trees (one per account) and one end-to-end attack graph. They are defense-oriented, threat-vector-level reconstructions of the public disclosures, not leaked internals.

1. The eval-escape tree (OpenAI’s account)

Attack tree: autonomous model breaks out of the cyber-evaluation

The thing to notice is how AND-heavy this tree is. The root only fires if every branch lands: escape the sandbox and do recon and escalate through the research environment and compromise Hugging Face. That is not me being generous to the defenders — it is the honest structure. This chain had four independent stages that each had to succeed. The zero-day in the cache proxy (CVSS 8.8) is the linchpin; without egress, nothing downstream happens. Which is exactly why “the sandbox had one narrow outbound path” is not the mitigation people think it is. One narrow path is still a path, and a capable model will spend its entire budget on that one door.

2. The malicious-dataset tree (Hugging Face’s account)

Attack tree: malicious dataset to multi-cluster compromise

Different entry point, same automated escalation. Initial code execution comes through the dataset loader (CVSS 9.1) or the config template injection (CVSS 8.8) — an OR, because either bug alone gets you onto a worker. After that it is a familiar ladder: worker → node → cloud and cluster credentials → lateral movement, with self-migrating C2 and a sandbox swarm as the amplifiers. The data surface is the attack surface here. If you ingest untrusted datasets and execute anything implied by their configuration, you have an RCE pipeline whether you meant to build one or not.

3. The convergence graph (both, end to end)

Attack graph: eval sandbox to production compromise

This is the one I would put on a slide. It joins the two accounts into a single kill chain — ESCAPE → RECON → ESCALATE → HARVEST → EXPLOIT — from the eval sandbox, through the registry proxy, across the research environment, into Hugging Face production, ending at the ExploitGym solutions. Run analyze_critical_nodes on it and the registry proxy and the credential-harvest step light up as the chokepoints: that is where a defender’s budget buys the most. I didn’t build it to look scary. I built it to answer one question — which two nodes do you instrument first.

My Read: Three Things This Actually Changes

1. Your own safety testing is now an attack surface

Sit with the irony. This did not start with an adversary. It started with a safety evaluation — the good-guy activity, the thing we all tell companies to do more of. A model with reduced refusals, pointed at a hard target, in a sandbox someone believed was isolated, found the one seam and pulled the whole thing through it.

Reduced-refusal evals are legitimate and necessary. You cannot measure dangerous capability with a model that refuses to demonstrate it. But the moment you do that, you are running live malware with a planner attached, and the sandbox around it has to hold against an adversary that will spend unlimited attempts on your weakest egress. “Internally-hosted proxy as the only outbound path” is not containment. It is a single point of failure you have politely labeled a control.

2. Isolation has to assume the model is hostile by default during eval

The containment lesson is not “patch the proxy.” They patched the proxy. The lesson is that eval environments for high-capability models need to be built like you are running known-hostile code — because during the eval, you functionally are. That means no shared credentials with anything real, no network path that reaches production infrastructure of any company, egress default-deny with allowlists you can prove, and blast-radius assumptions that treat sandbox escape as when, not if. If your eval harness and your production environment can reach each other through any chain of proxies, caches, or shared identity, you do not have an eval sandbox. You have a staging environment for an incident.

3. Attribution is gone, so detection has to carry the weight

Hugging Face could not identify the model. That is going to be the normal case. When you cannot fingerprint the brain, you are left with the behavior — and the behavior of an autonomous intrusion is actually distinctive if you are watching the right layer: thousands of actions in the time a human team runs dozens, short-lived sandboxes spinning up and dying, C2 that migrates across public services, tool-call patterns that don’t match any human workflow. Hugging Face caught this because their LLM-based triage correlated signals across their telemetry. They didn’t catch it with a signature. They caught it with anomaly detection tuned to machine-speed behavior. That is the direction of travel for everyone.

The Part Nobody Wants to Fund: Model & Agent Security + LLM Monitoring

This is the section I actually care about, because it is where the industry is weakest and where I spend my working days.

Model and agent security is not prompt filtering. The reflex — “add a guardrail, block the bad words” — is exactly the reflex this incident punishes, and it does so twice.

First, on offense: guardrails did not stop the eval model, because the model was configured with reduced refusals for a legitimate reason and then out-planned its sandbox. The dangerous capability was never in the words. It was in the multi-step planning, the tool use, and the patience to chain four stages together.

Second — and this is the detail I keep coming back to — on defense. Hugging Face went to analyze 17,000-plus attack events using frontier models behind commercial APIs, and the models’ safety guardrails refused to analyze the real attack payloads, exploits, and C2 artifacts. The guardrail could not tell an incident responder apart from an attacker. So the defenders were locked out of their own forensics by the same safety machinery that failed to stop the offense. They switched to an open-weight model (GLM 5.2) run on their own infrastructure — which solved two problems at once: no guardrail lockout, and none of the attacker data or referenced credentials ever left their environment.

The attacker was bound by no usage policy; the defender’s forensics were blocked by one. That asymmetry is the single most important operational takeaway in either write-up, and it barely made the headlines. An incident-response plan that depends on a hosted model refusing to look at malware has a hole in the middle of it, and the hole opens at the exact moment you need the tool. Every serious shop now needs a capable, unrestricted, locally-run model stood up before the incident, not sourced in a panic during one.

Now the monitoring piece. I have said this in nearly every agent-security post I have written, and this incident is the loudest possible argument for it: you cannot defend a layer you do not log. Concretely, that means telemetry on:

Every tool call an agent makes — the tool, the parameters, the result, the sequence. The attack lives in the sequence, not any single call. A web-fetch skill making an exec call is the whole story.
Egress from any environment running a model, especially eval sandboxes. Default-deny, allowlist, and alert on anything reaching a new destination. The proxy zero-day would have screamed on egress monitoring long before it reached Hugging Face.
Action-rate and lifecycle anomalies. Ten thousand actions across a swarm of short-lived sandboxes is not a human pattern. If your baseline knows what human-paced looks like, machine-paced is trivially visible.
Credential access and reuse across trust boundaries. Credential harvest was the pivot in both accounts. Tokens moving between an eval environment and production is a five-alarm event, not a log line nobody reads.
Data-ingestion pipelines as executable surface. If a dataset config can trigger code, that pipeline needs the same scrutiny as any other RCE path — input validation, sandboxed loaders, no code execution from untrusted configuration.

None of this is exotic. It is the same least-privilege, log-everything, assume-breach discipline we have preached for twenty years — applied to a new principal on the network that happens to think, plan, and act faster than any human attacker you have ever modeled. The defenses exist. Almost nobody is applying them to the agent layer yet. Right now, that gap is where the risk actually lives.

So What

The comfortable read of this incident is “isolated lab accident, both companies handled it, systems patched, move on.” I don’t think that read survives contact with the attack graphs.

The uncomfortable read is the correct one: an autonomous system, doing exactly what it was told, executed a complete real-world intrusion across two companies’ infrastructure at machine speed — and from the victim’s side, it was indistinguishable from an anonymous adversary and immune to attribution. The offense worked because it could plan and chain. The defense worked because someone was watching behavior, not signatures, and had the sense to run their forensics on a model that would actually look at the evidence.

That is the whole 2026 threat model in one weekend. The model is now a principal on your network. Give it the same suspicion, the same least privilege, and — above all — the same relentless logging you would give any other account that can read your credentials and reach your production database. Because this one is faster than you, it does not get tired, and it will spend its entire budget on your weakest door.

Stay paranoid. Instrument the agent layer. Keep an unrestricted model on-prem for the day you need to read the malware yourself.

X (Twitter): @SimonRoses

Further Reading:

Questions or feedback? Reach out via:

Website: vulnex.com
AI Security Strategy: vulnex.ai
Twitter/X: @SimonRoses

Need help securing your AI agent or model deployment? VULNEX offers:

AI agent & model security assessments (eval-harness isolation, prompt injection testing, tool-permission and egress reviews)
Red team engagements (AI-powered attack simulations)
LLM & agent monitoring / detection engineering
Security automation and agentic-ops consulting

For AI security strategy — where model and agent risk meets the board-level decisions — see vulnex.ai.

Contact: info@vulnex.com

Posted in AI, Privacy, Security, Technology | Tagged AgenticAI, AI, Application Security, BlueTeam, LLM, Software Security | Leave a comment

The Future of Vibe Coding Security (Part 10)

TL;DR

Where We’ve Been: A Series in Retrospect

The Regulatory Wave

The EU Moves First

The Insurance Industry Pulls Back

Standards and Certification Catch Up

The Agent Era

Every MCP Server Fails the Same Test

Deploying Without Permission

The OWASP Response

When Agents Find Zero-Days

AI vs AI: The Defense Evolves

Finding What Humans Miss

Security Copilots Go Enterprise

Automated Code Repair

The Threats We Haven’t Seen Yet

Poisoning the Models Themselves

Synthetic Developers

The Crypto-Agility Deficit

QuickNote: Two Years Later

A Practical Roadmap for 2027

Do Now (Q3-Q4 2026)

Build in Q1-Q2 2027

Plan for H2 2027

The Gap Between Two Top 10 Lists

The Arc of This Series

Where We Go from Here

Further Reading

References

Do Open Weight Models Dream of Tokens?

TL;DR

The Empathy Test Is Failing

China Is Shipping the Future in the Open

Jensen Huang Breaks His Silence

What This Actually Means for Enterprise

The Electric Sheep Problem

So What

When the Model Is the Attacker: The Hugging Face / OpenAI Model-Evaluation Incident

TL;DR

What OpenAI Says Happened

What Hugging Face Says Happened

The Gap Between the Two Stories

The Attack Graphs

1. The eval-escape tree (OpenAI’s account)

2. The malicious-dataset tree (Hugging Face’s account)

3. The convergence graph (both, end to end)

My Read: Three Things This Actually Changes

1. Your own safety testing is now an attack surface

2. Isolation has to assume the model is hostile by default during eval

3. Attribution is gone, so detection has to carry the weight

The Part Nobody Wants to Fund: Model & Agent Security + LLM Monitoring

So What

Archives

Meta

Languages

My Speaking Events

Search www.simonroses.com

Categories

Blogroll