Microsoft Turned 100 AI Agents Loose on Windows. They Found 16 Bugs.

Two days ago I wrote about Google confirming the first AI-generated zero-day exploit — a working 2FA bypass built by criminal threat actors using an AI model. The conclusion was grim: attackers are using AI to find and weaponize vulnerabilities at a pace that human defenders can't match.

On May 12, Microsoft published the other side of that story.

A system called MDASH — short for multi-model agentic scanning harness — found 16 previously unknown vulnerabilities in the Windows networking and authentication stack. Four of them were critical remote code execution flaws. All 16 were patched in this month's Patch Tuesday release. The bugs weren't theoretical. They were in your Windows machine, reachable over the network, and they're now fixed because AI found them before an attacker did.

What MDASH Actually Is

MDASH isn't a single AI model pointed at source code with a prompt that says "find bugs." It's an architecture — over 100 specialized AI agents orchestrated across multiple foundation models, working through a pipeline that progresses from discovery to validation to proof.

The pipeline has five stages: prepare, scan, validate, deduplicate, and prove. Frontier models handle the heavy reasoning — understanding code semantics, identifying potential vulnerability patterns, reasoning about memory safety. Smaller distilled models act as high-volume debaters, running parallel analysis passes at lower cost. A separate frontier model provides an independent counterpoint, specifically to challenge the conclusions of the others.

This is the key architectural insight: MDASH uses model disagreement as a quality signal. When multiple agents analyze the same code path and reach different conclusions about whether a vulnerability exists, that disagreement triggers deeper investigation. When they agree, the pipeline moves forward. When they don't, the system escalates to more expensive, more capable analysis. The result is a system that catches more bugs while producing fewer false positives than any single model could.

Domain-specific plugins inject the kind of context that foundation models can't infer on their own — kernel calling conventions, lock invariants, interprocess communication trust boundaries. The agents don't just read code. They understand the execution environment the code runs in.

Microsoft's Autonomous Code Security team built it. It's currently in limited private preview with Microsoft's internal security engineering teams and a small set of external customers.

What It Found

The 16 vulnerabilities span the Windows TCP/IP stack, the IKEEXT IPsec service, HTTP.sys, Netlogon, DNS resolution, and the Telnet client. Ten were kernel-mode. Six were user-mode. Most were reachable from a network position without authentication.

Two of the critical findings deserve a closer look.

CVE-2026-33824 lives in IKEEXT, the Windows service that handles IKE and AuthIP keying for IPsec. An unauthenticated remote attacker can trigger a deterministic double-free of a 16-byte heap allocation by sending a crafted IKE_SA_INIT with Microsoft's "IPsec Security Realm Id" vendor-ID payload, followed by a single IKEv2 fragment that reassembles immediately. Because IKEEXT runs as LocalSystem inside svchost.exe, successful exploitation gives the attacker pre-authentication remote code execution in one of the highest-privilege contexts on the system. Any host configured as an IKEv2 responder — which includes most Windows VPN servers — was exposed over UDP port 500.

CVE-2026-33827 is a remote unauthenticated use-after-free in tcpip.sys triggered through Strict Source and Record Route (SSRR) packet handling. This is a kernel-mode vulnerability in the core networking stack. The attack surface is any Windows machine that processes IP packets — which is, effectively, every Windows machine connected to a network.

These aren't obscure edge cases in deprecated components. They're in the TCP/IP stack and the IPsec service — fundamental networking infrastructure that runs on every Windows deployment. And they were invisible to human code reviewers for years.

The Benchmark Numbers

Microsoft submitted MDASH to CyberGym, a public benchmark developed by UC Berkeley researchers that measures how well AI systems can reproduce real-world vulnerabilities. The benchmark contains 1,507 tasks drawn from 188 open-source software projects.

MDASH scored 88.45%, placing it at the top of the CyberGym leaderboard. Anthropic's Mythos Preview — the vulnerability-discovery system that generated significant concern when it was previewed earlier this year — came in second at 83.1%. OpenAI's GPT-5.5 scored 81.8%.

Against Microsoft's own historical data, the results were sharper. MDASH achieved 96% recall against five years of confirmed Microsoft Security Response Center vulnerabilities in clfs.sys (the Common Log File System driver) and 100% recall in tcpip.sys. On a private test driver containing 21 deliberately planted vulnerabilities, MDASH found all 21 with zero false positives.

These numbers tell a specific story: AI vulnerability discovery has crossed the threshold from "interesting research" to "production-grade capability." MDASH isn't finding toy bugs in contrived test cases. It's finding real vulnerabilities in one of the most scrutinized codebases on the planet, and it's doing it more reliably than human security researchers have managed over the past five years.

The Arms Race Is Not a Metaphor

Here's what the past week looks like when you lay the timeline flat.

May 7: Microsoft discloses CVSS 10.0 and 9.8 RCE vulnerabilities in Semantic Kernel, its own AI agent framework — prompt injection escalating to code execution. May 11: Google's GTIG confirms the first AI-generated zero-day exploit used by criminal threat actors. May 12: Microsoft unveils MDASH, demonstrating that AI can find 16 real vulnerabilities in Windows defensively. May 13: Researchers disclose Bleeding Llama, a CVSS 9.1 vulnerability in Ollama that leaks the entire process memory — including API keys and conversation data — from 300,000 servers.

In a single week, AI was used to build offensive exploits, discover defensive vulnerabilities, and was itself revealed as a vulnerable attack surface. The three roles — weapon, shield, target — are now concurrent.

This isn't a future scenario. Three companies are competing to build the most capable vulnerability-discovery AI: Microsoft with MDASH, Anthropic with Mythos, and OpenAI with Daybreak. The CyberGym leaderboard is the public scoreboard. The capability is improving on a quarterly basis. The arms race framing isn't hype. It's a description of what's happening.

The uncomfortable question is straightforward: if MDASH found 16 vulnerabilities in one pass through one section of the Windows codebase, how many are left? Windows has hundreds of millions of lines of code across thousands of components. MDASH's 96% recall on clfs.sys means there are likely bugs it missed even in the components it already analyzed. Scale that across the full codebase, and the number of undiscovered vulnerabilities in production software is almost certainly larger than anyone has publicly estimated.

And Microsoft has MDASH. What comparable tooling does the average enterprise have? The answer is nothing. The asymmetry between organizations that can deploy AI-powered vulnerability discovery and those that can't is about to become a defining factor in security posture.

What This Changes for Patch Tuesday

May 2026 Patch Tuesday addressed 137 vulnerabilities total. Sixteen of them were found by MDASH. That means roughly 12% of this month's patches exist because an AI system found bugs that human reviewers hadn't caught.

Microsoft is on pace to break its annual vulnerability record this year, and AI-driven code analysis is a major factor. The Record reported that Microsoft's AI security testing is enabling an accelerated patch cadence, and Oracle announced it's shifting from quarterly to monthly security updates for the same reason — AI-driven testing is producing findings faster than the old release schedule can absorb.

For organizations that still run monthly or quarterly patch cycles, this creates a new problem. The volume of patches is increasing because AI is finding more bugs. The severity isn't decreasing — four of MDASH's sixteen findings were critical RCEs. And the time between "vulnerability found by defensive AI" and "vulnerability found by offensive AI" is shrinking toward zero.

If a defensive AI found CVE-2026-33824 in the IKEv2 service, an offensive AI could find it too. The only question is timing. The patch exists now. The window between "patch available" and "exploit available" is the only protection, and that window is measured in days, not months.

What to Do

Patch May 2026 Patch Tuesday immediately. Not next week. Not at the end of the month. The 16 AI-discovered vulnerabilities include pre-authentication RCE in the IPsec service and a use-after-free in the TCP/IP stack. If your Windows servers are reachable over a network and unpatched, you're exposed to attacks that AI can now generate faster than you can evaluate them.
Prioritize CVE-2026-33824 if you run Windows VPN. Any host configured as an IKEv2 responder — which includes most Windows-based VPN concentrators and Always On VPN deployments — was directly reachable for pre-auth RCE over UDP/500. If you can't patch immediately, restrict inbound UDP/500 to known peer addresses.
Audit your AI infrastructure. The Bleeding Llama disclosure (CVE-2026-7482) this week showed that Ollama servers are leaking process memory to anyone who can reach them. If you're running Ollama, update to v0.17.1 or later. If you're exposing any AI model serving infrastructure to the network, treat it like any other production service — not like a development toy.
Shorten your patch cycle. If you're still patching monthly, the math no longer works. AI-discovered vulnerabilities are compressing the window between disclosure and exploitation. Mandiant's M-Trends 2026 data showed 28.3% of CVEs exploited within 24 hours. Aim for a 72-hour window on critical patches, or accept that you're operating within the blast radius.
Don't ignore the cPanel wave. CVE-2026-41940, a CRLF injection that bypasses cPanel authentication entirely — no password, no 2FA, no log entry — was exploited by 44,000 IPs this month. If your website runs on shared hosting with cPanel, ask your hosting provider whether they've patched. If they can't tell you, find a new host.

What the Scoreboard Tells Us

The CyberGym leaderboard is the clearest signal of where this is heading. Three major AI companies are competing to build systems that can find vulnerabilities in real software. The scores are improving every quarter. The current top score is 88.45%. Within a year, it will be higher. Within two years, AI systems will likely find more vulnerabilities in a single week than human security researchers find in a year.

That sounds like good news for defenders, and in some ways it is. MDASH found 16 bugs that might have taken years to discover through manual review. Those bugs are now patched. Users are safer.

But the same capability doesn't stay exclusive to defenders. The techniques that MDASH uses — multi-agent analysis, model disagreement, domain-specific plugins — are architectural choices, not trade secrets. Academic papers describe them. Open-source frameworks implement them. The gap between a well-funded defensive AI and a capable offensive AI is narrowing, and it will continue to narrow.

The week of May 7–14, 2026, will probably be remembered as the week the AI vulnerability arms race became undeniable. Offensive AI built a zero-day. Defensive AI found sixteen bugs. AI infrastructure itself was exposed as critically vulnerable. All in seven days.

The question was never whether this would happen. It was whether defenders would patch fast enough once it did. If your answer to that question is "we'll get to it next cycle," you're already behind.

Sources: Microsoft Security Blog — MDASH, The Hacker News — MDASH 16 Flaws, CSO Online, GeekWire — CyberGym Benchmark, The Record — Microsoft Vulnerability Record Pace, SiliconANGLE, Neowin