The Security Treadmill: When Finding Flaws Becomes Free, the Scarce Skill Is Deciding What's Worth Fixing

Autonomous AI systems are collapsing the cost of finding software vulnerabilities toward zero, making known-flaw supply nearly infinite. The scarce, high-leverage work has shifted from discovery to deciding which of the roughly 6% of ever-exploited CVEs actually deserve a fix.

Figure picking one golden ticket from an endless treadmill of glowing tickets, AI vulnerability discovery economics

For decades, finding a vulnerability was slow and expensive, so the cost of discovery quietly did our triage for us. AI collapses that cost toward zero, which turns the supply of known flaws nearly infinite and reframes "patch everything" as a market with no finish line. The hard, valuable work moves from finding problems to deciding which ones matter. Only about 6% of published CVEs are ever exploited, so a backlog is not a risk ledger.

The bottleneck just moved, and almost nobody priced it in

Here is the thing we never said out loud. For most of the history of software security, finding a flaw was the expensive part, and that expense was doing quiet work on our behalf. It triaged for us. When discovery costs real money and real expert hours, the act of finding a vulnerability is itself a vote that the vulnerability might be worth the trouble. Scarcity was the filter.

That filter is dissolving.

In DARPA's AI Cyber Challenge final, autonomous systems produced bug reports and patches at an average cost of roughly $152 per task. For comparison, DARPA noted that equivalent bug bounties can range from hundreds to hundreds of thousands of dollars. The machines submitted patches in an average of 45 minutes. DARPA's director put the old world plainly: finding and patching vulnerabilities "using current methods is slow, expensive, and depends on a limited workforce."

Read that quote again, because it describes a constraint we built an entire industry on top of. Slow, expensive, limited. Those three properties were load-bearing. They kept the volume of findings inside the range a human team could reason about. Remove them and the volume goes somewhere new.

The first-order story is the cheerful one. More flaws found, more patches shipped, faster, cheaper. That story is true. It is also the boring half. The interesting half is what happens to your priorities, your budget, and your sanity when the supply of "known problems" stops being scarce.

The capability is real, and the curve is steep

It would be easy to wave this away as a contest artifact. Synthetic bugs, controlled conditions, a leaderboard. Don't.

In that same DARPA final, the competing systems identified 86% of the planted vulnerabilities. A year earlier, at the semifinals, that number was 37%. Patching jumped from 25% to 68% over the same stretch. And along the way the systems turned up 18 real, previously unknown vulnerabilities across more than 54 million lines of code. That is not a flat capability. That is a curve bending upward fast.

Production software is already feeling it. Google's autonomous "Big Sleep" agent reported roughly 20 previously unknown vulnerabilities in widely used open-source software, including a SQLite zero-day that had survived both traditional fuzzing and manual review. The relevant detail isn't the count. It's that the agent caught something the established methods had missed in code that millions of people depend on.

So set aside the specific tools and version numbers. They'll be obsolete by the time you finish reading. What matters is the durable shape underneath: a task that used to require scarce experts is becoming something a machine does cheaply, continuously, and at scale. Treat the examples as illustrations of a pattern, not as headlines.

When a capability moves like this, the smart question is never "how good is it today." The smart question is "what breaks downstream when this is abundant."

The trap: free discovery, infinite supply, perpetual demand

Here's what breaks. When discovery is nearly free, the supply of known flaws becomes effectively infinite. And an infinite supply of findings, paired with the reasonable-sounding goal of "fix the vulnerabilities," produces a demand curve with no ceiling.

Think about the economics for a second. Nothing is ever 100% secure. There is no state of "done." So a market priced on finding and fixing everything has, by construction, an addressable market that is never exhausted. "Patch the planet" sounds like a mission. Structurally, it's a subscription with no end date.

We already see the system straining under the old, human-paced rate of discovery. CVE submissions rose 263% between 2020 and 2025. NIST enriched nearly 42,000 CVEs in a single year, about 45% more than any prior year, and still couldn't keep up. So it changed its operating model: going forward it will fully enrich only the CVEs that are known to be exploited, present in federal software, or designated critical. The official scorekeeper of vulnerabilities looked at the firehose and decided it could no longer process everything.

Now imagine that firehose with the nozzle removed. That's the world cheap autonomous discovery creates.

I'll name the failure mode directly. The security treadmill is what you get when you treat an infinite stream of findings as a to-do list instead of a prioritization problem. You move fast. You burn budget. You ship patches all day. And you arrive nowhere, because the queue refills faster than you can drain it, forever. The motion feels like progress. The position never changes.

Organizations that survive this don't try to drain the queue. They change the question.

The uncomfortable truth: most findings don't matter

Here is the stat that should reorganize how you think about all of this. Of all published CVEs, only about 6% have ever been observed exploited in the wild. In the underlying study, that was 13,807 out of 237,687. The other 94% sit in databases, generating tickets, consuming attention, and mostly never touching a real attack.

Let that land. A backlog of findings is not a ledger of risk. It is a list of possibilities, and the overwhelming majority of those possibilities never become anyone's problem.

This was already true when humans, slow and expensive, were doing the finding. The 6% figure predates the autonomous-discovery wave. What AI changes is the denominator. If discovery used to surface, say, ten thousand findings a year and roughly six hundred mattered, cheap discovery might surface a hundred thousand. The ratio holds, more or less. The materiality is still concentrated in a thin slice. But now you have ten times the noise wrapped around the same small amount of signal.

A team that treats every finding as obligatory work has, in effect, agreed to spend most of its effort on things that will never hurt anyone. That isn't diligence. It's a category error with a security budget attached.

And notice which flaws get the attention under the treadmill model. The obvious, pattern-matchable ones get auto-triaged and patched faster than ever, including plenty that never warranted the worry. The non-obvious, structural problems, the design flaws and trust-boundary mistakes that don't fit a signature, survive untouched. Automated abundance is very good at the legible and very weak at the deep. Optimize for volume and you systematically clear the cheap stuff while the expensive stuff waits.

What actually gets scarce

When discovery goes to zero, value flows to whatever is still hard. And the hard thing is judgment.

Specifically, three judgments:

  • Materiality. Does this flaw touch anything that matters, or is it a theoretical defect in a code path nobody can reach with anything valuable behind it?
  • Blast radius. If it were exploited, what actually breaks, and how far does the damage travel?
  • Exploitability. Is there a realistic path from this finding to a working attack, or does it require conditions an adversary will never assemble?

None of these are search problems. They're decision problems, and they depend on context the discovery tool doesn't have: your architecture, your data, your threat model, what you can tolerate losing. Automating the search does not automate the deciding. The deciding is where the leverage now lives.

The most telling signal is that the referee has already made this move. NIST didn't respond to overload by hiring its way to processing everything. It switched to risk-based triage: exploited, federal, or critical first. The institution whose entire job was to process the vulnerability stream concluded that processing the whole stream is the wrong objective. If the scorekeeper now prioritizes instead of completes, the rest of us have no excuse for pretending completeness is the goal.

Scarce skill, restated: the ability to look at a mountain of true-but-trivial findings and confidently say which handful you'd be negligent to ignore. That judgment doesn't scale by buying more compute. It scales by building better decision processes and trusting experienced people to run them.

The pattern is bigger than security

Step back and this stops looking like a security story at all. It's a general law of automation.

When automation makes a discovery-bound task cheap, value migrates to whatever remains scarce, and what remains scarce is almost always the deciding. We've watched it elsewhere. When generating text got cheap, the bottleneck became editorial judgment about what's worth saying and what's true. When writing code got cheap, the bottleneck became deciding what to build and whether it's correct. When finding flaws gets cheap, the bottleneck becomes deciding which flaws matter.

The shape repeats: abundance in production, scarcity in discernment. The tool collapses the cost of doing the thing. It does nothing to collapse the cost of knowing whether the thing was worth doing. If anything it raises that cost, because now there's far more output to discern across.

So any time you hear that AI has made some expensive task nearly free, ask the second-order question immediately. Not "how much do we save on the task," but "what just became the new bottleneck, and are we organized around it or against it?" The teams that win the next decade are the ones who answer that question early and restructure before the firehose teaches them the hard way.

What builders and buyers should actually do

This is an essay about incentives, not a runbook, so I'll keep the prescriptions structural.

  1. Architect around prioritization, not completeness. Design your security program assuming the input stream is infinite and your job is to allocate finite attention well. A system built to reach zero open findings is a system built to fail. A system built to ensure the most material findings always get handled first is a system that stays sane.

  2. Treat "found" and "must-fix" as different questions with different owners. Discovery can be automated and cheap. The promotion of a finding from "known" to "we are spending money on this" should be a deliberate, defensible decision, not an automatic consequence of the finding existing.

  3. Keep a human holding authority over what ships and what gets fixed. Cheap discovery and cheap patching make it tempting to close the loop entirely and let the machines triage themselves. Resist it where the stakes are real. The deciding is exactly the part you don't want to automate away, because it's the part that's now scarce and valuable.

  4. Demand exploitability and impact evidence, not raw counts. When a vendor or an internal dashboard leads with the number of vulnerabilities found, treat that as a yellow flag. Volume of findings is the metric that abundance makes meaningless. Ask instead: which of these are exploitable, what's the blast radius, and what's your basis for that claim?

  5. Refuse incentives that reward volume. Any contract, tool, or team scorecard that pays out per finding is now structurally misaligned, because the supply of findings is heading toward infinite. Pay for risk reduced, not for boxes generated. If the metric rewards filling the queue, the queue will get filled, and you'll be back on the treadmill by Friday.

The organizations that thrive here will look, from the outside, like they're doing less. Fewer tickets touched, fewer patches shipped, a smaller open-findings number left deliberately non-zero. What they're actually doing is harder and more valuable: spending their scarce judgment on the small set of things that can actually hurt them, and consciously ignoring the rest. The treadmill rewards motion. The exit rewards discernment. Choose the exit.

FAQ

Does cheap AI vulnerability discovery make organizations more secure?

Not automatically. Cheaper, faster discovery is genuinely useful, and autonomous systems are already finding real flaws in production software that traditional methods missed. But more findings only improve security if you also improve your ability to judge which findings matter. Without that judgment, abundant discovery just produces a larger backlog and a busier team, not a safer system. The gain comes from prioritization, not from volume.

Why doesn't "patch everything" work as a security strategy?

Because nothing is ever fully secure, so "everything" has no endpoint, and cheap discovery makes the supply of findings effectively infinite. A program priced on fixing every known flaw is committing to a workload with no ceiling. Meanwhile, only about 6% of published CVEs are ever observed exploited in the wild, so most of that work targets problems that will never cause harm. Completeness is the wrong objective.

If most vulnerabilities are never exploited, why track them at all?

Because you can't tell which 6% matter without surveying the full set first. Discovery still has value as input. The mistake is treating the full list as an obligation rather than a candidate pool. The real work is filtering that pool down to the findings with genuine materiality, blast radius, and exploitability, then concentrating effort there instead of spreading it evenly across noise.

What skills become more valuable as AI automates vulnerability discovery?

Judgment about risk. Specifically, the ability to assess whether a flaw is materially dangerous, how far damage would spread if exploited, and whether a realistic attack path exists. These are context-dependent decisions that discovery tools can't make, because they require knowledge of your architecture, data, and threat model. As finding flaws gets cheap, deciding which to fix becomes the scarce, high-leverage skill.

How should buyers evaluate AI security tools without falling for volume metrics?

Ignore raw counts of vulnerabilities found, because abundance makes that number meaningless. Ask instead for exploitability and impact evidence: which findings are realistically attackable, what would break, and what's the reasoning behind the assessment. Avoid any pricing or scorecard that rewards the quantity of findings generated, since the supply of findings is heading toward infinite. Pay for risk reduced, not for tickets created.

Further reading

  • DARPA, "AI Cyber Challenge marks pivotal inflection point for cyber defense" (2025) — Official results of the autonomous find-and-patch competition: AI cyber reasoning systems produced bug reports and patches at roughly $152 per task, in about 45 minutes on average. darpa.mil
  • Google, "A summer of security: empowering cyber defenders with AI" (2025) — Google's Big Sleep agent finding real-world flaws in widely used open-source software, including the SQLite zero-day CVE-2025-6965 caught before it could be exploited. blog.google
  • NIST, "NIST Updates NVD Operations to Address Record CVE Growth" (2026) — CVE submissions rose 263% between 2020 and 2025, pushing NIST to a risk-based model instead of enriching every CVE. nist.gov
  • Cyentia Institute / FIRST EPSS, exploitation-in-the-wild study (2024) — Only about 6% of published CVEs have ever been observed exploited in the wild. first.org/epss
Share this post X LinkedIn
Runs on your GPU

Local AI Playground

Real AI models running entirely in your browser. Your GPU, your data — nothing sent to a server.

Try it free

Before you go...

Get our best AI insights delivered straight to your inbox. No spam, we promise.