We were supposed to be doing routine follow-through. Check the box, confirm the pattern, move on.

After four clean CVEForge runs, most teams would call it a day. We kept going - not because we expected fireworks, but because the only way to trust a tool is to watch it work on things you didn’t cherry-pick. So we pointed it at a Java codebase and hit enter.

The run started as a standard reproduction pass. It ended with the “fixed” version still popping shells.

When we stepped back across all five CVEs, the numbers were uncomfortable: three out of five ended with bypass or incomplete-fix outcomes. That’s either a streak of bad luck or a pattern worth paying attention to.

If you want the earlier steps in this arc, they’re here:

This post is about that last part. Not hype. Not “AI replaces researchers.” Just a very practical observation: we ran five CVEs through the same pipeline, and three times the “fixed” version was still exploitable.

Five is not a sample size you’d publish a paper about. But three out of five is not noise either.

Why This Run Mattered

Every CVEForge run so far had been PHP. Laravel, WeGIA, Kanboard - Composer dependencies, Apache, the usual LAMP-adjacent stack. Comfortable territory. We needed something that would actually stress the pipeline.

CVE-2025-60355 gave us that. Spring Boot. FreeMarker templates. A Java codebase split across admin and web components, where the injection point is authenticated but the trigger path isn’t. And a vendor who’d already shipped what they called a fix.

CVE-2025-60355: The Fix That Wasn’t

The short version: CVEForge reproduced the full exploit chain, tested it against the vendor’s patched version, and got a root shell on both.

What CVEForge Actually Did

Same pipeline as before. Pull intelligence, clone the repo, trace the data flow, build Docker labs for vulnerable and patched versions, write a PoC, verify it works on one and fails on the other. If the fix looks narrow, run a bypass pass.

What changed between runs wasn’t the model or the architecture - it was the boring stuff. Better Docker networking assumptions. Stricter verification on patched targets. Cleaner artifact handoff between agents. All learned from previous failures and baked into the prompts.

Less dramatic. More reliable. Exactly what you want from tooling you’re going to run again tomorrow.

The Java Case: CVE-2025-60355 (OneBlog)

OneBlog is a Java blogging platform. Spring Boot, FreeMarker templates, the kind of setup you see in hundreds of Chinese open-source projects on GitHub. The vulnerable behavior sits in how it configures FreeMarker to render stored templates.

If you know FreeMarker, you can probably already guess the shape of this bug:

Configuration cfg = new Configuration(Configuration.VERSION_2_3_22);
cfg.setNumberFormat("#");
// missing class resolver restriction
t = new Template("", new StringReader(templateContent), cfg);

No TemplateClassResolver. That’s the whole bug. Without it, FreeMarker’s ?new() built-in can instantiate arbitrary Java classes - including the ones that run operating system commands.

The exploit chain has a nice two-hop quality to it: inject the payload through an authenticated admin endpoint (POST /template/edit), then trigger it through a completely public rendering path (GET /robots.txt). The admin plants the bomb. The internet detonates it.

PoC result on vulnerable target:

uid=0(root) gid=0(root) groups=0(root)

The interesting part came when CVEForge tested the “fixed” version (v2.3.9). Same payload. Same code path. Same root shell.

This wasn’t a narrow bypass of a strong fix - the kind where the fix blocks one encoding but misses another. This was closer to a version bump with no meaningful change to the vulnerable sink. The FreeMarker configuration was still wide open. The ?new() built-in still had unrestricted class resolution. The fix, as shipped, didn’t fix the thing.

Current Scorecard: 5 CVEs, 3 Bypass Outcomes

Here is the current set in eip-pocs-and-cves:

CVEClassPatched BehaviorBypass Outcome
CVE-2025-53833SSTI (PHP/Laravel)Patched behavior held in verificationNo bypass confirmed
CVE-2025-58159Upload to RCE (PHP)Endpoint removal blocked exploit pathNo bypass confirmed
CVE-2025-55010Deserialization (PHP)Primary sink fixedBypass confirmed via alternate deserialization path
CVE-2026-28296CRLF injection (GVFS/FTP)Direct vector blockedBypass confirmed via server-supplied symlink target path
CVE-2025-60355Template injection (Java/FreeMarker)Claimed fixed version remained exploitable in testsBypass/incomplete fix confirmed

Three out of five landed on the same theme: patching the reported path is not always patching the vulnerability class.

The Real Signal

The interesting part is not that an AI found something flashy. It’s what happens when you automate the boring part: test the fix.

When reproduction is automated, patched-vs-vulnerable comparison is mandatory, and bypass checks are a normal part of the workflow rather than an afterthought, a lot of fixes start looking narrower than they did in the commit message. Anyone who’s audited patches for a living knows this feeling - that quiet suspicion when a fix touches one call site but the same pattern exists three functions over.

What CVEForge adds isn’t insight. It’s consistency. The same skeptical workflow, run the same way, across different codebases and vulnerability classes. It doesn’t get tired. It doesn’t skip the patched-version test because it’s 6 PM on a Friday.

The question it keeps asking is simple:

When a CVE fix lands, does the adjacent code still carry equivalent exploitability?

What We Are Not Claiming

Before someone writes a hot take about “60% of CVE fixes are broken” - no. Five CVEs is not a sample size. Selection bias is real. Tooling bias is real. We picked application-level vulnerabilities in open-source web stacks because that’s what CVEForge is built for.

But five runs, three bypass outcomes, across PHP and Java, across four different vulnerability classes? That’s enough to keep running. If the next ten come back clean, great. If they don’t, we’ll have something worth talking about with actual numbers behind it.

Run Cost Snapshot (This Case)

For this CVE-2025-60355 run, CVEForge logged:

  • Total cost: $13.33 ($13.325611)
  • Total runtime: 44m 18s (2,657,548 ms)

Phase spend breakdown:

  • Intel: $2.49
  • Analysis: $1.58
  • Lab build: $4.82
  • PoC verify: $0.78
  • Reporting + bypass: $3.65

That profile is consistent with what we would expect in a harder Java case: lab construction and verification-heavy phases dominate cost more than initial triage.

Where We Go Next

The plan is boring in the best way: more CVEs, same verification rubric, track bypass rates over time, keep tightening prompts from real failure logs. Method, repetition, evidence over opinions.

We’re finally measuring something that the industry has been hand-waving about for years. When a fix ships, does it actually fix the thing? Not “did the CI pass” or “did the reporter close the issue” - does the vulnerability class go away?

Early returns suggest the answer is “not always.” We’d like to know how often.


CVEForge Series: