72 Hours, 24 CVE Proof of Concept Exploits, and 8 Disclosure Submissions: The CVEForge Stress Test

CVEForge is our autonomous exploit development pipeline - six AI agents that take a CVE number and produce a working proof-of-concept, a full Docker lab, a vulnerability analysis, and when the vendor’s fix is incomplete, a bypass with disclosure-ready documentation. We built it on top of Shannon and the EIP MCP server .

After the first five blog posts - seven CVEs, three bypass findings, a lot of lessons about Docker networking - we had a pipeline that worked. Every time. Reliably enough that feeding it another CVE had become a low-ceremony operation: pick a target, type the command, check back in forty minutes.

So we kept feeding it. For three days straight.

What came out the other end was more than we expected. Not because any single finding was earth-shattering - though one of them got submitted to GitLab’s bug bounty program, which was new for us - but because of the sheer volume. Twenty-four CVEs. All twenty-four produced working exploits. Ten bypass or incomplete-fix findings. Eight responsible disclosure submissions. The kind of throughput that makes you reconsider what “routine patch validation” could look like at scale.

This is the full accounting. Not curated highlights - everything, including the ones that were boring and the ones that made us nervous.

The Numbers

Let’s get the spreadsheet out of the way.

The table below is the full CVEForge output - every run, including the seven from earlier blog posts. Twenty-four CVEs. All in one place.

#	CVE	Product	Class	CVSS	PoC	Bypass	Post
1	CVE-2026-28296	GVFS (GNOME)	CRLF injection	4.3	2 scripts	Bypass found	Post 1
2	CVE-2025-53833	LaRecipe (Laravel)	SSTI / RCE	10.0	1 script	Clean fix	Post 2
3	CVE-2025-58159	WeGIA	File upload / RCE	9.9	1 script	Clean fix	Post 3
4	CVE-2025-55010	Kanboard	Deserialization / RCE	9.1	2 scripts	Bypass found	Post 3
5	CVE-2025-60355	OneBlog	FreeMarker SSTI / RCE	9.8	2 scripts	Fix never applied	Post 4
6	CVE-2025-10622	Foreman (Red Hat)	Command injection	8.4	2 scripts	Clean fix	Post 5
7	CVE-2026-28372	GNU telnetd	Privilege escalation	8.4	2 scripts	Clean fix	Post 5
8	CVE-2025-66489	Cal.com	Auth bypass	9.8	3 scripts	Clean fix	New
9	CVE-2026-26321	OpenClaw	Path traversal / SSRF	7.5	3 scripts	Clean fix	New
10	CVE-2016-15057	Apache Continuum	Pre-auth cmd injection	9.8	3 scripts	Clean fix	New
11	CVE-2025-11539	Grafana Image Renderer	File write / RCE	9.9	3 scripts	Clean fix	New
12	CVE-2024-56143	Strapi 5	IDOR / data exfiltration	8.2	3 scripts	Clean fix	New
13	CVE-2025-58046	DataEase	JNDI injection / RCE	9.8	3 scripts	Bypass found	New
14	CVE-2025-69985	FUXA SCADA	Auth bypass / RCE	9.8	5 scripts	5 bypasses	New
15	CVE-2026-1868	GitLab AI Gateway	Jinja2 SSTI / DoS	9.9	4 scripts	4 bypasses (0-day)	New
16	CVE-2026-23906	Apache Druid	LDAP auth bypass	9.8	4 scripts	Clean fix	New
17	CVE-2026-2635	MLflow	Default credentials	9.1	4 scripts	Bypass found	New
18	CVE-2026-26988	LibreNMS	SQL injection	8.8	6 scripts	Bypass found	New
19	CVE-2026-2749	Centreon	Path traversal / RCE	9.9	3 scripts	Clean fix	New
20	CVE-2026-28215	Hoppscotch	Unauth config overwrite	9.1	3 scripts	Clean fix	New
21	CVE-2026-28370	OpenStack Vitrage	eval() injection / RCE	9.1	4 scripts	Bypass found	New
22	CVE-2026-28409	WeGIA	Command injection / RCE	9.9	6 scripts	Bypass found	New
23	CVE-2026-28417	Vim netrw	Command injection	9.8	5 scripts	Clean fix	New
24	CVE-2026-28268	Vikunja	Password reset token reuse	9.8	3 scripts	Clean fix	New

All twenty-four produced working PoCs. A 100% completion rate across the full run.

Twenty-four CVEs. Twenty-four working exploits. Ten bypass or incomplete-fix findings. Eight responsible disclosure submissions. Across PHP, Java, Python, Node.js, Go, and C. Across SSTI, SQLi, IDOR, command injection, deserialization, path traversal, JNDI, LDAP bypass, eval injection, and auth bypass. Three days.

That said - the pipeline wasn’t on autopilot. We were in the loop the whole time, and the system got better because we kept poking at it.

The Tuning

We should be honest about this part, because the “we just typed a command and walked away” narrative is tempting but incomplete.

Between runs, we kept tweaking. Not the core architecture - the six agents, the phase sequencing, the MCP integration all stayed the same. But the prompts evolved constantly, the way a recipe evolves when you cook it twenty-four times in a row.

The path bug. Early on, the lab build agent kept writing PoC scripts that referenced hardcoded container IPs from a previous run. Turns out we had a path resolution issue where the workspace context wasn’t being passed cleanly between phases. Not a glamorous fix - twenty minutes of staring at a variable scope - but it was the difference between 60% and 95% lab build success rates.

Pattern matching prompts. After the third bypass finding, we started tuning the bypass agent’s prompts to look for similar patterns to the ones it had already found. If a fix blocks * and ** operators, does it also block %? If a fix validates the Referer header, does it also validate the JWT secret? We weren’t telling the agent the answers - we were teaching it to ask better questions. The difference showed: the early runs found one bypass per CVE when they found any. The later runs (FUXA, GitLab, LibreNMS) found entire families of bypasses.

The “read the fix diff first” rewrite. The analysis agent originally started with the CVE description and worked forward to the code. We flipped it: read the fix diff first, then work backward to understand what was vulnerable. Sounds obvious in retrospect. The effect was immediate - the agent stopped writing PoCs for vulnerabilities it didn’t fully understand and started writing PoCs that matched the actual code path. False positives dropped to near zero.

Docker hint injection. Some applications have… creative… build processes. Cal.com needs a Postgres database seeded with Prisma migrations before the app will even start. Strapi needs a specific version of Node with specific env vars. After a few painful lab builds, we started enriching the workspace context with build hints scraped from the project’s CI configuration. The agent could already read READMEs and docker-compose files - we just gave it one more source of ground truth.

None of these changes were revolutionary. They were the kind of unglamorous refinements that separate a demo from a tool. We typed a CVE number and walked away, yes - but we’d spent the previous evening reading failed build logs and muttering at prompt templates. The automation isn’t magic. It’s the residue of a lot of very boring debugging.

The Highlights

Not all CVEs are created equal. Some are interesting because the vulnerability is clever. Some because the fix is broken. Some because the implications scale. Here are the ones that made us sit up.

GitLab AI Gateway: From Reproduction to HackerOne

CVE-2026-1868 started as a routine run. Jinja2 SSTI in GitLab’s AI Gateway - the prompt rendering engine used by Duo, GitLab’s AI assistant. CVSS 9.9. CVEForge reproduced it, wrote three PoC scripts, confirmed 27 out of 27 test payloads. Standard pipeline output.

Then the bypass agent got involved.

GitLab’s fix introduced a custom Jinja2 sandbox class that restricts specific operators and marks callables as unsafe. Solid approach. But the agent found two architectural blind spots in the sandbox design - places where the Jinja2 runtime invokes functionality through code paths that the sandbox doesn’t intercept. Four independent bypass vectors, all achieving the same impact as the original vulnerability on the fully patched code.

We verified every bypass locally against the exact patched class, across three GitLab versions: the first patched release, the latest release, and main. All vulnerable. We also confirmed that the original attack vectors (the ones the fix was designed to block) correctly fail. Not a false positive.

Since this is an active 0-day, we’re keeping the technical details under wraps until GitLab has had time to respond. We submitted a full security report via HackerOne - four bypass techniques, a self-contained reproduction script, suggested fix code, and a production attack path assessment. CVSS 7.7 (High). The report is pending triage.

That was a first for us. CVEForge went from “reproduce a CVE” to “find a 0-day in a vendor’s fix and submit a responsible disclosure” in a single run. The bypass agent earned its keep. We’ll publish the full technical details once the fix ships.

FUXA: The Fix That Wasn’t (Again)

FUXA is an open-source SCADA/HMI platform - the kind of software that controls industrial processes. CVE-2025-69985 is nominally an authentication bypass, but the reality is worse.

CVEForge reproduced the original vulnerability: spoof a Referer header to bypass JWT authentication, hit /api/runscript, get RCE via Module._compile(). The vendor’s fix (commit 5e7679b0) rewired the middleware so the Referer bypass no longer works. Good.

Except the JWT signing secret is still hardcoded as frangoteam751 in the public source code. And verifyToken never actually rejects - it auto-generates guest JWTs for unauthenticated requests. And when secureEnabled defaults to false, verifyGroups returns admin privileges for everyone.

The bypass agent found five independent vectors on the patched version. The most devastating: forge an admin JWT using the hardcoded secret, send it with your /api/runscript request, get a root shell. The “fix” changed the lock on the front door. The key is published in the source code.

This one is still an incomplete fix for CVE-2023-33831 from three years ago. Same codebase, same architectural problem, same hardcoded secret. The door keeps getting a new lock. The key keeps being taped to the door frame.

Cal.com: `!credentials.totpCode` and the Art of Falling Through

Sometimes a vulnerability is so clean it reads like a textbook example. Cal.com - the popular open-source scheduling platform - had a critical auth bypass where including any truthy value in the totpCode field during login skips password verification entirely:

if (user.password?.hash && !credentials.totpCode) {
    // Password verification ONLY happens inside this block
    const isCorrectPassword = await verifyPassword(credentials.password, user.password.hash);
}

Send totpCode: "anything". The condition !credentials.totpCode evaluates to false. The password check never executes. The code falls through to the 2FA verification - which only runs if the user has 2FA enabled. For the majority of accounts without 2FA: no check at all. Login succeeds. Full account access with only an email address.

CVEForge built the lab (Cal.com is a large Node.js application - 2 GB of dependencies), wrote the PoC, demonstrated three attack vectors: TOTP field bypass, account enumeration via timing, and session hijacking. The fix in v5.9.8 restructured the auth flow to validate password first, unconditionally. Clean fix. But the original bug is the kind that makes you audit every if (!field) gate in your auth code.

CVE-2024-56143 in Strapi 5 is an IDOR that lets an unauthenticated attacker extract admin password hashes character by character through the public content API. The lookup parameter - designed as an internal-only mechanism for publication status and locale filtering - passes through sanitizeQuery() and validateParams() unscathed because neither function rejects unknown parameters. It lands directly in a database WHERE clause via a destructuring spread:

return assoc('where', { ...params?.lookup, ...query.where }, query);

One HTTP GET request per character guess:

GET /api/articles?lookup[createdBy][password][$startsWith]=$2b$10$abc

If articles come back, the prefix matches. If empty, it doesn’t. Repeat until you’ve extracted the full bcrypt hash. CVEForge wrote the blind extraction script - it’s methodical and patient, exactly the kind of boring automation that makes this bug class dangerous in practice.

Apache Druid: The LDAP Anonymous Bind

CVE-2026-23906 is the kind of bug that every LDAP-integrated application should be tested for but rarely is. Apache Druid’s druid-basic-security extension accepts empty passwords because LDAP anonymous binds return success (RFC 4513). The fix adds an explicit empty-password check. Simple, correct, the kind of one-liner that makes you wonder why it wasn’t there from the start:

if (password == null || password.isEmpty()) {
    throw new BasicSecurityAuthenticationException("Empty password not allowed");
}

Vim netrw: A URI and a Shell

CVE-2026-28417 in Vim’s built-in netrw plugin is a command injection via scp:// URIs. The hostname validation regex is unanchored - it only checks that the string starts with an alphanumeric character - and MakeSshCmd() substitutes hostnames directly into shell commands without shellescape(). Open scp://a;id;b/dir/ in Vim and your terminal runs id. CVEForge wrote five PoC variants covering scp, sftp, and ssh protocols.

The beauty of this one is the attack surface. Vim is everywhere. netrw is loaded by default. All it takes is opening a URI - from a Git config, a Markdown link, a shell alias.

The Bypass Rate

We’ve been tracking this since the beginning. Adding the latest batch to the running total:

Batch	CVEs Completed	Clean Fix	Bypass/Incomplete	Rate
Posts 1-5 (7 CVEs)	7	4	3	43%
This batch (17 new)	17	10	7	41%
Total	24	14	10	42%

Forty-two percent of the CVE fixes we’ve tested have had some form of bypass or incomplete-fix finding. We’ve been careful in previous posts not to overstate this - five was not a sample size, seven was not a sample size - but twenty-four starts to tell a story. Even accounting for selection bias (we pick application-level vulns in open-source web stacks), a 42% incomplete-fix rate is striking.

The caveats still apply. We’re selecting for CVEs that have source code available, that are in languages CVEForge handles well, that have clear fix commits to test against. This is not a random sample of all CVE fixes ever. But it’s also not cherry-picked - we’ve been running these against whatever looked interesting, across eight languages and a dozen vulnerability classes.

The pattern we keep seeing: the fix addresses the reported vector but not the vulnerability class. The patch locks one door. The bypass walks through another door in the same hallway.

What Nearly Broke

Every run produced a working PoC. We’re as surprised as you are. But “produced a working PoC” and “went smoothly” are very different statements.

Lab build is still the bottleneck. The slowest, most expensive, most failure-prone phase. Building Docker containers for applications with complex dependency chains - Gradle builds, npm monorepos with 2 GB of node_modules, Ruby on Rails with migration-seeded databases - is genuinely hard. The agent is getting better at it (it reads the project’s README, docker-compose.yml, and now the CI config before designing its own), but it still wastes turns on build failures about 30% of the time. The runs that succeeded often needed two or three attempts at the Dockerfile before the lab came up clean. Watching an AI agent debug a Gradle build failure is a lot like watching a first-year CS student debug a Gradle build failure - the same mix of overconfidence and confusion, just faster.

Cost adds up. API costs scale with complexity - more agent turns means more tokens, and bypass runs are significantly more expensive than clean-fix runs because the bypass agent keeps iterating. Here’s the rough per-CVE breakdown:

#	CVE	Product	Est. Cost	Notes
1	CVE-2026-28296	GVFS	~$12	Bypass analysis
2	CVE-2025-53833	LaRecipe	~$7	Simple, 1 PoC
3	CVE-2025-58159	WeGIA (upload)	~$7	Simple, 1 PoC
4	CVE-2025-55010	Kanboard	~$14	Bypass, 5 PoCs
5	CVE-2025-60355	OneBlog	~$13	Bypass, Java
6	CVE-2025-10622	Foreman	~$11	Complex lab (Rails)
7	CVE-2026-28372	GNU telnetd	~$10	C target, custom lab
8	CVE-2025-66489	Cal.com	~$15	Huge codebase (2 GB deps)
9	CVE-2026-26321	OpenClaw	~$11	6 PoCs, clean fix
10	CVE-2016-15057	Apache Continuum	~$9	Legacy Java, straightforward
11	CVE-2025-11539	Grafana	~$13	Complex Docker lab
12	CVE-2024-56143	Strapi 5	~$14	Large Node.js app
13	CVE-2025-58046	DataEase	~$18	12 PoCs, 4 bypass files
14	CVE-2025-69985	FUXA	~$19	5 independent bypasses
15	CVE-2026-1868	GitLab AI Gateway	~$17	4 bypasses, 0-day
16	CVE-2026-23906	Apache Druid	~$13	Complex LDAP lab
17	CVE-2026-2635	MLflow	~$14	Bypass, 3 bypass files
18	CVE-2026-26988	LibreNMS	~$20	7 PoCs, 9 bypass files
19	CVE-2026-2749	Centreon	~$13	Bypass, complex lab
20	CVE-2026-28215	Hoppscotch	~$10	Clean fix, moderate
21	CVE-2026-28370	Vitrage	~$15	Bypass, eval injection
22	CVE-2026-28409	WeGIA (cmd inj)	~$16	6 PoCs, 3 bypass files
23	CVE-2026-28417	Vim netrw	~$12	5 PoC variants, clean fix
24	CVE-2026-28268	Vikunja	~$11	Auth bypass, 3 PoCs
		Total	~$314	Avg ~$13/CVE

Remember what that cost covers: not just the PoC, but the full working lab - vulnerable and patched Docker containers built from source, with seeded databases, configured services, and exposed endpoints. Plus the vulnerability analysis, the writeup, and when triggered, the bypass analysis with independent proof scripts. Seven dollars gets you a complete reproduction environment for a simple CVE. Twenty dollars gets you a multi-container lab, seven exploit variants, nine bypass analysis files, and a finding ready for disclosure. The most expensive run (LibreNMS) would take a human analyst the better part of a day to replicate manually.

What the Pipeline Produced

For each successful run, CVEForge produced:

README.md - Full writeup: root cause, affected versions, lab setup, PoC usage, fix analysis
Vulnerability analysis - Data flow tracing from input to sink, fix completeness assessment
Docker lab - docker-compose.yml with vulnerable and patched containers
PoC scripts - Standalone Python exploits, zero or minimal dependencies
Bypass analysis (when triggered) - Independent bypass vectors with proof scripts
Verification report - Test matrix showing each payload’s result against vulnerable and patched targets

The consistency matters more than any individual artifact. Every run follows the same template. Every PoC has the same structure. Every verification report uses the same pass/fail criteria. You can compare findings across CVEs because they were measured the same way.

The Disclosures

Ten incomplete-fix findings meant ten conversations about responsible disclosure. We submitted eight reports through three different channels, depending on what each project supports.

Six GitHub security issues. Most open-source projects don’t have bug bounty programs or dedicated security teams. The responsible path is a GitHub issue - ideally through their Security Advisory workflow if the project has it enabled, otherwise a regular issue with enough detail to reproduce. Each report included the CVEForge-generated vulnerability analysis, bypass techniques, and reproduction steps. For projects like FUXA, DataEase, MLflow, LibreNMS, OpenStack Vitrage, and WeGIA, this was the right channel.

One MITRE submission. For bypass findings where the original CVE fix was demonstrably incomplete, MITRE needs to know. The original CVE describes a vulnerability that the vendor claims is fixed - if the fix doesn’t hold, that’s material information for anyone relying on the CVE entry for patch prioritization. We submitted a report documenting the incomplete fix with evidence.

One HackerOne submission. GitLab has a formal bug bounty program with real triage, real bounties, and a 95% response efficiency rating. We wrote a complete security report for the CVE-2026-1868 bypass: summary, root cause analysis, four bypass techniques with payloads, a self-contained Python reproduction script (no Docker, no lab - just pip install jinja2), a production attack path through the Duo Agent Platform, impact assessment, and suggested fix code. Attached the full bypass PoC. Set the CVSS to 7.7 (High). Hit submit.

The step from “find bypass” to “submit disclosure” turned out to be shorter than we expected. CVEForge’s deliverables are already structured for it - the vulnerability analysis becomes the root cause section, the PoC becomes the reproduction steps, the bypass analysis becomes the finding. We documented the full submission process so we can repeat it.

Eight reports in three days. Not because we were trying to hit a number - because the pipeline kept finding things that needed reporting.

As embargoes lift and vendors ship their fixes, we’ll publish each CVE alongside its full lab environment, PoC scripts, and bypass analysis - the same way we’ve published everything so far, just on the vendor’s timeline instead of ours.

Three Days

We started this to stress-test the pipeline. To find out where it breaks, what it can’t handle, whether the bypass agent’s early findings were a fluke or a pattern. We expected to find the ceiling - the point where the complexity exceeds what six agents and a good prompt set can handle.

We haven’t found it yet. Which is either exciting or terrifying, depending on your relationship with automation.

Twenty-four CVEs across PHP, Java, Python, Node.js, Go, and C. SSTI, SQLi, IDOR, command injection, deserialization, path traversal, JNDI, LDAP bypass, eval injection, auth bypass, and password reset token reuse. Cal.com’s 2 GB monorepo and Vim’s 30-year-old netrw plugin. GitLab’s enterprise AI gateway and FUXA’s SCADA HMI. Every single run produced a working exploit.

The bypass rate didn’t go down with volume. If anything, the bigger the sample got, the more stable the number became. Forty-two percent of the fixes we tested were incomplete in some way. That’s not a pipeline artifact - it’s a reflection of how hard it is to patch a vulnerability class versus patching a vulnerability instance. Developers fix the bug. They don’t always fix the kind of bug.

CVEForge isn’t replacing security researchers. It’s running the boring verification step that nobody has time for: does this fix actually fix the thing? When the answer is no - and it’s no about four times in ten - having a complete lab, a working PoC, and a structured analysis ready to go means the finding can move straight to disclosure instead of sitting in someone’s “I should look at that” pile.

We started with one prompt and a CRLF injection. Six blog posts later, we have twenty-four CVEs, twenty-four working exploits, ten incomplete-fix findings, eight disclosure submissions, and a pipeline that - when we’re not fiddling with its prompts at 2 AM - mostly runs itself.

The pipeline keeps running. We’ll let you know what it finds next.

All labs and PoCs are published as they’re verified:

GitHub: eip-pocs-and-cves
EIP Labs page (pre-built Docker images)
EIP MCP server (the intelligence layer)

CVEForge Series:
From CRLF Injection PoC to Fix Bypass - the one-prompt precursor
CVE-2025-53833: From CVE Number to Root Shell in 32 Minutes - the first full run
Zero to RCE: Three Vulnerability Classes - three CVEs, three PoCs, one bypass
OneBlog: 3 Bypasses in 5 Runs - the Java case study
Foreman & Telnetd - two CVEs, two very different fixes
72 Hours, 24 CVEs - you are here
The One That Failed - when the pipeline hits a wall