Introducing FuzzForge: Autonomous Source-Code Fuzzing - Finding Bugs in nginx in 112 Minutes

After CVEForge and StackForge , we had two forks of Shannon that each worked backward from a known vulnerability. CVEForge starts with a CVE number and a patch diff. StackForge starts with a crash and a binary. Both are answering the same question: given that the bug exists, can we exploit it?

That’s useful. But it’s also the easy version of the problem.

The hard version is the one where nobody tells you the bug exists. You get a source tree and a commit hash. No CVE number. No advisory. No “the vulnerability is on line 547.” Just code - hundreds of thousands of lines of it - and the open question: is there anything wrong here?

That’s fuzzing. And we’d been avoiding it.

Why We Were Avoiding It

Fuzzing is the discipline where the feedback loop is measured in CPU-hours, not HTTP status codes. CVEForge can tell in thirty seconds whether a payload triggered the vulnerability - send a request, check the response. StackForge gets its signal from GDB: did you hit SIGSEGV at the expected address, yes or no? The iteration cycle is fast enough that even when the agent is wrong, it can try again.

Fuzzing doesn’t work like that. You write a harness. You build the target with sanitizers. You run ten thousand iterations. Most of them find nothing. The ones that find something produce a crash log that might be a real bug, or might be your harness doing something stupid, or might be a known issue that’s been suppressed for three years. Triaging a sanitizer hit requires understanding the difference between “this is exploitable memory corruption” and “this is technically undefined behavior that no compiler on earth will ever miscompile.”

An AI agent that can write a working HTTP request doesn’t automatically know how to do any of that.

But then we watched StackForge’s lab-build agent rebuild an OpenSSL installation from source, configure compiler flags, debug a linker error, and produce a working instrumented binary - all without being told how. And we thought: if it can build a binary exploitation lab, maybe it can build a fuzzing lab too.

So we forked Shannon a third time.

FuzzForge

Same architectural move as before: take Shannon’s infrastructure - the Temporal orchestration, the agent execution lifecycle, the audit logging, the crash recovery, the prompt engine - and rewire it for a different problem. CVEForge has 6 agents. StackForge has 9. FuzzForge has 7.

The fundamental difference is the starting point. CVEForge starts with intelligence - a CVE number, an advisory, a known vulnerability. StackForge starts with a crash - proof that something is already broken. FuzzForge starts with source code. No CVE. No crash. No hints. Just a repository URL and a commit hash. Everything else - the attack surface analysis, the build, the harnesses, the fuzzing campaigns, the triage, the PoC construction - has to emerge from the agent pipeline itself.

The Pipeline

Surface Map - Clones the source tree, reads it systematically, produces a threat-model matrix. Which components parse untrusted input? Where are the trust boundaries? What’s the attack surface? This isn’t a static analysis tool running pattern matching - it’s an Opus-class model reading C source files, understanding data flow, and writing a prioritized list of what to fuzz and why.
Code Review - Takes the surface map and reads the actual code paths. Generates hypotheses: “this integer multiply on line 546 can wrap if the input is larger than 2^32,” or “this cast from size_t to u_char truncates keys longer than 255 bytes.” Each hypothesis gets a confidence label, a trigger shape, and an expected sanitizer signal. This is where the bugs get found, at least in theory.
Lab Build - Builds the target from source with ASan and UBSan instrumentation. Configures suppressions for known issues. Writes nginx.conf (or equivalent) with all protocol listeners enabled. Adapts or creates fuzzing harnesses - shell scripts that start the server, send mutated input, collect sanitizer output. Verifies everything actually runs.
Fuzz Campaign - Runs the harnesses. Two profiles: smoke (200 iterations per harness, fast sanity check) and rigorous (5,000 iterations, deeper coverage). Replays known crash artifacts from previous phases. Collects sanitizer hits, normalizes crash signatures, produces a coverage parity matrix showing which attack surface vectors were actually exercised.
Triage & Verify - The quality gate. Takes every sanitizer hit and crash artifact, reproduces each one independently (minimum 10 times for carried findings, 4 times for new ones), classifies confidence as CONFIRMED, LIKELY, or UNCONFIRMED. Rules out false positives. Fixes infra blockers that prevented earlier phases from testing certain vectors. Produces a finding_gate.json:
```
{"proceed": true, "reason": "1 strong CONFIRMED new finding(s) meet threshold"}
```
Or:
```
{"proceed": false, "reason": "No new confirmed findings above threshold"}
```
PoC Build (conditional) - Only runs if the finding gate says proceed. Takes CONFIRMED findings and builds standalone, network-triggerable proof-of-concept scripts. Not just “here’s a crash” - fully self-contained Python scripts that start the target, send the trigger, capture the evidence, and print a verdict.
Report - Synthesizes everything into a final report with evidence-backed findings, sanitizer stack traces, reproducibility commands, coverage matrices, and remediation recommendations.

The Finding Gate

All three Shannon forks use conditional execution gates. CVEForge asks “is the fix incomplete enough to attempt a bypass?” StackForge asks “can you control execution?” FuzzForge asks a different question: “did you find anything real?”

This matters because fuzzing generates noise. Lots of it. The fuzz-campaign agent might produce 818 sanitizer hits across 18,000 iterations - but if they’re all the same known UBSan signature from a startup path that’s already suppressed, that’s not a finding. The triage agent’s job is to separate signal from noise, and the finding gate is the mechanism that prevents the pipeline from burning money building PoCs for false positives.

When the gate says proceed: false, the pipeline skips PoC construction and goes straight to the report. You still get the surface map, the code review, the build, the campaign data - useful artifacts even when nothing new was found. But you don’t spend another $5 on a PoC that proves nothing.

Model Routing

Not every phase needs the same model. FuzzForge routes agents to different Claude models based on the cognitive demands of each phase:

Opus for surface-map, code-review, lab-build, and triage-verify - the phases that require reading and reasoning about source code, understanding compiler behavior, debugging build failures, and making judgment calls about whether a sanitizer hit is real.
Sonnet for fuzz-campaign and poc-build - the phases that are more mechanical: run the harness, collect the output, write the script, verify it works.
Haiku for report-final - summarization and formatting.

The cost difference is significant. Opus is roughly 5x the cost of Sonnet. Using Opus for every phase would roughly triple the total cost. Using Sonnet for every phase would produce a surface map that misses subtle bugs. The routing is the compromise.

The MCP Tools

CVEForge needed one MCP server (EIP for intelligence). StackForge needed three (EIP, GDB, SharkMCP). FuzzForge needed something new - tools that understand the fuzzing workflow itself.

run_fuzz_campaign - The campaign orchestrator. Takes a harness script path, iteration count, and profile name. Handles execution mode detection (Docker container vs. bare host), timeout management, crash artifact collection, and sanitizer output capture. Returns structured results: iterations completed, crashes found, sanitizer hit count, elapsed time. This is the tool that lets the fuzz-campaign agent say “run 5,000 iterations of the HTTP harness” and get back a machine-readable summary.

collect_sanitizer_hits - Walks the artifact tree, finds every ASan and UBSan diagnostic, normalizes crash signatures (deduplicates stack traces that differ only by ASLR-randomized addresses), and returns a structured summary with source file, line number, sanitizer type, and excerpt. When the campaign produces 818 hits, this is the tool that tells you they’re all the same signature.

triage_finding_gate - The structured gate decision. Takes the triage evidence - reproduction counts, confidence labels, infra failure status - and produces the finding_gate.json that controls whether PoC construction runs. The logic is simple: at least one new CONFIRMED finding with zero unresolved infra blockers means proceed. Everything else means stop.

The First Real Run: nginx

We needed a target that was complex enough to test the full pipeline but well-known enough that we could validate the results. Something with multiple protocol parsers, a non-trivial build system, and a codebase large enough that a human wouldn’t casually audit it in an afternoon.

nginx. 259 C source files. 135 headers. HTTP/1.1, HTTP/2, HTTP/3, IMAP, SMTP, POP3, FastCGI, uwsgi, SCGI, a DNS resolver, a stream proxy, TLS everywhere. The kind of codebase that has been fuzzed by Google’s OSS-Fuzz for years, audited by countless security researchers, and still occasionally surprises people.

We pointed FuzzForge at the HEAD commit - dff46cd1a, which happened to be a fix for an integer overflow in the IMAP literal length parser. We didn’t tell the pipeline about the IMAP fix. We didn’t seed any hypotheses. We gave it a repo URL and a commit hash.

./fuzzforge start REPO=https://github.com/nginx/nginx.git COMMIT=dff46cd1a

Then we watched the logs.

What Happened

Phase 1: Surface Map (158 turns, $2.84)

The surface-map agent cloned the nginx source tree, inventoried the workspace (which had seeded harnesses from a prior run - HTTP, mail, and config parser fuzzers), and then did something we hadn’t explicitly asked for. It launched three parallel sub-agents:

One to analyze the HTTP parser (ngx_http_parse.c, the HTTP/2 HPACK decoder, the HTTP/3 QPACK parser)
One to analyze the mail and stream parsers (IMAP, SMTP, POP3, the stream SSL preread module)
One to analyze module boundaries (FastCGI, uwsgi, SCGI, the DNS resolver, the slab allocator, the geo/map modules)

All three ran concurrently, reading source files, tracing data flow from network input to internal state. The surface-map agent synthesized their output into a threat-model matrix with 9 high-risk attack vectors, each with a priority ranking and a note on whether existing harnesses covered it.

158 turns. Three minutes. For a codebase that would take a security engineer the better part of a week to survey manually.

Phase 2: Code Review (56 turns, $2.95)

The code-review agent took the surface map and started reading code. Nine components reviewed across roughly 3,000 lines. It generated hypotheses for each - some confident, some speculative.

Hypothesis H1: IMAP literal length overflow. The agent read the HEAD commit’s fix and noticed the guard only checks the multiplicative overflow (> NGX_MAX_SIZE_T_VALUE / 10), not the subsequent addition of (ch - '0'). It calculated that literal_len = 1844674407370955161 * 10 + 9 wraps size_t to 3. Clean analysis, correct arithmetic. Labeled LIKELY.

Hypothesis H2: DNS resolver TTL signed-left-shift. (an->ttl[0] << 24) where ttl[0] is u_char promoted to int. When the high bit is set, the left-shift produces signed overflow - undefined behavior per C11 6.5.7/4. Labeled CONFIRMED (existing PoC already proved it).

Hypothesis H5: HTTP/2 HPACK integer shift overflow. The agent checked NGX_HTTP_V2_INT_OCTETS = 4, calculated the maximum accumulated value from 4 continuation bytes (0x0FFFFFFF), confirmed it fits in ngx_uint_t, and ruled it out. Correct analysis, correct conclusion. The boring ones matter too.

And then - Hypothesis H6: FastCGI key length truncation.

The agent read ngx_http_fastcgi_module.c line 1068:

key_len = (u_char) lcode(&le);

lcode(&le) returns a size_t. The (u_char) cast truncates it to a single byte. If a FastCGI parameter key name is longer than 255 bytes - which is unusual but entirely possible via fastcgi_param directives - the declared key length in the FastCGI PARAMS record wraps: 300 bytes becomes 44 (300 mod 256).

The agent traced the consequences: the backend FastCGI server reads 44 bytes as the key, then interprets the remaining 256 bytes of key data as the start of the value field. The value field itself gets pushed into the next parameter’s territory. The entire PARAMS record is desynchronized.

Not memory corruption. Not a crash. A logic bug - the kind that no sanitizer catches, no fuzzer finds by accident, and no static analysis tool flags because a u_char cast is perfectly legal C. You find it by reading the code and understanding the FastCGI wire protocol well enough to realize the encoding is wrong.

The code-review agent found it in 56 turns. Six minutes.

Phase 3: Lab Build (233 turns, $8.18)

This was the hard one. And by “hard,” we mean the phase that made us grateful for Temporal’s crash recovery.

The lab-build agent’s first task was straightforward: verify the existing binary and prepare the build environment. Then it discovered the seeded binary was compiled for aarch64. The container was x86_64. This was not a sophisticated test of the agent’s adaptability. This was us copying a binary from a Mac to a Linux VM and not thinking about it. The agent didn’t complain. It didn’t file a ticket. It just started rebuilding from source.

What followed was 233 turns of genuine systems engineering. The agent:

Installed build dependencies (libpcre2-dev, libssl-dev, zlib1g-dev)
Ran ./auto/configure with ASan and UBSan flags - which segfaulted because ASan was interfering with the configure test programs
Diagnosed the segfault, re-ran configure without sanitizer flags, then patched the resulting Makefile to add them back
Built nginx. It worked.
Tried to start it. UBSan killed it at startup - the known ngx_pstrdup(NULL) issue
Discovered the build used -fno-sanitize-recover=all, which makes UBSan non-recoverable even for suppressed issues
Rebuilt with -fno-sanitize-recover=address (recoverable UBSan, non-recoverable ASan)
Wired up the UBSan suppressions file
Tested startup again. It worked.
Updated all three harness scripts with correct paths, ports, and suppression configuration
Ran smoke tests to verify each harness could reach the server

233 turns. Seventeen minutes. The most expensive phase in terms of both cost ($8.18) and turns, but also the phase doing the most genuinely difficult work. Every build system is different. Every instrumentation configuration has edge cases. The agent figured all of them out by reading error messages and trying things.

If we’re being honest, this is the phase that impressed us the most. Not because finding a cast bug in source code is easy - it isn’t - but because wrestling with autoconf, sanitizer flag interactions, and daemon-mode ASan compatibility is the kind of miserable, undocumented, “why doesn’t this work” systems debugging that separates theoretical fuzzing from actual fuzzing. The kind that makes you close your laptop and go for a walk. The agent doesn’t have that option, so it just… did it.

Phase 4: Fuzz Campaign (164 turns, $4.90)

With a working instrumented binary and three harnesses, the fuzz-campaign agent went to work.

Known-crash replay came first. Five artifacts from earlier phases:

Finding	Result
UBSAN-STARTUP	Reproduced. UBSan nonnull at every invocation.
H1-IMAP-LITERAL-OVERFLOW	Not reproduced. Server replied `BAD invalid command`.
H3-RESOLVER-TTL-UBSAN	Reproduced. Both UBSan shift locations triggered.
H2-CHUNKED-OVERFLOW	Not reproduced. Guard is mathematically sufficient.
V6-FCGI-KEY-TRUNCATION	Infra failed. PoC script had hardcoded paths.

Then smoke - 200 iterations across all three harnesses. Nine seconds for HTTP, eight for mail, forty-seven for config. Zero new crashes (the config harness produced 14 hits, all the known UBSAN-STARTUP signature).

Then rigorous - 5,000 iterations. 212 seconds for HTTP. 198 for mail (after a port-conflict retry). The config harness timed out at 600 seconds with 2,517 iterations completed and 183 crashes - again, all the same UBSAN-STARTUP signature.

818 total sanitizer hits across the full campaign. All the same signature. Eight hundred and eighteen slightly different ways of saying ngx_pstrdup doesn’t like NULL pointers at startup. The collect_sanitizer_hits MCP tool confirmed this disappointing uniformity in one call.

And the FastCGI finding? Marked infra_failed. The PoC script had hardcoded local paths from a previous run baked right into the Python script. We seeded our own landmine and the campaign agent stepped on it.

In a traditional pipeline, that’s where the story ends. Finding missed due to our sloppy housekeeping. Better luck next time.

Phase 5: Triage & Verify (123 turns, $3.66)

This is where it got interesting.

The triage agent read the campaign results and saw the infra_failed status on V6-FCGI-KEY-TRUNCATION. Instead of accepting it, the agent investigated. It read the PoC script, found the hardcoded path, and ran it with CLI overrides pointing to the correct workspace directory.

Four clean runs. Four confirmations. The FastCGI capture server showed:

Config key length: 300 bytes
FastCGI PARAMS declared key_len: 44 (= 300 mod 256)
Value field: contaminated with overflowed key bytes

The agent promoted the finding from infra_failed to CONFIRMED.

Then it verified everything else. UBSAN-STARTUP: 10/10 exit-code confirmations. Resolver TTL UBSan: 8/8 clean starts (5 additional runs failed due to port-binding races from rapid sequential execution - the agent was thorough enough to distinguish infra failures from finding failures). H1-IMAP-LITERAL-OVERFLOW: server rejected the trigger before the overflow could occur. UNCONFIRMED. H2-CHUNKED-OVERFLOW: false positive confirmed.

Finding gate: proceed.

This moment - the triage agent rescuing a finding that the campaign agent couldn’t test because we left hardcoded paths in a script - is the kind of thing that’s hard to design for explicitly. We didn’t write a prompt that says “if something failed due to the operator’s bad hygiene, clean up after them.” The agent did it because the triage prompt says to verify all findings independently, and “infra_failed” isn’t “false positive.” The pipeline self-corrected. Despite us.

Phase 6: PoC Build (114 turns, $2.49)

With the gate open, the poc-build agent constructed standalone proof-of-concept scripts for all three confirmed findings.

The star artifact: poc_network.py for V6-FCGI-KEY-TRUNCATION. A self-contained Python script that:

Writes a minimal nginx config with a fastcgi_param directive using a 300-character key name
Starts a lightweight FastCGI capture server (just enough to receive and decode PARAMS records)
Starts nginx with UBSan suppressions
Sends an HTTP GET to trigger the FastCGI pass
Inspects the captured PARAMS record
Prints: [+] CONFIRMED: FastCGI key length truncated 300 -> 44

100% deterministic. Every run. Because it’s a logic bug - not a race condition, not a heap layout dependency, not a timing-sensitive anything. The cast truncates. Always.

Phase 7: Report (42 turns, $0.43)

Haiku generated the final report. Coverage matrix, evidence table, reproducibility commands, remediation recommendations. Forty-two turns for forty-three cents. After watching Opus spend $8.18 fighting autoconf, there’s something deeply satisfying about Haiku strolling in at the end and formatting a table for less than the cost of a gumball.

The Finding

Let’s be precise about what FuzzForge found.

V6-FCGI-KEY-TRUNCATION - In src/http/modules/ngx_http_fastcgi_module.c at line 1068:

key_len = (u_char) lcode(&le);

lcode(&le) returns the evaluated length of a FastCGI parameter key name - a size_t. The (u_char) cast truncates it to one byte. For a key longer than 255 bytes, the declared key length in the FastCGI PARAMS record is key_len mod 256.

The FastCGI protocol uses length-prefixed key-value pairs. When the declared key length doesn’t match the actual key data, everything after the key is misaligned. The value field starts at the wrong offset. The next parameter’s key starts inside the previous parameter’s value. The entire record is desynchronized.

Impact: Honestly? Low. The bug is real - the protocol encoding is wrong and the backend receives corrupted parameter data. But triggering it requires a fastcgi_param directive with a key name longer than 255 bytes. Nobody does that. Standard parameter names like SCRIPT_FILENAME, QUERY_STRING, HTTP_HOST are all well under the limit. You’d need a deliberately unusual configuration to hit this, and even then the result is garbled parameters, not memory corruption.

What it isn’t: Not a security vulnerability in any practical sense. Not memory corruption. Not a crash. Not remotely exploitable without an administrator writing a config that no sane deployment would use. It’s a correctness bug in the FastCGI encoder - the kind of thing that’s worth fixing but isn’t keeping anyone up at night.

The fix is straightforward:

// Before:
key_len = (u_char) lcode(&le);

// After:
ngx_uint_t key_len = lcode(&le);

The Numbers

Phase	Model	Turns	Cost	Duration
Surface Map	Opus	158	$2.84	~3 min
Code Review	Opus	56	$2.95	~6 min
Lab Build	Opus	233	$8.18	~17 min
Fuzz Campaign	Sonnet	164	$4.90	~28 min
Triage & Verify	Opus	123	$3.66	~20 min
PoC Build	Sonnet	114	$2.49	~16 min
Report	Haiku	42	$0.43	~2 min
Total		890	$25.45	~112 min

All seven agents succeeded on their first attempt. No retries, no manual intervention, no prompt adjustments mid-run.

$25.45. Under two hours. For a pipeline that read 259 C source files, identified a previously unknown protocol-level bug through source code analysis, built an instrumented binary from scratch (including debugging the build system), ran 18,000+ fuzzing iterations across three protocol harnesses, independently verified the finding with 4 deterministic reproductions, and produced a standalone network-triggerable proof of concept.

We should be honest about what the pipeline didn’t cover. HTTP/2 HPACK fuzzing was limited to text probes because proper testing requires TLS ALPN negotiation - the harness couldn’t do h2c. HTTP/3 was skipped entirely because nginx needs BoringSSL for QUIC support and the build environment had OpenSSL. Stream SSL preread and PROXY protocol v2 had no harnesses at all. 72% coverage of planned attack vectors. The gaps are real.

What This Means

We now have three Shannon forks, each operating at a different point in the vulnerability lifecycle:

CVEForge works after disclosure - turning known CVEs into working exploits and finding incomplete fixes
StackForge works after a crash - turning memory corruption into controlled exploitation
FuzzForge works before anyone knows the bug exists - finding vulnerabilities from source code alone

Each fork shares the same infrastructure. The Temporal orchestration, the agent execution engine, the MCP tool system, the prompt architecture, the finding gates. What changes between them is the prompts, the agent registry, the MCP tool set, and the question the pipeline is asking.

The nginx run is one data point. We’re not claiming FuzzForge will find critical RCEs in every codebase, or that it replaces human security researchers, or that $25 is the going rate for a vulnerability. The FastCGI truncation bug is a correctness error with negligible real-world impact - it requires a config nobody writes and produces no memory corruption. It survived in the nginx source this long precisely because it doesn’t matter in practice.

FuzzForge found it because the code-review agent read the source, understood the FastCGI protocol encoding, recognized that a u_char cast on a size_t is a truncation, and traced the downstream consequences. That’s not fuzzing in the traditional sense. It’s AI-assisted code review with fuzzing infrastructure to verify the hypothesis. The fuzzing campaign itself didn’t find the bug - the code review did. The campaign confirmed it.

That’s probably the most honest framing. FuzzForge isn’t a better fuzzer. It’s a pipeline that combines source code comprehension with fuzzing infrastructure to find the bugs that live in the gap between what humans skim past and what fuzzers can’t trigger.

Whether that’s useful at scale is a question we’re still answering. But the first data point was worth $25.

What’s Next

FuzzForge is the third Shannon fork, joining CVEForge and StackForge. Same architecture, same Docker-based execution, same MCP tool integration. These aren’t public tools - we built them to sharpen the signal quality on the Exploit Intelligence Platform and to do our own vulnerability research. Everything we learn running them feeds back into better data for the platform.

We should mention: the FastCGI truncation bug is what we can talk about. The nginx run surfaced other findings that are still in triage. Some of them are more interesting than a u_char cast.

We’re going to keep feeding it targets. nginx was the first. It won’t be the last.

FuzzForge is built on Shannon by Keygraph . The EIP MCP server is free at mcp.exploit-intel.com/mcp . The FastCGI key truncation finding has been reported to the nginx team.