exploit-forge
Fully autonomous, multidisciplinary exploit development driven by orchestrated AI agents.
Autonomous Exploit Research
Give it a target. Get a verified exploit - lab, code, and documentation included.
⏳ The Problem
A skilled security researcher takes days to weeks to analyze a vulnerability, build a test environment, write an exploit, and prove it works. The process is manual, expensive, and doesn't scale.
🤖 What exploit-forge Does
78 AI agents work together - researching the vulnerability, building Docker labs, writing exploit code, and verifying it against both vulnerable and patched targets. Fully autonomous.
✅ The Result
A complete, verified exploit package - working PoC, Docker lab, and technical writeup - delivered in minutes, not weeks. No human in the loop. Nine pipelines cover every domain.
The Scale of exploit-forge
A single TypeScript monorepo that consolidates 9 autonomous security research pipelines, 78 AI agents, and 18,000 lines of tool infrastructure into one unified platform.
Each pipeline deploys 6-13 autonomous Claude agents that sequentially execute specialized security research phases - from reconnaissance to verified exploit delivery.
The High-Skill Research Bottleneck
Exploit development is one of the most demanding disciplines in security. It requires deep expertise across multiple domains, tools, and architectures - and even the best researchers hit a ceiling.
🧩 Multidisciplinary Complexity
One CVE needs Ghidra and GDB. The next needs WinDbg on a remote VM. The next needs a Playwright browser session. Each demands a different skillset, different tools, different architectures - and hours of debugging to get right.
🦄 The Unicorn Problem
A researcher who can reverse engineer binaries, write ROP chains, pentest web apps, diff Windows patches, AND decompile Android APKs? They barely exist. And even they can only work on one thing at a time.
🌐 Every Stack, Every Framework
The target could be Node, ASP.NET, PHP, Flask, Spring, Rails - any stack. A human researcher might master two or three. The AI reads them all equally, pulling API docs, source code, and framework internals on the fly.
exploit-forge's answer: 78 specialized agents across 9 disciplines. No skill gaps. No context switching. No ceiling.
Domain-Specific Security Expertise
Each pipeline is purpose-built for a different security domain - from known CVEs to undiscovered zero-days. Same core engine, different specialized agents and tools.
CVE Advisory to Proof of Concept
Input a CVE ID. Get a verified exploit with Docker lab, PoC code, and documentation.
Intel
Query EIP, clone repo, identify versions
Analysis
Read fix diff, trace vuln path
Lab Build
Generate Dockerfile, build containers
PoC Dev
Write exploit, test on vulnerable target
Verify
Confirm patch blocks exploit
Report
Synthesize evidence into README
QA
Quality review of all deliverables
Binary Exploitation
Reverse engineering meets autonomous exploitation. Ghidra for static analysis, GDB for runtime debugging, pwntools for exploit development.
Intel
Fingerprint binary, enumerate inputs
Reverse
Map control + data flow
Sink
Analyze dangerous functions
Harness
Build test harness + lab
PoC
Execute, measure impact
Report
Document mechanics
Bypass
Attempt patch bypass
QA
Validate reliability
Zero-Day Research
Discovers unknown vulnerabilities through automated fuzz campaigns with coverage-guided mutation. 11 agents - the most complex pipeline.
🗺️ Surface Mapping
Identify attack surface, entry points, dangerous sinks. Build call graphs and trace data flow from inputs to sinks.
🔎 Variant Hunting
Search for similar code patterns and related CVEs. Find variants of known vulnerability classes in new locations.
🧪 Fuzz Campaigns
Run AFL++/libFuzzer with coverage feedback, ASAN/UBSAN sanitizers. Auto-triage crashes by exploitability.
Stack Overflow Exploitation
The only pipeline with built-in quality gates. Must prove a crash before analyzing control, and must prove register control before attempting exploitation.
Research
Query EIP, identify vuln class and target
Env Setup
Configure build tools and dependencies
Lab Build
Compile target, build Docker lab
Crash PoC
Prove crash with ASAN or GDB evidence
Control
Map registers, calculate offsets
Exploit
ROP chains, shellcode, pwntools
Validate
Confirm exploit reliability
Report
Document exploit mechanics
Bypass
Attempt patch bypass
Patch Tuesday Analysis
Every month Microsoft patches hundreds of binaries. WinForge diffs them automatically - finding what changed, why it matters, and whether it's exploitable.
Patch Intel
Identify target KB, acquire binaries
Diff Analysis
ghidriff binary diffing, function-level changes
Lab Setup
Provision QEMU/KVM Windows VM
PoC Dev
Write exploit targeting the diff
Verify
WinDbg remote debugging session
Report
Document patch analysis
Bypass
Attempt patch bypass
QA
Validate deliverables
Web Application Pentesting
The largest pipeline - 13 agents. Performs a full web application security assessment with 5 vulnerability types analyzed in parallel, each with its own Playwright browser agent.
Phase 1: Reconnaissance
Pre-Recon analyzes source code for attack surfaces. Recon performs full site mapping with Playwright. Two agents feed all five parallel tracks.
Phase 2: Vulnerability Analysis
Five agents run in parallel - SQLi, XSS, Auth Bypass, SSRF, and Authorization Bypass. Each gets its own browser instance.
Phase 3: Exploitation
Only vuln types with confirmed findings proceed to exploitation. Each exploit agent gets a dedicated Playwright session to prove the finding with evidence.
WordPress Security Research
Purpose-built for WordPress plugin and theme vulnerabilities. Audits source code, maps hook and filter dependencies, builds a local WordPress lab, and verifies findings end-to-end.
Intake
Identify target plugin/theme, fetch source
Dep Map
Map hooks, filters, AJAX handlers, REST routes
Code Audit
Security review - sinks, guards, callbacks
Lab Build
Docker WordPress + MySQL + target plugin
Exploit
Verify findings against live WP instance
Report
Full writeup with reproduction steps
Android Application Assessment
Takes an APK file and tears it apart - decompiling with APKTool and JADX, hunting for hardcoded secrets, mapping API endpoints, and validating findings against live infrastructure.
Intake
Fetch APK, validate target
Unpack
APKTool + JADX decompilation
Secrets
Scan for keys, tokens, credentials
Endpoints
Map API calls, URLs, backends
Validate
Test findings against live targets
Lab Build
Docker test environment
Exploit
Verify exploitability
Report
Document all findings
Exploit QA and Release Prep
The pipeline that makes everything publishable. Takes raw exploit output from other forges - scrubs internal paths, fixes Docker configs, normalizes documentation, and verifies reproducibility.
Docker Fix
Repair and normalize Dockerfiles
Scrub
Remove all internal paths and IPs
Generate
Create standardized README and docs
Verify
Build and test from clean checkout
Audit Fix
Fix issues found in verification
QA
Quality gate check
QA Fix
Address QA findings
Final QA
Final release approval
Modular Exploitation Framework
Write the engine once. Plug in a new security domain with ~2,000 lines. Every pipeline inherits retry logic, cost tracking, audit trails, and MCP tooling for free.
Execution Flow
One command triggers a cascade - Docker containers spin up, Temporal orchestrates the workflow, Claude agents execute sequentially, and verified deliverables land in the workspace.
Temporal Workflow Engine
Every pipeline run is a deterministic Temporal workflow - fault-tolerant, resumable, and fully auditable.
🔄 Retry Presets
Production: 5min → 30min backoff, 50 max attempts
Testing: 10s → 30s, 5 attempts
Subscription: 5min → 6hr, 100 attempts
📊 State Tracking
Current phase, completed agents, per-agent cost/turns/duration, advisory warnings, and human review flags - all persisted in workflow state.
⏸️ Resume Support
Workflows pause and resume from the last checkpoint. Workspaces persist by default - use CLEAN=true for fresh runs.
Agent Turn Execution
Each agent is a multi-turn Claude conversation augmented with specialized tools. The executor handles retries, cost tracking, and model selection - so agents focus on the security research.
Claude Executor
Manages multi-turn Claude conversations. Handles MCP tool dispatch, streaming responses, retry logic with exponential backoff, spending cap detection, and API error recovery.
Prompt Manager
Template interpolation with {{CVE_ID}}, {{WORKSPACE_PATH}} variables. Include directives for shared partials. 177 prompt template files across all pipelines.
Multi-Tier Models
Small (Haiku) for lightweight tasks, Medium (Sonnet) for standard agent work, Large (Opus) for complex reasoning. Per-agent model selection optimizes cost vs. capability.
Metrics & Billing
Per-agent tracking of turns used, USD cost, and wall-clock duration. Aggregated into workflow summary. Billing cap detection auto-pauses if spending limits are hit.
Deliverables & Evidence
Every agent produces deterministic deliverables - markdown reports and JSON decision gates that downstream agents read to decide whether to proceed.
📄 Markdown Reports
- Intel brief
- Vulnerability analysis
- Lab build report
- PoC verification report
- Final README
🚦 JSON Gates
- Exploitability gate (proceed/block)
- Runtime gate (env ready?)
- Finding gate (findings sufficient?)
- Confidence scores + blockers
📦 Workspace Layout
- source/ - target code
- lab/ - Docker builds
- deliverables/ - reports + PoCs
- artifacts/ - evidence
MCP Tool Infrastructure
18,181 lines of MCP server code. 10 in-repo servers plus external integrations - all via stdio transport.
Shared Tools
- save_deliverable
- write_file + verification
- run_in_repo
- run_exploit_test
- exploit_claim_gate
Pipeline-Specific
- WinForge: 18 tools (WinDbg, VM SSH, ghidriff)
- APKForge: 14 tools (APKTool, JADX, secrets)
- BinForge: 8 tools (GDB, checksec, crash)
- ZeroForge: 8 tools (fuzz, sanitizer, triage)
External MCPs
- EIP server (vuln intelligence)
- Playwright (5 browser agents)
- Ghidra + GDB agents
- SharkMCP (packet capture)
Docker Image Hierarchy
11 Dockerfiles organized as a hierarchy. Shared base image with pipeline-specific layers adding domain tools.
base.Dockerfile
Ubuntu 22.04 + Node.js 22 + Docker CLI + build tools, git, ripgrep, jq, cmake, Python 3
base-jvm.Dockerfile
Base + Java 21 + Ghidra headless - for BinForge and StackForge reverse engineering
Pipeline Layers
- CVEForge: GDB, strace, pwntools, checksec
- BinForge: Ghidra, GDB, ROPgadget, capstone
- ZeroForge: AFL++, libFuzzer, ASAN/UBSAN
- WinForge: ghidriff, QEMU guest tools
New Domains
- ShannonForge: Playwright, Chromium
- APKForge: APKTool, JADX, aapt
- WPForge: WordPress CLI, PHP, semgrep
- LabForge: Docker-in-Docker for cleanup
The ./forge CLI
One unified command interface for all 9 pipelines. Dispatches to per-pipeline bash scripts that handle Docker Compose orchestration.
Commands
- start - Launch a pipeline workflow
- stop - Stop containers
- logs - Tail live workflow output
- status - Structured run status (--json)
- workspaces - List workspace dirs
Examples
- ./forge cveforge start CVE=CVE-2026-28296
- ./forge binforge start TARGET=/path/to/binary
- ./forge shannonforge start URL=https://target.com
- ./forge apkforge start APK=/path/to/app.apk
- ./forge zeroforge start TARGET_REPO=git://...
ShannonForge: 5 Parallel Tracks
The only pipeline with true parallel agent execution - 5 vulnerability types analyzed simultaneously, each with a dedicated Playwright browser agent.
Production Deployment
Two dedicated servers - EU and US - run pipelines around the clock. Temporal manages state, Docker isolates execution, and daily encrypted backups protect everything.
Automated Provisioning
Two-phase provisioning - install.sh (13 stages, 30-60 min) + setup.sh (6 stages, 10-15 min). From bare metal to fully operational in under 2 hours.
install.sh - 53 KB, 13 stages
- Base packages + Docker CE + Node.js 22
- Java 21 + Ghidra + ghidriff
- QEMU/KVM + libvirt (WinForge VMs)
- Clone 15+ repos + build all MCP servers
- Temporal + PostgreSQL + Docker images
- Windows VM provisioning (ISOs + QCOW2)
setup.sh - 42 KB, 6 stages
- System tools + Homebrew
- Systemd services (Paperclip, forge-browser)
- MCP infrastructure wiring
- B2 backup system + GPG keys
- AI tool settings (Claude Code, Codex)
- Shell config + environment
The Exploit Claim Gate
A critical MCP tool that prevents agents from lying about exploit success. Compares the claimed impact against actual evidence and downgrades or blocks inflated claims.
Claimed Impact
Agent claims "RCE achieved" or "authentication bypass confirmed" based on its analysis and PoC execution results.
Evidence Check
Gate compares claim against actual exploit test output - exit codes, signals, stdout/stderr, timeout state. Does the evidence support the claim?
Verdict
Approved - evidence matches claim.
Downgraded - partial evidence, reduced severity.
Blocked - no supporting evidence.
exploit-intel.com
The public face of the Exploit Intelligence Platform - vulnerability search, exploit rankings, stats dashboards, labs, and an MCP server for AI integration.
Forge Archive Browser
Every run, every agent turn, every dollar spent - searchable and browsable. Full visibility into what the AI agents are doing and thinking.
Connected Systems
exploit-forge doesn't operate in isolation - it's wired into the entire Exploit Intel ecosystem. Vulnerability intelligence feeds in, results flow out to the company workflow, and the founder monitors everything from Telegram.
🔍 EIP Server
Query vulnerability intelligence - CVE details, exploit rankings, EPSS scores. Powers the intel phase of every CVE-based pipeline.
📋 Paperclip AI
Company workflow orchestration. CEO and ForgeRunner agents query pipeline status, trigger runs, and monitor progress via forge-ops MCP.
📱 Telegram
Real-time notifications for pipeline phase transitions and completions. The founder monitors runs from their phone.
🐙 GitHub
Clone target repositories, checkout specific commits/tags. SSH key authentication for private repos.
🗄️ PostgreSQL
forge_archive database stores audit logs, agent metrics, and deliverable metadata. Ingested via scripts for historical analysis.
☁️ Backblaze B2
Daily encrypted backups of forge data, audit logs, and workspace artifacts. GPG-encrypted at rest.
By the Numbers
The scale of what a small team can build when AI agents do the heavy lifting - from a $2 CVEForge run to a $40 zero-day research campaign.
CVEForge run cost
ZeroForge run cost
Max turns per agent
The Full Stack
From TypeScript and Temporal at the core to Ghidra, GDB, and AFL++ at the edges - every tool a security researcher needs, orchestrated by AI.
TypeScript
ES2022 strict mode
Temporal.io
v1.15 workflows
Claude SDK
v0.2.38 agent runtime
Docker
11 images · Compose
MCP
18K LOC · 10 servers
Ghidra
Headless analysis
GDB
Runtime debugging
AFL++
Coverage fuzzing
Real-World Impact
exploit-forge doesn't just generate reports - it finds real vulnerabilities and produces verified, reproducible exploit code.
Highlight Reel
A selection of findings - from Windows kernel exploits to vendor patch bypasses in under 75 minutes.
Windows Kernel Use-After-Free
0-dayIOCP race condition in the Windows kernel. WinForge identified the vulnerability through binary diffing of Patch Tuesday updates, then built a working exploit on a QEMU/KVM Windows VM.
CVE-2026-24289V8 JIT Type Confusion
Patch BypassType confusion in Chrome's Maglev JIT compiler. After the vendor patched the original finding, exploit-forge's bypass agent found a way around the fix - in 75 minutes.
CVE-2026-3910systemd Privilege Escalation
Patch BypassLocal privilege escalation via D-Bus in systemd-machined. The vendor's patch was incomplete - exploit-forge identified and bypassed it in 72 minutes, autonomously.
CVE-2026-4105nginx Protocol Desync
0-dayPreviously unknown FastCGI protocol desynchronization bugs in nginx, discovered by ZeroForge through automated coverage-guided fuzzing with AFL++ and sanitizer feedback.
ZeroForge DiscoveryWhat Makes exploit-forge Special
Four architectural decisions that make the difference between a proof-of-concept and a production system.
Autonomous Security Research
78 AI agents across 9 specialized pipelines - from CVE advisory to zero-day discovery - orchestrated by Temporal, powered by Claude, delivering verified exploit research at scale.
One CLI. One monorepo. Autonomous exploit research.