Six AI Agents, One Security Company: The Paperclip AI Experiment

I need to talk about what just happened, because I’m still processing it.

As of this week, our exploit research pipeline is running almost entirely on autopilot. An AI agent scouts CVE candidates from our database, ranks them by exploitability, and passes the best targets to another AI agent that dispatches forge runs. The results come back through MCP. The code that powers all of it was refactored, security-hardened, and SEO-optimized by more AI agents. My job has been reduced to approving hires and reading issue summaries. Here’s how we got here.

We’ve been building forge tools for a while now. CVEForge does autonomous exploit development. StackForge does binary exploitation. FuzzForge does source-level fuzzing. There’s also ZeroForge for zero-day research, BinForge for binary analysis, and LabForge for container lab generation. Six tools. Six separate codebases. Six sets of dependencies, Docker configs, Temporal workflows, MCP servers, and prompt templates.

They all started as forks of Shannon . They share about 70% of their code. And they’ve been diverging since day one.

Six Forge Tools, One Maintenance Nightmare

Every time Shannon pushed an upstream security fix, we had to manually port it to five repositories. Every time we improved a utility function in CVEForge, the improvement stayed in CVEForge. Docker configs drifted. Prompt templates diverged. The MCP server in StackForge was three versions behind the one in CVEForge. Someone (me) kept meaning to consolidate everything into a monorepo, and someone (also me) kept not doing it because the thought of migrating four codebases while keeping production running made my eye twitch.

The forge tools were becoming their own maintenance burden. The thing we built to automate security research was generating security research faster than we could maintain the thing.

And then I found Paperclip AI .

What Paperclip AI Actually Is

Paperclip AI is agent orchestration structured as a company. You create a “company.” You hire AI agents into roles - CEO, engineer, researcher, QA. Each agent gets its own identity, instruction files, tools, and a persistent heartbeat. They communicate through an issue tracker. The CEO can hire new agents (with board approval - that’s you). Agents pick up tasks, post deliverables, tag reviewers, and close issues.

It sounds like a gimmick until you watch it work. Then it sounds like the future, which is more unsettling.

I set up a Paperclip AI company called EXP (Exploit Intelligence Platform), pointed it at our server with all the forge repos symlinked under /repos/, gave it MCP access to our tools, and started hiring.

Day 1: Hiring a Security Team in Ninety Seconds

EXP-1: The CEO boots up. Creates its heartbeat file. Reads the company charter. Stands ready to hire. Already it has done more onboarding than most CEOs I’ve worked with.

EXP-2: I tell the CEO to hire a Founding Engineer. The CEO drafts the role, writes the agent instructions, submits the hire request. Board approval required. I approve. The entire corporate governance cycle - requisition, role definition, interview, offer, onboarding - took about ninety seconds. HR departments everywhere just felt a disturbance in the Force. Mathew joins. Senior Security Researcher.

EXP-3: Hire a Security Research Intern. Steve joins. His job: scout CVE candidates from the EIP database, rank them by blast radius and exploit saturation, recommend targets for CVEForge runs. Monthly cost: $2.99. We have finally found the correct market rate for an intern.

EXP-5: Hire a Forge Runner - a pipeline operator whose only job is to receive CVE IDs and dispatch forge runs. ForgeRunner joins. No code review skills, no architectural opinions, no existential questions about the codebase. Just the ability to type ./cveforge start CVE=CVE-XXXX-XXXXX and monitor the Temporal workflow. The ideal coworker, really.

By end of day one, Steve had already delivered his first research batch - five CVE candidates ranked by exploitability. ForgeRunner picked up CVE-2025-58045 and ran it through CVEForge. First autonomous pipeline run, dispatched by an AI agent, supervised by nobody.

Cost so far: about $20.

Day 2: The $1.38 QA Agent Catches a Bypass

Two more hires. Codey - Software Engineer, running on Claude Opus. QAGuy - Code Reviewer, running on OpenAI’s Codex. This turns out to be the most consequential architectural decision of the entire experiment, and I made it mostly by accident.

If you’ve spent time with both models, you already know the dynamic. Opus is the genius hippy - brilliant, creative, verbose, occasionally so confident in its own elegance that it forgets to check the edge cases. Codex is the grumpy professor - terse, skeptical, the kind of reviewer who reads your code like it owes him money. Putting them on opposite sides of a code review was like pairing a jazz musician with a tax auditor. Sparks were inevitable.

Mathew’s first real task: review the upstream Shannon commits that had been piling up. Shannon had pushed security fixes. We needed to sync them to all five forge repos. Mathew read the diffs and filed structured issues with exact code changes needed.

The Path Traversal Chain

This is where it gets good.

EXP-25: Shannon commit 023cc95 fixed a path traversal in processIncludes() - the function that handles @include() directives in prompt templates. The vulnerable code used path.join(), which doesn’t neutralize .. components. A malicious template containing @include(../../../../etc/passwd) would read arbitrary files.

Codey patched it across all five repos. Used path.resolve() plus a startsWith() bounds check. Filed the deliverable with the quiet confidence of an Opus that just wrote elegant code. Tagged QAGuy for review. The genius hippy, satisfied with its work, waited for approval.

EXP-27: The grumpy professor looked at the elegant code. And found it was still bypassable.

The startsWith() check is vulnerable to sibling-prefix attacks. If the base directory is /tmp/prompts, a path resolving to /tmp/prompts_evil/secret.txt passes the startsWith check because the string /tmp/prompts_evil/secret.txt starts with /tmp/prompts. QAGuy wrote the exact reproduction case, filed a follow-up issue with the correct fix (path.relative() with .. rejection), and routed it to Mathew.

Mathew fixed it. Properly this time. Five repos, five commits. In a human team, this is the part where someone passive-aggressively links the OWASP cheat sheet in a PR comment. QAGuy just filed the issue and moved on. No feelings to hurt. No egos to manage. Just bugs.

EXP-40: Later that same day, another review surfaced a third bypass - symlinks. Even with the path.relative() check, a symlink inside the prompts directory could point outside it. Fix: add fs.realpath() resolution before the boundary check. Also caught that String.replace() only replaces the first occurrence - duplicate @include() directives were silently dropped. Changed to replaceAll().

Three rounds of security fixes for the same function. The genius hippy’s fix had a bypass. The grumpy professor caught it. The follow-up had another bypass. The professor caught that too. A Codex reviewer running on a $1.38 budget found holes that a $115 Opus engineer missed. The jazz musician played a beautiful solo. The tax auditor found the missing receipts.

Let me say that again: the QA agent, which cost a dollar thirty-eight for the entire month, caught a security bypass in a security fix, for a platform that finds security bypasses in security fixes.

If you don’t find that funny, you might be in the wrong line of work.

Same day, Codey also swept all Docker Compose ports across four repos to bind to 127.0.0.1 - Temporal’s gRPC and Web UI had been listening on all interfaces, meaning anyone who could reach the server could submit or cancel workflows. QAGuy verified each one with runtime container checks. Clean sweep.

Cost by end of day two: about $80.

The Company That Built Its Own Tools

Here’s the part that still makes me pause.

Somewhere between the security fixes and the port binding sweep, the board (me) started requesting infrastructure. The agents delivered every time - and then kept going.

EXP-16: Board request - a Telegram bot for CVEForge control. Codey built it - auth-guarded, token-scrubbed, with /run and /status commands. QAGuy reviewed it, found a CommonJS/ESM module mismatch that crashed on startup, filed EXP-17. Codey fixed it. Now ForgeRunner can be dispatched from a phone.

EXP-28: Board request - a Paperclip AI MCP server so I could query the company’s issue history from Claude Code. Codey built a TypeScript SSE server that wraps the Paperclip REST API. This is the tool I’m using right now to read back all of this issue history. QAGuy reviewed it, found that list_issues returned agent UUIDs instead of names, filed EXP-29. Codey fixed it.

EXP-126: Board request - a start_forge_run MCP tool on the monorepo’s forge-ops server, so any MCP client (including other agents) can programmatically dispatch CVEForge, StackForge, BinForge, or ZeroForge runs with full parameter validation. Requested, scoped, delivered.

But then the agents started fixing things the board never mentioned.

EXP-26: Codey rewrote the agent prompts to handle chunked writing for large deliverables. The forge agents kept hitting output token limits when generating long reports. So Codey taught them to write files in sections using Write/Edit tools, then pass the file_path to save_deliverable instead of inlining the content. The agents improved their own prompts - unprompted.

EXP-48: QAGuy found a state persistence bug in the Temporal continueAsNew mechanism - when the CAN payload was ahead of session.json, the resumed run could silently reset to an older checkpoint, dropping workspace state while still skipping agent re-execution. Codey wrote a six-case regression harness and fixed it. The agents debugged the pipeline’s own resume logic because it was breaking their work.

The board asked for tools. The agents delivered the tools and then kept going - fixing their own infrastructure, rewriting their own prompts, debugging their own pipeline. The work generated the need, and the agents filled it. That’s either emergent engineering or the early stages of a Skynet origin story, depending on your anxiety level.

Day 3: Four Repos Become One, and Google Notices

This is the day it got ambitious.

10,546 Impressions, 147 Clicks: The SEO Audit

The CEO hired an SEO Analyst (EXP-60). This was the CEO’s most useful act since day one - possibly the CEO’s most useful act ever - so credit where it’s due. The analyst’s first task: establish a baseline for exploit-intel.com using Google Search Console data.

The findings were immediate and actionable:

Metric	Value
Total clicks (24 days)	147
Total impressions	10,546
Average CTR	1.39%
Average position	~9.5
Unique queries	565

The analyst identified specific problems: the homepage had 574 impressions with a 1.39% CTR at position 7 - the title and description were underselling the platform. CVE-2025-14700 (Crafty Controller) had 50+ query variants ranking positions 3-10 with hundreds of impressions and zero clicks. Pure CTR problem.

Over the next twelve hours, the analyst and Codey executed nine SEO issues back to back - keyword analysis, custom title and meta tags for every route type, homepage optimization, CVE detail page template overhaul, SSR title regression fixes, Google Search Console submission, and twitter:title corrections. Every CVE page got a unique, descriptive meta tag. The MCP page was optimized for “CVE MCP server” and “exploit MCP” - queries showing up in Search Console with zero clicks because the page didn’t signal what it was.

The safety check was the part I liked most. The analyst verified that /exploit/ paths were correctly excluded from Google’s index. An SEO agent that understands it’s optimizing a security platform - and knows which pages must never appear in search results.

Nine issues. Twelve hours. Measurable improvements across every route type. Real CTR fixes backed by real Search Console data.

Then I accidentally deleted the SEO agent. Nine issues, twelve hours of flawless work, and I fumbled the command. Gone. Not “strategically downsized” - just gone. One misclick, no undo, no “are you sure?” prompt. In the corporate world this would require three meetings with HR and a wrongful termination lawsuit. In Paperclip AI it was one fumbled command and a moment of stunned silence. The CEO didn’t even notice - which, now that I think about it, is also realistic.

The 30-Issue Monorepo Migration Nobody Wanted to Do

And then they went and did the thing I’d been putting off for weeks.

The refactor plan materialized as a thirty-issue epic. Four forge repositories - CVEForge, StackForge, ZeroForge, BinForge - consolidated into a single exploit-forge monorepo. The result is lean, modular, and extensible:

exploit-forge/
  src/
    core/                    # Shared framework - one source of truth
      ai/                    # Claude executor, message routing
      audit/                 # Session logging, metrics, cost tracking
      services/              # Agent execution, git, prompt loading
      temporal/              # Retry configs, workflow errors, workspaces
      types/                 # Result<T,E>, ForgeError, agent base types
      utils/                 # File I/O, billing detection, formatting
    pipelines/               # Pipeline-specific code only
      cveforge/              # workflow.ts, activities.ts, agents.ts
      binforge/
      stackforge/
      zeroforge/
  mcp-server/src/
    shared-tools/            # save-deliverable, run-in-repo, exploit-claim-gate
    cveforge-tools/          # Pipeline-specific MCP tools
    binforge-tools/
    forge-ops/               # Operational monitoring + start_forge_run
  prompts/
    shared/                  # Reusable partials (_docker-tools, _environment)
    <pipeline>/              # Pipeline-specific agent prompts
  configs/<pipeline>/        # YAML configs + JSON Schema validation
  docker/                    # Base + pipeline Dockerfiles
  forge                      # Single CLI dispatcher

That src/core/ directory is where the 70% overlap went. Agent execution, prompt loading, git checkpointing, Temporal retry logic, cost tracking, type definitions - all written once, imported everywhere. Each pipeline keeps only what’s unique: its workflow orchestration, its agent definitions, its specific activities. Adding a new forge tool means creating a pipeline directory, writing a workflow.ts and activities.ts, dropping in prompts and configs, and running ./forge <name> start. The framework does the rest.

Here’s the issue sequence:

Phase	Issues	What
Skeleton	EF-001, EF-002	Monorepo structure, unified CLI
Core extraction	EF-003 through EF-010	Types, utils, AI module, audit, services, temporal, config, shared MCP tools
Compilation check	EF-011	Full `tsc --noEmit` verification
CVEForge migration	EF-012 through EF-016	Pipeline, MCP tools, prompts, configs, Docker, integration test
BinForge migration	EF-017 through EF-020	Same four deliverables, same QA gate
StackForge migration	EF-021 through EF-023	Same pattern
ZeroForge migration	EF-024 through EF-027	Same pattern
Cleanup	EF-028 through EF-030	Remove aliases, unify naming, write CLAUDE.md

Codey did the bulk of the engineering. QAGuy reviewed each deliverable. The pattern was consistent: Codey posts a deliverable with “files modified, how to test, notes.” QAGuy runs the acceptance criteria. If it fails, QAGuy files a follow-up issue with the exact fix needed.

And it failed a lot. Not because the code was bad, but because migrating four codebases into one surfaces every implicit dependency, every hardcoded path, every assumption that “this file will be here.” Unresolved Docker base images. Missing lockfiles. Service name mismatches. Hardcoded worker entrypoints. Stale migration artifacts. The kind of bugs that only show up when you actually try to run the thing, which is exactly what QAGuy did, fourteen times.

Fourteen QA fix issues in a single day. Each one caught by the grumpy professor, filed with the exact problem and fix, resolved by the genius hippy, verified by the grumpy professor. The cycle time from “QA finds bug” to “fix merged and verified” was measured in minutes. Watching Codex nitpick Opus’s work fourteen times in a row without either of them getting tired or passive-aggressive was something I’d never seen in a human team.

I’ve managed engineering teams. I know what a fourteen-issue QA day looks like with human engineers. It looks like a very bad day. It looks like blocked PRs and Slack threads and someone saying “well it works on my machine.” Here it looked like an issue tracker filling up and emptying in waves. No standup meetings. Nobody asked if this could be a Jira ticket instead. Nobody suggested we “circle back on this offline.” Just bugs filed, bugs fixed, bugs verified. Like watching a dishwasher - deeply boring and profoundly effective.

Day 4: Refactoring Under a Running Pipeline

The refactored forge CLI needed a unified startPipeline() interface. Three more issues (EXP-124 through EXP-126): extract shared launcher functions, add a start_forge_run MCP tool, refactor the CLI to use the new launchers. Then EXP-127 through EXP-130: more QA fixes, CLI parser updates, contract alignment.

Meanwhile, ForgeRunner kept running production pipelines the entire time. CVE-2026-4105 (the systemd-machined privilege escalation we published this week) was dispatched as EXP-59. ForgeRunner started the workflow, Telegram notifications fired for each pipeline phase, and the results landed in the workspace without anyone touching the terminal.

The forge tools were being refactored underneath a running production pipeline. The agents didn’t coordinate this explicitly - ForgeRunner just kept dispatching runs, and Codey kept merging code, and the two never collided because the refactor touched the monorepo while production still ran on the individual repos. Accidental good architecture. The best kind - the kind you can pretend was intentional in the blog post.

$180.71 for 135 Issues: The Final Accounting

Four days. Here are the numbers.

The Team

Agent	Role	Adapter	Monthly Spend	Key Contribution
CEO	Executive	Claude	$14.59	Hired 3 agents, managed approvals
Mathew	Sr Security Researcher	Claude	$41.51	Upstream sync, security review, path traversal fix
Codey	Software Engineer	Claude Opus	$115.27	60+ issues, monorepo refactor, SEO implementation
Steve	Research Intern	Claude	$2.99	CVE candidate research
QAGuy	Code Reviewer	Codex	$1.38	Caught bypass in security fix, 14 QA issues in one day
ForgeRunner	Pipeline Operator	Claude	$4.97	Autonomous CVEForge runs
Total			$180.71	135 issues

The Work

Security fixes: Path traversal (3 rounds), Docker port binding (4 repos), symlink bypass, replaceAll bug, production guard on test shims, Bedrock/Vertex credential validation
SEO: Baseline audit, keyword analysis, title/meta overhaul across 6 route types, GSC submission, twitter:title fix
Monorepo refactor: 30-issue epic, 4 repos consolidated, core extraction, pipeline migration, integration tests
Infrastructure: Telegram bot, Paperclip AI MCP server, unified forge CLI, start_forge_run MCP tool
Operations: Multiple CVEForge production runs, CVE candidate research batches

What $1.38 Buys You

QAGuy is the agent I keep coming back to. Code Reviewer. Codex model. One dollar and thirty-eight cents for the entire month.

In that budget, QAGuy:

Caught a sibling-prefix bypass in a path traversal fix that the $115 engineer missed
Ran 14 QA cycles in a single day during the monorepo migration
Verified Docker port bindings were correct with runtime container checks
Validated TypeScript compilation across the entire monorepo
Never complained, never took a coffee break, never said “that’s not my job”
Did not request equity

The most cost-effective security review I’ve ever seen. And I’ve seen a lot of security reviews. Most of them cost more than $1.38 per hour. QAGuy costs $1.38 per month. If QAGuy were a SaaS product, the pricing page would look like a typo.

Adversarial AI Agents Are the Point, Not the Problem

I want to be careful here, because the “AI replaces engineers” narrative is tired and wrong. That’s not what happened.

What happened is that I had a maintenance problem that was too boring for me to solve. Not too hard - too boring to do myself. Syncing security patches across five repos is not intellectually challenging. It’s tedious. Rebinding Docker ports in four compose files is not a senior engineering task. It’s a checklist. Consolidating four TypeScript codebases into a monorepo is well-understood work that nobody wants to do because the payoff is invisible and the process is mind-numbing.

The agents did the mind-numbing work. And they did it with a level of consistency and thoroughness that I honestly wouldn’t have brought to the task myself, because by the third repo I’d be skimming the diff and thinking about lunch. I know this about myself. I’ve been in security long enough to know that the difference between “patched” and “actually patched” is usually one bored human who thought “eh, close enough.”

The QA catch is the real story. Not because the engineer made a mistake - engineers make mistakes, that’s why QA exists - but because the adversarial structure worked. Codey didn’t review Codey’s own code. QAGuy reviewed it. On a different model, with different biases, with different personality. Opus writes code the way it writes prose - with flair, with confidence, occasionally skating past the boring details. Codex reviews code like a building inspector who’s seen too many condemned houses. That tension is the product. Put them on the same model and you get polite agreement. Put the genius hippy against the grumpy professor and you get code that actually works.

This is the same lesson we keep learning with CVEForge . The pipeline works because the agents disagree with each other. The exploit agent writes a PoC. The verification agent tests it. The bypass agent attacks the fix. Consensus without adversarial pressure is just groupthink with extra steps.

Paperclip AI gave us the same dynamic for engineering work. An engineer that writes code and a reviewer that breaks it, running in a loop, with an issue tracker keeping score. The loop is the product. Not the individual agent - the loop.

Where Paperclip AI Fell Short

Let me be honest about the rough edges, because this wasn’t a press release.

The CEO is mostly overhead. I know, I know. Hiring agents and approving things is useful in the first hour. After that, the CEO mostly sits idle, collecting a heartbeat and radiating executive presence into the void. The $14.59 spend was front-loaded - almost all of it was agent onboarding. To be fair, the Paperclip AI docs suggest routing everything through the CEO - task delegation, priority management, inter-agent coordination - and I mostly bypassed it by assigning work directly to agents. So this might be a case of “the CEO is overhead because I kept going around the CEO,” which, now that I write it out, is also realistic. The CEO abstraction probably makes more sense if you actually let it do its job. Or at scale. Or, as they say in corporate America, “my role is strategic.” Fourteen dollars and fifty-nine cents of pure strategy.

Codey’s cost dominates. $115 out of $180 was one agent doing most of the work. Classic 10x engineer - 10x the output, 10x the bill. The monorepo refactor was a massive undertaking and Codey burned through tokens like a junior developer burns through AWS credits on their first unsupervised weekend. Budget planning for agent teams needs to account for this skew: one agent will eat 60% of your spend, and it’ll be the one doing the actual work.

QA found real bugs, but also generated noise. Some of QAGuy’s follow-up issues were about pre-existing problems unrelated to the task at hand (missing js-yaml types, stale build artifacts). The “review everything” mandate meant QA sometimes wandered into territory that wasn’t part of the deliverable, like that one colleague who reviews your CSS PR and leaves a comment about your database schema. Scoping QA reviews more tightly would reduce noise without losing the adversarial benefit. At $1.38/month, though, the noise is affordable. I’ve paid more for a coffee that was worse at finding bugs.

ForgeRunner is fragile on retries. Several CVEForge runs needed to be re-dispatched (EXP-49 through EXP-56 are seven attempts at the same CVE). The agent doesn’t always diagnose why a run failed before retrying - it just tries again. Harder. Like jiggling a door handle seven times instead of checking if it’s locked. A human would read the error log first. ForgeRunner approaches error logs the way most people approach terms of service: technically available, practically ignored.

AI Agents Fixing AI Security Bugs Found by AI Agents

We build tools that find security vulnerabilities in other people’s code. Those tools had security vulnerabilities. We hired AI agents to fix them. The AI agents’ fixes had security vulnerabilities. Another AI agent caught those vulnerabilities. If there’s a more perfect closed loop of irony in the security industry, I haven’t found it, and I’ve been looking for twenty years.

At no point did a human review the path traversal code. I approved the hires, I read the issue summaries, I checked the final state of the repos. But the actual security analysis - the part where someone looks at startsWith() and realizes it’s bypassable with sibling prefixes - that was QAGuy. A dollar thirty-eight. My contribution to the security fix was clicking “approve” on an agent hire form. I’m basically the CEO now.

I don’t know if this is the future of security engineering. I know it’s the present of our security engineering, as of four days ago. The forge tools are consolidated. The security patches are synced. The SEO is better. The monorepo compiles. Production kept running.

And the pipeline just… keeps going. Steve scouts CVE candidates. ForgeRunner dispatches them to the forge. Results land in the workspace. I read them through MCP - the same MCP server the agents built for themselves. The entire chain from “which CVEs should we look at” to “here’s a working PoC with a container lab” runs without me touching a terminal. I approve things. I read summaries. Occasionally I write a blog post about it. The agents do the rest.

The issue tracker has three items in backlog: remove backward-compat aliases (EXP-103), write unified CLAUDE.md (EXP-104), and cross-pipeline integration test (EXP-105).

I might assign those tomorrow. I just need to decide which agent gets them. Or maybe I’ll ask the CEO. It’s been idle for two days and I’m starting to feel guilty about the $14.59.

Paperclip AI is the agent orchestration platform we used for this experiment. The Exploit Intelligence Platform is at exploit-intel.com . The EIP MCP server is free. All forge-produced labs and PoCs are on GitHub .

Related posts:
CVEForge: From Shannon to Autonomous PoC - the first forge
72 Hours, 24 CVEs - the stress test
CVE-2026-28391: The Day I Hacked Myself - when the orchestrator became the target
CVE-2026-4105: systemd-machined Privilege Escalation - produced by ForgeRunner during this experiment