Teaching an AI to Talk Exploits - Building a RAG Chatbot Over 370K CVEs

370K CVEs. 105K exploits. EPSS scores, KEV status, AI analysis, Nuclei templates, the whole stack. Our database has the answers to most questions a security team would ask.

But every interface we’d built for EIP - a CLI for terminal people, an API for automation, an MCP server for AI agents, a full web search with filters across every dimension we track - assumed you already knew what you were looking for. They’re power tools for people who think in CVE IDs.

The most common question in security isn’t a structured query. It’s something like “what Metasploit modules dropped this month?” or “is there a working exploit for that Fortinet thing?” or just “what should I worry about?”

Our database doesn’t speak natural language. It speaks SQL. So we decided to teach it.

You Can’t RAG a Postgres Table

The first thing you learn about Retrieval-Augmented Generation is that you need documents. Not database rows - documents. Self-contained chunks of text that mean something on their own.

Our exploit data lives across ten Postgres tables. A vulnerability has metadata in one, affected products in another, references in a third, exploits in a fourth - each exploit with its own author links, code files, AI analysis JSON, GitHub stats. Ask the database about CVE-2024-3400 and you get a join across half the schema. An embedding model looks at that and has the equivalent of a panic attack.

So we wrote an exporter. For each CVE that has analyzed exploits, it renders a single markdown file - a complete intelligence briefing:

Vulnerability metadata: CVE ID, severity, CVSS, EPSS, CWE, KEV status, ransomware use
Description and affected products
Author intelligence: who wrote the exploits, their track record, attribution sources
Exploit details: source, language, quality tier, release date, platform
Actual code snippets: extracted from the exploit archives on disk
LLM analysis summary: attack type, complexity, reliability, MITRE ATT&CK mapping

Each file is one CVE, and each file stands alone. You could hand it to a human analyst and they’d have everything they need. That’s the point - if the document makes sense to a person, it’ll make sense to a retrieval system.

We split the output into two folders: code/ for documents that contain executable exploit code (Python scripts, Metasploit modules, Nuclei templates), and intel/ for everything else (writeups, advisories, vulnerability descriptions without runnable code). This distinction turns out to matter a lot for retrieval.

First batch: 1,000 CVEs. The highest-severity vulnerabilities with analyzed exploits, rendered into markdown. About 15MB of structured intelligence.

The Platform That Does It All (Except When It Doesn’t)

We chose Cloudflare AI Search for the RAG pipeline. It’s a managed service - you upload documents to an R2 bucket, point AI Search at it, and it handles chunking, embedding, indexing, retrieval, reranking, and response generation. One API call:

const answer = await env.AI.autorag("eip-cve-search").aiSearch({
  query: message,
  max_num_results: 6,
  reranking: { enabled: true },
  stream: false
});

No vector database to manage. No embedding pipeline to maintain. No chunk-and-store logic to debug. For a team that would rather obsess over exploit data than infrastructure plumbing, this was the right call.

We uploaded the 1,000 documents, waited for indexing, and ran our first query. It went beautifully. Then it went horribly.

“ProFTPD backdoor remote code execution.”

Score: 0.999. The correct document. Full exploit details, Metasploit module, rootkit patch analysis. Beautiful.

“CVE-2010-20103.”

Score: 0.023. Wrong document entirely. It returned a completely unrelated CVE.

Same vulnerability. Same document in the index. One query found it perfectly, the other failed catastrophically. The score didn’t just drop - it fell off a cliff.

The Identity Problem

This was the moment the whole thing clicked - and almost the moment we gave up.

CVE-2010-20103 is the ProFTPD backdoor. They’re the same thing. But to an embedding model, “CVE-2010-20103” is just a string of characters. It has no semantic meaning. It doesn’t mean anything the way “ProFTPD backdoor remote code execution” does. The model can’t look at “2010-20103” and understand that it refers to a supply chain compromise of an FTP server.

This is the fundamental gap in pure semantic search: identifiers aren’t concepts. CVE IDs, EDB numbers, GHSA references - these are the primary keys of our entire industry. Security professionals think in CVE IDs. They paste them into search bars, Slack channels, ticket systems. And vector search can’t match them.

And we made it worse. Three mistakes, compounding like interest on a bad loan:

Wrong embedding model. The default was qwen3-embedding-0.6b - a small model that wasn’t even on Cloudflare’s supported list. It worked, technically. Like a bicycle technically works for a cross-country move. A 0.6B parameter model doesn’t have the capacity to develop meaningful representations of identifier strings.

Wrong chunk size. We set chunks to 2,048 tokens. The embedding model accepts 512 tokens. Do the math: seventy-five percent of each chunk was invisible to the model - embedded but not understood. Ask “what’s the EPSS score for this CVE?” and the retriever returns a chunk that’s mostly Metasploit module source code. Helpful.

No hybrid search. Pure vector search. No keyword matching. The one feature that would have trivially matched “CVE-2010-20103” as a literal string in the document text - and we didn’t turn it on. Three settings, all wrong. Not our finest hour.

Getting It Right

We rebuilt the instance from scratch. New settings:

Embedding model: bge-large-en-v1.5 - 1,024 dimensions, 512 input tokens, top-tier English retrieval benchmarks, free on Cloudflare Workers AI
Chunk size: 512 tokens with 15% overlap - each chunk fits entirely within the model’s context window
Hybrid search: Enabled - vector similarity plus BM25 keyword matching

Same documents. Same R2 bucket. Ten minutes of indexing.

“CVE-2010-20103.”

Score: 0.989. Correct document. First result.

The chunk size change alone was dramatic. At 512 tokens, our average document produces about 9 chunks instead of 2. The vulnerability metadata section gets its own chunk. The exploit code gets its own chunks. When someone asks about EPSS scores, the retriever finds the metadata chunk - not a blob of Ruby code that happens to be in the same document.

But the real win was hybrid search. BM25 does exact string matching. “CVE-2010-20103” appears literally in the document text, and keyword search finds it trivially. Vector search handles the semantic queries (“buffer overflow in FTP server”), keyword search handles the identifiers. Together, they cover everything.

The Helpful Feature That Broke Everything

AI Search has a query rewriting feature. Before searching, it passes your query through an LLM that “improves” it - expanding abbreviations, adding context, rephrasing for better retrieval.

Sounds great. Here’s what it did to “CVE-2010-20103”: it rewrote it into something completely different. We never saw what - the rewritten query isn’t returned in the API response, which is its own kind of fun to debug - but the retrieval results were catastrophically wrong. Scores dropped from 0.989 to 0.02.

The rewriter was trying to be helpful. It saw something that looked like an identifier and decided it knew better. It did not.

We added intent detection in our Worker. When someone asks a question that’s primarily about a specific CVE - “What is CVE-2024-3400?” or just “CVE-2024-3400” - we disable query rewriting and send the identifier straight through. For semantic queries like “critical Fortinet vulnerabilities with public exploits,” rewriting stays on.

This was a recurring theme throughout the build: knowing when to turn the smart features off. A lookup (“CVE-2024-3400”) and a search (“critical Fortinet vulnerabilities with public exploits”) are fundamentally different operations. Treating them the same way breaks one of them, and it’s always the one your users care about most.

The Wall of Silence

First generation model: Llama 4 Scout, running on Cloudflare Workers AI. Free. Fast. Good enough for testing.

We asked: “What new Metasploit exploits have come out?”

The response started beautifully. Structured, detailed, listing four exploits with CVE IDs, descriptions, module filenames, and release dates. Then, mid-sentence:

“These exploits were –”

And nothing. The response just stopped. Not an error. Not a timeout. The model hit its output token limit and silently truncated.

This is the kind of thing that makes a prototype feel broken even when the retrieval is perfect. The right documents were found. The right context was assembled. But the answer was literally cut short.

We switched to Gemini 2.5 Flash via Cloudflare’s AI Gateway. Connected a Google AI Studio API key, changed one configuration field, redeployed. Same query, complete response. Structured, grounded, finished.

The difference between a model that stops mid-sentence and one that finishes is the difference between a demo you show friends and a product you’d actually use.

What It Feels Like Now

The chat lives at exploit-intel.com/chat . Password-gated for now - we’re scaling the document index from 1,000 to 100,000 CVEs before opening it up.

The experience is simple: you ask a question, you get an answer grounded in real exploit intelligence data. Not hallucination - retrieval. Every response includes source links back to the full CVE pages on the platform.

Ask “latest Metasploit modules” and you get a structured list of recent modules with CVE IDs, descriptions, and release dates. Ask “CVE-2024-3400” and you get a full briefing on the Palo Alto PAN-OS vulnerability, complete with exploit quality assessment and remediation context. Ask “critical CVEs with high EPSS scores” and you get a prioritized list sorted by real exploitation probability.

When the system doesn’t know - when no trusted sources meet the quality threshold - it says so. No confident-sounding fabrication. This matters more than any other feature we built. A security intelligence tool that hallucinates is actively dangerous. Better to say “I don’t know” than to invent a CVE score that someone patches infrastructure based on.

What We Learned

RAG isn’t a feature you bolt on. It’s a craft, and a humbling one. Here’s what we’d tell anyone building one:

Your documents matter more than your model. We spent more time on the markdown exporter - deciding what to include, how to structure it, where to split code from intel - than on any other component. A mediocre embedding model over well-structured documents beats a great model over messy data.

Chunk size is the silent killer. This isn’t a parameter you set and forget. It depends on your embedding model’s input limit, your document structure, and the kinds of questions people ask. Getting it wrong doesn’t throw an error. It just quietly makes everything slightly worse, and you blame the model, the data, the alignment of the planets - everything except the one number you set wrong on day one.

Identifiers need keyword search. If your domain has primary keys that humans use as lookup terms - CVE IDs, product codes, ticket numbers - pure vector search will fail on them. Hybrid search isn’t optional. It’s the baseline.

Know when to turn the smart features off. Query rewriting, automatic reformulation, intent inference - these features exist for good reasons. But they assume your queries are natural language. When the query is an identifier, the smartest thing the system can do is get out of the way.

Managed services are worth it. We considered building the pipeline ourselves - Vectorize for vectors, D1 for keyword search, manual embedding, custom RRF fusion. We would have had more control. We also would have had more code to maintain, more failure modes to debug, and less time to spend on the actual data. Cloudflare AI Search handles the infrastructure. We focus on making the documents good.

What’s Next

We’re scaling the document index to 100,000 CVEs - every vulnerability in the platform that has analyzed exploits. The export pipeline is already running. The upload is idempotent (SHA256 manifest, only uploads changed files), so scaling is a matter of turning up the --limit parameter and waiting.

At 100K documents, the chatbot will cover the same breadth as the MCP server and the API - just through natural language instead of structured queries. Different interface, same intelligence.

If you want to try it: exploit-intel.com/chat . It’s in closed beta while we scale the index - drop us a line at dev at exploit-intel dot com and we’ll send you access. We’d love to hear what you ask it. Especially the questions that break it. Those are the best ones.