AI
Production RAG on Cloudflare Without LangChain
A 200-line RAG pipeline on Cloudflare Workers + Vectorize + D1 — the five primitives that matter, why frameworks rot at the wrong layer, and how to keep retrieval debuggable in production.
Every RAG tutorial in 2026 starts the same way: import { ChatOpenAI } from "langchain/openai". Two weeks later, in production, half of those teams are rewriting their pipeline from scratch — usually because the framework abstracts over the wrong layer, hides the parts that actually need debugging, and adds dependencies that move faster than they should.
We’ve built RAG pipelines for clients without LangChain. They land at around 200 lines of TypeScript, run on Cloudflare Workers, store vectors in Vectorize, persist source documents in D1 or R2, and stay legible six months later when someone has to debug why a retrieval missed.
This is how to build them, and why we think it’s the better default for production.
The five primitives that matter
A retrieval-augmented generation system is not a framework problem. It’s a pipeline of five operations, each of which is small and well-understood:
- Chunk — split source documents into retrievable units.
- Embed — turn each chunk into a vector.
- Store — persist chunk + vector + metadata, indexed for similarity search.
- Retrieve — given a query, embed it and return the top-K matching chunks.
- Prompt — assemble retrieved chunks into a model prompt and stream the answer.
Frameworks abstract these into “chains” or “agents” or “graphs” — useful vocabulary for tutorials, harmful vocabulary in production. Each primitive has different failure modes, different cost profiles, and different observability requirements. Wrapping all five in one abstraction means when retrieval is broken, you’re debugging the framework, not the operation.
Why we don’t use LangChain in production
To be clear: LangChain is fine for prototyping. The complaint is specific.
It abstracts over the wrong layer. The hard parts of RAG aren’t “call OpenAI” or “embed text” — those are HTTP requests. The hard parts are: chunking strategy for your specific corpus, retrieval scoring tuning, query rewriting heuristics, and the eval harness that catches regressions. LangChain doesn’t help with any of those — they require domain understanding of your data.
It moves fast in incompatible ways. A LangChain upgrade between minor versions has, repeatedly, broken production pipelines for teams we’ve talked to. The framework is in heavy iteration. Pinning a version means you stop getting fixes; upgrading means you’re on a treadmill.
It hides the parts you most need to log. When a retrieval surfaces the wrong chunk, you want to see the embedding, the similarity scores, the metadata of the candidates that lost. Most framework code makes this either impossible or console.log-debugging through three layers of inheritance.
It pulls in 200 dependencies. None of which you needed.
What you actually need is five SDK calls and a hundred lines of code that you wrote and understand.
The Cloudflare stack for RAG
Each of the five primitives maps to one Cloudflare product:
| Primitive | Cloudflare service | Why |
|---|---|---|
| Chunk | Workers (your code) | It’s pure CPU work; Workers run for free in volume. |
| Embed | Workers AI (@cf/baai/bge-base-en-v1.5) or OpenAI | Workers AI is local-to-Worker, fast, and free for low volume. OpenAI’s text-embedding-3-small is more accurate. |
| Store | Vectorize | Native vector DB on Cloudflare; no external service. |
| Retrieve | Vectorize query() | Same database, sub-50 ms query at the edge. |
| Prompt | Workers AI Llama / Anthropic Claude / OpenAI GPT | Whichever fits the cost/quality bar. We default to Claude. |
Source documents live in D1 (if structured) or R2 (if PDFs / large text blobs). Vectorize stores only the vectors and metadata — D1 / R2 are the source of truth for the actual chunk text.
This stack means: one wrangler config, one platform for billing, one observability surface, no external integrations to break.
A 200-line RAG pipeline
The whole thing fits in one file. Annotated.
// src/rag.ts
import type { D1Database, Vectorize, Ai } from '@cloudflare/workers-types';
interface Env {
AI: Ai;
VEC: Vectorize;
DB: D1Database;
ANTHROPIC_API_KEY: string;
}
// ─── 1. Chunk ──────────────────────────────────────────────────────────────
// Naive chunker. Replace with a corpus-aware version when retrieval quality
// matters more than implementation simplicity.
function chunk(text: string, size = 800, overlap = 100): string[] {
const out: string[] = [];
for (let i = 0; i < text.length; i += size - overlap) {
out.push(text.slice(i, i + size));
}
return out;
}
// ─── 2. Embed ──────────────────────────────────────────────────────────────
async function embed(env: Env, texts: string[]): Promise<number[][]> {
// BGE base = 768 dims, free at low volume on Workers AI.
const res = await env.AI.run('@cf/baai/bge-base-en-v1.5', { text: texts });
return res.data;
}
// ─── 3. Store ──────────────────────────────────────────────────────────────
async function ingest(env: Env, docId: string, text: string): Promise<void> {
const chunks = chunk(text);
const vectors = await embed(env, chunks);
// Persist source-of-truth chunks in D1.
const stmt = env.DB.prepare(
'INSERT INTO chunks (id, doc_id, text) VALUES (?, ?, ?)'
);
await env.DB.batch(
chunks.map((c, i) => stmt.bind(`${docId}#${i}`, docId, c))
);
// Persist vectors + minimal metadata in Vectorize.
await env.VEC.upsert(
chunks.map((_, i) => ({
id: `${docId}#${i}`,
values: vectors[i],
metadata: { docId, idx: i },
}))
);
}
// ─── 4. Retrieve ───────────────────────────────────────────────────────────
async function retrieve(env: Env, query: string, topK = 5) {
const [q] = await embed(env, [query]);
const result = await env.VEC.query(q, { topK, returnMetadata: 'all' });
// Hydrate chunk text from D1 (Vectorize stores only vectors + metadata).
const ids = result.matches.map(m => m.id);
const rows = await env.DB.prepare(
`SELECT id, text FROM chunks WHERE id IN (${ids.map(() => '?').join(',')})`
).bind(...ids).all<{ id: string; text: string }>();
const byId = new Map(rows.results.map(r => [r.id, r.text]));
return result.matches.map(m => ({
id: m.id,
score: m.score,
text: byId.get(m.id) ?? '',
}));
}
// ─── 5. Prompt ─────────────────────────────────────────────────────────────
async function answer(env: Env, query: string): Promise<Response> {
const hits = await retrieve(env, query);
const context = hits
.map((h, i) => `[${i + 1}] (score=${h.score.toFixed(3)})\n${h.text}`)
.join('\n\n');
const sys = `You answer using ONLY the numbered context below. If the answer is not present, say so. Cite sources as [1], [2], etc.\n\n${context}`;
const res = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: {
'x-api-key': env.ANTHROPIC_API_KEY,
'anthropic-version': '2023-06-01',
'content-type': 'application/json',
},
body: JSON.stringify({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: sys,
messages: [{ role: 'user', content: query }],
stream: true,
}),
});
// Stream the SSE response back to the client unchanged.
return new Response(res.body, {
headers: { 'content-type': 'text/event-stream' },
});
}
// ─── Worker entrypoint ─────────────────────────────────────────────────────
export default {
async fetch(req: Request, env: Env): Promise<Response> {
const url = new URL(req.url);
if (url.pathname === '/ingest' && req.method === 'POST') {
const { docId, text } = await req.json<{ docId: string; text: string }>();
await ingest(env, docId, text);
return new Response('ok');
}
if (url.pathname === '/ask' && req.method === 'POST') {
const { query } = await req.json<{ query: string }>();
return answer(env, query);
}
return new Response('not found', { status: 404 });
},
};
That’s the whole pipeline. Two endpoints (/ingest, /ask), five primitives, no framework.
What this gives you
Observability. Every chunk has an ID. Every retrieval surfaces scores. Every Anthropic call is one fetch — log it directly. Workers Logs captures everything for free.
Debuggability. When a retrieval is wrong, you can run /ask against the same query, look at result.matches, and see exactly which chunks scored highest and why. There’s no abstraction in the way.
Cost. Workers AI embeddings are free up to 10,000 neurons / day. Vectorize storage is $0.04 per 1 million vectors / month. D1 reads are essentially free for this scale. The most expensive line item is Claude — and you control how much context you stuff into each call.
Migrations. When you decide BGE base isn’t accurate enough and you want OpenAI’s text-embedding-3-small, you change three lines. No framework rewrite.
The parts a framework can’t help with
Once the pipeline is in place, the work that actually matters happens at three points:
Chunking strategy. Naive 800-char windows are a starting point, not a destination. Real corpora benefit from semantic chunking: split on headings, keep lists together, preserve table structure. This is corpus-specific and is the single biggest lever on retrieval quality.
Query rewriting. “How do I cancel?” and “cancellation policy” should retrieve the same chunks. A small Claude call to rewrite the user’s query into 3–5 retrieval-friendly variants, then OR the results, dramatically improves recall — at the cost of one extra LLM call per query.
Eval harness. A list of 30–100 known queries with expected answers. Run them on every model change, every chunking change, every prompt change. Track precision and recall. This is what frameworks definitely don’t give you, and what every production RAG system needs.
We treat the eval harness as deploy-blocking — if precision drops more than 5% on the suite, the deploy stops. That habit is worth more than every framework feature combined.
When LangChain is fine
We’re not anti-LangChain in absolute terms. It’s a great surface for tutorials, prototypes, and exploratory work where you want to swap models in 30 seconds. If you’re building a one-off internal tool that nobody is going to maintain past next quarter, the dependency cost is irrelevant.
The objection is specifically against shipping it to production for systems that need to live for years. Frameworks at the application layer rot. Frameworks at the infrastructure layer (Cloudflare Workers, Anthropic SDK, OpenAI SDK) are stable enough that you can build durable software on top of them.
The template, deploy-ready
The full code above is published as an open-source Cloudflare Workers template:
github.com/setkernel/cf-rag-template — fork it, run
npm install && npx wrangler deploy, and you have a working RAG endpoint in about three minutes. MIT licensed.
It includes the complete pipeline, D1 migrations, Vectorize index setup, and four endpoints (/ingest, /ask, /forget, /health) — production-hardened, not just the essay snippet.
What we’d build for you
If you’re building a RAG-shaped product and want it to last past v1, this is exactly the work we do — scoped in writing, priced per project. The deliverable is the pipeline above (adapted to your corpus), an eval harness, and a runbook for swapping models when the price-quality landscape moves.
Write a brief — two paragraphs is enough.