AI
Workers AI vs OpenAI: A Cost-Quality Matrix at Low Volume
Most production AI features default to OpenAI by reflex. They shouldn't. A real comparison of Workers AI, OpenAI, and Anthropic Claude across the four tasks that actually show up in client engagements — embeddings, generation, transcription, classification — with real numbers and a routing strategy that uses each for what it's good at.
The default for most production AI features in 2026 is “wire it to OpenAI”. The reflex is so strong that teams pick a vendor before they pick a model, and pick a model before they understand what the feature actually needs.
We’ve watched this play out enough times across client work to have a strong opinion: most production AI features should be 80% Workers AI, 15% Claude, 5% OpenAI — and the 5% is for the specific places OpenAI is genuinely irreplaceable. The reflex defaults to it the wrong way around.
This essay lays out the four tasks that show up in nearly every engagement, the real cost and latency numbers at low volume, and a routing layer that picks the right model per task instead of locking in one vendor.
The four tasks that show up everywhere
After the framing, you can ignore most of the AI vendor catalogue. Production engagements concentrate on these:
- Embeddings — turn text into vectors for retrieval, deduplication, classification.
- Generation — answer a question, summarise, transform text.
- Classification & extraction — schema-guided structured output.
- Transcription — audio to text, often with diarization.
Image generation, fine-tuning, vision-LLM, voice-cloning, agents — these come up too, but they’re niche. The four above are the bread-and-butter. Get them right and the rest is variation.
The matrix at low volume
“Low volume” here means under 1 million requests per month — which is the band where almost every SetKernel engagement starts. Numbers below are blended across April 2026 published rates and our own measurements; treat them as directional, not exact.
1. Embeddings (768-dim, English text)
| Model | Cost (per 1M tokens) | Latency p50 from Cloudflare Worker | Quality (MTEB) |
|---|---|---|---|
Workers AI @cf/baai/bge-base-en-v1.5 | Free to first 10k neurons/day, then trivial | 18 ms (in-Worker) | 64.2 |
OpenAI text-embedding-3-small (1536-dim) | $0.02 | 320 ms | 62.3 |
OpenAI text-embedding-3-large (3072-dim) | $0.13 | 380 ms | 64.6 |
Workers AI @cf/baai/bge-large-en-v1.5 | Free / trivial | 24 ms | 65.2 |
Verdict. Use Workers AI’s BGE for 95% of cases. It runs in-Worker (no external HTTP), so 18 ms instead of 320 ms; it costs effectively nothing at low volume; and it scores within a point of OpenAI’s flagship on MTEB benchmarks. The remaining 5% is when your retrieval quality is bottlenecked by embedding fidelity and you’ve measured the gain — at which point spend the money on text-embedding-3-large. Don’t guess.
2. Generation (long-form, instruction-following)
| Model | Cost (input / output per 1M tokens) | Latency p50 (first token, Worker → API) | Notes |
|---|---|---|---|
Workers AI @cf/meta/llama-3.3-70b-instruct-fp8-fast | Free to first 10k neurons/day | 180 ms | In-Worker, no external API hop. |
| Anthropic Claude Haiku 4.5 | $1 / $5 | 380 ms | Cheap, fast, surprisingly capable. |
| Anthropic Claude Sonnet 4.6 | $3 / $15 | 540 ms | Best non-frontier reasoning. |
| Anthropic Claude Opus 4.7 (1M ctx) | $15 / $75 | 720 ms | Frontier — long context worth the price. |
| OpenAI GPT-4o | $2.50 / $10 | 460 ms | Solid but rarely the best at any specific task. |
| OpenAI GPT-4o-mini | $0.15 / $0.60 | 290 ms | Cheap. Quality dips on multi-step reasoning. |
Verdict. Workers AI’s Llama 3.3 70B handles ~70% of generation tasks at SetKernel — anything that fits “follow this template, fill in these slots, produce this output.” Claude Sonnet 4.6 owns the next 25% — anything reasoning-heavy, multi-step, or where output quality compounds (writing, code review, decisions with stakes). Opus 4.7 takes the remaining 5% — long-context corpus analysis, complex agent loops, situations where one wrong answer is expensive. OpenAI is rarely first choice for generation in 2026. It’s neither cheapest nor best at anything specific; it’s average across the board, which is why teams default to it without thinking.
3. Classification & extraction (schema-guided JSON output)
| Model | Approach | Reliability of structured output |
|---|---|---|
Workers AI @cf/meta/llama-3.3-70b-instruct-fp8-fast + JSON mode | Native | ~94% valid first try |
| Anthropic Claude Sonnet 4.6 + tool-use | Native, very reliable | ~99.5% valid first try |
| OpenAI GPT-4o + Structured Outputs (strict mode) | Native, deterministic | 100% (provider-validated) |
| OpenAI GPT-4o-mini + Structured Outputs | Native, deterministic | 100% (provider-validated) |
Verdict. This is one of the few tasks where OpenAI is genuinely the right default. GPT-4o-mini in Structured Outputs strict mode is cheap, fast, and 100% schema-compliant by construction. For high-stakes extraction (legal docs, financial line-items), pay up to GPT-4o or Claude Sonnet. Workers AI does this fine for most cases but the 6% retry rate adds latency, and at low volume the OpenAI cost is negligible. This is the 5% where OpenAI earns its keep.
4. Transcription (audio → text)
| Model | Cost | Quality (WER on common benchmarks) |
|---|---|---|
Workers AI @cf/openai/whisper-large-v3-turbo | Free / trivial | ~5.5% |
OpenAI Whisper API (whisper-1) | $0.006/min | ~5.4% |
OpenAI gpt-4o-transcribe | $0.06/min | ~3.8% |
Verdict. Workers AI’s Whisper Large v3 Turbo is the right default for transcription unless you genuinely need the lower error rate of gpt-4o-transcribe (medical, legal, broadcast). At low volume the cost difference is rounding error, but the latency from running in-Worker (no audio upload to OpenAI) is a real win.
The decision framework, in four questions
Before you wire anything, answer these. They take five minutes; they save you a vendor migration.
1. Does the task tolerate a 6% retry rate?
If yes (most cases): default to Workers AI. If no (extraction, financial, legal, anything customer-facing): use OpenAI Structured Outputs strict mode for extraction, Claude tool-use for everything else.
2. Is the answer-quality bottleneck the model, or the retrieval / prompt?
In 9 out of 10 production RAG systems we audit, the bottleneck is retrieval, not the model. Spending $15/M tokens on Opus to compensate for a chunking strategy that’s losing the right context is a $14.85/M waste. Fix retrieval first; you’ll find Workers AI’s 70B Llama is enough.
3. Does the request involve a corpus larger than 200k tokens?
If yes: Claude Opus 4.7 (1M context) is uniquely positioned. Opus 4.7’s 1M-context window is the only practical choice for “summarise this entire codebase / contract / runbook.” OpenAI maxes at 200k. Workers AI Llama maxes at 128k.
4. Does the request need to feel “instant” to the user?
If yes: in-Worker (Workers AI) wins by 200–400 ms over any external API. For interactive UI features (autocomplete, inline summarisation, real-time RAG), this is the difference between feeling instant and feeling laggy.
A routing layer in 60 lines
Since you’ll often want to make this choice per-request rather than per-feature, put a thin router in front of all your AI calls. Below is the exact pattern we ship.
// src/ai/route.ts
import type { Ai } from '@cloudflare/workers-types';
interface Env {
AI: Ai;
ANTHROPIC_API_KEY: string;
OPENAI_API_KEY: string;
}
type Task =
| { kind: 'embed'; text: string | string[] }
| { kind: 'generate'; prompt: string; system?: string; reasoning?: boolean; longContext?: boolean }
| { kind: 'extract'; prompt: string; schema: object }
| { kind: 'transcribe'; audio: ArrayBuffer; precision?: 'standard' | 'high' };
export async function ai(env: Env, task: Task) {
switch (task.kind) {
case 'embed':
// Workers AI for 95% of cases. Free, in-Worker, ~18ms.
return env.AI.run('@cf/baai/bge-base-en-v1.5', { text: task.text });
case 'generate':
if (task.longContext) return claude(env, 'claude-opus-4-7', task);
if (task.reasoning) return claude(env, 'claude-sonnet-4-6', task);
// Default: Workers AI Llama. In-Worker, no external API.
return env.AI.run('@cf/meta/llama-3.3-70b-instruct-fp8-fast', {
messages: [
...(task.system ? [{ role: 'system', content: task.system }] : []),
{ role: 'user', content: task.prompt },
],
});
case 'extract':
// OpenAI Structured Outputs — 100% schema-compliant by construction.
return openaiStructured(env, task);
case 'transcribe':
if (task.precision === 'high') return openaiTranscribe(env, task.audio);
// Default: Workers AI Whisper. In-Worker, no audio upload.
return env.AI.run('@cf/openai/whisper-large-v3-turbo', { audio: [...new Uint8Array(task.audio)] });
}
}
async function claude(env: Env, model: string, task: Extract<Task, { kind: 'generate' }>) {
const res = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: {
'x-api-key': env.ANTHROPIC_API_KEY,
'anthropic-version': '2023-06-01',
'content-type': 'application/json',
},
body: JSON.stringify({
model,
max_tokens: 4096,
system: task.system,
messages: [{ role: 'user', content: task.prompt }],
}),
});
return res.json();
}
async function openaiStructured(env: Env, task: Extract<Task, { kind: 'extract' }>) {
const res = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${env.OPENAI_API_KEY}`,
'content-type': 'application/json',
},
body: JSON.stringify({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: task.prompt }],
response_format: {
type: 'json_schema',
json_schema: { name: 'extraction', schema: task.schema, strict: true },
},
}),
});
return res.json();
}
async function openaiTranscribe(env: Env, audio: ArrayBuffer) {
const fd = new FormData();
fd.append('file', new Blob([audio], { type: 'audio/wav' }), 'audio.wav');
fd.append('model', 'gpt-4o-transcribe');
const res = await fetch('https://api.openai.com/v1/audio/transcriptions', {
method: 'POST',
headers: { 'Authorization': `Bearer ${env.OPENAI_API_KEY}` },
body: fd,
});
return res.json();
}
Now every AI call in your codebase looks like:
const summary = await ai(env, {
kind: 'generate',
prompt: `Summarize the following report:\n\n${report}`,
reasoning: true, // → Claude Sonnet 4.6
});
const fields = await ai(env, {
kind: 'extract',
prompt: `Extract invoice fields:\n\n${invoiceText}`,
schema: invoiceSchema,
// → OpenAI gpt-4o-mini, strict structured output
});
const vectors = await ai(env, {
kind: 'embed',
text: chunks,
// → Workers AI BGE, in-Worker, free
});
Every call routes to the right vendor for the task. Switching model providers in the future is one file. Cost and latency are predictable per task. Each part of the pipeline gets the model it actually needs.
The 80/20 result
Using this router on a recent client engagement (a regulated SaaS with ~600k AI-touching requests/month):
| Task | Vendor | Share of requests | Share of cost |
|---|---|---|---|
| Embeddings | Workers AI | 78% | 0% |
| Default generation | Workers AI | 12% | 0% |
| Reasoning generation | Claude Sonnet 4.6 | 5% | 38% |
| Long-context generation | Claude Opus 4.7 | 1% | 41% |
| Structured extraction | OpenAI GPT-4o-mini | 3% | 6% |
| High-precision transcription | OpenAI gpt-4o-transcribe | 1% | 15% |
90% of requests cost zero (Workers AI). The 10% that hit external APIs are the ones where the better model genuinely matters — and they account for 100% of the bill, which is itself a tenth of what an “OpenAI-only” version of the same product would cost.
This is what “AI engineering” looks like: pick the right model for each task, route at the boundary, don’t pay for capability you don’t need.
When OpenAI is still the right answer
To be fair to OpenAI:
- Structured extraction with strict JSON schema validation. No one else has provider-side schema enforcement that’s actually 100%. Workers AI gets 94% on JSON mode; Claude tool-use gets 99.5%. OpenAI Structured Outputs strict mode is 100%. For extraction-critical work (financial data, legal entities, anything where a single bad parse breaks the system), this is worth paying for.
gpt-4o-transcribefor high-precision audio. When the WER difference between 5.5% and 3.8% matters (medical dictation, legal depositions, broadcast subtitles), Workers AI Whisper isn’t enough.- DALL-E 3 for image generation. Workers AI has Stable Diffusion but DALL-E 3 still produces meaningfully better commercial outputs at the same price tier.
- The OpenAI Realtime API for voice agents. If you’re building voice-in/voice-out interactive features, OpenAI’s realtime offering is the only mature option.
If your product needs any of those: pay OpenAI happily. If it doesn’t: stop reflex-defaulting to GPT-4 because it’s the name everyone knows.
Our actual stance
We default to Workers AI for everything that fits, fall back to Claude when reasoning or long context matters, use OpenAI for structured extraction and high-precision audio. We don’t have a vendor preference — we have a task preference, and the vendor falls out of that. Picking the model is engineering, not branding.
If you’re building a feature where the AI vendor choice is non-obvious and the cost matters, we scope exactly this work — the routing layer, with cost numbers proven against your actual workload — per project, in writing. Write a brief — two paragraphs is enough.
The full RAG implementation that uses this routing pattern is published openly: github.com/setkernel/cf-rag-template. Companion MCP server template: github.com/setkernel/cf-mcp-template. Both are MIT licensed and deploy in under five minutes.