Back to articles

Which AI Model Is Good Enough For What

The real question isn't which model is best. Best almost always means most expensive. It's which model is good enough for a given job, and where paying for a smarter one stops buying you better results.

T
Twegs
Notes on AI workflows, tools, and automation

Ask which AI model to use and you'll get benchmark links, brand loyalty, and whatever the person answering already runs. None of it maps to your specific task.

The real problem is the question. "Which model is best" treats AI like a phone purchase. Best almost always means most expensive. And for most automation tasks, you're paying 20x for output a reader or a downstream workflow can't tell apart from what a much cheaper model would have produced.

The better question: which model is good enough for this specific job, and at what point does paying more stop buying you better output? That's what this piece is about. Not a ranking. A reasoning method that works even as the version numbers change.

Pricing pulled from provider pages, Artificial Analysis, and llm-stats.com in June 2026. Numbers shift. The framework doesn't.


The mental model: the diminishing-returns curve

Think of any task as having a curve. Cost on the x-axis. Useful quality of output on the y-axis. The curve rises steeply at first. Better model, noticeably better output. Then it flattens.

flatten point paying more, can't tell the difference model cost → useful quality → budget frontier

Past that flatten point, you're paying more for output your reader, your workflow, or your customer can't tell apart from the cheaper version. The flatten point isn't in the same place for every task. Classifying a support ticket flattens fast. Reasoning through a multi-step agentic loop does not flatten for a long time.

Here's the heuristic I keep coming back to: If you can write a rule to check whether the output is correct, a cheap model is probably good enough. If correctness needs judgment you can't easily verify, that's where a bigger model earns its price.

Classification: checkable. Translation: mostly checkable. "Is this email warm but professional?" Not easily checkable. "Does this code pass its tests?" Checkable. "Will it hold up under edge cases you haven't tested yet?" Not.

Write that test on a sticky note and put it next to every new task you automate. It predicts your optimal tier better than any benchmark leaderboard, because benchmarks measure what researchers care about measuring, not whether your particular workflow output was good enough to ship.


Tiers, not brands

The market has sorted itself into four tiers. Within each tier, brands compete hard. But tier matters more than brand for most decisions.

Frontier models (Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Grok 4) earn their price on long-chain reasoning and judgment calls with no clear rubric: in a multi-step agent loop, one bad inference corrupts everything downstream. At $8–30/M output tokens, they're genuinely expensive, so use them where you'd actually notice the upgrade. If you can't articulate what broke when you ran the mid-tier version, you don't need the frontier one.

Mid-tier (Claude Sonnet 4.6, Gemini 3.5 Flash, Grok 4.1 Fast, DeepSeek V3) is where most production volume should land. These handle synthesis, polished copy, bounded code, and structured reasoning at input prices from $0.27 to $3/M. For most quality-sensitive tasks the output is close enough to frontier that you won't notice the difference, and the cost gap is real money at volume.

Budget (Claude Haiku 4.5, Gemini Flash-Lite, GPT-4o mini, Qwen Flash, Mistral Small) handles tasks with checkable outputs: extraction, classification, reformatting, translation, at pennies per million tokens. Running these on a frontier model is paying for output that isn't better.

Open-weight models (Llama 4, Qwen3, DeepSeek V3/R1, Kimi k2) run free on your own hardware or cheaply through hosted providers like Together, Fireworks, or Groq. The cost math often wins at high volume, though the trade-off is infrastructure overhead and, for some deployments, data residency questions worth talking through with whoever owns your compliance requirements.


What it actually costs: 30 models on one chart

Blended cost = input $/M + 3 × output $/M (roughly 3:1 input-to-output token ratio in most API workloads). Log scale. Prices as of June 2026.

Blended cost = input $/M + 3 × output $/M  ·  Log scale  ·  June 2026

Budget Mid-tier Frontier Open-weight
Qwen Flash
$0.20
GLM-Z1-Air
$0.24
Ministral 8B
$0.40
GLM-4-Plus
$0.56
Gemini 1.5 Flash
$0.98
Gemini 2.0 Flash-Lite
$0.98
Mistral Small 3
$1.00
Grok 4.1 Fast
$1.70
Llama 3.3 70B
$1.80
GPT-4o mini
$1.95
Llama 4 Scout
$1.97
Llama 4 Maverick
$2.75
Qwen3 235B
$2.86
DeepSeek V3
$3.57
Qwen-Plus
$4.00
Qwen-Max
$6.40
DeepSeek R1
$7.12
Kimi k2
$7.65
Amazon Nova Pro
$10.40
o4-mini
$14.30
Claude Haiku 4.5
$16.00
Mistral Large 2
$20.00
o3
$26.00
Gemini 3.5 Flash
$28.50
Cohere Command R+
$32.50
Gemini 3.1 Pro
$38.00
Grok 4
$48.00
Claude Sonnet 4.6
$48.00
Claude Opus 4.8
$80.00
GPT-5.5
$95.00

A few things jump out.

The frontier tier is genuinely expensive: Claude Opus 4.8's blended cost is 400x Qwen Flash's, and GPT-5.5 isn't far behind at 475x. For any task where a budget model gets it right, that gap is pure waste. Nobody is getting 400x better output.

The mid-tier is crowded, with Gemini 3.5 Flash, Cohere Command R+, DeepSeek V3, and several others competing in overlapping price ranges for often-comparable output quality. Good news if you're building at volume.

DeepSeek R1 also stands out: reasoning-class output at prices that compete with budget tiers from other brands. If data residency isn't a constraint, it's worth a head-to-head test before defaulting to a US-based provider.


The task ladder

Tasks run from "cheapest model is fine" up to "pay for the flagship or expect regrets."

Cheap is enough

These tasks have a right-or-wrong output you can verify with a simple check. Tagging an inbound email with one of six intent categories. Extracting a company name and phone number from a block of text. Translating a product description into French. Reformatting a JSON object into CSV. Summarizing a single document into three bullet points.

In an n8n workflow, this is the node that reads a webhook payload, pulls out the relevant fields, and routes the ticket to the right queue. A Haiku-class or Qwen Flash model does this for cents per thousand calls. The output is either right or it isn't, and you can write a test to check. The upgrade to mid-tier adds nothing here, and the upgrade to frontier borders on absurd. If you're running a frontier model on field extraction, that's money leaving your account every hour for nothing.

The thing people miss: this rung covers more tasks than most people think. A lot of "AI strategy" inside companies is really just over-engineering cheap lookups.

The mid-tier sweet spot

Now the task needs bounded judgment. Synthesizing three documents into a coherent summary that someone will read. Writing a first draft of a customer email that needs to sound like a person wrote it. Code that solves a clear problem with test coverage. Deciding whether a news story is relevant to your product category. Structuring research notes into a format that matches a template.

This is the n8n node that reads a competitor's blog post and a changelog and produces a PMM update. You can read the output in 30 seconds and know if it's good. You can't write a binary test for "sounds like a person," but the quality floor is obvious when it fails. Claude Sonnet 4.6 or Gemini 3.5 Flash handles this well, at a fraction of frontier cost. And if you're running high volume, DeepSeek V3 at $0.27/M input is worth profiling seriously.

The sweet spot here is recognizing that "bounded" is doing a lot of work. If you have a clear success criterion and the reasoning chain is short, mid-tier is almost always enough.

Pay for the flagship

Chained reasoning where step two depends on step one, and step three depends on step two. Judgment calls with no rubric: is this contract clause actually problematic? Is this competitor move a real threat or noise? Long-horizon agentic loops that need to stay coherent across many tool calls. Code that has to be correct in edge cases you haven't written tests for yet.

A concrete example: an agent that reviews a contract, flags risk clauses, cross-references your standard terms, and drafts a redline. Every step depends on the previous one. A hallucinated clause interpretation at step two corrupts the rest of the output. You might not catch it until it's in front of a lawyer. Frontier models aren't immune to errors, but they're better at flagging uncertainty, which matters more than raw accuracy on the straightforward tasks.

The myth worth puncturing: the newest model is not always the right default. A well-prompted Claude Sonnet often beats a poorly-specified Claude Opus. The prompt and the task structure matter as much as the model tier. Start by getting the task right, then decide whether you need more model.

Task classTierWhy
Classification, tagging, routingBudgetCheckable output; cost scales with volume
Extraction from structured textBudgetDeterministic; errors are visible
Translation, reformattingBudgetGeneral domain; rare edge cases acceptable
Single-document summarizationBudget to MidCompression is easy; synthesis is harder
Multi-source synthesisMidJudgment required; output readable in 30s
Polished customer-facing copyMidQuality floor matters; not safety-critical
Code with test coverageMidWorks for bounded, checkable problems
Routing with ambiguous signalsMidSoft classification; quality matters
Long-chain reasoningFrontierOne bad step poisons everything downstream
Ambiguous judgment callsFrontierCorrectness can't be checked against a rule
Long-horizon agentic loopsFrontierCoherence across many tool calls
Safety-critical codeFrontierCan't test every production edge case

What this looks like in marketing

Marketing teams run a wide range of AI tasks. The mistake most make is treating them all the same: defaulting to frontier across the board when that's expensive and usually overkill, or going budget everywhere and hitting quality problems exactly where they hurt.

Five real workflows, and which tier fits each.

Lead enrichment and tagging. You have a list of inbound leads. You want each one categorized by industry, company size, and likely use case based on the form fill and company website. This is classification against structured inputs with a checkable output; sample 20 rows and you'll see immediately if the tagging is right. GPT-4o mini or Gemini 2.0 Flash-Lite handle it at a fraction of a cent per lead.

Outreach email drafts. Personalized cold emails based on a lead's role, company, and recent news. The output is customer-facing and a bad draft either gets ignored or damages the brand, but this isn't hard judgment — it's polished prose from defined inputs. Claude Sonnet 4.6 or GPT-5.5 handles it, and the cost per email is low enough that volume isn't a concern at normal outreach scale.

Competitive intelligence brief. You drop in five analyst reports, three competitor blog posts, and a batch of customer call transcripts. You want a brief that synthesizes positioning shifts and flags anything that should change your messaging. This is multi-source synthesis, and the hard part is figuring out what to act on. Gemini 3.5 Flash or Mistral Large 2 handles it well: the document set rarely exceeds what mid-tier context windows can hold, and a human reviews the output before it changes anything.

Strategic threat assessment. New entrant in the market. You need an analysis of how they compare to your positioning, which customer segments they're likely to pull, and what your response options are. The inputs are ambiguous, the reasoning chain is long, and a wrong conclusion has real business consequences. Use Claude Opus 4.8 or Gemini 3.1 Pro — you're running this once or twice a quarter, so the cost is irrelevant.

SEO content at volume. You're generating 50 product description variants, 200 FAQ answers, or a batch of location pages. The output is templated, the quality bar is "good enough not to hurt rankings," and the volume makes cost the main factor, so budget tier is right: Gemini 2.0 Flash-Lite or Qwen Flash. Run quality checks on a sample rather than paying for per-token quality you won't notice at this scale.

Marketing workflow decision table

WorkflowTierWhyEstimated cost
Lead enrichment / taggingBudgetCheckable output, high volume~$0.10 per 1,000 leads
Outreach email draftsMidCustomer-facing, quality floor matters~$0.70 per 100 emails
Competitive intelligence briefMidSynthesis, reviewed by human~$0.04 per brief
Strategic threat assessmentFrontierAmbiguous, long chain, high stakes~$0.40 per report
SEO content at volumeBudgetTemplated, cost-sensitive, sampled QA~$0.02 per 100 pages

Volume tasks go budget, quality-sensitive drafting goes mid, and judgment calls with real consequences go frontier. Where teams overspend is running mid-tier models on tagging jobs. Where they underspend is using budget models on customer-facing copy or strategic analysis where errors don't show up until something goes wrong.


The master comparison table

Input and output prices per million tokens. Open-weight models show typical hosted-provider rates, not self-hosting infrastructure costs. June 2026.

ModelBrandTierBest forIn $/MOut $/MFrays at
Claude Opus 4.8AnthropicFrontierMulti-step agents, judgment calls$5$25Anything with a checkable output
GPT-5.5OpenAIFrontierGeneral reasoning, coding$5$30Structured extraction, simple summaries
Grok 4xAIFrontierReasoning, long context$3$15High-volume commodity tasks
Gemini 3.1 ProGoogleFrontierLong context, multimodal, code$2$12Short extraction and routing tasks
o3OpenAIFrontierHard math, chain-of-thought$2$8Standard drafting, tagging
Claude Sonnet 4.6AnthropicMidCode, synthesis, copy$3$15Classification, basic reformatting
Gemini 3.5 FlashGoogleMidSpeed-quality balance, 1M context$1.50$9Safety-critical judgment
Cohere Command R+CohereMidRAG workloads, retrieval$2.50$10High-volume budget tasks
Mistral Large 2MistralMidEU data residency, code$2$6High-volume budget tasks
o4-miniOpenAIMidCheap reasoning tasks$1.10$4.40Non-reasoning extraction
Qwen-MaxAlibabaMidCode, structured output$1.60$1.60Ambiguous judgment
Amazon Nova ProAmazonMidAWS-native, enterprise$0.80$3.20Complex reasoning chains
DeepSeek R1DeepSeekMidReasoning at mid-tier prices$0.55$2.19Tasks needing no reasoning chain
Grok 4.1 FastxAIMidFast, cheap, 2M context$0.20$0.50Complex agentic loops
DeepSeek V3DeepSeekMidGeneral tasks, remarkable value$0.27$1.10Frontier-level judgment
GLM-4-PlusZ.AIMidCode, Chinese-market tasks$0.14$0.14Nuanced cross-cultural judgment
Kimi k2MoonshotOpenCode, math, long context$0.15$2.50Unstructured creative tasks
Qwen3 235BAlibabaOpenOpen-weight flagship (MoE)$0.22$0.88Same ceiling as hosted frontier
Llama 4 MaverickMetaOpenGeneral open-weight flagship$0.20$0.85Long-horizon agents
Llama 4 ScoutMetaOpenLong context (10M tokens)$0.17$0.60Judgment-heavy tasks
Llama 3.3 70BMetaOpenEstablished, widely supported$0.18$0.54Frontier-level tasks
Claude Haiku 4.5AnthropicBudgetFast extraction, classification$1$5Multi-step reasoning
GPT-4o miniOpenAIBudgetVolume tasks with quality floor$0.15$0.60Long-chain reasoning
Qwen-PlusAlibabaBudgetGeneral budget tier$0.40$1.20Complex reasoning
Mistral Small 3MistralBudgetEU budget, fast$0.10$0.30Judgment-heavy tasks
Ministral 8BMistralBudgetEdge/embedded, ultra-cheap$0.10$0.10Any quality-sensitive task
Gemini 1.5 FlashGoogleBudgetPrior-gen, stable price$0.075$0.30Anything needing nuance
Gemini 2.0 Flash-LiteGoogleBudgetUltra-fast, ultra-cheap$0.075$0.30Any synthesis task
GLM-Z1-AirZ.AIBudgetBudget thinking variant$0.06$0.06Complex chains
Qwen FlashAlibabaBudgetCommodity tasks, near-zero cost$0.015$0.06Any quality-sensitive task

The budget creep problem

Nobody tells you this when a new model generation drops: "budget" is a moving target, and it has been creeping upward for two years.

Claude Haiku 4.5 costs $1/M input, meaningfully more than the original. Gemini Flash is cheap, but Flash with thinking mode enabled is not. When providers upgrade their budget tiers, the marketing says "same price, smarter model," and sometimes that's true; other times the price quietly moved, or the cheap option now has a thinking surcharge that kicks in on the tasks you actually care about.

The counter-move: don't auto-upgrade budget nodes. A prior-generation model on a stable, calibrated workflow is often better than a new model you haven't profiled against your actual distribution of inputs. Gemini 1.5 Flash still exists. If your extraction node is working, there's no reason to move it until you've run a side-by-side comparison on real examples from your queue, not a benchmark from a lab.

Run the math at real volume. Say your workflow processes 10M tokens a day, split 7M input and 3M output.

At Gemini 2.0 Flash-Lite: 7 × $0.075 + 3 × $0.30 = $1.43/day, about $43/month.

At Claude Sonnet 4.6: 7 × $3 + 3 × $15 = $66/day, about $1,980/month.

For a classification task, that $1,937/month gap is cost with no return. Nobody in your company would notice if you switched. For a multi-step reasoning task where Sonnet 4.6 actually produces better output, the math might justify it. But you have to run the comparison first, on your data, not on someone else's benchmark.


A decision shortcut

For any new automation task, run through this before picking a model:

  • Can you write a rule to check whether the output is correct? Yes: start at budget tier.
  • Does the task synthesize multiple sources, or generate copy a person will read and judge? Yes: move to mid.
  • Does the output chain into other steps where one wrong answer corrupts everything that follows? Yes: frontier.
  • Is the volume high? Yes: pressure-test the next tier down at scale before committing. The quality difference often doesn't survive contact with real inputs.
  • Data residency or compliance constraint? Yes: that narrows the field first, then apply the above.

Start cheap, test on real inputs, and upgrade only when you can name the specific failure mode you're fixing.


The skill that doesn't expire

Model versions change every few months. Prices shift. New entrants show up with a better cost story. Any article that just says "use model X" is out of date before you finish reading it.

The curve and the heuristic don't go stale. If you can verify the output against a rule, you probably don't need to pay for judgment. If you can't verify it, you probably do. That test still works when half the models in this table have been replaced by successors with different names and version numbers.

Use the table to get oriented today. Use the mental model to stay calibrated next quarter.

TWEGS · Notes on workflows, tools, and automation