How to Read LLM API Pricing Pages Without Getting Burned
Last updated: 2026-05-21 · 9 min read
Part of the toksum.dev Guides series.
Why pricing pages are confusing by design
LLM API pricing pages are not written for engineers making production cost decisions. They are marketing documents first, reference material second. The headline number — typically the input price for the flagship model — is chosen to look competitive. The details that actually determine your bill are buried in footnotes, separate tabs, or in some cases not on the page at all. Understanding the structure of these pages is a prerequisite to building reliable cost models for your product.
The single most important thing to internalize before reading any pricing page is that LLM costs are not uniform. There are at least four distinct rates that may apply to a single API request: standard input, standard output, cached input reads, and cached input writes. On top of that, batch APIs introduce a separate rate schedule. If you average these together or pick the wrong rate for your workload, your cost estimate will be wrong by a factor of two to five — not by a few percent.
This guide walks through each pricing dimension you are likely to encounter, explains the math clearly, and gives you a checklist for verifying any provider's pricing before you commit budget to a model.
Per-1K vs per-1M: the unit convention shift
Until approximately 2023, most providers (including OpenAI) quoted prices per 1,000 tokens. The industry has since moved to per 1,000,000 tokens (per-1M) as the standard unit, because the per-1K numbers had become small enough to be confusing: $0.002 per 1K tokens is harder to reason about than $2.00 per 1M tokens. The math is identical — multiply your token count by the rate and divide by the appropriate unit — but mixing up the unit in application code produces a 1,000× billing error. Always check which unit a pricing page is using. The safest approach is to normalize everything to per-1M tokens in your internal cost model before doing any arithmetic.
Some providers still quote certain tiers or older models in per-1K format, particularly in legacy documentation pages that have not been updated. If you see a rate like "$0.0015" with no unit qualifier, assume per-1K and verify before using it. Rate confusions in billing code are one of the most common sources of dramatic over- or under-budget surprises at invoice time.
Input vs output asymmetry: the number that controls your bill
Every major LLM API charges significantly more for output tokens than input tokens. The ratio varies by provider and model, but the typical range is 3× to 5× more expensive for output. For GPT-4o, the standard rates are $2.50/1M input and $10.00/1M output — a 4× ratio. For Claude 3.5 Sonnet, it is $3.00/1M input and $15.00/1M output — a 5× ratio. For Gemini 1.5 Pro at volumes above 128K context, the ratio is similar.
This asymmetry has a direct and large effect on your cost model depending on your workload type. A summarization product that takes a 10,000-token document and returns a 200-token summary has a 50:1 input-to-output ratio — your input cost dominates, and optimizing prompt compression pays off more than reducing output length. A code generation product that takes a 500-token specification and returns 3,000 tokens of code has a 1:6 ratio — your output cost is six times higher per request than your input cost, and the main lever for cost reduction is constraining output verbosity or using a cheaper model for generation.
When you see a single "price" quoted for a model on a comparison site or in a blog post, ask immediately: is that the input price, the output price, or some blended average? Blended averages are nearly useless for cost planning because the input-to-output ratio varies by one or two orders of magnitude across different use cases. Always model input and output separately. Use our token counter to get accurate counts before building your cost model.
A practical rule of thumb: for most chat and assistant workloads, output tokens account for 60–80% of the total bill even though they represent a smaller fraction of total token volume. This is counterintuitive until you work through the math, but it means that any optimization targeting output — shorter responses, streaming with early termination, response caching at the application layer — tends to deliver more savings than input optimizations.
Prompt caching: the biggest line item most teams miss
Prompt caching is a mechanism where the provider stores the key-value (KV) computations for a portion of your input prompt, so that subsequent requests reusing that same prefix do not need to recompute it. The savings are dramatic: Anthropic charges $3.75/1M for cache writes but only $0.30/1M for cache reads — a 90% reduction compared to the $3.00/1M standard input rate. OpenAI's automatic caching charges $1.25/1M for cache-hit input tokens, down from $2.50/1M standard — a 50% reduction.
The catch is that prompt caching is invisible on most pricing pages unless you know to look for it. Anthropic's pricing page has a dedicated section for it under each model's rate card. OpenAI calls it "cached input" and it appears as a separate line in API usage responses. Google's Gemini API calls the feature "context caching" and charges a storage fee per cached token per hour in addition to the reduced read rate. Each provider has different rules about minimum cache size, cache lifetime, and whether caching is explicit (Anthropic, Google) or automatic (OpenAI).
For any workload with a large, reusable system prompt — a customer support bot with a knowledge base, a coding assistant with a codebase context, a document Q&A system with a fixed corpus — prompt caching can be the largest single cost lever available. A 50,000-token system prompt reused 1,000 times per day costs $150/day in standard Anthropic input tokens. With caching, the same workload costs one write ($0.1875) plus 999 reads ($14.99) per day: roughly $15/day instead of $150/day. That is a $4,000/month difference from a single API parameter change. See our dedicated guide on prompt caching ROI for the full worked math.
When reading a pricing page, always check: (1) does the provider offer prompt caching? (2) is it explicit or automatic? (3) what is the cache lifetime? (4) what is the minimum cacheable prefix size? Missing any of these details means you may be paying standard input rates when cache reads were available.
Batch discounts and where to find them
All three major providers — OpenAI, Anthropic, and Google — offer a batch processing API that cuts both input and output costs by 50% in exchange for accepting up to 24-hour turnaround time. This discount is straightforwardly named the "Batch API" or "Batch mode" on their respective pricing pages, but it is often listed on a separate tab or section rather than alongside the standard rates for each model.
OpenAI's batch pricing appears on the main pricing page in a toggle that switches between "Standard" and "Batch" columns. Anthropic's batch pricing is listed in a separate section of the model's rate card, below the standard and cache rates. Google's batch pricing for Gemini appears under the "Batch" tab on the AI Studio pricing page. If you are comparing models across providers and only looking at the first table you see, you are very likely missing the batch rates entirely.
The batch discount applies to the same models as the standard API — you are not being routed to a smaller or lower-quality model. The only difference is SLA: results are guaranteed within 24 hours instead of seconds. For workloads that are genuinely asynchronous — nightly analytics, bulk embedding generation, content classification, data labeling — the batch API is a straightforward 50% cost reduction with no quality tradeoff. See our Batch API guide for a detailed breakdown of when to use it.
Hidden fees: rate limits, tier upgrades, and per-character quirks
The rates on a pricing page are not the only costs associated with using a provider. There are several categories of hidden or semi-hidden costs that regularly surprise engineering teams at scale.
Rate limit tier upgrades. All providers enforce rate limits measured in requests per minute (RPM) and tokens per minute (TPM). When your workload saturates these limits, your only option is to wait, implement retry-with-backoff logic, or upgrade to a higher usage tier. Higher usage tiers often require a minimum monthly spend commitment (typically $50–$500 depending on the provider and tier). This commitment is not visible on the main pricing page — you find it when you hit the rate limit wall and read the upgrade documentation. Budget for tier upgrade costs if your workload has bursty traffic patterns.
Per-character pricing on older Google endpoints. Some older Gemini API documentation and certain Google Cloud Vertex AI endpoints express pricing in per-1K characters rather than per-1M tokens. Since the average English word is about 5 characters and roughly 0.75 tokens, the conversion factor is not obvious. If you see a per-character rate that looks very small (like $0.000125), verify whether you are using the right unit before coding that into a cost estimator.
System prompt tokens. Every API request includes not just the user message but the system prompt, conversation history, and any injected tool schemas. These all count as input tokens and appear in your bill, but they are invisible in your application's user-facing token count. A system prompt of 2,000 tokens added to every request costs $6/month per 1,000 daily requests at GPT-4o standard rates — not enormous, but it is easy to forget when setting budget alerts based on user-message length alone.
Tool/function schema tokens. When you use function calling or tool use, the tool definitions you send in each request count as input tokens. A complex tool schema with five function definitions and detailed parameter descriptions can add 500–1,500 tokens to every request. At scale, this is a meaningful cost that does not appear in your prompt's character count.
How to verify a price changed without redeploying
LLM providers change prices with varying frequency and notice periods. OpenAI has historically given 30-day notice for price increases but has also introduced new pricing tiers mid-month. Anthropic has periodically revised rates as models evolve. Hard-coding prices in application logic is therefore a reliability and accuracy risk: your billing estimates become stale without any code change, and you may not notice until a monthly invoice surprises you.
The engineering-correct approach is to store prices in a configuration layer that can be updated without a code deploy. This means: a database table with provider, model, rate type, and value columns; a config file fetched from a CDN or object store at startup; or a third-party pricing API like the one powering this site. Your application code reads the rate at runtime rather than baking it in as a constant.
On top of that, build a monitoring job that runs nightly and compares the live provider pricing page (or API, if available) against your stored rates. When a rate changes, the monitor fires an alert to your billing or engineering Slack channel. This closes the loop: you find out about price changes within 24 hours rather than at invoice time. The monitor does not need to be sophisticated — a simple HTTP fetch of a structured pricing endpoint, diffed against yesterday's snapshot, is sufficient for most teams.
For quick sanity checks on current model pricing without writing any monitoring code, the toksum.dev token counter and the compare pages (e.g., GPT-4o vs Claude 3.5 Sonnet) are updated regularly against official pricing sources and show the full rate card including batch and cache rates.
Frequently asked questions
Related
Guide
Batch API: The 50% Discount You're Not Using
How batch APIs cut your per-token cost in half for async workloads at any scale.
Guide
Prompt Caching ROI
When Anthropic's 90% cache discount actually pays off — with worked monthly savings examples.
Guide
5 Token Counting Myths
Why the 4-char rule is wrong and what token counting mistakes cost engineering teams real money.