Batch API: The 50% LLM Discount You're Probably Not Using

Last updated: 2026-05-21 · 10 min read

Part of the toksum.dev Guides series.

What the batch API actually is

Every major LLM provider offers a batch processing API that cuts your per-token cost in half. OpenAI calls it the Batch API. Anthropic calls it the Message Batches API. Google offers it as batch mode on Vertex AI for Gemini models. In all three cases, the mechanism is the same: instead of sending individual requests and waiting for individual responses in real time, you upload a file containing many requests at once, the provider processes them at lower priority over the next several hours, and you retrieve a result file when they are done. The discount for accepting this latency is 50% on both input and output tokens across all three providers.

This is not a beta feature or a tier that requires special approval. It is a documented, stable part of each provider's API surface, available to any paying customer. The reason most engineering teams are not using it is simply lack of awareness: the batch APIs are not prominent in provider marketing, and the default tutorial experience for every LLM is the synchronous chat completion endpoint, which is the wrong tool for a large class of real-world workloads.

The core mechanics are straightforward. You prepare a JSON Lines file (JSONL) where each line is a self-contained request object in the same format as you would send to the standard API. You upload the file via the batch endpoint. The provider returns a batch job ID. You poll the batch status endpoint or set up a webhook until the job reports completion, typically within 1–6 hours and guaranteed within 24 hours. You then retrieve the result file, which contains one response object per input request, matched by a custom ID you supply. Failed requests appear as error entries in the result file rather than causing the entire batch to fail.

Which providers offer it and how they differ

OpenAI Batch API supports GPT-4o, GPT-4o Mini, GPT-4 Turbo, and most embedding models. You submit a JSONL file where each line contains a custom ID, the HTTP method ("POST"), the endpoint path, and the request body. The batch API uses the same message format as the chat completions endpoint, so existing prompt logic transfers directly. OpenAI allows up to 50,000 requests and 200 MB per batch file. Results are returned in a JSONL file with the same structure. The 50% discount applies to both input and output tokens, making batch GPT-4o $1.25/$5.00 per 1M input/output versus $2.50/$10.00 standard.

Anthropic Message Batches API supports Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, and other current models. The request format mirrors the standard Messages API, including support for vision inputs and tool use. Anthropic allows up to 10,000 requests per batch. Batch pricing for Claude 3.5 Sonnet is $1.50/$7.50 per 1M input/output versus $3.00/$15.00 standard — a 50% reduction. Anthropic also supports prompt caching within batch requests, which can stack on top of the batch discount for workloads with reusable prefixes.

Google Vertex AI batch predictions support Gemini 1.5 Pro, Gemini 1.5 Flash, and Gemini 2.x models. Google's batch mode is configured slightly differently from OpenAI and Anthropic: you create a batch prediction job pointing at a BigQuery table or Cloud Storage bucket as input and output. The format is more integrated with Google Cloud infrastructure, which is an advantage if you are already running a GCP data pipeline and a friction point if you are not. Google's batch discount is also 50% on token costs.

All three providers use per-batch-job processing, not per-request. You do not pay a flat fee per batch submission — costs are purely token-based at the discounted rate. There is no premium for small batches and no additional discount for very large ones.

When to use batch API

The right workloads for batch API share a common property: the output is not needed in real time by a human or by another system that is actively waiting. The 24-hour SLA is a soft upper bound — in practice most batches complete in one to six hours — but you need to be genuinely indifferent to that window. Any latency requirement tighter than an hour disqualifies batch API from consideration.

Nightly analytics and summarization. If you summarize the previous day's user conversations, customer tickets, or log data every night before business hours, the entire pipeline runs in a window where batch is natural. You submit the batch at midnight, retrieve results at 4 AM, and load summaries into your dashboard before people arrive at work. The workload is identical to what you would run synchronously but at half the cost.

Bulk embedding generation. When you add new documents to a RAG knowledge base, you need to generate embeddings for each chunk. This is an embarrassingly parallel batch workload — each chunk's embedding is independent of every other — and there is no user waiting for the result. Batch embedding generation is one of the most common and highest-ROI uses of the batch API at scale.

Content classification and moderation. Classifying user-generated content for toxicity, spam, category tags, or sentiment is typically a background job. Content goes into a queue, gets classified in bulk, and the classification results are applied to moderation decisions made by human reviewers or automated rules. The whole pipeline tolerates hours of latency, and the token volumes can be very high, making batch classification economically attractive.

Data labeling and annotation. Using LLMs to generate training labels, entity extraction, or structured annotations on large datasets is a canonical batch workload. You have a static corpus, you need to process every item, and you read the results when the job finishes. Batch API is the correct tool.

SEO and content generation pipelines. Generating meta descriptions, product descriptions, FAQ entries, or other structured content for a large catalog is a batch job. You run it once when the catalog changes, store the results, and serve them statically. No user is waiting for the LLM to respond at page load time.

When NOT to use batch API

Batch API is categorically incompatible with interactive or time-sensitive workloads. The following use cases should always use the standard synchronous API regardless of cost pressure.

Chat and conversational assistants. Any product where a human is typing a message and waiting for a reply requires real-time response. Even a 10-second delay is unacceptable for most chat UX; a multi-hour delay makes the feature non-functional. Use the standard streaming API for chat.

Agentic workflows with live tool execution. LLM agents that call external APIs, browse the web, write and run code, or take actions in the real world need the model's decision within seconds. The batch API has no support for mid-batch tool calls — each request in the batch is self-contained and stateless. Agents with real-time decision loops require the standard API.

Search and retrieval augmentation. If your product uses an LLM to rerank search results or synthesize an answer from retrieved documents while the user waits for a search results page, you are constrained by user-acceptable search latency (typically under 2 seconds). Batch API does not apply.

Streaming UX. Any feature that streams partial responses character-by-character to give the user a sense of progress is inherently real-time. The batch API returns complete responses in a result file — there is no streaming mode for batch jobs.

The actual math: savings at 10K, 100K, and 1M jobs per month

To make the savings concrete, consider a content classification workload using Claude 3.5 Sonnet. Each request processes one piece of user-generated content: 800 input tokens (content plus a classification system prompt) and 100 output tokens (a structured label). The workload runs nightly and tolerates overnight processing.

At 10,000 jobs per month (333 per day, a small startup scale):

At 100,000 jobs per month (3,333 per day, a growing product):

At 1,000,000 jobs per month (33,333 per day, a scaled product):

The percentage savings are constant (always 50%) but the absolute dollar impact scales linearly with volume. At high volumes, the engineering cost of implementing batch API — typically a few days for a clean integration — pays back in the first month. For the GPT-4o vs Claude 3.5 Sonnet comparison, batch pricing significantly changes the competitive cost picture and is worth modeling explicitly for any workload above 100K monthly requests.

How to wire batch API into a production pipeline

A production-grade batch pipeline needs four components: a job builder, a submitter, a poller, and a result processor. Here is the architecture pattern that works reliably at scale.

Job builder. A scheduled job (cron, Airflow DAG, or cloud scheduler) runs at your desired frequency — nightly is common — queries your data store for all items that need processing since the last run, and serializes them into a JSONL file. Each line gets a stable custom ID that maps back to the source record (e.g., a database primary key). The custom ID is how you join results back to your records without relying on order. Do not rely on positional order in the result file — use the custom ID for matching.

Submitter. Uploads the JSONL file to the provider's batch endpoint and stores the returned batch job ID in your database alongside the run metadata (submission timestamp, record count, expected completion window). Record the batch ID persistently — if your poller crashes, you need to be able to recover the job ID to check status.

Poller. A separate process (or the same job run with exponential backoff) checks the batch status endpoint every 30–60 minutes. When the status transitions to "completed" or "failed", it downloads the result file. Design this as an idempotent operation: downloading the same result file twice should not corrupt your database. A simple "processed" flag on the batch record in your database prevents double-processing.

Result processor. Parses the result JSONL, maps each response back to the source record via the custom ID, writes the extracted data to your database, and marks the source records as processed. Log any per-request errors separately for retry logic or human review. A failed individual request should not block the rest of the batch results from being written.

For retry logic, do not re-submit failed requests in the same batch job immediately. Collect failed custom IDs, investigate the error types (rate limit exceeded, content policy, malformed input), fix the root cause, and submit a smaller follow-up batch with only the failed items. This keeps your retry logic clean and audit-able.

One important operational note: do not submit a new batch before the previous one completes if you are processing the same dataset. Overlapping batches on the same records create race conditions in your result processor. Use your database's "in progress" flag on batch records to gate new submissions.

Frequently asked questions

Which providers offer a batch API with a 50% discount?
As of May 2026, OpenAI, Anthropic, and Google all offer batch processing APIs with a 50% discount on both input and output tokens. OpenAI calls it the Batch API, Anthropic calls it the Message Batches API, and Google offers batch mode on Gemini via Vertex AI. All three accept JSON Lines files of requests, process them asynchronously, and guarantee results within 24 hours.
Can I use batch API for user-facing chat?
No. Batch API is strictly for asynchronous workloads. Responses are returned in a result file hours later, not in real time. Any feature that requires a model response before the user can proceed — chat, search, live content generation — requires the standard synchronous API. Batch API is appropriate for background pipelines, nightly jobs, and bulk processing tasks.
What happens if a batch job fails partway through?
Each request in a batch is processed independently. If individual requests fail (due to content policy, token limits, or model errors), they appear as error entries in the result file while successful requests are returned normally. The batch itself does not fail as a unit. You should build result parsing logic that handles partial failures, retries failed requests in a follow-up batch, and logs error types for monitoring.
Is there a minimum or maximum batch size?
OpenAI allows up to 50,000 requests per batch file and up to 200 MB file size. Anthropic allows up to 10,000 requests per batch. Google Vertex AI batch limits depend on the specific quota granted to your project. There is no enforced minimum — you can submit a batch with a single request, though the economics only make sense at higher volumes where the processing overhead per job is amortized.

Related

Guide

Prompt Caching ROI

Stack prompt caching on top of batch discounts for the lowest possible token cost on Anthropic.

Guide

How to Read LLM Pricing Pages

Decode all pricing dimensions — input/output asymmetry, caching, and batch tabs — without getting burned.

Compare

GPT-4o vs Claude 3.5 Sonnet

See how batch pricing shifts the competitive cost picture between the two flagship models.