Migrating from OpenAI to Anthropic Without Breaking Production
Last updated: 2026-05-21 · 12 min read
Part of the toksum.dev Guides series.
Switching your production LLM workload from OpenAI to Anthropic is not a drop-in swap. The two providers have different API shapes, different tokenizers, different prompt conventions, different tool use formats, different streaming event schemas, and different approaches to prompt caching. Teams that attempt a naive migration — changing the API key and endpoint URL while keeping everything else the same — encounter broken requests, degraded output quality, and inaccurate cost models. This guide walks through every significant difference with concrete code-level guidance for handling each one safely.
Before starting, read the cost comparison for your target model pair. Depending on your workload's input/output ratio and whether you can use prompt caching, the economics may shift significantly in either direction. The GPT-4o vs Claude 3.5 Sonnet cost comparison with batch and cache modeling is a good starting point for the financial case.
SDK differences: anthropic vs openai npm packages
The OpenAI Node.js SDK is installed as openai and initialized with an API key. The Anthropic SDK is installed as @anthropic-ai/sdk and uses an Anthropic client class. The method for creating a message completion is client.messages.create() on Anthropic versus client.chat.completions.create() on OpenAI. These are not interchangeable — you need to update every call site in your codebase.
If you want to minimize refactoring, consider using the Vercel AI SDK (ai package) or LangChain, both of which abstract provider differences behind a common interface and support both OpenAI and Anthropic as backends. This approach trades some provider-specific feature access (like Anthropic's explicit cache_control) for simpler provider switching. If you need full access to Anthropic-specific features like explicit prompt caching or beta features, use the native SDK directly.
Authentication also differs at the header level. OpenAI uses Authorization: Bearer YOUR_API_KEY. Anthropic uses x-api-key: YOUR_API_KEY plus an anthropic-version header specifying the API version (e.g., 2023-06-01). The Anthropic SDK handles both headers automatically — but if you are making raw HTTP calls (e.g., from a language without an official SDK), you must include both. The anthropic-version header is also useful for observability: log it alongside the request-id response header for support ticket traceability.
Request shape: system message, messages array, max_tokens
The most immediate structural difference between the two APIs is how the system prompt is represented. In OpenAI's Chat Completions API, the system prompt is the first element of the messages array, formatted as {"role": "system", "content": "..."}. In Anthropic's Messages API, the system prompt is a top-level system string parameter, entirely separate from the messages array. The messages array in Anthropic's API contains only user and assistant turns.
Anthropic also requires a max_tokens parameter on every request — it is not optional. OpenAI makes max_tokens optional and defaults to a high value. If you forget to set max_tokens on an Anthropic request, the API returns a validation error. Set it to a conservative but sufficient value for your expected output length — this also serves as a cost guard against runaway generation.
The model identifier strings are different. OpenAI uses identifiers like gpt-4o, gpt-4o-mini. Anthropic uses identifiers like claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022. Anthropic model identifiers include a date suffix that pins a specific model version; when Anthropic releases an updated version, they use a new date suffix so existing integrations do not change behavior. Store model identifiers in configuration rather than hardcoding, so you can update them without touching application logic when a newer model version is released.
Response shape: content blocks and stop reasons
The OpenAI response object uses choices[0].message.content to access the text response and choices[0].finish_reason for the stop reason. Anthropic's response uses a content array of content blocks, where a text response is at content[0].text, and the stop reason is a top-level stop_reason field with values like end_turn, max_tokens, stop_sequence, or tool_use.
The content block structure is important for tool use responses. When Claude decides to use a tool, the response content array contains a tool_use block alongside or instead of a text block. Your response parsing code must iterate over content blocks and handle each type, rather than blindly accessing content[0].text. The stop_reason of tool_use signals that the model is requesting a tool call and the conversation should continue with a tool result message.
Token usage is reported in response.usage.input_tokens and response.usage.output_tokens on Anthropic, which maps directly to the billing units. When prompt caching is active, additional fields cache_creation_input_tokens and cache_read_input_tokens appear in the usage object. Log all of these fields for cost tracking — they are the authoritative source for billing reconciliation.
Prompt format conventions: XML tags vs JSON mode
Prompt engineering conventions differ significantly between the two providers, and using OpenAI-optimized prompts on Claude without adaptation typically degrades output quality. Claude is trained to respond well to XML-style delimiters for separating logical sections of a prompt. Using tags like <instructions>, <context>, <examples>, and <output_format> helps Claude identify and follow each section's instructions more reliably than markdown headers or paragraph breaks alone.
OpenAI's GPT-4o is optimized for markdown formatting and supports a response_format: {type: "json_object"} or JSON Schema mode that constrains the model's output to valid JSON at the API level. Anthropic does not have an equivalent API-enforced JSON mode — you must prompt Claude to produce JSON (via an XML output format tag and few-shot examples) and validate the output in your application code. Claude is reliable enough at following JSON instructions that this works well in practice, but it is not the same as a hard API-level enforcement. Plan for a JSON validation and retry loop in your parsing code.
For structured data extraction use cases, Anthropic's tool use feature is often a better approach than JSON mode: define your desired output schema as a tool input schema, instruct Claude to call the tool with the extracted data, and parse the structured tool_use block from the response. This gives you schema validation at the tool definition level and integrates well with the content block response model.
Tokenizer differences, vision inputs, and function calling
As covered in detail in the token counting myths guide, Claude and GPT-4o use different tokenizers with different vocabularies. For typical English prose, Claude tokenizes 10–20% more efficiently. For code and non-English text, the difference varies. Recalculate your token budgets and cost models using Anthropic's tokenizer (or the count_tokens API) on a representative sample of your actual production content after migrating.
Vision inputs work differently on the two platforms. Both support image inputs in the messages array, but the field names and encoding options differ. OpenAI accepts image_url content items with a URL or base64 data URI. Anthropic accepts image content blocks with a source object that specifies the media type and base64 data, or a URL. Both support JPEG, PNG, GIF, and WebP. Anthropic does not support OpenAI's "detail: low/high/auto" image resolution control — Claude processes images at its own internally determined resolution.
Tool/function calling requires a dedicated migration step. OpenAI's function calling uses a functions array (legacy) or tools array where each tool has a function object with name, description, and parameters (JSON Schema). Anthropic's tool use has each tool at the top level of the tools array with name, description, and input_schema (JSON Schema). Tool results are returned differently too: OpenAI uses a role: "tool" message; Anthropic uses a role: "user" message containing a tool_result content block.
Streaming, prompt caching wiring, and observability
Both providers support Server-Sent Events (SSE) streaming, but the event names and payload shapes differ. OpenAI streams data: {"choices": [{"delta": {"content": "..."}}]} events. Anthropic streams typed events including content_block_start, content_block_delta (with delta.text), content_block_stop, and message_delta. The final message_delta event contains the stop_reason and the usage summary. Use the Anthropic SDK's stream() method to avoid hand-parsing SSE events.
To wire in prompt caching, add cache_control to the system prompt or to specific content blocks in the messages array. The system parameter accepts an array of content blocks (not just a string) when you need to add cache_control to the system prompt: system: [{type: "text", text: "...", cache_control: {type: "ephemeral"}}]. After enabling caching, verify it is working by checking response.usage.cache_read_input_tokens is non-zero on requests after the first. If it is always zero, the prefix may be below the 1,024-token minimum or the token sequence is changing between requests. See the prompt caching ROI guide for diagnostics and economics.
For observability, log the request-id header from every Anthropic API response. This is Anthropic's equivalent of OpenAI's id field in the response body and is the identifier you need when contacting Anthropic support about a specific request. Log it alongside your own trace ID, the model name, the anthropic-version header value, input tokens, output tokens, cache tokens, and stop reason. This gives you a complete audit trail for cost reconciliation and debugging.
Gradual rollout pattern
Never migrate 100% of production traffic at once. LLM migrations introduce too many variables — prompt effectiveness, output format changes, latency differences, and model behavior differences — to validate safely in a single cut-over. Use a percentage-based traffic split that you can roll back in minutes if quality degrades.
The recommended pattern: implement a thin provider abstraction layer that accepts your normalized internal request format (model alias, messages, parameters) and routes to either OpenAI or Anthropic based on a feature flag or percentage split. Start with 1% of traffic to Claude, instrument quality metrics (output format compliance, downstream parsing error rates, user feedback signals), run for 48–72 hours, then step to 5%, 20%, 50%, and finally 100% if all metrics hold. At each step, compare per-provider cost from your usage logs to validate that your pre-migration cost model was accurate.
Maintain your OpenAI integration and credentials for at least 30 days after a full migration to Claude. This allows you to roll back instantly if a problem emerges in production that was not caught during the phased rollout. After 30 days with stable metrics, you can decommission the OpenAI path and clean up the abstraction layer.
Run your quality evaluation suite on at least 200–500 examples from your actual production distribution before starting the rollout. Public benchmarks measure average performance across diverse tasks; your product may specialize in a domain where one model has a clear advantage or disadvantage. There is no substitute for measuring on your own data before committing to a migration.
Frequently asked questions
Related
Guide
Prompt Caching ROI
The full economics of wiring cache_control into your migrated Anthropic integration.
Guide
5 Token Counting Myths
Why GPT and Claude tokens aren't interchangeable — recalculate your cost model after migrating.
Compare
GPT-4o vs Claude 3.5 Sonnet
Full cost comparison with batch and cache modeling — the financial case for your migration decision.