LLMEndpoint

LLM API Pricing Explained

Input tokens, output tokens, requests, caching, and the hidden cost drivers.

Last updated 2026-05-13. Pricing, model names, and provider policies change frequently.

Quick answer

LLM API pricing is usually driven by input tokens, output tokens, model tier, request volume, caching, and any platform or routing fees. Output tokens often cost more than input tokens, so verbose answers can be the biggest cost driver.

Use this guide when

You need a first budget before launch

Use this guide when you need to translate vague product usage into a range that finance, founders, or procurement can react to.

Token tables are not making the decision clearer

This article helps when provider price pages look precise but still do not tell you what your monthly bill will actually look like.

You are comparing official APIs with routers or cheap open-model routes

It is especially useful when you are weighing direct provider cost against marketplace, routing, or open-model tradeoffs.

Token pricing basics

Most providers charge per million input tokens and per million output tokens. Input tokens include your system prompt, user message, retrieved context, tool schemas, and conversation history. Output tokens are what the model generates.

Why monthly cost surprises teams

Costs grow when prompts include long retrieved documents, chat history is replayed every turn, agents call tools repeatedly, or users generate longer outputs than expected. Request count alone is not enough to estimate spend.

Aggregator and gateway fees

Aggregators can simplify provider access, but you need to understand whether pricing is pass-through, marked up, credit-based, or plan-based. Always compare total cost, not only model token rates.

Example decision paths

Chatbot looks cheap until output grows

A product team may estimate a chatbot on short prompts, then discover the real cost driver is long assistant answers and replayed history across every conversation turn.

RAG system hides prompt cost

A retrieval app may focus on the model price and miss the fact that long retrieved context and repeated system instructions are adding most of the bill.

DeepSeek-V4 changes the cost baseline

DeepSeek-V4-Flash and the currently discounted DeepSeek-V4-Pro can make official-API pricing look very different from the older assumption that only third-party open-model routes are cheap.

Cheap provider still loses on total cost

A lower token rate can still become more expensive overall if retries, weak formatting, or longer responses create more product and engineering overhead.

Provider examples to compare

ProviderCategorySupported modelsOpenAI-compatibleStarting priceContextTool callingVisionStreamingStatusTrustLinks
OpenAIOfficial APIsGPT, reasoning models, embeddings, imageYesBudget to premium GPT tiersShort to very long, model basedYesYesYesAvailable12/15
AnthropicOfficial APIsClaude, Claude Haiku, Claude Sonnet, Claude OpusNoMid to premium Claude tiersLong context optionsYesYesYesAvailable10/15
DeepSeekOfficial APIsDeepSeek-V4-Flash, DeepSeek-V4-ProYesLow-cost flash to discounted pro tiers1M context, up to 384K outputYesNoYesAvailable11/15
Google GeminiOfficial APIsGemini, embedding models, multimodal modelsYesLow-cost flash to premium tiersShort to million-token-class optionsYesYesYesAvailable11/15
DeepInfraInference ProvidersLlama, Qwen, DeepSeek-V4, MistralYesOften low for open modelsBroad open-model range, model specificNoYesYesAvailable10/15
OpenRouterLLM API AggregatorsGPT, Claude, Gemini, DeepSeek-V4YesVaries by model routeModel dependent across upstream routesNoYesYesAvailable11/15

Compare next

Checklist

Recommended next step

Use the calculator with realistic token sizes, then compare cheaper alternatives only if quality remains acceptable.

FAQ

Why are output tokens often more expensive?

Generation consumes inference time and capacity. Providers commonly price generated tokens higher than prompt tokens.

Is price per million tokens enough to compare providers?

No. You also need quality, latency, retry rate, context needs, support, and platform fees.

How can I reduce cost without hurting quality?

Shorten prompts, cache repeated context, route simple tasks to cheaper models, and measure quality with evals.