Last updated 2026-05-13. Pricing, model names, and provider policies change frequently.
Quick answer
Estimate LLM API cost by multiplying daily requests by average input tokens and output tokens, applying provider token rates, then adding cache assumptions, retries, tool calls, and growth scenarios. Use ranges instead of a single optimistic number. DeepSeek-V4 is a good model family to include in that comparison because it can materially change cost estimates for long-context text workloads.
Start with one user journey
Pick a realistic workflow: one chat turn, one document summary, one RAG answer, or one agent task. Count all prompt text, retrieved context, tool schemas, and expected output.
Model best, expected, and worst cases
Averages hide risk. Estimate normal usage, a successful launch spike, and heavy users. For agents, include repeated tool calls and retries.
Add operational overhead
Costs can include failed retries, evaluation runs, background jobs, embeddings, reranking, logging, and aggregator fees. These are easy to forget in a simple token calculation.
Provider examples to compare
| Provider | Category | Supported models | OpenAI-compatible | Starting price | Context | Tool calling | Vision | Streaming | Status | Trust | Links |
|---|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI | Official APIs | GPT, reasoning models, embeddings, image | Yes | Budget to premium GPT tiers | Short to very long, model based | Yes | Yes | Yes | Available | 12/15 | |
| Anthropic | Official APIs | Claude, Claude Haiku, Claude Sonnet, Claude Opus | No | Mid to premium Claude tiers | Long context options | Yes | Yes | Yes | Available | 10/15 | |
| DeepSeek | Official APIs | DeepSeek-V4-Flash, DeepSeek-V4-Pro | Yes | Low-cost flash to discounted pro tiers | 1M context, up to 384K output | Yes | No | Yes | Available | 11/15 | |
| Google Gemini | Official APIs | Gemini, embedding models, multimodal models | Yes | Low-cost flash to premium tiers | Short to million-token-class options | Yes | Yes | Yes | Available | 11/15 | |
| DeepInfra | Inference Providers | Llama, Qwen, DeepSeek-V4, Mistral | Yes | Often low for open models | Broad open-model range, model specific | No | Yes | Yes | Available | 10/15 | |
| Groq | Inference Providers | Llama, Mixtral, Gemma, Whisper-like speech models | Yes | Speed-oriented model tiers | Selected fast-serving model range, model specific | Yes | No | Yes | Available | 11/15 |
Checklist
- Measure or estimate input and output tokens separately.
- Multiply by requests per day and 30 days.
- Add retries, background jobs, embeddings, and eval traffic.
- Recalculate after real users generate production logs.
Recommended next step
Use the calculator, then save three scenarios: conservative, expected, and high-growth.
FAQ
What is a good cost estimate before launch?
Use a range. A single estimate is usually too fragile before real usage data exists.
Do embeddings cost a lot?
Often less than chat generation, but large document ingestion or frequent re-indexing can still matter.
How often should I revisit estimates?
After launch, after pricing changes, after prompt changes, and whenever usage volume changes materially.