Most teams building AI features get surprised by their first invoice. They tested with a few hundred requests, the numbers looked fine, then they hit 50,000 requests in month two and the bill tripled their projections. The formula is simple — the mistake is almost always not measuring actual token counts before estimating costs.
The Core Formula for AI Cost Per Request
Every AI API call has two cost components: input tokens (everything you send to the model) and output tokens (what the model returns). The formula for cost per request is:
Cost per request = (Input tokens × Input price per token) + (Output tokens × Output price per token)
Since pricing is quoted per million tokens, you divide by 1,000,000:
Cost per request = (Input tokens / 1,000,000 × Input MTok price) + (Output tokens / 1,000,000 × Output MTok price)
Scaling to cost per 1,000 requests simply multiplies by 1,000:
Cost per 1K requests = [(Avg input tokens × Input MTok price) + (Avg output tokens × Output MTok price)] / 1,000
This is the number that goes into your product cost model. Run this calculation before you write the integration code, not after your first production invoice.
Worked Example: Customer Support Summarization
Imagine you’re building a feature that summarizes customer support tickets and suggests a resolution category. A typical prompt might look like:
- System prompt: 500 tokens (instructions, category list, examples)
- Customer message: 200 tokens (average ticket length)
- Total input: 700 tokens
The model output — a summary plus a category — is typically around 150 tokens.
Running this on GPT-5 ($2.50 input / $15.00 output per MTok):
- Input cost per request: 700 / 1,000,000 × $2.50 = $0.00175
- Output cost per request: 150 / 1,000,000 × $15.00 = $0.00225
- Total cost per request: $0.004
- Cost per 1,000 requests: $4.00
At 10,000 tickets per month, that’s $40/month. Reasonable for a serious feature.
Now run the same calculation on GPT-4.1 Mini ($0.40 input / $1.60 output per MTok):
- Input: 700 / 1,000,000 × $0.40 = $0.00028
- Output: 150 / 1,000,000 × $1.60 = $0.00024
- Total: $0.00052 per request
- Cost per 1K requests: $0.52
The cheaper model handles the same task at roughly 1/8th the cost. For a classification task with well-structured inputs, the quality gap is often minimal. That’s the calculation worth doing before you default to a frontier model.
Use the free AI Token Counter to paste your actual system prompt and a representative message, get the exact token count, and run this formula with real numbers instead of guesses.
The Input-to-Output Ratio Changes Everything
The single biggest source of cost estimation error is misunderstanding the input-to-output ratio for your specific use case. Since output tokens cost 4–10x more than input tokens, a generation-heavy task is fundamentally different from an extraction task.
Extraction tasks (classify, tag, extract structured data): typically 85–95% input, 5–15% output. Input price dominates. Choose the cheapest model that achieves acceptable accuracy.
Summarization tasks (condense long documents): typically 80–90% input, 10–20% output. Still input-dominant, but output cost becomes meaningful when your model is verbose.
Generation tasks (write content, draft responses, create copy): typically 30–50% input, 50–70% output. Output price becomes the dominant factor. A model with cheap input but expensive output can surprise you here.
Conversation tasks (multi-turn chat): the ratio shifts each turn as conversation history grows. By turn 5, a chat session that started with a 200-token message might have 2,000 tokens of input just from accumulated history. Model costs can increase 3–5x over a long session compared to a fresh request.
Measuring the actual ratio for your task is worth doing once. Run 50–100 representative requests, log input and output token counts, and calculate your real ratio. Everything downstream of that — model selection, pricing estimates, budget forecasts — becomes more accurate.
Building a Monthly Cost Projection
Once you have cost per 1,000 requests, the monthly projection formula is:
Monthly cost = (Daily request volume × 30 × Cost per request)
Or equivalently:
Monthly cost = (Monthly requests / 1,000) × Cost per 1K requests
For a realistic annual budget, add three multipliers that experienced teams consistently find necessary:
- Growth buffer (+25%): Usage grows as more users discover the feature. Plan for it.
- Infrastructure overhead (+30%): Orchestration, monitoring, error handling, rate limiting logic — these add real API calls that your initial estimate doesn’t include.
- Experimentation budget (+15%): You’ll test new models, optimize prompts, run A/B tests. Budget this as a line item rather than letting it appear as an unplanned overage.
The realistic annual budget is roughly 1.7× your base calculation. Teams that skip these multipliers consistently underestimate actual spend.
A rough benchmark from NMM student projects: a B2B SaaS feature handling 50,000 requests per month with 1,500 average input tokens and 400 average output tokens costs approximately $200–250/month on GPT-4.1 Mini, versus $1,800–2,100/month on GPT-5. Same feature, same quality for extraction work — 8–9x cost difference.
Five Factors That Inflate Real-World Costs
The formula gives you a floor, not a ceiling. Here’s what adds to the theoretical number:
1. System prompt size. A 2,000-token system prompt gets charged on every single request. On 100,000 monthly requests, that’s 200 million tokens of input just from your system prompt. Prompt caching makes this economical — cached input from OpenAI costs $0.25/MTok versus $2.50/MTok standard, a 90% reduction. If your system prompt is large and static, caching it is the highest-leverage cost optimization available.
2. Reasoning tokens. If you use a reasoning model like o3, o4-mini, or DeepSeek R1, the model generates internal “thinking” tokens that count toward output cost. These are invisible in the response but very visible on your bill. A reasoning call that returns 500 tokens of visible output might have generated 3,000 tokens of internal reasoning charged at output rates.
3. Retry logic. A 5% error rate with automatic retries means roughly 5% more API calls than your base estimate. A 15% error rate on a cheaper model might cost more in retries than the savings from lower per-token rates.
4. Context accumulation in conversations. Multi-turn applications where you include the full conversation history grow in cost with every turn. A conversation at turn 10 sends 9 turns of history as input on that call. Design truncation or summarization logic to cap context size.
5. Streaming overhead. Some implementations stream token-by-token for real-time UX. Streaming doesn’t change your token count, but if your implementation sends partial response confirmations or keeps connections open, check that your proxy layer isn’t adding overhead.
Count Your Tokens in 30 Seconds
The most common error in AI cost planning is estimating token counts instead of measuring them. “It’s probably about 500 tokens” is a rough guess that can be off by 3–4x depending on prompt structure, language, whitespace, and special characters.
The free AI Token Counter lets you paste your exact system prompt and a representative user message, then shows you the precise token count, word and character equivalents, and a side-by-side cost estimate across GPT-5, GPT-4.1 Mini, Claude Sonnet 4, Gemini 2.5 Flash, and others. Run it on your 10th-percentile, median, and 90th-percentile request sizes to understand your cost distribution — not just your average case.
Once you have real token counts, the formula above gives you a defensible cost projection you can actually take to a budget meeting or product roadmap discussion.
Frequently asked questions
How do I get my average token counts if I haven’t built the feature yet? Manually assemble 10–20 representative prompts the way your application would send them — system prompt plus realistic user inputs. Run them through a token counter to get counts. This takes 20–30 minutes and gives you a much better estimate than guessing. For output tokens, ask the model to complete a handful of sample requests and log what comes back.
Does the model temperature setting affect my token costs? No. Temperature controls randomness in the output but doesn’t change token counts. A higher temperature might produce slightly longer or shorter responses as a side effect of different word choices, but the effect is noise-level small compared to prompt design decisions.
Is batch processing always cheaper? Yes, if your use case tolerates latency. OpenAI’s Batch API processes requests asynchronously (results within 24 hours) at 50% off standard pricing. Anthropic offers a similar batch discount. For any non-real-time task — overnight report generation, background enrichment, scheduled summaries — batch processing halves your effective per-token cost.
How do I log token usage per request in production?
Every major provider returns token usage in the API response. OpenAI returns usage.prompt_tokens and usage.completion_tokens in every response object. Log these to your analytics store (Datadog, Mixpanel, your own database) and you’ll have real cost attribution per feature, user, and request type within a week of deployment.
What’s a reasonable cost target per AI-assisted action for a B2B SaaS product? A rough benchmark from NMM experience: most B2B SaaS teams price their product so AI costs represent under 10–15% of revenue per user. If your plan charges $50/user/month, keeping AI costs under $5–7.50/user/month is a healthy target. That translates to roughly 1,000–3,000 AI actions per user per month at $0.002–0.005 per action, depending on model tier.