Your AI Inference Bill Grows Every Sprint Without New Traffic โ€” It's Your System Prompts

Input token consumption climbs 15%+ month over month while traffic stays flat โ€” slow enough to look like organic growth, fast enough to compound to 195% annually. The culprit is never in a ticket. It's in the system prompt.

Every sprint, your team makes the AI better. They add a few-shot example that improves output quality. They append a safety instruction to prevent an edge-case failure. They expand the RAG chunk size to give the model more context. They inject an extra section of conversation history to make responses more coherent. None of these changes get a cost ticket. None of them trigger a deploy alert. And every one of them adds tokens to every single request your application makes, forever.

At 10,000 requests per day, adding 200 tokens to a system prompt adds 2 million extra input tokens per day. At Claude Haiku rates, that is roughly $0.50/day โ€” $15/month โ€” from a single sprint's worth of improvements. Repeat across four sprints and twelve feature additions over a quarter, and a 200-token system prompt has grown to 2,000 tokens. The cost footprint is 10ร— what it was at launch. Nothing in your deployment pipeline measured it.

The pattern is slow enough to evade standard spike detection. A 15% month-over-month cost increase doesn't look alarming in isolation โ€” it looks like organic business growth. But 15% MoM compounds to 195% annualized. A workload that cost $500/month in January reaches $1,475/month by December with no increase in request volume. The entire growth is in per-request token consumption, not in the number of users or API calls. The compute leg โ€” EC2, Fargate, Lambda โ€” stays flat the whole time. Tokens grow; servers don't. That divergence is the fingerprint.

The Billing Signal

A single billing snapshot for a system-prompt-bloated workload is structurally indistinguishable from a healthy one. The pattern only becomes visible when you compare two 30-day windows side by side.

FOCUS / CUR FieldHealthy workload (stable prompt)Context window creep (growing prompt)
servicename Amazon Bedrock Amazon Bedrock โ€” identical
x_usagetype USE1-AmazonBedrock-InputTokens:... USE1-AmazonBedrock-InputTokens:... โ€” identical usage type
chargecategory Usage Usage โ€” identical
consumedquantity Stable (e.g., 80.0 = 80K tokens/day) Trending: 80.0 โ†’ 92.0 โ†’ 105.8 โ†’ 121.7 over four 30-day windows
billedcost Flat with traffic patterns Rising 15% MoM with flat request count
resourceid Inference profile ARN or model ARN Same ARN โ€” the model hasn't changed, the prompt has
Request count (CloudWatch) Tracking user traffic Flat โ€” growth is tokens-per-request, not requests-per-day

The detection condition: Two 30-day windows compared per (subaccountid, resourceid). Prior window: [today โˆ’ 60d, today โˆ’ 30d). Current window: [today โˆ’ 30d, today). Signal fires when prior_cost exceeds $10 and current_cost / prior_cost exceeds 1.15. This catches the slow, sprint-aligned growth pattern that spike detectors miss entirely.

The false positive to rule out first: A model upgrade (Haiku โ†’ Sonnet, Sonnet โ†’ Opus) also causes per-request cost to jump without traffic growth. Rule this out by checking whether the model identifier in resourceid or x_usagetype changed in the same 30-day window. If the model changed, the cost jump is expected. If the model is the same and cost is still growing, the tokens are in the prompt.

Real Incidents

These incidents are sourced from published engineering post-mortems and cost analyses. None are fabricated.

$8,400/month โ†’ $800/month: CI/CD code-review agent (TrueFoundry, 2026): A 50-engineer org ran AI code review on every pull request. The monthly Bedrock bill reached $8,400. Root cause: a 50,000-token security policy manual was injected into every prompt regardless of whether the PR had anything to do with security. The manual had been added sprint by sprint as the team encountered new policy edge cases, and nobody measured its cost per invocation. Replacing the full injection with conditional retrieval โ€” inject only the relevant sections based on PR diff โ€” cut the bill to under $800, a 90% reduction from prompt scope alone. Source: TrueFoundry, "Agentic Token Explosion in CI/CD" (2026).

2,500-line YAML system prompts at 20,000 tokens each (ProjectDiscovery, 2026): An agentic security scanner accumulated system prompts that grew to 2,500 lines of YAML across sprint iterations โ€” over 20,000 tokens per invocation. On a 40-step scan task, that prompt was re-sent 40 times, consuming 800,000 tokens for the prompt alone before a single byte of scan output. A single complex task reached 60 million total tokens. The team implemented prompt caching, achieving 91.8% cache-hit rates and cutting costs 59%. The cache-hit rate proved the prompt was far larger than any single step needed โ€” the bloat had accumulated beyond what was functionally necessary for each step. Source: ProjectDiscovery, "How We Cut LLM Costs by 59% With Prompt Caching" (2026).

Annual AI budget burned in four months: growing context at enterprise scale (Uber via Investing.com, 2026): As Claude Code adoption at a 5,000-engineer organization jumped from 32% to 84%, the entire annual AI budget was consumed in four months. Monthly API cost per engineer reached $500โ€“$2,000. A significant driver was growing context window consumption โ€” tool schemas, conversation history, and codebase context grew per session, not per engineer. Total cost grew faster than headcount because token consumption per session was climbing. Source: Investing.com, "The AI Token Pricing Crisis Behind OpenAI and Anthropic's Revenue Race" (2026).

10ร— prompt growth over six months, $1,500/month in pure bloat (Adaline, industry pattern): A documented pattern: a 200-token system prompt grows to 2,000 tokens over six months of sprint iterations, each change individually reasonable, collectively unmeasured. At 10,000 requests per day, that 1,800-token growth adds 18 million extra input tokens daily โ€” approximately $50/day of pure prompt overhead at GPT-4o rates, $1,500/month. No deploy was tagged with a cost impact. No ticket tracked the accumulation. The team discovered it only when they first measured their prompt's token count against the original version. Source: Adaline, "LLM Cost Optimization: Token Efficiency, Caching, and Prompt Design" (2026).


If you built it: what to look for and how to fix it

For platform engineers, ML engineers, and AI application developers who built or maintain the inference workload.

The bill doesn't show you which system prompt caused the growth. Bedrock charges at the invocation level, and the invocation record doesn't include your prompt text. Here is how to isolate the cost increase and prevent the next sprint from making it worse.

Step 1 โ€” Confirm cost is growing faster than requests

In AWS Cost Explorer: set Service = Amazon Bedrock, Group by = Usage Type. Look for rows matching your workload's input token usage type (USE1-AmazonBedrock-InputTokens:anthropic.claude-...). Compare this month to last month. If billedcost is up but your application's request count (from your own logs or CloudWatch) is flat, per-request cost is growing. That's the signal.

Divide total monthly input token cost by total monthly invocations. If that ratio is trending up month over month with no model upgrade, your prompts are growing.

Step 2 โ€” Measure your prompt size today

Most teams have never measured their system prompt token count after launch. Use your SDK's token-counting endpoint โ€” it doesn't invoke the model:

import anthropic
client = anthropic.Anthropic()
response = client.messages.count_tokens(
    model="claude-haiku-4-5-20251001",
    system=YOUR_SYSTEM_PROMPT,
    messages=[{"role": "user", "content": "hello"}]
)
print(f"System prompt tokens: {response.input_tokens}")

Set a token budget for your system prompt and fail CI if the prompt exceeds it. This is the single most effective intervention: it makes future additions a deliberate, reviewed cost decision โ€” not an invisible config change that accumulates sprint by sprint.

The detection SQL (FOCUS-native, QB17)

If you have FOCUS-formatted billing exports (AWS Data Exports, FOCUS 1.2), this query identifies workloads where input token cost grew more than 15% month over month. Two 30-day windows are compared per (subaccountid, resourceid):

WITH prior_window AS (
    SELECT
        subaccountid,
        resourceid,
        SUM(billedcost)       AS prior_cost,
        SUM(consumedquantity) AS prior_tokens_k
    FROM focus_billing
    WHERE servicename    LIKE '%Bedrock%'
      AND x_usagetype   LIKE '%InvokeModel%'
      AND chargecategory = 'Usage'
      AND chargeclass    = 'Regular'
      AND CAST(chargeperiodstart AS DATE) >= DATE_ADD('day', -60, CURRENT_DATE)
      AND CAST(chargeperiodstart AS DATE) <  DATE_ADD('day', -30, CURRENT_DATE)
    GROUP BY 1, 2
),
current_window AS (
    SELECT
        subaccountid,
        resourceid,
        SUM(billedcost)       AS current_cost,
        SUM(consumedquantity) AS current_tokens_k
    FROM focus_billing
    WHERE servicename    LIKE '%Bedrock%'
      AND x_usagetype   LIKE '%InvokeModel%'
      AND chargecategory = 'Usage'
      AND chargeclass    = 'Regular'
      AND CAST(chargeperiodstart AS DATE) >= DATE_ADD('day', -30, CURRENT_DATE)
    GROUP BY 1, 2
)
SELECT
    c.subaccountid,
    c.resourceid,
    ROUND(c.current_cost, 2)                                    AS current_month_cost,
    ROUND(p.prior_cost, 2)                                      AS prior_month_cost,
    ROUND(c.current_tokens_k, 1)                                AS current_tokens_k,
    ROUND(p.prior_tokens_k, 1)                                  AS prior_tokens_k,
    ROUND((c.current_tokens_k - p.prior_tokens_k)
          / NULLIF(p.prior_tokens_k, 0) * 100.0, 1)            AS token_growth_pct,
    ROUND(c.current_cost - p.prior_cost, 2)                    AS monthly_cost_delta
FROM current_window c
JOIN prior_window   p
  ON c.subaccountid = p.subaccountid
 AND c.resourceid   = p.resourceid
WHERE p.prior_cost > 10.0
  AND c.current_cost / NULLIF(p.prior_cost, 0) > 1.15
ORDER BY monthly_cost_delta DESC

Reading the results: monthly_cost_delta is the extra spend this month versus last. token_growth_pct confirms whether the growth is in tokens (prompt bloat) or cost-only (model price change). A workload with 80%+ token growth and flat request count has a prompt that expanded โ€” not a traffic event. A workload with 15% cost growth but 0% token growth is likely a pricing change from a model tier shift; check resourceid for a model identifier change.

What to do differently next sprint

Add a token count check to your PR template for any change that touches a system prompt. Make the before/after count visible in the review. Consider conditional injection for policy content: instead of the full safety manual in every prompt, retrieve only the sections relevant to the current request โ€” the TrueFoundry team cut costs 90% with this change. For large stable system prompt sections, evaluate prompt caching: Anthropic's cache_control prefix caching eliminates per-request cost for the cached portion once cache-hit rates exceed 50%.


If you watch the bill: how to detect this at scale

For FinOps practitioners, cloud finance analysts, and platform teams monitoring spend across multiple accounts.

Context window creep is a multi-account problem. A single workload growing 15% MoM may be noise. Across 20 accounts each with several AI workloads, a cluster of 15%+ MoM growers is a routing signal worth sending to the teams responsible for them.

Multi-account detection approach

Run the detection SQL above against your centralized FOCUS data lake. The query groups by subaccountid and resourceid, so you see every workload that crossed the 15% threshold across your entire organization in a single result set. Sort by monthly_cost_delta DESC to triage the highest-value remediations first.

The 15% threshold is intentional. Below 10%, the signal is indistinguishable from billing rounding and measurement noise. Above 30%, this is more likely a real traffic event, a model upgrade, or a runaway inference incident (QB15 โ€” different pattern, different fix). The 15โ€“25% range is the characteristic shape of prompt bloat: slow enough to go unnoticed month to month, fast enough to compound materially over a quarter.

Monthly impact math

At $500/month per AI workload and 15% MoM growth, the 12-month trajectory:

MonthSingle workload costDelta vs. Month 110 workloads
1$500โ€”$5,000
3$661+$161$6,610
6$1,007+$507$10,070
12$2,027+$1,527$20,270

At 10 workloads with this growth pattern, the organization is spending $20,270/month by month 12 on a workload portfolio that started at $5,000/month โ€” a $15,270/month overrun from prompt bloat alone, with no new users and no new features deployed. For workloads starting at $1,000/month, double all figures.

How to report this to the team that caused it

The team is not aware this is happening. Developers don't get billed for adding a few-shot example. The framing that works: "your inference workload grew 22% last month while your request count held flat โ€” that's a token-per-request increase, not a traffic increase. Do you know if the system prompt changed this sprint?" Almost always, it did. The developer who changed it usually has a good reason; they just didn't know the cost impact.

Give them the monthly_cost_delta figure and the inference profile ARN. That's enough to trace back to the workload and the sprint that caused it. The conversation ends with a token budget added to the next sprint's definition of done โ€” not a blame discussion.


If you own the outcome: the governance gap and how to close it

For engineering managers, VP Engineering, and CTOs who see a growing Bedrock line item that doesn't match user growth.

Your teams are making good engineering decisions. Adding few-shot examples improves output quality. Expanding RAG context reduces hallucination rates. Longer safety instructions reduce harmful outputs. Every individual change has a defensible rationale. The problem is that none of these changes include a cost measurement, and none are reversible once they're merged into the system prompt that the whole team treats as "just configuration."

Why standard dashboards miss it

AWS Cost Anomaly Detection operates at the service level. A 15% month-over-month increase on a workload that costs $500/month is a $75 change โ€” well below any anomaly threshold in an org spending thousands on Bedrock monthly. The cost compounds month after month before the absolute value is large enough to surface in any alert. By the time it's alarming, it's been accruing for six to twelve months.

Your FinOps dashboard shows Bedrock growing at a rate you might attribute to team growth or new AI feature adoption. The per-workload analysis that would reveal token consumption growing faster than requests is not in the default view of any standard cost monitoring tool. Total spend looks like organic growth. It is not.

The governance gap: no token budget at prompt authoring time

System prompts are treated as code โ€” versioned, reviewed, deployed โ€” but not as cost. There is no token budget, no cost gate, no measurement in the merge process. The structural gap is not technical; it's a process gap. The fix is a single policy: every system prompt has a token budget, enforced as a CI gate, before the feature that depends on it ships.

State of FinOps 2026 reports that 98% of organizations now actively manage AI spend (up from 31% two years prior). Token trend monitoring โ€” specifically per-workload MoM token consumption growth โ€” is the top identified gap in current FinOps tooling. The industry has not yet built the equivalent of a cost-per-query metric for prompt authoring. Until it does, a CI gate is the only mechanism that catches this before production.

Decision framework: fix now vs. accept

Fix now if: the detection query returns workloads with more than 25% MoM token growth, or if the sum of monthly_cost_delta across flagged workloads exceeds a developer-day of engineering time. The fix (prompt audit, token count measurement, CI gate) takes one to two days. The ongoing waste compounds monthly with no intervention required.

Accept if: the workload costs less than $50/month, the growth is well below 15%, or the quality improvement from prompt expansion demonstrably offsets the cost via a measurable A/B test, reduced human review time, or lower hallucination rate in a tracked eval suite. Accepting context window creep is a valid choice when the cost is small and the quality gain is real and measured. It is not a valid choice by default.


Fix checklist


Find out what else is hiding in your AI cloud bill

Context window creep is one of several AI spend patterns that standard FinOps tools don't detect โ€” either because they accumulate below general anomaly thresholds, grow too slowly for spike detectors, or require per-workload trend analysis rather than service-level totals. The DropInFinOps free assessment takes 2 minutes and shows you which patterns your current billing setup is positioned to catch โ€” and which ones are accumulating undetected.

Take the free assessment โ†’