Runaway AI Inference: How a Prompt Loop at 2 AM Becomes a $4,200 Weekend Bill

Your Bedrock inference was $7/day last week. This morning it's $132/day. No AWS alert fired. The agent that caused it stopped running six hours ago.

When an AI agent loop fails to terminate โ€” or accumulates context faster than it clears it โ€” token consumption doesn't grow linearly. It grows quadratically. A 20-step agent that re-injects its full conversation history on every iteration sends more input tokens on step 20 than it sent on steps 1 through 10 combined. By the time daily billing data reflects the spike, the damage is done.

This isn't a billing disguise problem like orphaned Knowledge Base OCU charges. The costs appear exactly where you'd expect โ€” under Amazon Bedrock, inference token usage. The problem is that no standard FinOps alert watches for a statistical spike in that specific meter. AWS Cost Anomaly Detection operates at the service level. A team with $50K/month in EC2 won't have a Bedrock-specific anomaly threshold configured. The bill arrives. Nobody connected it to the weekend agent run.

The detection logic is straightforward: compare the 3-day average inference spend against the 23-day baseline (days โˆ’30 to โˆ’7) and flag accounts where the recent average exceeds the baseline by more than 2 standard deviations. This self-calibrating threshold means high-variance accounts require a larger absolute spike to trigger, while low-variance accounts fire on smaller deviations. At 2ฯƒ, roughly 97.7% of normal days don't produce a false positive. When this query runs against validated billing data, it returns spike_ratio=18.04ร— and ฯƒ=819.57. That is the shape of a real runaway inference event.

Why Token Costs Compound Faster Than You Expect

Real-world agents use significantly more tokens per task than an equivalent chat session โ€” often 10โ€“20ร— more, and higher in failure scenarios. The mechanism is context accumulation: each iteration of an agent loop sends the full conversation history as input tokens. A 20-step agent with an average of 10,000 tokens per tool call response accumulates approximately 210,000 input tokens by step 20 โ€” versus 20,000 for 20 independent chat calls. That is a 10.5ร— amplification factor from architecture alone, before accounting for retry behavior.

Retry amplification compounds the problem. When a tool call fails and the agent retries with a 20% per-step failure rate, token consumption approximately doubles relative to a clean run (TianPan.co, 2026). Each failed tool call retries with the full accumulated context โ€” the retry itself adds tokens, and the next step starts from a larger context window than a clean run would have produced. Combined with context accumulation, real retry-heavy workloads can reach 20ร— or more relative to a simple chat session of equivalent scope.

AWS Bedrock billing appears in Cost and Usage Reports with up to a 24-hour delay. By the time the spike is visible in billing data, a loop running since midnight has already run for 12+ hours. Daily CUR granularity is the correct tool for catching these events โ€” not because it is fast, but because it is the only billing signal available.

Real Incidents

These incidents are sourced from public post-mortems, practitioner analyses, and community threads. None are fabricated.

$15 in under 10 minutes (r/AI_Agents, 2026): A developer agent loop stuck retrying with no spend cap. No manual intervention โ€” the loop terminated when it hit an API rate limit. The $15 would have been $150 at 10ร— the retry budget. The developer only noticed when the next API call failed.

$4,200 over a long weekend (LeanOps, 2026): A developer left an autonomous refactoring run (Cursor/Claude Code) running unattended across a three-day weekend. No per-session spend cap, no token limit on context window management, no alert on Bedrock-specific spend. Discovered Monday morning. The baseline had been ~$20/day.

~$26,100/month โ€” bug-triage agent, 35-engineer SaaS team (LeanOps, 2026): An automated bug-triage agent routing all tickets through an Opus-class model consumed 30% of the team's $87,000/month AI bill. No prompt caching, no model routing to cheaper tiers for simpler tasks, no per-ticket token cap. The team didn't notice because the absolute Bedrock spend growth was gradual โ€” until it was dominant.

$16,000โ€“$50,000 in 5 hours (r/Claude via relayplane.com, 2026): A single user session accumulated 1.67 billion tokens in 5 hours from context accumulation in a loop. Cost estimate ranges from $16K to $50K depending on model tier mix. Discovered via API dashboard, not billing alerts.

$15,000 total over 8 months (r/ClaudeAI, 2026): A single developer's daily Claude Code usage accumulated 10 billion tokens over 8 months with no usage tracking or per-day budget. No organization-level spend caps were in place. Monthly line items appeared unremarkable individually; the cumulative total emerged only when the developer reconciled annual cloud spend.

Retry amplification analysis (TianPan.co, 2026-04-16): A structured analysis of LLM agent retry behavior found that a 20% per-step failure rate approximately doubles the token bill per task completed. Each failed tool call retries with the full accumulated context โ€” the retry itself adds tokens, and the next step starts from a larger context window than a clean run would have produced.


If you built it: what to look for and how to fix it

For platform engineers, ML engineers, and AI application developers responsible for inference-heavy workloads.

The bill doesn't tell you which agent caused the spike. Bedrock charges at the model level, not the session level. Here is how to trace a spike back to its source and prevent the next one.

Step 1 โ€” Confirm the spike is in inference, not storage

In AWS Cost Explorer: set Service = Amazon Bedrock, Group by = Usage Type. Look for rows with InvokeModelInference or InvokeModelStreamingInference in the usage type โ€” or token-type suffixes like -input-tokens and -output-tokens. If the spike is in these rows and your compute (EC2, Fargate) spend is flat, that's the runaway inference fingerprint: token explosion without server scale-out. The two meters decouple โ€” more tokens does not mean more servers.

If the spike is disproportionately in output tokens, look for agent loops generating verbose reasoning chains. Output tokens cost 3โ€“5ร— input tokens on frontier models โ€” an agent producing 2,000-token step summaries on each iteration multiplies cost faster than one consuming long context.

Step 2 โ€” Identify the workload

Switch Cost Explorer to Group by = Resource (requires resource-level cost allocation). Look for the inference profile ARN pattern:

arn:aws:bedrock:{region}:{account}:inference-profile/

Cross-reference against your deployed AI applications. If you have multiple services calling Bedrock, the inference profile ARN or subAccountId narrows it to the responsible application team.

Via CLI โ€” query recent high-token invocations (requires model invocation logging enabled):

# List in-progress invocation jobs
aws bedrock list-model-invocation-jobs --status-equals InProgress

# If logging to CloudWatch is configured, filter for high-token events:
aws logs filter-log-events \
  --log-group-name /aws/bedrock/modelinvocations \
  --start-time $(date -d '3 days ago' +%s000) \
  --filter-pattern "{ $.totalTokens > 100000 }"

The detection SQL (FOCUS-native, QB15)

If you have AWS FOCUS-formatted billing exports (AWS Data Exports, FOCUS 1.2), this query identifies accounts where Bedrock inference spending has spiked more than 2 standard deviations above the 23-day rolling baseline:

-- QB15: Runaway AI Inference โ€” statistical spike detection
-- Fires when 3-day avg inference spend > baseline_mean + 2 ร— baseline_ฯƒ
-- Self-calibrating: high-variance accounts need a larger absolute spike to trigger
WITH inference_daily AS (
    SELECT
        subaccountid,
        CAST(chargeperiodstart AS DATE) AS charge_date,
        SUM(billedcost)                 AS daily_cost
    FROM focus_billing
    WHERE servicename LIKE '%Bedrock%'
      AND x_usagetype LIKE '%InvokeModel%'
      AND chargecategory = 'Usage'
      AND chargeclass    = 'Regular'
      AND CAST(chargeperiodstart AS DATE) >= DATE_ADD('day', -30, CURRENT_DATE)
    GROUP BY 1, 2
),
baseline AS (
    SELECT
        subaccountid,
        AVG(daily_cost)        AS avg_daily_cost,
        STDDEV_POP(daily_cost) AS stddev_daily_cost
    FROM inference_daily
    WHERE charge_date >= DATE_ADD('day', -30, CURRENT_DATE)
      AND charge_date <  DATE_ADD('day', -7,  CURRENT_DATE)
    GROUP BY subaccountid
),
recent AS (
    SELECT
        subaccountid,
        AVG(daily_cost) AS recent_avg_daily_cost,
        MAX(daily_cost) AS recent_max_daily_cost
    FROM inference_daily
    WHERE charge_date >= DATE_ADD('day', -3, CURRENT_DATE)
    GROUP BY subaccountid
)
SELECT
    r.subaccountid,
    ROUND(r.recent_avg_daily_cost, 4)                                          AS recent_avg_daily_cost,
    ROUND(b.avg_daily_cost, 4)                                                 AS baseline_avg_daily_cost,
    ROUND(b.stddev_daily_cost, 4)                                              AS baseline_stddev,
    ROUND(r.recent_avg_daily_cost / NULLIF(b.avg_daily_cost, 0), 4)           AS spike_ratio,
    ROUND(
        (r.recent_avg_daily_cost - b.avg_daily_cost)
        / NULLIF(b.stddev_daily_cost, 0),
        2
    )                                                                          AS sigma_distance
FROM recent r
JOIN baseline b ON r.subaccountid = b.subaccountid
WHERE r.recent_avg_daily_cost > b.avg_daily_cost + 2.0 * b.stddev_daily_cost
  AND b.avg_daily_cost > 1.0
ORDER BY spike_ratio DESC
LIMIT 25

Reading the results: spike_ratio is how many times above normal the current 3-day average is. sigma_distance is the statistical severity โ€” 2.0 is the threshold, 5.0+ is severe. A result with spike_ratio=18.0 and sigma_distance=820 means the account's inference spend is 18ร— its normal daily average, well clear of any reasonable interpretation of normal variability.

Why 2ฯƒ, not a fixed ratio: Compute workloads have predictable daily cost patterns. Inference spending is far more variable โ€” a team running a batch embedding job once a week has a high-variance baseline; a team with a steady RAG pipeline has a low-variance baseline. The same absolute spike ($50/day โ†’ $200/day) means very different things for each team. A fixed ratio produces false positives on legitimate high-variability workloads. 2ฯƒ self-calibrates: each account's threshold is derived from its own history.

The billing fields (FOCUS / CUR)

FOCUS / CUR FieldValue for Bedrock Inference
servicename Amazon Bedrock
x_usagetype {region}-{model}-input-tokens, {region}-{model}-output-tokens, {region}-{model}-cache-read-input-token-count, {region}-{model}-cache-write-input-token-count
x_operation InvokeModelInference (standard), InvokeModelStreamingInference (streaming). Available on accounts using January 2026+ CUR format. For older formats: filter on x_usagetype LIKE '%token%' as fallback.
chargecategory Usage
chargeclass Regular
Temporal profile during runaway Sustained elevation for 1โ€“3 days on inference rows. Compute rows stay flat โ€” token explosion โ‰  server scale-out. This decoupling is the causal fingerprint.

How to fix it at the application level

  1. Set MaxTokens on every InvokeModel call. This is the single highest-leverage control. Without it, a frontier model generates as many output tokens as it decides to โ€” there is no AWS-side enforcement. Set a per-call token budget appropriate to the task (2,000 for a summary, 500 for a classification). This is a code change, not an infrastructure change.
  2. Add an iteration budget to agent loops. Every agent loop must have a max_iterations or max_tool_calls guard. When the budget is exceeded, the loop forces a summary of progress and exits โ€” rather than retrying indefinitely with a growing context window. 20โ€“50 iterations is a reasonable starting range for most task types.
  3. Enable prompt caching for repeated context. Bedrock prompt caching (cache_control: ephemeral) reduces cost on cached input tokens by approximately 90%. For agents that re-inject a system prompt, knowledge base results, or conversation history on every iteration, this is a 40โ€“60% cost reduction with no quality trade-off. Implement it on any context block that repeats across iterations.
  4. Route simple tasks to lower-tier models. Opus-class models are appropriate for multi-step reasoning. Classification, extraction, and summarization tasks can run on Haiku or Nova Lite at 10โ€“20% of the cost. Implement model routing at the application layer โ€” not all agent steps need frontier intelligence.
  5. Manage context window growth explicitly. Naive context accumulation is the quadratic cost driver. Implement rolling summaries: after N turns, replace the raw history with a condensed summary. This caps input token cost at a linear rather than quadratic rate.

If you watch the bill: how to detect this at scale

For FinOps practitioners and cloud finance analysts responsible for multi-account AI spend visibility.

The challenge with runaway inference for FinOps is that the absolute dollar amounts can look small relative to total cloud spend โ€” until they don't. A team spending $50K/month on compute won't be alarmed by a Bedrock line item growing from $200 to $2,000/month over two sprints. By the time it's $20,000/month, the pattern is established and the team has no idea what changed.

Multi-account detection approach

The QB15 statistical query above is designed to run against a consolidated billing export. Run it monthly at minimum, or weekly if your organization has active AI application development. Any row returned represents an account where recent inference spending is statistically anomalous relative to that account's own baseline โ€” regardless of absolute dollar amount. A team spending $7/day that spikes to $132/day fires the query just as reliably as a team spending $700/day that spikes to $12,000/day.

In Cost Explorer: Group by = Service, filter to Amazon Bedrock. Set a 3-month date range. Look for accounts where Bedrock spend trend shows a non-linear slope โ€” gradual growth is usually legitimate; a step-change within a 1โ€“2 week window warrants investigation. Click into the account, Group by = Usage Type, to isolate whether the spike is in input or output tokens. Output token spikes point to verbose reasoning chains; input token spikes point to context accumulation.

Dollar impact math

The compounding economics make early detection disproportionately valuable:

The 2ฯƒ threshold self-calibrates per account, which matters at scale. An account running batch embedding jobs with high week-to-week variance needs a larger absolute spike to trigger than a steady-state RAG pipeline account. Running the same fixed-threshold alert across all accounts produces false positives on variable workloads and misses real spikes on steady ones. The statistical approach eliminates this problem.

Setting up the alert

AWS Budgets: Create a cost budget scoped to Service = Amazon Bedrock. Set threshold at 2ร— the previous month's Bedrock spend for each account. This catches month-over-month step changes even when the absolute dollar amount is below your general anomaly threshold. Configure email + SNS notification to the account owner team.

AWS Cost Anomaly Detection: Create a monitor on the Amazon Bedrock service with a spend threshold of $50/day above baseline. This uses AWS's own ML-based anomaly model and can detect a spike within 24โ€“48 hours of it appearing in CUR data โ€” the fastest available signal given daily billing granularity.

For organizations on AWS Organizations: run the QB15 query against the consolidated payer account's FOCUS export monthly. Any account returning a result gets a FinOps alert and a request for root cause description within 48 hours.

Escalation template

When you find the spike and trace it to a team:

We identified an anomalous Bedrock inference spend spike in [account name] over the past [N] days. Your account's 3-day average inference cost is $[recent_avg]/day โ€” [spike_ratio]ร— above your 23-day baseline of $[baseline_avg]/day.

This is consistent with an AI agent loop that failed to terminate cleanly, a missing MaxTokens limit, or a session running unattended with no spend cap. Action needed: review your Bedrock invocation logs for the period [start_date] to [end_date], identify the workload responsible, and confirm it has been stopped or bounded. If this was a legitimate workload, let us know so we can update the baseline threshold for your account.

Tracking going forward

The root metric to track is cost per output unit โ€” not total Bedrock spend. A team whose AI feature handles 10ร— more requests at the same token cost is doing well. A team whose token consumption grows faster than their feature usage is accumulating technical debt in the form of context bloat. Ask for tokens-per-request alongside cost reports; that's the signal that surfaces context accumulation before it becomes a billing anomaly.


If you own the outcome: the governance gap and how to close it

For Engineering Managers, VPs of Engineering, and CTOs who need root cause and a process fix.

What happened, in plain English

Your team deployed an AI agent โ€” a feature that calls a large language model in a loop to complete a multi-step task. That agent ran without a spending limit. When it encountered a problem (a tool call failure, an ambiguous state, a large input file), it retried. Each retry included the full conversation history from all prior steps. The context window grew. Token costs compounded. By the time daily billing data reflected the spike, the agent had been running for 24โ€“48 hours.

No AWS alert fired because the alert configuration covered total cloud spend โ€” not Bedrock inference specifically. The Bedrock line item was too small to breach a general anomaly threshold at first. By the time it was visible in aggregate spend, it had already run for multiple billing cycles.

This is not a mistake in the traditional sense. The engineers who built the agent were building a feature, not a cost center. Token pricing feels small at the per-call level. The compounding effect of context accumulation at 50-step agent loops is not intuitive until you have seen a runaway incident. But the result โ€” real dollars spent on a loop that delivered no value โ€” is the same regardless of intent.

The governance gap

Three things failed simultaneously:

  1. No per-session or per-user spend cap. AWS Bedrock has no native per-session token budget enforcement. Without application-level controls (MaxTokens per call, max iterations per session, daily per-user cost cap), there is no ceiling on what a single agent run can cost. This is different from compute, where instance types set an hourly ceiling. AI inference has no equivalent guardrail unless you build one.
  2. No Bedrock-specific anomaly alert. A team running $50K/month in EC2 will have general budget alerts โ€” but those thresholds were calibrated for compute spend. A $500/day Bedrock spike is unremarkable against a $1,600/day EC2 baseline. The alert needed for AI spend is a Bedrock-specific budget at the account level, calibrated to the expected inference workload volume. That alert almost certainly doesn't exist if your team added AI features incrementally without revisiting the alerting configuration.
  3. No engineering policy for AI agent iteration budgets. There is a wide body of engineering practice around memory management, connection pool limits, and retry budgets for traditional software. None of that culture has translated to AI agent development. Teams building agent loops for the first time have no framework for "what's a reasonable number of tool calls per task" โ€” so they don't set a limit, and the agent loops until it succeeds or the user closes the tab.

The policy fix

Three changes close the gap:

  1. Mandate MaxTokens and iteration budgets at the engineering level. Any code that calls Bedrock's InvokeModel must pass a MaxTokens parameter. Any agent loop must have a max_iterations guard. These should be code review requirements, not suggestions. Add a linting rule or PR template checklist item that flags Bedrock invocations without token limits.
  2. Create a Bedrock-specific budget alert per account. Threshold: 2ร— last month's Bedrock spend, or $100/day absolute, whichever is lower for dev/test accounts. Recipient: the team owning that account's AI workloads. This takes 10 minutes to configure and will catch every runaway inference incident within 24โ€“48 hours of it starting.
  3. Add cost-per-request to AI feature observability. Every AI-backed feature should track tokens consumed per request, and cost per request, alongside latency and error rate. Make it a first-class metric in your dashboards. When cost-per-request trends upward while request volume is flat, that is the early warning signal for context window creep or model routing drift โ€” before it reaches the billing anomaly threshold.

The decision

For agents already running: audit whether MaxTokens is set on every invocation and whether iteration budgets are enforced. This is a code audit, not an infrastructure change. It can be completed in a sprint without disrupting any feature.

For the alerting gap: set up the Bedrock budget alert today. The cost is zero. The return is catching the next incident within 48 hours instead of at month-end billing review.

For cost-per-request observability: this is a one-sprint engineering investment. The alternative is discovering cost problems through billing anomalies โ€” always reactively, always after the spend has already happened.

What good looks like

Every Bedrock InvokeModel call in production has an explicit MaxTokens limit. Every agent loop has a documented max-iterations budget, enforced in code. Cost-per-request and tokens-per-request are tracked in your standard observability stack alongside latency and error rate. A Bedrock-specific cost anomaly alert fires within 24 hours of any account exceeding 2ร— its inference baseline. When a developer finishes an autonomous agent session, they don't wonder what it cost โ€” the per-session spend estimate is in the development environment logs before they close the tab.


Fix checklist

  1. Set MaxTokens on every Bedrock InvokeModel call. Per-call token limit is the only application-level ceiling on inference cost. Require it in code review โ€” flag any InvokeModel call without it.
  2. Add iteration budgets to all agent loops. Maximum tool calls per session, with a forced summary-and-stop when the budget is exceeded. 20โ€“50 iterations is a reasonable starting range for most task types.
  3. Enable Bedrock prompt caching. For agents that re-inject system prompts or knowledge base results, cache_control: ephemeral cuts cached input token cost by ~90%. Implement on any context block that repeats across iterations.
  4. Implement model routing. Route classification, extraction, and summarization steps to Haiku or Nova Lite. Reserve Sonnet/Opus for reasoning-intensive steps. A single tier decision point at the application layer cuts token cost by 50โ€“80% on mixed-task agents.
  5. Create a Bedrock budget alert per account. Threshold: 2ร— last month's Bedrock spend or $100/day, whichever is lower for non-production accounts. Configure SNS + email to the owning team. 10 minutes to set up; catches every runaway incident within 24โ€“48 hours.
  6. Track cost-per-request as a first-class metric. Add tokens consumed and estimated cost per request to your AI feature observability. Trending upward without request volume growth = context accumulation or model routing drift โ€” investigate before it reaches the billing anomaly threshold.
  7. Run the QB15 detection query monthly. Against your consolidated FOCUS billing export. Any account returning a row gets a FinOps review within 48 hours. This is the systematic backstop for all the controls above โ€” it catches incidents that slip through.

Find out what else is hiding in your AI cloud bill

Runaway inference is one of several AI spend patterns that standard FinOps tools don't detect โ€” either because they cross service boundaries, accumulate below general anomaly thresholds, or require statistical baselines rather than fixed rules. The DropInFinOps free assessment takes 2 minutes and shows you which patterns your current billing setup is positioned to catch โ€” and which ones are accumulating undetected.

Take the free assessment โ†’