Runaway AI Inference: How a Prompt Loop at 2 AM Becomes a $4,200 Weekend Bill

Your Bedrock inference was $7/day last week. This morning it's $132/day. No AWS alert fired. The agent that caused it stopped running six hours ago.

When an AI agent loop fails to terminate — or accumulates context faster than it clears it — token consumption doesn't grow linearly. It grows quadratically. A 20-step agent that re-injects its full conversation history on every iteration sends more input tokens on step 20 than it sent on steps 1 through 10 combined. By the time daily billing data reflects the spike, the damage is done.

This isn't a billing disguise problem like orphaned Knowledge Base OCU charges. The costs appear exactly where you'd expect — under Amazon Bedrock, inference token usage. The problem is that no standard FinOps alert watches for a statistical spike in that specific meter. AWS Cost Anomaly Detection operates at the service level. A team with $50K/month in EC2 won't have a Bedrock-specific anomaly threshold configured. The bill arrives. Nobody connected it to the weekend agent run.

The detection logic is straightforward: compare the 3-day average inference spend against the 23-day baseline (days −30 to −7) and flag accounts where the recent average exceeds the baseline by more than 2 standard deviations. This self-calibrating threshold means high-variance accounts require a larger absolute spike to trigger, while low-variance accounts fire on smaller deviations. At 2σ, roughly 97.7% of normal days don't produce a false positive. When this query runs against validated billing data, it returns spike_ratio=18.04× and σ=819.57. That is the shape of a real runaway inference event.

Why Token Costs Compound Faster Than You Expect

Real-world agents use significantly more tokens per task than an equivalent chat session — often 10–20× more, and higher in failure scenarios. The mechanism is context accumulation: each iteration of an agent loop sends the full conversation history as input tokens. A 20-step agent with an average of 10,000 tokens per tool call response accumulates approximately 210,000 input tokens by step 20 — versus 20,000 for 20 independent chat calls. That is a 10.5× amplification factor from architecture alone, before accounting for retry behavior.

Retry amplification compounds the problem. When a tool call fails and the agent retries with a 20% per-step failure rate, token consumption approximately doubles relative to a clean run (TianPan.co, 2026). Each failed tool call retries with the full accumulated context — the retry itself adds tokens, and the next step starts from a larger context window than a clean run would have produced. Combined with context accumulation, real retry-heavy workloads can reach 20× or more relative to a simple chat session of equivalent scope.

AWS Bedrock billing appears in Cost and Usage Reports with up to a 24-hour delay. By the time the spike is visible in billing data, a loop running since midnight has already run for 12+ hours. Daily CUR granularity is the correct tool for catching these events — not because it is fast, but because it is the only billing signal available.

Real Incidents

These incidents are sourced from public post-mortems, practitioner analyses, community threads, and mainstream reporting on enterprise AI spend. None are fabricated.

Uber exhausts its entire 2026 AI coding budget by April (TechCrunch, 2026-06-02): Uber's CTO confirmed the company burned through its full-year 2026 AI coding budget in four months, driven by enterprise-wide Claude Code and Cursor adoption — monthly API costs ran $500–$2,000 per engineer with no per-session cap. Uber's response was a blunt one: a flat $1,500/month cap per employee, per agentic coding tool.

~$500 million in one month, no usage limits configured (reported by Axios, 2026-05-28): An unnamed enterprise ran up roughly $500 million in Claude spend in a single month after never configuring per-user spend limits on employee licenses. The controls existed — they were simply never turned on.

$40,000 in tokens, one engineer, one month (TechCrunch, quoting Faros AI CEO Vitaly Gordon, 2026-06-05): A CTO told Gordon that a single engineer had run up $40,000 in token spend in a month — and that the CTO genuinely didn't know whether to shut it down or tell the rest of the team to do the same. That ambiguity is the tell: without a cost-per-output baseline, heavy spend and heavy productivity look identical on the bill.

GitHub Copilot moves to token-metered billing, June 1, 2026 (GitHub; session-cost estimate via ChatForest, 2026-06-02): Copilot's shift from flat per-seat pricing to token-metered credits means a single agentic session can burn $30–$40 — three to four times a Pro subscriber's entire $10/month credit allowance, gone before lunch. A fixed subscription line item became a variable one overnight.

Cursor renewal comes back 4–5× higher (TechCrunch, 2026-06-05): A Priceline employee described a routine Cursor contract renewal landing at 4–5× the prior price. Same pattern as a Bedrock inference spike — usage grew faster than anyone was tracking it, and the bill only became visible at the renewal cliff, not incrementally.

Per-developer token consumption up ~18.6× in nine months (Jellyfish research, via TechCrunch, 2026-06-05): Jellyfish's head of research found agentic features — not simple chat usage — drove per-developer token consumption up roughly 18.6× over three quarters. The heaviest users were about twice as productive as light users, but consumed roughly 10× the tokens to get there. Raw token spend, without a cost-per-output baseline, is a poor productivity proxy.

$15 in under 10 minutes (r/AI_Agents, 2026): A developer agent loop stuck retrying with no spend cap. No manual intervention — the loop terminated when it hit an API rate limit. The $15 would have been $150 at 10× the retry budget. The developer only noticed when the next API call failed.

$4,200 over a long weekend (LeanOps, 2026): A developer left an autonomous refactoring run (Cursor/Claude Code) running unattended across a three-day weekend. No per-session spend cap, no token limit on context window management, no alert on Bedrock-specific spend. Discovered Monday morning. The baseline had been ~$20/day.

~$26,100/month — bug-triage agent, 35-engineer SaaS team (LeanOps, 2026): An automated bug-triage agent routing all tickets through an Opus-class model consumed 30% of the team's $87,000/month AI bill. No prompt caching, no model routing to cheaper tiers for simpler tasks, no per-ticket token cap. The team didn't notice because the absolute Bedrock spend growth was gradual — until it was dominant.

$16,000–$50,000 in 5 hours (r/Claude via relayplane.com, 2026): A single user session accumulated 1.67 billion tokens in 5 hours from context accumulation in a loop. Cost estimate ranges from $16K to $50K depending on model tier mix. Discovered via API dashboard, not billing alerts.

$15,000 total over 8 months (r/ClaudeAI, 2026): A single developer's daily Claude Code usage accumulated 10 billion tokens over 8 months with no usage tracking or per-day budget. No organization-level spend caps were in place. Monthly line items appeared unremarkable individually; the cumulative total emerged only when the developer reconciled annual cloud spend.

Retry amplification analysis (TianPan.co, 2026-04-16): A structured analysis of LLM agent retry behavior found that a 20% per-step failure rate approximately doubles the token bill per task completed. Each failed tool call retries with the full accumulated context — the retry itself adds tokens, and the next step starts from a larger context window than a clean run would have produced.

If you built it: what to look for and how to fix it

For platform engineers, ML engineers, and AI application developers responsible for inference-heavy workloads.

The bill doesn't tell you which agent caused the spike. Bedrock charges at the model level, not the session level. Here is how to trace a spike back to its source and prevent the next one.

Step 1 — Confirm the spike is in inference, not storage

In AWS Cost Explorer: set Service = Amazon Bedrock, Group by = Usage Type. Look for rows with InvokeModelInference or InvokeModelStreamingInference in the usage type — or token-type suffixes like -input-tokens and -output-tokens. If the spike is in these rows and your compute (EC2, Fargate) spend is flat, that's the runaway inference fingerprint: token explosion without server scale-out. The two meters decouple — more tokens does not mean more servers.

If the spike is disproportionately in output tokens, look for agent loops generating verbose reasoning chains. Output tokens cost 3–5× input tokens on frontier models — an agent producing 2,000-token step summaries on each iteration multiplies cost faster than one consuming long context.

Step 2 — Identify the workload

Switch Cost Explorer to Group by = Resource (requires resource-level cost allocation). Look for the inference profile ARN pattern:

arn:aws:bedrock:{region}:{account}:inference-profile/

Cross-reference against your deployed AI applications. If you have multiple services calling Bedrock, the inference profile ARN or subAccountId narrows it to the responsible application team.

Via CLI — query recent high-token invocations (requires model invocation logging enabled):

# List in-progress invocation jobs
aws bedrock list-model-invocation-jobs --status-equals InProgress

# If logging to CloudWatch is configured, filter for high-token events:
aws logs filter-log-events \
  --log-group-name /aws/bedrock/modelinvocations \
  --start-time $(date -d '3 days ago' +%s000) \
  --filter-pattern "{ $.totalTokens > 100000 }"

The detection SQL (FOCUS-native, QB15)

If you have AWS FOCUS-formatted billing exports (AWS Data Exports, FOCUS 1.2), this query identifies accounts where Bedrock inference spending has spiked more than 2 standard deviations above the 23-day rolling baseline:

-- QB15: Runaway AI Inference — statistical spike detection
-- Fires when 3-day avg inference spend > baseline_mean + 2 × baseline_σ
-- Self-calibrating: high-variance accounts need a larger absolute spike to trigger
WITH inference_daily AS (
    SELECT
        subaccountid,
        CAST(chargeperiodstart AS DATE) AS charge_date,
        SUM(billedcost)                 AS daily_cost
    FROM focus_billing
    WHERE servicename LIKE '%Bedrock%'
      AND x_usagetype LIKE '%InvokeModel%'
      AND chargecategory = 'Usage'
      AND chargeclass    = 'Regular'
      AND CAST(chargeperiodstart AS DATE) >= DATE_ADD('day', -30, CURRENT_DATE)
    GROUP BY 1, 2
),
baseline AS (
    SELECT
        subaccountid,
        AVG(daily_cost)        AS avg_daily_cost,
        STDDEV_POP(daily_cost) AS stddev_daily_cost
    FROM inference_daily
    WHERE charge_date >= DATE_ADD('day', -30, CURRENT_DATE)
      AND charge_date <  DATE_ADD('day', -7,  CURRENT_DATE)
    GROUP BY subaccountid
),
recent AS (
    SELECT
        subaccountid,
        AVG(daily_cost) AS recent_avg_daily_cost,
        MAX(daily_cost) AS recent_max_daily_cost
    FROM inference_daily
    WHERE charge_date >= DATE_ADD('day', -3, CURRENT_DATE)
    GROUP BY subaccountid
)
SELECT
    r.subaccountid,
    ROUND(r.recent_avg_daily_cost, 4)                                          AS recent_avg_daily_cost,
    ROUND(b.avg_daily_cost, 4)                                                 AS baseline_avg_daily_cost,
    ROUND(b.stddev_daily_cost, 4)                                              AS baseline_stddev,
    ROUND(r.recent_avg_daily_cost / NULLIF(b.avg_daily_cost, 0), 4)           AS spike_ratio,
    ROUND(
        (r.recent_avg_daily_cost - b.avg_daily_cost)
        / NULLIF(b.stddev_daily_cost, 0),
        2
    )                                                                          AS sigma_distance
FROM recent r
JOIN baseline b ON r.subaccountid = b.subaccountid
WHERE r.recent_avg_daily_cost > b.avg_daily_cost + 2.0 * b.stddev_daily_cost
  AND b.avg_daily_cost > 1.0
ORDER BY spike_ratio DESC
LIMIT 25

Reading the results: spike_ratio is how many times above normal the current 3-day average is. sigma_distance is the statistical severity — 2.0 is the threshold, 5.0+ is severe. A result with spike_ratio=18.0 and sigma_distance=820 means the account's inference spend is 18× its normal daily average, well clear of any reasonable interpretation of normal variability.

Why 2σ, not a fixed ratio: Compute workloads have predictable daily cost patterns. Inference spending is far more variable — a team running a batch embedding job once a week has a high-variance baseline; a team with a steady RAG pipeline has a low-variance baseline. The same absolute spike ($50/day → $200/day) means very different things for each team. A fixed ratio produces false positives on legitimate high-variability workloads. 2σ self-calibrates: each account's threshold is derived from its own history.

The billing fields (FOCUS / CUR)

FOCUS / CUR Field	Value for Bedrock Inference
`servicename`	`Amazon Bedrock`
`x_usagetype`	`{region}-{model}-input-tokens`, `{region}-{model}-output-tokens`, `{region}-{model}-cache-read-input-token-count`, `{region}-{model}-cache-write-input-token-count`
`x_operation`	`InvokeModelInference` (standard), `InvokeModelStreamingInference` (streaming). Available on accounts using January 2026+ CUR format. For older formats: filter on `x_usagetype LIKE '%token%'` as fallback.
`chargecategory`	`Usage`
`chargeclass`	`Regular`
Temporal profile during runaway	Sustained elevation for 1–3 days on inference rows. Compute rows stay flat — token explosion ≠ server scale-out. This decoupling is the causal fingerprint.

How to fix it at the application level

Set MaxTokens on every InvokeModel call. This is the single highest-leverage control. Without it, a frontier model generates as many output tokens as it decides to — there is no AWS-side enforcement. Set a per-call token budget appropriate to the task (2,000 for a summary, 500 for a classification). This is a code change, not an infrastructure change.
Add an iteration budget to agent loops. Every agent loop must have a max_iterations or max_tool_calls guard. When the budget is exceeded, the loop forces a summary of progress and exits — rather than retrying indefinitely with a growing context window. 20–50 iterations is a reasonable starting range for most task types.
Enable prompt caching for repeated context. Bedrock prompt caching (cache_control: ephemeral) reduces cost on cached input tokens by approximately 90%. For agents that re-inject a system prompt, knowledge base results, or conversation history on every iteration, this is a 40–60% cost reduction with no quality trade-off. Implement it on any context block that repeats across iterations.
Route simple tasks to lower-tier models. Opus-class models are appropriate for multi-step reasoning. Classification, extraction, and summarization tasks can run on Haiku or Nova Lite at 10–20% of the cost. Implement model routing at the application layer — not all agent steps need frontier intelligence.
Manage context window growth explicitly. Naive context accumulation is the quadratic cost driver. Implement rolling summaries: after N turns, replace the raw history with a condensed summary. This caps input token cost at a linear rather than quadratic rate.

If you watch the bill: how to detect this at scale

For FinOps practitioners and cloud finance analysts responsible for multi-account AI spend visibility.

The challenge with runaway inference for FinOps is that the absolute dollar amounts can look small relative to total cloud spend — until they don't. A team spending $50K/month on compute won't be alarmed by a Bedrock line item growing from $200 to $2,000/month over two sprints. By the time it's $20,000/month, the pattern is established and the team has no idea what changed.

Multi-account detection approach

The QB15 statistical query above is designed to run against a consolidated billing export. Run it monthly at minimum, or weekly if your organization has active AI application development. Any row returned represents an account where recent inference spending is statistically anomalous relative to that account's own baseline — regardless of absolute dollar amount. A team spending $7/day that spikes to $132/day fires the query just as reliably as a team spending $700/day that spikes to $12,000/day.

In Cost Explorer: Group by = Service, filter to Amazon Bedrock. Set a 3-month date range. Look for accounts where Bedrock spend trend shows a non-linear slope — gradual growth is usually legitimate; a step-change within a 1–2 week window warrants investigation. Click into the account, Group by = Usage Type, to isolate whether the spike is in input or output tokens. Output token spikes point to verbose reasoning chains; input token spikes point to context accumulation.

Dollar impact math

The compounding economics make early detection disproportionately valuable:

A $7/day baseline that spikes to $132/day and runs for 3 days = $375 incident before discovery
The same event, undiscovered for 2 weeks: $2,600
An unmanaged bug-triage agent at 35 engineers: $26,100/month in preventable spend
At 10 accounts with unmanaged AI agents, running the statistical query catches each account independently — regardless of whether you know which team owns it

The 2σ threshold self-calibrates per account, which matters at scale. An account running batch embedding jobs with high week-to-week variance needs a larger absolute spike to trigger than a steady-state RAG pipeline account. Running the same fixed-threshold alert across all accounts produces false positives on variable workloads and misses real spikes on steady ones. The statistical approach eliminates this problem.

Setting up the alert

AWS Budgets: Create a cost budget scoped to Service = Amazon Bedrock. Set threshold at 2× the previous month's Bedrock spend for each account. This catches month-over-month step changes even when the absolute dollar amount is below your general anomaly threshold. Configure email + SNS notification to the account owner team.

AWS Cost Anomaly Detection: Create a monitor on the Amazon Bedrock service with a spend threshold of $50/day above baseline. This uses AWS's own ML-based anomaly model and can detect a spike within 24–48 hours of it appearing in CUR data — the fastest available signal given daily billing granularity.

For organizations on AWS Organizations: run the QB15 query against the consolidated payer account's FOCUS export monthly. Any account returning a result gets a FinOps alert and a request for root cause description within 48 hours.

Escalation template

When you find the spike and trace it to a team:

We identified an anomalous Bedrock inference spend spike in [account name] over the past [N] days. Your account's 3-day average inference cost is $[recent_avg]/day — [spike_ratio]× above your 23-day baseline of $[baseline_avg]/day.

This is consistent with an AI agent loop that failed to terminate cleanly, a missing MaxTokens limit, or a session running unattended with no spend cap. Action needed: review your Bedrock invocation logs for the period [start_date] to [end_date], identify the workload responsible, and confirm it has been stopped or bounded. If this was a legitimate workload, let us know so we can update the baseline threshold for your account.

Tracking going forward

The root metric to track is cost per output unit — not total Bedrock spend. A team whose AI feature handles 10× more requests at the same token cost is doing well. A team whose token consumption grows faster than their feature usage is accumulating technical debt in the form of context bloat. Ask for tokens-per-request alongside cost reports; that's the signal that surfaces context accumulation before it becomes a billing anomaly.

If you own the outcome: the governance gap and how to close it

For Engineering Managers, VPs of Engineering, and CTOs who need root cause and a process fix.

What happened, in plain English

Your team deployed an AI agent — a feature that calls a large language model in a loop to complete a multi-step task. That agent ran without a spending limit. When it encountered a problem (a tool call failure, an ambiguous state, a large input file), it retried. Each retry included the full conversation history from all prior steps. The context window grew. Token costs compounded. By the time daily billing data reflected the spike, the agent had been running for 24–48 hours.

No AWS alert fired because the alert configuration covered total cloud spend — not Bedrock inference specifically. The Bedrock line item was too small to breach a general anomaly threshold at first. By the time it was visible in aggregate spend, it had already run for multiple billing cycles.

This is not a mistake in the traditional sense. The engineers who built the agent were building a feature, not a cost center. Token pricing feels small at the per-call level. The compounding effect of context accumulation at 50-step agent loops is not intuitive until you have seen a runaway incident. But the result — real dollars spent on a loop that delivered no value — is the same regardless of intent.

The governance gap

This is no longer a problem confined to a single developer's weekend. In 2026, entire companies have hit the same wall at enterprise scale: Uber exhausted its full-year AI coding budget by April; one unnamed enterprise ran up roughly $500 million in a single month on Claude with no per-user limits configured; per-developer token consumption is up ~18.6× in nine months industry-wide. The governance gap that turns a runaway loop into a five-figure weekend bill is the same gap that turns it into an eight- or nine-figure one — the difference is only which account it happens in and how long it runs before someone looks.

Three things failed simultaneously:

No per-session or per-user spend cap. AWS Bedrock has no native per-session token budget enforcement. Without application-level controls (MaxTokens per call, max iterations per session, daily per-user cost cap), there is no ceiling on what a single agent run can cost. This is different from compute, where instance types set an hourly ceiling. AI inference has no equivalent guardrail unless you build one.
No Bedrock-specific anomaly alert. A team running $50K/month in EC2 will have general budget alerts — but those thresholds were calibrated for compute spend. A $500/day Bedrock spike is unremarkable against a $1,600/day EC2 baseline. The alert needed for AI spend is a Bedrock-specific budget at the account level, calibrated to the expected inference workload volume. That alert almost certainly doesn't exist if your team added AI features incrementally without revisiting the alerting configuration.
No engineering policy for AI agent iteration budgets. There is a wide body of engineering practice around memory management, connection pool limits, and retry budgets for traditional software. None of that culture has translated to AI agent development. Teams building agent loops for the first time have no framework for "what's a reasonable number of tool calls per task" — so they don't set a limit, and the agent loops until it succeeds or the user closes the tab.

The policy fix

Three changes close the gap:

Mandate MaxTokens and iteration budgets at the engineering level. Any code that calls Bedrock's InvokeModel must pass a MaxTokens parameter. Any agent loop must have a max_iterations guard. These should be code review requirements, not suggestions. Add a linting rule or PR template checklist item that flags Bedrock invocations without token limits.
Create a Bedrock-specific budget alert per account. Threshold: 2× last month's Bedrock spend, or $100/day absolute, whichever is lower for dev/test accounts. Recipient: the team owning that account's AI workloads. This takes 10 minutes to configure and will catch every runaway inference incident within 24–48 hours of it starting.
Add cost-per-request to AI feature observability. Every AI-backed feature should track tokens consumed per request, and cost per request, alongside latency and error rate. Make it a first-class metric in your dashboards. When cost-per-request trends upward while request volume is flat, that is the early warning signal for context window creep or model routing drift — before it reaches the billing anomaly threshold.

The decision

For agents already running: audit whether MaxTokens is set on every invocation and whether iteration budgets are enforced. This is a code audit, not an infrastructure change. It can be completed in a sprint without disrupting any feature.

For the alerting gap: set up the Bedrock budget alert today. The cost is zero. The return is catching the next incident within 48 hours instead of at month-end billing review.

For cost-per-request observability: this is a one-sprint engineering investment. The alternative is discovering cost problems through billing anomalies — always reactively, always after the spend has already happened.

What good looks like

Every Bedrock InvokeModel call in production has an explicit MaxTokens limit. Every agent loop has a documented max-iterations budget, enforced in code. Cost-per-request and tokens-per-request are tracked in your standard observability stack alongside latency and error rate. A Bedrock-specific cost anomaly alert fires within 24 hours of any account exceeding 2× its inference baseline. When a developer finishes an autonomous agent session, they don't wonder what it cost — the per-session spend estimate is in the development environment logs before they close the tab.

Fix checklist

Set MaxTokens on every Bedrock InvokeModel call. Per-call token limit is the only application-level ceiling on inference cost. Require it in code review — flag any InvokeModel call without it.
Add iteration budgets to all agent loops. Maximum tool calls per session, with a forced summary-and-stop when the budget is exceeded. 20–50 iterations is a reasonable starting range for most task types.
Enable Bedrock prompt caching. For agents that re-inject system prompts or knowledge base results, cache_control: ephemeral cuts cached input token cost by ~90%. Implement on any context block that repeats across iterations.
Implement model routing. Route classification, extraction, and summarization steps to Haiku or Nova Lite. Reserve Sonnet/Opus for reasoning-intensive steps. A single tier decision point at the application layer cuts token cost by 50–80% on mixed-task agents.
Create a Bedrock budget alert per account. Threshold: 2× last month's Bedrock spend or $100/day, whichever is lower for non-production accounts. Configure SNS + email to the owning team. 10 minutes to set up; catches every runaway incident within 24–48 hours.
Track cost-per-request as a first-class metric. Add tokens consumed and estimated cost per request to your AI feature observability. Trending upward without request volume growth = context accumulation or model routing drift — investigate before it reaches the billing anomaly threshold.
Run the QB15 detection query monthly. Against your consolidated FOCUS billing export. Any account returning a row gets a FinOps review within 48 hours. This is the systematic backstop for all the controls above — it catches incidents that slip through.

Sources

TechCrunch, "Uber caps employee AI spending after blowing through budget in four months", 2026-06-02
TechCrunch, "The token bill comes due: Inside the industry scramble to manage AI's runaway costs", 2026-06-05 (Faros AI / Vitaly Gordon $40K engineer anecdote, Jellyfish 18.6× consumption stat, Priceline/Cursor renewal figure)
Futurism, "Unfortunate Company Accidentally Blows Half a Billion Dollars on Claude in One Month", reporting on Axios's original 2026-05-28 story
GitHub Blog, "GitHub Copilot is moving to usage-based billing", effective 2026-06-01
ChatForest, "GitHub Copilot's Token Billing Is Live", 2026-06-02 ($30–$40 per-session cost estimate)
TianPan.co, retry amplification analysis, 2026-04-16

Find out what else is hiding in your AI cloud bill

Runaway inference is one of several AI spend patterns that standard FinOps tools don't detect — either because they cross service boundaries, accumulate below general anomaly thresholds, or require statistical baselines rather than fixed rules. The DropInFinOps free assessment takes 2 minutes and shows you which patterns your current billing setup is positioned to catch — and which ones are accumulating undetected.

Take the free assessment →

Runaway AI Inference: How a Prompt Loop at 2 AM Becomes a $4,200 Weekend Bill

Runaway AI Inference: How a Prompt Loop at 2 AM Becomes a $4,200 Weekend Bill

Why Token Costs Compound Faster Than You Expect

Real Incidents

If you built it: what to look for and how to fix it

Step 1 — Confirm the spike is in inference, not storage

Step 2 — Identify the workload

The detection SQL (FOCUS-native, QB15)

The billing fields (FOCUS / CUR)

How to fix it at the application level

If you watch the bill: how to detect this at scale

Multi-account detection approach

Dollar impact math

Setting up the alert

Escalation template

Tracking going forward

If you own the outcome: the governance gap and how to close it

What happened, in plain English

The governance gap

The policy fix

The decision

What good looks like

Fix checklist

Sources

More from our guides

What is FinOps?

Common AWS Cost Mistakes

Practical AWS Lambda Automations

Runaway AI Inference: How a Prompt Loop at 2 AM Becomes a $4,200 Weekend Bill

Runaway AI Inference: How a Prompt Loop at 2 AM Becomes a $4,200 Weekend Bill

Why Token Costs Compound Faster Than You Expect

Real Incidents

If you built it: what to look for and how to fix it

Step 1 — Confirm the spike is in inference, not storage

Step 2 — Identify the workload

The detection SQL (FOCUS-native, QB15)

The billing fields (FOCUS / CUR)

How to fix it at the application level

If you watch the bill: how to detect this at scale

Multi-account detection approach

Dollar impact math

Setting up the alert

Escalation template

Tracking going forward

If you own the outcome: the governance gap and how to close it

What happened, in plain English

The governance gap

The policy fix

The decision

What good looks like

Fix checklist

Sources

More from our guides

What is FinOps?

Common AWS Cost Mistakes

Practical AWS Lambda Automations

Privacy & Cookie Notice

1. What We Collect

2. How We Use Your Data

3. Where Your Data Lives

4. Legal Basis for Processing

5. Data Retention

6. Cookies & Tracking

7. Your Rights (GDPR)

8. Security

9. Updates