When Cloud Costs Won't Come Back Down: Detecting Persistent Runaway Patterns

Your ECS cluster ran hot Tuesday night. It is still running hot on Friday. No AWS alert fired on Wednesday, Thursday, or Friday โ€” because spike detectors only fire on new spikes. Persistence requires its own detector.

A cost spike is one failure mode. The spike that does not come back down is a different one โ€” and it requires different detection. By the time a runaway event has persisted for four days, the spike detection window has closed. The resource has normalized at its new, wrong baseline. Standard anomaly alerts have stopped firing. The cost is accruing silently.

Two distinct onset profiles reach the same outcome. The first is a step-function: an ECS group scales to 10ร— normal in four minutes and stays there. A spike detector fires on day 3. If nobody fixes it, a persistence detector fires on day 4 and beyond โ€” the "it's still broken" signal. The second profile is a gradual ramp: RDS storage autoscales up month by month, each increment small enough to dismiss as noise, accumulating for four months before anyone investigates. No spike detector ever fires. The persistence detector is the only mechanism that catches this class.

The FOCUS billing signature of both profiles looks identical by day 5: the same ResourceId, the same RegionId, elevated BilledCost for 4 or more of the last 7 days. The detection logic does not care how the cost reached its current level โ€” only that it has been sustained.

Spike Detector vs. Runaway Detector

DetectorFires whenTime horizonWhat it means
Spike (QB2) 3-day average > 2ร— 30-day baseline Last 3 vs last 30 days Abrupt recent jump โ€” may be transient or ongoing
Runaway (QB3) 4+ of last 7 days > 1.5ร— baseline Last 7 days vs days โˆ’30 to โˆ’7 Sustained โ€” elevated for days, has not self-corrected

When both detectors fire on the same resource: "This had an abrupt cost jump and the root cause has not been addressed." When QB3 fires alone โ€” without QB2 โ€” the signal is different: "This resource has been quietly running above baseline for days or months. There was no spike to alert on." That is the gradual-ramp class, and it is the class that accumulates the most undetected waste.

Real Incidents

These incidents are sourced from public post-mortems, practitioner retrospectives, and engineering case studies. None are fabricated.

[Step-function] $47,000 ECS overnight (Medium, 2025): An ECS autoscaling group hit a traffic spike with a misconfigured maximum capacity setting. The group scaled to hundreds of instances in 4 minutes โ€” 11:42 PM to 11:46 PM. The service team discovered it the next morning. The autoscaling policy had no maximum that reflected what the infrastructure could actually absorb.

[Step-function] $72,000 Google Cloud Run overnight (The Register): A Cloud Run deployment with no spend cap reached $72,000 overnight when an unintended test load caused the service to autoscale without bound. No billing alert was configured. The Cloud Run billing dashboard carries a 24+ hour update lag โ€” the team discovered the damage on Monday morning. The event was over before the billing data confirmed it had happened.

[Step-function] Lambda recursion: $10,000+ per hour (FinOps Foundation): A Lambda function was modified to call itself under certain conditions without a circuit breaker. The recursion ran for hours before hitting account-level concurrency limits. For step-function events this severe, the spike detector fires on day 3 โ€” QB3 then confirms on day 4+ if the root cause was not addressed.

[Step-function] Lambda 6ร— overnight: $42,000 (DEV Community): An unchecked Lambda feedback loop ran overnight, producing a 6ร— spike against the prior-day baseline. Not discovered until a cost management review hours after the loop had already completed. The billing delay meant that by the time the cost was visible, the event was over.

[Gradual-ramp] RDS storage autoscaling: $12,000 over 4 months (Medium Engineering Playbook): RDS storage autoscaling scales up in response to real growth โ€” correctly. The problem is that storage autoscaling only scales up, never down. Month 1: a $11.50 cost increase, dismissed as noise. Month 2: $23 more, attributed to data growth. Month 3: $150 more โ€” finally investigated. By then the monthly bill had climbed from $450/month to $1,190/month. No spike detector ever fired. Every single day during those 4 months was running at 2.6ร— the original daily baseline โ€” above QB3's 1.5ร— threshold every day โ€” but the cost arrived gradually enough that no single week triggered a spike alert. This is the pure gradual-ramp case: QB3 is the only detector that catches it.


If you built it: what to look for and how to fix it

For platform engineers, DevOps engineers, and infrastructure owners responsible for the resources that ran hot.

The FOCUS billing data tells you a resource has been elevated โ€” but not which configuration decision caused it. Here is how to trace the pattern to its root cause by resource type, and what to change to prevent the next one.

The billing fields (FOCUS)

FOCUS FieldNormal baselineDuring runaway
BilledCost $X/day, stable $1.5Xโ€“$10X/day, sustained for 4+ of 7 days
ConsumedQuantity Q units/day, stable 1.5Qโ€“10Q units/day โ€” cost and quantity rise together
ResourceId Existing, known resource Same resource โ€” not a new ID. New IDs in new regions indicate compromised instance (QB9), not runaway
PricingCategory Standard or OnDemand Unchanged โ€” the rate is the same. More usage is the signal, not a rate change
RegionId Your normal operating regions Same region โ€” this is how QB3 discriminates from compromised-instance billing patterns

Tracing by resource type

ECS / EC2 autoscaling: In AWS Cost Explorer, filter to Service = Amazon EC2, Group by = Instance Type, daily granularity. Look for a sudden increase in a specific instance type count that does not correspond to a deployment event. Check Auto Scaling Group activity history in the EC2 console โ€” scale-out events near the onset date confirm the cause. The fix: add a realistic MaxSize to every Auto Scaling Group that reflects actual infrastructure capacity, not the cloud provider default.

Lambda recursion / feedback loops: Filter Cost Explorer to Service = AWS Lambda, Group by = Resource. Cross-reference with CloudWatch invocation metrics: a runaway Lambda shows a concurrent spike in both cost and invocation count. Check CloudWatch Logs Insights for recursive call patterns around the onset date:

fields @timestamp, @message
| filter @message like /Recursion detected/ or @message like /self-invoke/
| stats count() by bin(1h)

The fix: set reserved concurrency on any Lambda that could be triggered by its own output โ€” its own SQS queue, SNS topic, or a downstream Lambda. Reserved concurrency is the only hard cap on Lambda billing damage.

RDS storage autoscaling drift: This is the gradual-ramp class. Pull a 90-day daily cost chart for the specific RDS instance in Cost Explorer: a slow upward slope over weeks or months with no step-change is the signature. Check current allocated storage versus actual data size in RDS Console โ†’ Storage. Storage autoscaling never scales back down โ€” the mitigation is a manual resize after growth stabilizes, combined with a monthly review of allocated-versus-used ratios for any RDS instance with autoscaling enabled.

The detection query (FOCUS-native, QB3)

If you have FOCUS-formatted billing exports, this query identifies resources where 4 or more of the last 7 days exceeded the 23-day baseline by more than 50%:

-- QB3: Runaway cost acceleration โ€” count-based persistence detector
-- Baseline: days โˆ’30 to โˆ’7 (excludes last 7 days โ€” prevents runaway contaminating its own baseline)
-- Fires when 4+ of last 7 days have daily_cost > avg_baseline_cost ร— 1.5
WITH daily_costs AS (
    SELECT
        resourceid,
        servicename,
        subaccountid,
        CAST(chargeperiodstart AS DATE) AS charge_date,
        SUM(billedcost)                 AS daily_cost
    FROM focus_billing
    WHERE chargecategory = 'Usage'
      AND chargeclass    = 'Regular'
      AND CAST(chargeperiodstart AS DATE) >= DATE_ADD('day', -30, CURRENT_DATE)
    GROUP BY 1, 2, 3, 4
),
baseline AS (
    SELECT
        resourceid,
        servicename,
        subaccountid,
        AVG(daily_cost) AS avg_baseline_cost
    FROM daily_costs
    WHERE charge_date >= DATE_ADD('day', -30, CURRENT_DATE)
      AND charge_date <  DATE_ADD('day', -7,  CURRENT_DATE)
    GROUP BY 1, 2, 3
    HAVING AVG(daily_cost) > 0.01
),
recent AS (
    SELECT
        d.resourceid,
        d.servicename,
        d.subaccountid,
        b.avg_baseline_cost,
        SUM(CASE WHEN d.daily_cost > b.avg_baseline_cost * 1.5 THEN 1 ELSE 0 END) AS high_days,
        AVG(d.daily_cost)                                                          AS avg_recent_cost
    FROM daily_costs d
    JOIN baseline b ON d.resourceid   = b.resourceid
                   AND d.subaccountid = b.subaccountid
    WHERE d.charge_date >= DATE_ADD('day', -7, CURRENT_DATE)
    GROUP BY 1, 2, 3, 4
)
SELECT
    resourceid,
    servicename,
    subaccountid,
    ROUND(avg_baseline_cost, 4)                                       AS avg_baseline_cost,
    ROUND(avg_recent_cost, 4)                                         AS avg_recent_cost,
    ROUND(avg_recent_cost / NULLIF(avg_baseline_cost, 0), 2)         AS cost_ratio,
    high_days
FROM recent
WHERE high_days >= 4
ORDER BY avg_recent_cost DESC
LIMIT 25

Reading the results: cost_ratio is how many times above the 23-day baseline the recent 7-day average is. high_days is the count of individual days that exceeded the 1.5ร— threshold โ€” anything โ‰ฅ 4 fires. A cost_ratio=3.5 with high_days=7 means every day of the past week was above 3.5ร— baseline. A cost_ratio=1.6 with high_days=4 is a lower-severity runaway, barely crossing the threshold on exactly 4 days.

Key design decisions in the query:


If you watch the bill: how to detect this at scale

For FinOps practitioners and cloud finance analysts responsible for multi-account cost visibility and anomaly escalation.

The challenge with runaway cost acceleration for FinOps is that it looks like stability to most detection tools. After the initial spike, the elevated cost is consistent โ€” not growing, not dropping. That flat-but-elevated pattern is exactly what AWS Cost Anomaly Detection normalizes away as its rolling baseline catches up to the new level. By week 3, the alert has silenced itself. The cost keeps running.

The AWS Cost Anomaly Detection gap

AWS Cost Anomaly Detection is a useful first layer โ€” but it has structural limitations that leave the sustained-cost class undetected:

The QB3 check is the complement to AWS CAD, not a replacement. AWS CAD catches day-of explosive events faster. QB3 catches the sustained pattern that AWS CAD normalizes away โ€” and is the only detector for the gradual-ramp class.

Multi-account detection approach

Run the QB3 query above against your consolidated FOCUS billing export weekly. Any resource returning a row is in a sustained runaway state โ€” 4 or more of the last 7 days above its own 23-day baseline by 50% or more. The self-calibration per resource means a predictable workload fires on smaller absolute deviations; a high-variance workload requires a larger sustained elevation to trigger.

In Cost Explorer without FOCUS exports: set date range to 90 days, Group by = Resource, daily granularity. Resources showing a step-up plateau (flat โ†’ elevated โ†’ still elevated) are the step-function onset class. Resources showing a slow upward slope visible only at 90-day zoom are the gradual-ramp class. The two patterns require different root-cause investigations.

Dollar impact math

Distinguishing onset profiles in the results

When a resource fires QB3, the 90-day cost chart tells you which onset class you are dealing with:

Escalation template

We identified a sustained cost anomaly on [resource ID] in [account name] โ€” [high_days] of the last 7 days had daily cost at [cost_ratio]ร— its 23-day baseline. This resource has been running above normal for at least [N] days without returning to baseline.

This is consistent with an autoscaling event that was not corrected, a configuration change that increased steady-state resource consumption, or storage and data growth that has not been reviewed since the initial scaling event. Action needed: identify what changed on or before [onset_date], confirm whether the elevated cost is expected, and either remediate or update the expected baseline for this resource.


If you own the outcome: the governance gap and how to close it

For Engineering Managers, VPs of Engineering, and CTOs who need root cause and a process change.

What happened, in plain English

One of two things occurred. Either a resource configuration failed suddenly โ€” an autoscaling group with no upper bound ran to scale during a traffic event and never came back down โ€” and the team received a spike alert that was dismissed, deprioritized, or never acted on. Or a resource configuration drifted gradually โ€” storage autoscaled in response to real growth, month by month, each increment too small to trigger an alert, until four months of accumulated cost became undeniable.

Both failure modes share the same root cause: the infrastructure was configured for availability and performance, not for cost control. An ECS autoscaling group with no maximum capacity is correctly configured for reliability โ€” it will never fail to scale. It is incorrectly configured for cost โ€” it will never stop scaling either. RDS storage autoscaling is correctly configured for availability โ€” the database will never run out of storage. It has no native mechanism to scale back down, so the only mitigation is a periodic manual review that most teams do not have.

The result in both cases is a resource that is functioning correctly by every operational metric โ€” availability is fine, latency is fine, error rates are fine โ€” while billing data reflects a cost that is 2ร— or 3ร— what it should be, accumulating every day.

Why existing process did not catch it

Two mechanisms failed simultaneously:

  1. Spike alerting is not persistence alerting. AWS Cost Anomaly Detection and budget alerts fire when something spikes. They do not fire when something that already spiked continues to run elevated. The alert fires once โ€” on day 1 or day 2 โ€” and then silences itself as the elevated cost becomes the new rolling baseline. Day 4, day 10, and day 30 of elevated spend look identical to the ML model: no new anomaly, nothing to alert on. Persistence requires a separate instrument.
  2. Monthly billing reviews catch totals, not sustained patterns. A monthly review showing EC2 spend up 40% triggers an investigation. A monthly review showing EC2 spend up 8% โ€” because the runaway started on day 22 of the billing period โ€” produces no escalation. By the time the full elevated month appears in the next review cycle, 30 days of waste have already been realized. The gradual-ramp RDS case: no single monthly delta was large enough to trigger escalation until month 4.

The governance gap

The gap is not in alerting sophistication. It is in what is being measured. Spike alerting and persistence alerting are different instruments for different failure modes:

An organization with the first but not the second has an undetected class of spend: every incident that persists past day 3, and the entire gradual-ramp class where no spike ever occurred. Both produce real accumulated waste that is invisible to standard alerting until it appears in a monthly summary โ€” after it has already happened.

The policy fix

  1. Autoscaling bounds as a required IaC field. Every Auto Scaling Group, Cloud Run service, and Lambda function in production must have a defined maximum capacity or concurrency limit in the infrastructure-as-code definition. Make this a pull request requirement โ€” any resource that manages autoscaling without a maximum should not pass code review. This is a preventive control: runaway incidents cannot happen at unbounded scale regardless of traffic events.
  2. Weekly persistence scan as a FinOps review item. The QB3 query runs weekly against the consolidated billing export. Any resource with 4+ of 7 days above 1.5ร— baseline gets an escalation ticket to the resource owner within 24 hours. This is the detective control: it catches the runaways that slip past the preventive controls, and it is the only mechanism that catches the gradual-ramp class.
  3. Spend caps on all non-production environments. Any development, test, or staging environment without a daily spend cap is a liability. AWS Budgets can trigger an SNS notification โ€” or a Lambda-based shutdown โ€” when a daily threshold is crossed. A $50/day budget on dev environments is a 1,000ร— return when one undetected runaway costs $47,000.

What good looks like

Every production Auto Scaling Group has a documented maximum that reflects actual infrastructure capacity. Every non-production environment has a daily spend cap in AWS Budgets. A weekly FinOps review includes a QB3 scan of consolidated billing data. When a resource fires the persistence detector, it receives an escalation ticket within 24 hours โ€” not at month-end review. Mean time to detection for runaway events is under 7 days. The RDS storage class of incident โ€” four months of gradual cost growth before investigation โ€” does not recur because the weekly scan catches the pattern in week 2, not month 4.


Fix checklist

  1. Set autoscaling MaxSize on every Auto Scaling Group. Required field in IaC โ€” any ASG without a documented maximum fails code review. An unbounded max is a billing risk with no technical ceiling.
  2. Set Lambda reserved concurrency on any function that could be triggered by its own queue, topic, or a downstream Lambda. Reserved concurrency is the only hard cap on Lambda billing damage from a recursion or feedback loop.
  3. Audit RDS instances with storage autoscaling enabled. For any instance showing gradual cost growth over 60+ days, compare allocated storage to actual data size. Reclaim over-provisioned storage on the next maintenance window.
  4. Configure spend caps on all non-production environments. AWS Budgets at $50/day absolute for dev/test accounts. Takes 10 minutes to configure. Catches a $47,000 ECS runaway within hours.
  5. Configure AWS Cost Anomaly Detection as a complementary first layer. Set a monitor on EC2, Lambda, and RDS with a per-account $50/day-above-baseline threshold. AWS CAD catches explosive step-function events faster than daily billing aggregation. QB3 then confirms the event became sustained โ€” together they cover both onset profiles.
  6. Run the QB3 persistence query weekly against your FOCUS billing export. Any resource with high_days >= 4 gets an escalation ticket within 24 hours. This is the systematic backstop for all the controls above โ€” and the only detection mechanism for the gradual-ramp class.

See if this pattern is in your billing data

Runaway cost acceleration is one of several billing patterns that standard alerting misses โ€” either because it normalizes away after the initial spike, or because it arrives gradually enough to stay below any single-period threshold. The DropInFinOps free assessment takes 2 minutes and tells you which anomaly patterns your current setup is positioned to catch โ€” and which ones are accumulating undetected.

Take the free assessment โ†’