How to Detect a Cost Spike Before It Becomes a Bill: The 3-Day Baseline Method
The most expensive cloud incidents are the ones nobody catches in real time. An autoscaling group absorbs a DDoS attack and scales to 2,000 instances. A Lambda function calls itself recursively until the account throttles it. A developer runs a load test against the wrong environment. In each case, the billing data shows the damage โ but billing exports typically arrive 8โ24 hours after the spend occurs.
The 3-day baseline method is the right detection approach for this data latency: compare the average daily cost over the last 3 days against the average daily cost over the prior 30 days. If the recent average is more than double the baseline, something changed โ and it is still recent enough to investigate.
The Detection Logic
The spike detector fires when:
avg_3d_cost > avg_30d_baseline_cost ร 2.0
Where:
avg_3d_cost= mean daily cost over the last 3 days (days โ3 to 0)avg_30d_baseline_cost= mean daily cost over the 30 days prior to the 3-day window (days โ30 to โ3, baseline window)
The baseline excludes the last 3 days deliberately: if a spike has been running for a week, including it in the baseline would pull the baseline up and suppress the ratio. The separation keeps the baseline clean.
Secondary output fields (informational):
- Cost acceleration ratio: the 3-day/30-day multiple. At 5ร, the resource is costing 5 times its normal rate. At 10ร, something serious has happened.
- Spike strength (z-score): how many standard deviations above the baseline mean. A resource with volatile baseline costs (high stddev) needs a larger absolute spike to produce the same z-score as a stable resource. High spike strength on a normally flat resource is a more actionable signal than the same ratio on an already-volatile one.
Threshold Calibration: Why 2ร, Not Less
AWS Cost Anomaly Detection (CAD) uses a default threshold of approximately 40% deviation above the 7-day trend, with a $100 minimum. QB2's 2ร (100%) requirement is significantly stricter. This is intentional:
| Detector | Threshold | Behavior |
|---|---|---|
| AWS Cost Anomaly Detection (gradual) | ~10โ20% above 7-day trend | Very sensitive โ fires on warm Tuesdays after quiet Mondays |
| Spike detector (QB2) | 100% above 30-day baseline (2ร) | High-confidence โ fires only on genuine anomalies worth investigating |
| AWS CAD (sudden spike) | ~300% within 24 hours | Extreme events only โ misses sustained mid-magnitude spikes |
A 2ร spike is always actionable. An 8% deviation above a weekly trend is often a Tuesday. The 2ร threshold trades recall (it misses moderate anomalies) for precision (every result is worth a look).
What Appears in FOCUS Billing Data
| FOCUS Field | Baseline (normal) | Spike period |
|---|---|---|
BilledCost |
$X/day, stable | $2Xโ$10X+ per day โ step function on spike onset day |
ConsumedQuantity |
Q units/day, stable | 2Qโ10Q units/day โ quantity rises proportionally with cost |
ResourceId |
Existing resource | Same resource โ more of the same thing, not a new thing |
UsageType |
e.g., BoxUsage:t3.medium, Lambda-GB-Second |
Same UsageType โ the rate is unchanged, the quantity is the signal |
PricingCategory |
Standard or OnDemand | Standard โ rate unchanged, volume up |
The key discriminant: in a genuine usage spike, both BilledCost and ConsumedQuantity rise together. If BilledCost rises but ConsumedQuantity stays flat, that is a pricing change โ not a usage spike. The spike detector fires on usage spikes specifically: more of the same thing, at the same rate, in a volume you did not plan for.
Multi-Service Spikes
The most damaging real-world spikes are multi-service events. An autoscaling runaway generates simultaneous cost increases across compute (BoxUsage), load balancer (LoadBalancerUsage), and network (DataTransfer-Out-Bytes) legs. Each leg is a separate billing resource with its own ResourceId. The spike detector fires on each independently.
When multiple resources spike simultaneously in the same account and the same time window, that cluster is a stronger signal than any single resource spike. The combined picture โ compute + LB + network all at 5ร in the same 3-day window โ is autoscaling, not coincidence.
Real-World Incidents
DDoS-triggered autoscaling: $120,000 (InfoQ, August 2025): A DDoS attack triggered autoscaling on an EKS cluster, which scaled to approximately 2,000 instances in an attempt to absorb the traffic. The attack and the autoscaling ran simultaneously for 72 hours before the security team mitigated the DDoS. Cost: $120,000. The compute, load balancer, and network legs all showed the same step-function spike in billing data on day 1 of the attack.
Fintech vulnerable endpoint โ autoscaling attack: A fintech application's vulnerable API endpoint was discovered by automated scanners and exploited to trigger computationally expensive operations. The endpoint was not rate-limited. Autoscaling absorbed the load for 48 hours before the security team mitigated it. The billing spike โ compute, load balancer, and network all moving together โ was visible in billing data within one billing day of the attack beginning. The multi-leg spike profile (not a single resource spiking in isolation) is the fingerprint of an externally driven autoscaling event.
Lambda recursion: $10,000+ per hour (Vantage blog): A Lambda function introduced a recursive call path without a circuit breaker. At 100ร normal invocation volume for a full billing day, the daily cost appears as a 100ร spike against the 30-day baseline. The spike detector fires on the billing day aggregate. The event was contained by hitting account-level concurrency limits, not by detection.
RDS autoscaling misconfiguration (AWS case study): The opposite scenario โ a 2โ3ร sustained spike that builds slowly. This is the boundary case: it fires the spike detector if the 3-day window catches the transition, and then the runaway detector confirms it has persisted for 4+ days. Together the two detectors cover the full range from explosive events to slow accumulation.
What the Spike Detector Does NOT Catch
Knowing the boundaries matters as much as knowing what fires:
- Sub-day spikes that self-recover within hours: a Lambda that runs at 2ร for 3 hours on a high-baseline service may not move the daily cost aggregate above the 2ร threshold. The daily aggregation granularity means fast spikes that recover within one billing day are only detectable if their absolute dollar impact is large enough to dominate the day's total.
- Gradual ramp-ups: if cost grows 5% per day over a month, the 3-day window and the 30-day baseline move together. The ratio stays near 1.0. Slow escalation belongs to the runaway persistence detector (QB3) and the data transfer misconfiguration pattern โ not the spike detector.
- Data egress creep: real egress explosion incidents ($200/month โ $3,500/month over 8 months) are slow ramps, not step spikes. The spike detector is not the right tool for them.
Fix Checklist
- Investigate the 3-day window first: what changed? New deployment, traffic event, configuration change? Start with AWS CloudTrail events, ECS service events, or Lambda invocation logs from the days that show the spike. The billing spike onset day is the investigation starting point.
- Set autoscaling max capacity at realistic bounds: every Auto Scaling Group should have a
MaxSizethat cannot exceed your infrastructure budget. An unlimited max is not a reliability feature โ it is an uncapped cost exposure. - Add Lambda reserved concurrency: set reserved concurrency on functions that process user-generated events or that could theoretically be recursively triggered. Reserved concurrency limits the blast radius of a recursion bug.
- Enable AWS Cost Anomaly Detection as a complementary alert: AWS CAD uses a near-real-time billing feed and fires within hours of a spike starting. It catches explosive events faster than daily aggregated billing. Use it for immediate paging; use the 3-day baseline method for higher-confidence, lower-noise investigation.
- Rate-limit and throttle all external-facing endpoints: DDoS-driven autoscaling incidents are preventable at the API layer. AWS WAF, API Gateway throttling, and application-level rate limiting prevent autoscaling from absorbing attack traffic that should be blocked at the edge.
See if this pattern is in your billing data
The 5-question DropInFinOps assessment takes 2 minutes and tells you which anomaly patterns your current billing setup is positioned to catch โ and which ones are slipping through.
Take the free assessment โ