Automating AWS Cost Alerts with Terraform
Three alert tiers that cover what the others miss โ service budgets for AI-specific floors, anomaly detection for behavioral changes, and a zero-threshold security sensor for new regions.
A single monthly budget alert on total account spend is the most common FinOps automation and the least useful one. It fires when 80% of the month's budget is spent โ at which point the damage is done and 2โ3 weeks of billing remain. It catches nothing that happened fast. It catches nothing that happened quietly. And it routes to whoever set it up, which is often nobody current.
Effective cost alerting requires three tiers that address different failure modes: absolute thresholds for known cost floors (catches the thing that should be $0 but isn't), behavioral anomaly detection for unexpected pattern changes (catches the thing that grew too fast), and a zero-threshold sensor for new services and regions (catches the thing that should never have existed). Each tier catches different problems. None of the three is redundant with the others.
All three can be deployed in a single terraform apply.
The Foundation: SNS Topic and Routing
Every alert tier publishes to SNS. Build the shared SNS infrastructure first โ all alert resources reference it.
variable "alert_email" {
description = "Email address for cost and security billing alerts"
type = string
}
variable "environment" {
description = "Environment name used in resource naming"
type = string
default = "prod"
}
variable "monthly_account_budget_usd" {
description = "Total monthly account spend threshold in USD"
type = number
default = 5000
}
# Shared SNS topic โ all three alert tiers publish here
resource "aws_sns_topic" "cost_alerts" {
name = "cost-alerts-${var.environment}"
tags = {
Environment = var.environment
Purpose = "cost-and-security-billing-alerts"
ManagedBy = "terraform"
}
}
# Email subscription โ replace with Slack webhook or PagerDuty for ops teams
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.cost_alerts.arn
protocol = "email"
endpoint = var.alert_email
}
# Allow AWS Budgets to publish to this topic
resource "aws_sns_topic_policy" "budgets_publish" {
arn = aws_sns_topic.cost_alerts.arn
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowBudgetsPublish"
Effect = "Allow"
Principal = {
Service = "budgets.amazonaws.com"
}
Action = "SNS:Publish"
Resource = aws_sns_topic.cost_alerts.arn
},
{
Sid = "AllowCostAnomalyPublish"
Effect = "Allow"
Principal = {
Service = "costalerts.amazonaws.com"
}
Action = "SNS:Publish"
Resource = aws_sns_topic.cost_alerts.arn
}
]
})
}
output "sns_topic_arn" {
value = aws_sns_topic.cost_alerts.arn
description = "SNS topic ARN for cost alert subscriptions"
}
The SNS topic policy needs both service principals: budgets.amazonaws.com for the budget alerts in Tier 1, and costalerts.amazonaws.com for the Cost Anomaly Detection subscriptions in Tier 2. Missing either principal causes silent alert delivery failures โ the budget fires but the notification never arrives.
Tier 1: Service-Specific Budgets for AI Cost Floors
General account budgets alert when total spend crosses a threshold. Service-specific budgets alert when a specific service crosses a threshold โ which is where AI cost floors live. An OpenSearch Serverless OCU charge of $50 in a month is invisible in a $5,000 total account budget. In a $0 OpenSearch budget, it fires within a week of an orphaned Knowledge Base collection appearing.
# OpenSearch Serverless budget โ catches orphaned Bedrock KB collections
# $50 threshold = ~5 days of minimum OCU billing for one idle collection
resource "aws_budgets_budget" "opensearch_serverless" {
name = "opensearch-serverless-${var.environment}"
budget_type = "COST"
limit_amount = "50"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon OpenSearch Service"]
}
cost_filter {
name = "UsageType"
values = ["SearchOCU", "IndexingOCU"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
# Bedrock inference budget โ baseline for detecting runaway AI spend
# Set this to 2x your expected monthly inference cost
resource "aws_budgets_budget" "bedrock_inference" {
name = "bedrock-inference-${var.environment}"
budget_type = "COST"
limit_amount = var.bedrock_monthly_budget_usd
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon Bedrock"]
}
# Alert at 80% of budget (actual) โ gives time to investigate before limit hit
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.alert_email]
}
# Alert at 100% (actual) โ escalate via SNS
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
variable "bedrock_monthly_budget_usd" {
description = "Expected monthly Bedrock inference spend. Set to 2x current monthly average."
type = number
default = 100
}
# SageMaker budget โ catches idle endpoints accumulating 24/7 instance hours
resource "aws_budgets_budget" "sagemaker" {
name = "sagemaker-${var.environment}"
budget_type = "COST"
limit_amount = var.sagemaker_monthly_budget_usd
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon SageMaker"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
# Forecasted alert โ SageMaker endpoint costs are predictable; catch overruns early
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = [var.alert_email]
}
}
variable "sagemaker_monthly_budget_usd" {
description = "Expected monthly SageMaker spend. Set to 0 if no active endpoints are planned."
type = number
default = 0
}
# Total account budget โ backstop for anything not caught by service budgets
resource "aws_budgets_budget" "account_total" {
name = "account-total-${var.environment}"
budget_type = "COST"
limit_amount = tostring(var.monthly_account_budget_usd)
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 50
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = [var.alert_email]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.alert_email]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
Threshold rationale for the OpenSearch budget: $50 catches an idle Knowledge Base collection within 4โ5 days of creation (floor cost is $11.52/day). Setting it to $0 produces too many false positives in accounts that legitimately use OpenSearch for non-Bedrock workloads. Setting it to $200 means a collection runs for 17+ days before alerting. $50 is the practical midpoint โ early enough to limit waste, high enough to suppress noise.
Tier 2: Cost Anomaly Detection for Behavioral Changes
Budgets catch absolute thresholds. They do not catch behavioral changes โ a service whose spend doubles but stays under budget, or a new spend pattern that appears in a previously zero-spend service. Cost Anomaly Detection uses ML to learn your spending baseline and alert when it detects a statistically significant deviation.
# Service-level anomaly monitor โ watches all AWS services for unusual spend
resource "aws_ce_anomaly_monitor" "service_level" {
name = "service-level-monitor-${var.environment}"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
# AI-specific anomaly monitor โ scoped to AI/ML services only
# Catches subtle changes in AI spend that get lost in account-level noise
resource "aws_ce_anomaly_monitor" "ai_services" {
name = "ai-services-monitor-${var.environment}"
monitor_type = "CUSTOM"
monitor_specification = jsonencode({
And = null
Not = null
Or = null
CostCategories = null
Dimensions = {
Key = "SERVICE"
Values = [
"Amazon Bedrock",
"Amazon SageMaker",
"Amazon OpenSearch Service",
"Amazon Rekognition",
"Amazon Comprehend",
"Amazon Textract"
]
MatchOptions = ["EQUALS"]
}
Tags = null
})
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
# Subscription: alert when anomaly impact exceeds $50 absolute OR 20% relative
# The dual threshold prevents alert noise from small-percentage anomalies on tiny spend
resource "aws_ce_anomaly_subscription" "immediate" {
name = "cost-anomaly-alerts-${var.environment}"
frequency = "IMMEDIATE"
monitor_arn_list = [
aws_ce_anomaly_monitor.service_level.arn,
aws_ce_anomaly_monitor.ai_services.arn,
]
subscriber {
type = "SNS"
address = aws_sns_topic.cost_alerts.arn
}
threshold_expression {
and {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["50"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
and {
dimension {
key = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
values = ["20"]
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
tags = {
Environment = var.environment
ManagedBy = "terraform"
}
}
Important constraints on Cost Anomaly Detection: the ML model requires at least 10 days of historical billing data to establish a baseline before it can fire. A new account or a new service within an existing account will not produce anomaly alerts during this learning window. For new AI services, Tier 1 budget alerts carry the load during the first 10 days.
Tier 3: Zero-Threshold Security Sensor
Neither budgets nor anomaly detection are designed to catch a new region appearing in your billing data with no prior history. Budgets require a threshold โ and you cannot set a budget for "any spend in a region this account has never used before." Cost Anomaly Detection requires a baseline โ a region that has never had spend has no baseline to deviate from.
The zero-threshold security sensor uses a CloudWatch billing alarm on estimated charges, scoped by region. Any charge in a region that has never appeared in your account is an immediate escalation โ not an alert, an incident.
# CloudWatch billing alarm for total estimated charges โ account-level backstop
# Note: billing metrics are only available in us-east-1 regardless of your account region
resource "aws_cloudwatch_metric_alarm" "estimated_charges" {
provider = aws.us_east_1 # Billing metrics only published in us-east-1
alarm_name = "estimated-charges-${var.environment}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "EstimatedCharges"
namespace = "AWS/Billing"
period = 86400 # 24 hours โ billing metrics update daily
statistic = "Maximum"
threshold = var.daily_charge_alert_usd
alarm_description = "Total estimated AWS charges exceeded daily threshold"
treat_missing_data = "notBreaching"
dimensions = {
Currency = "USD"
}
alarm_actions = [aws_sns_topic.cost_alerts.arn]
}
variable "daily_charge_alert_usd" {
description = "Total daily charge threshold in USD for CloudWatch billing alarm"
type = number
default = 200
}
# Lambda function for new-region detection โ the zero-threshold security sensor
# Triggered on a schedule; queries FOCUS billing data via Athena for new region/account combos
resource "aws_lambda_function" "new_region_detector" {
filename = data.archive_file.new_region_detector.output_path
function_name = "new-region-detector-${var.environment}"
role = aws_iam_role.new_region_detector.arn
handler = "handler.lambda_handler"
runtime = "python3.12"
timeout = 60
environment {
variables = {
ATHENA_DATABASE = var.billing_athena_database
ATHENA_TABLE = var.billing_athena_table
ATHENA_WORKGROUP = var.billing_athena_workgroup
RESULTS_BUCKET = var.billing_results_bucket
SNS_TOPIC_ARN = aws_sns_topic.cost_alerts.arn
}
}
tags = {
Environment = var.environment
Purpose = "security-billing-sensor"
ManagedBy = "terraform"
}
}
# Run the new-region detector every 6 hours
resource "aws_cloudwatch_event_rule" "new_region_detector" {
name = "new-region-detector-schedule-${var.environment}"
schedule_expression = "rate(6 hours)"
}
resource "aws_cloudwatch_event_target" "new_region_detector" {
rule = aws_cloudwatch_event_rule.new_region_detector.name
target_id = "NewRegionDetector"
arn = aws_lambda_function.new_region_detector.arn
}
resource "aws_lambda_permission" "allow_eventbridge" {
statement_id = "AllowEventBridgeInvoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.new_region_detector.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.new_region_detector.arn
}
The Lambda function body (not shown in full here) runs the new-region detection query from the CUR Deep Dive article against your FOCUS billing data in Athena โ any RegionId appearing in the last 6 hours that has no history in the prior 90 days triggers an SNS publish. The query scans only recent partitions, keeping Athena costs to cents per execution. At 6-hour intervals, a cryptomining campaign that launches at midnight is caught before the 6 AM scan โ limiting blast radius to under 6 hours of GPU spend instead of 24 hours under AWS's native anomaly detection latency.
Deploying the Full Stack
# Initialize and validate
terraform init
terraform validate
# Preview what will be created (review before applying)
terraform plan -var="alert_email=your-team@example.com" \
-var="environment=prod" \
-var="monthly_account_budget_usd=5000" \
-var="bedrock_monthly_budget_usd=200" \
-var="sagemaker_monthly_budget_usd=500" \
-var="daily_charge_alert_usd=300"
# Deploy
terraform apply -var="alert_email=your-team@example.com" \
-var="environment=prod" \
-var="monthly_account_budget_usd=5000" \
-var="bedrock_monthly_budget_usd=200" \
-var="sagemaker_monthly_budget_usd=500" \
-var="daily_charge_alert_usd=300"
After terraform apply completes:
- Confirm the SNS email subscription. AWS sends a confirmation email to
alert_emailโ the subscription is inactive until confirmed. Check your inbox within 5 minutes of apply completing. - Verify budget creation. AWS Console โ AWS Budgets โ confirm all four budgets appear with the correct service filters and thresholds.
- Wait 10 days before trusting anomaly detection. The ML model requires historical data. Tier 1 budgets are effective immediately; Tier 2 anomaly detection builds its baseline over the first 10 days.
- Test the CloudWatch alarm. AWS Console โ CloudWatch โ Alarms โ select the estimated-charges alarm โ Actions โ Edit โ temporarily lower the threshold below your current estimated charges โ confirm the alarm transitions to ALARM state and SNS fires. Reset the threshold after confirming.
What Each Tier Catches
| Alert tier | What it catches | What it misses | Response time |
|---|---|---|---|
| Tier 1 โ Service budgets | Orphaned KB collections, idle SageMaker endpoints, any service exceeding its monthly cap | Behavioral changes within budget, new services with no budget | Within days of threshold crossing |
| Tier 2 โ Anomaly detection | Unexpected spend increases, new patterns in known services, cost growth that outpaces baseline | New regions with no history, zero-to-nonzero spend on unknown services | Within 24โ48 hours of pattern emerging (after 10-day learning period) |
| Tier 3 โ Security sensor | New region in any account, credential compromise, shadow IT deployments, cryptomining onset | Anomalies within known regions on known services | Within 6 hours of first charge appearing |
The three tiers are complements, not alternatives. A cryptomining campaign that stays within budget in a known region will be caught by Tier 2 (behavioral anomaly). An orphaned Knowledge Base in a low-spend account that never crosses the anomaly detection threshold will be caught by Tier 1. A new region deployment โ legitimate or otherwise โ is caught by Tier 3 within hours, before any other tier could fire.
Find out which alert tiers your current setup is missing
The DropInFinOps free assessment maps your current billing configuration against all three alert tiers โ showing which cost and security patterns are covered and which gaps remain.
Take the free assessment โ