The FinOps Starter Kit: A Sequenced Checklist for AI-Era Cloud Spend
Order matters. Tagging before alerts. FOCUS before detection. Security sensors before monthly reviews. This is the right sequence โ and why skipping steps creates the gaps that cost you later.
Most FinOps checklists are a flat list. Enable CUR. Add tags. Set budget alerts. The problem is that some of these steps are prerequisites for others, and doing them out of order either produces useless results or requires rework. A budget alert set before tagging is enforced routes to nobody when it fires. An anomaly detection rule built on CUR data will need to be rebuilt when you migrate to FOCUS. A monthly review cadence is retrospective; daily behavioral queries are operational โ the sequence determines whether you are catching problems before or after they compound.
This checklist is sequenced by dependency. Each phase builds on the previous one. Skip a phase and the ones after it deliver partial value at best.
Phase 1 โ Billing Data Foundation (Do This First, Everything Depends on It)
-
Enable FOCUS 1.2 exports via AWS Data Exports.
Not CUR โ FOCUS. AWS Data Exports โ Create export โ FOCUS 1.2 with AWS columns โ Hourly โ Parquet. CUR 2.0 is still available but FOCUS is the schema that supports cross-service AI detection queries and multi-cloud normalization. This is the data layer every subsequent step depends on. Takes 30 minutes. Do it before anything else.
Why first: Every detection query, every dashboard, every alert threshold depends on having clean structured billing data in a queryable format. Building alerts on Cost Explorer summaries instead of raw FOCUS data means you are working with a pre-filtered view that hides the cross-service patterns where AI waste lives. - Create the Athena pipeline. S3 bucket โ Glue crawler โ Athena table โ workgroup. See the ETL Architecture article for the Terraform. Run the baseline sanity query (total cost by service, last 30 days) to confirm data flows. This step turns the raw export files into something queryable.
-
Confirm FOCUS field availability.
Run:
SELECT DISTINCT x_usagetype FROM focus_billing LIMIT 50. ConfirmSearchOCU,InvokeModelInference, andNatGateway-Bytesappear if those services are in use. Ifx_usagetypeis empty or missing, the export is not FOCUS 1.2 โ it may be CUR 2.0, which uses a different field name. Fix before proceeding.
Phase 2 โ Tagging Policy (Before Budget Alerts โ Alerts Without Tags Route to Nobody)
-
Define your required tag keys.
Minimum:
owner,project,environment. Addcost-centerif you do chargeback. Keep it to 3โ5 keys โ more than that creates compliance fatigue and reduces enforcement rates. Document the allowed values for each key (especiallyenvironment: prod/staging/dev โ not freeform). - Enforce tags at provisioning time via IaC. Terraform and CDK both support required tag variables. Any resource created through IaC gets tags at creation. No console Quick-create for production or staging resources โ this is the single rule that prevents the majority of untagged AI resource incidents. SageMaker notebooks, Bedrock Knowledge Bases, and OpenSearch Serverless collections created via console bypass all tagging enforcement.
- Deploy the AI tagging enforcer Lambda. The function from the Lambda Automations article scans Bedrock KBs, OpenSearch Serverless collections, and SageMaker endpoints for missing required tag keys and publishes a daily SNS report. This catches Quick-create resources within 24 hours of provisioning. Deploy this before setting up budget alerts โ untagged resources cannot be attributed when alerts fire.
-
Backfill tags on existing untagged resources.
Run:
SELECT resourceid, servicename, SUM(billedcost) FROM focus_billing WHERE tags IS NULL OR tags = '{}' AND chargecategory = 'Usage' GROUP BY resourceid, servicename ORDER BY SUM(billedcost) DESC LIMIT 50. Fix the top 50 by spend โ these are your highest-risk unattributed resources.
Phase 3 โ Cost Alerts (AI-Specific Tiers, Not Just Total Account Budget)
-
OpenSearch Serverless OCU budget: $50/month threshold.
Catches an orphaned Bedrock KB collection within 4โ5 days of creation. Any OCU charges in an account that does not deliberately run OpenSearch Serverless for a non-Bedrock workload are almost certainly from a Knowledge Base. Service filter: Amazon OpenSearch Service, usage type: SearchOCU/IndexingOCU. See the Terraform Alerts article for the exact HCL.
Why this specific threshold: $50 is ~4โ5 days of floor billing. Low enough to catch quickly, high enough to suppress false positives from legitimate small OpenSearch workloads. - Bedrock inference budget: 2ร your current monthly average. Set this to twice whatever Bedrock currently costs per month. As you scale AI features, update it quarterly. This catches runaway inference (retry storms, misconfigured agents) before it doubles your AI spend.
- SageMaker endpoint budget: based on your active endpoint inventory. Count active endpoints ร instance type cost ร 730 hours/month. Set the budget at that number. Any overage means an endpoint was added without authorization or an existing endpoint was upgraded without updating the budget.
-
Total account budget: three thresholds.
50% forecasted (early warning), 80% actual (investigate now), 100% actual (escalate). Route 100% actual to SNS โ your incident channel, not just email. See Terraform Alerts article for the multi-threshold
aws_budgets_budgetresource. -
AWS Cost Anomaly Detection: service-level and AI-scoped monitors.
Deploy
aws_ce_anomaly_monitorfor all services plus a scoped monitor for AI services (Bedrock, SageMaker, OpenSearch). Set the subscription threshold at $50 absolute AND 20% relative to suppress noise from small spend. Takes 10 days to build a baseline โ Tier 1 budgets carry the load in the meantime.
Phase 4 โ Security Billing Sensors (Zero-Threshold, Not Spend-Threshold)
-
New-region zero-threshold alarm.
Any AWS region that has never appeared in an account's billing history is an immediate investigation โ not a budget concern, a security event. Deploy the new-region Lambda from the Lambda Automations article (6-hour schedule, Cost Explorer query comparing last 48 hours against 90-day history). Add
SUPPRESSED_REGIONSfor any planned new region deployments. This is the cheapest, highest-signal security control you can add. -
New-service zero-threshold alert.
Same pattern as new-region: any
ServiceNameappearing for the first time in an account. Particularly valuable for catching AI services provisioned without authorization โ a SageMaker endpoint or Bedrock agent created outside your IaC process appears as a new service in accounts that have never used those services. - CloudWatch billing alarm for estimated charges. The backstop for anything the behavioral sensors miss. Set at your expected daily spend ร 1.5. Requires the us-east-1 provider in Terraform โ billing metrics are only published in us-east-1 regardless of account region.
Phase 5 โ Behavioral Detection Queries (The Layer Standard Tools Miss)
-
Orphaned KB detection query (daily).
SUM(billedcost WHERE servicename LIKE '%OpenSearch%' AND x_usagetype LIKE '%OCU%')cross-joined againstSUM(billedcost WHERE servicename LIKE '%Bedrock%' AND x_usagetype LIKE '%InvokeModel%')per account per 30-day window. Fire when OCU cost is more than 10ร inference cost and OCU cost exceeds $50. This is QB18 in the detection library. -
Idle SageMaker endpoint detection (weekly).
SageMaker endpoint-hours in billing cross-referenced against CloudWatch
Invocationsmetric. Any endpoint with 7+ days of billing and zero CloudWatch invocations is a deletion candidate. - Egress-to-compute ratio query (weekly). DataTransfer-Out-Bytes cost growing faster than total compute cost in the same account. Ratio inversion (egress > 20% of compute cost) flags for review โ could be architectural (NAT Gateway misconfiguration) or security (exfiltration).
-
Commitment utilization drift (monthly).
Reserved Instance and Savings Plan rows where
PricingCategory = 'Committed'โ check that ConsumedQuantity is tracking against committed quantity. Significant under-utilization on a commitment you are still paying for is a RI/SP waste signal.
Phase 6 โ Dashboards and Reporting (Last โ Needs Clean Data From Phases 1โ3)
- Daily cost by service chart. The operational chart โ look at it every morning. Catches spikes, new services, and drift. See the Trend Charts article for the Athena query and Grafana configuration.
- OCU vs inference ratio chart. The AI waste indicator โ visual form of the orphaned KB detection query. A flat OCU line above a zero inference line is the unmistakable signature.
- Security heatmap panel. New region / new service activity in the last 7 days. Any lit cell is an investigation, not a review item.
- AI spend as % of total (monthly). The executive metric โ is AI spend growing proportionately to value, or outpacing it? The FinOps Foundation benchmark: 18% at AI-forward enterprises. Know where you are relative to that.
The Ordering Principle
Every phase in this checklist is a prerequisite for the next. Data foundation before alerts (alerts without data are guesses). Tags before alerts (alerts without tags route to nobody). Cost alerts before security sensors (different cadence and threshold logic โ don't mix them). Behavioral queries before dashboards (dashboards visualize what queries already compute). Do them in this order and each step compounds the previous one. Do them out of order and you will revisit them.
Find out which phases you have already completed
The DropInFinOps free assessment takes 2 minutes and maps your current billing setup against all six phases โ showing where you are in the sequence and which gaps are costing you the most right now.
Take the free assessment โ