What LLM-Assisted FinOps Actually Looks Like in Production

Most "AI-powered" cost tools do attribution. The useful version does detection — finding what's wrong, not just what's expensive.

AI cloud spend has become a first-tier budget concern faster than most organizations built the tooling to manage it. Every major cloud platform now offers AI services with billing models that differ fundamentally from compute: token-based pricing, per-OCU floors, invocation-based models with no clear ceiling. Teams that understand EC2 and RDS billing discover that inference, vector stores, and agent loops follow completely different cost dynamics — and that the dashboards they built for compute don't catch what's wrong with AI spend.

The problem is not the dashboards. Every major FinOps platform now has an "AI cost" tab. CloudZero breaks down inference cost by model, team, and unit. Apptio shows you which accounts are driving AI spend. Vantage lets you filter by Bedrock service. All of them will tell you how much you spent and where it went.

None of them will tell you what's wrong with it.

That is a different problem. And it is where language models — used correctly — actually change what is possible.

The Gap: Attribution vs. Detection

Attribution tools answer: "Who spent this money, and on what?" Detection answers: "Why is this cost pattern wrong, and what should someone do about it?" These require different architectures. Attribution is a grouping and labeling problem. Detection is a pattern-matching and reasoning problem.

Consider two real incidents from 2025:

$5,000 for a single query (Snowflake Cortex AI): A data team ran a Cortex Functions query against a table with 1.18 billion records. The cost was nearly $5,000 in credits — for a single execution. Unlike warehouse compute costs, Cortex AI charges are token-based, not time-based. The team had no AI-specific resource monitors. No alert fired. The bill appeared at month-end under a service line they had never seen before. An attribution tool would have shown the charge under "Snowflake Cortex" after the fact. A detection system would have flagged that a single AI function call consumed more credits in one execution than the team's entire prior month of AI spend — and would have escalated it before the query finished.

400% weekend spend spike from a missing parameter: A team's inference cost jumped from $600/day to $2,400/day over a single weekend. The cause: a missing max_retries cap on an LLM API wrapper caused a retry storm when the upstream model returned occasional rate-limit errors. The billing rows were clean — correct model, correct service, no anomalous usage type. The signal was purely in the cost magnitude relative to the prior 3-day baseline. A rules-based alerting system set to fire on absolute thresholds did not catch it because the daily cost was still within the team's monthly budget. A pattern-based detection system would have flagged 4× baseline on a weekend as an immediate anomaly requiring explanation.

In both cases, the billing data had the signal. The gap was in asking the right question and explaining what it meant.

Why the Billing Dataset Is Ideal LLM Input

Language models are often misapplied to cost data — asked to find patterns in raw numbers rather than explain patterns that have already been found. The effective architecture inverts this: detection happens in SQL, narration happens in the LLM. This produces better results because:

SQL is deterministic. A query that computes ocu_cost_30d / bedrock_inference_cost_30d and returns 14.3 is a fact. Asking an LLM to "find cost anomalies" from raw billing data is asking it to hallucinate patterns from numbers it cannot reliably reason about arithmetically.
Billing data is high-cardinality structured data. A FOCUS billing export for a medium-sized AWS account can contain 10–50 million rows per month. No LLM context window handles this raw. But the output of a behavioral query — "account X has $347 in OCU charges and $12 in Bedrock inference in the same 30-day window" — fits in a few hundred tokens and contains everything needed for an accurate explanation.
The "why" requires cross-service context. LLMs excel at holding multiple related facts simultaneously: "The OpenSearch OCU charge has the naming prefix bedrock-knowledge-base-, there are no active Bedrock Knowledge Bases in this account, and the OCU billing has been perfectly flat for 22 days." A human reading those three facts reaches the correct conclusion immediately. An LLM given those same structured facts reaches the same conclusion — reliably, at scale, across 50 accounts at once.

The Briefing Builder Pattern

The architecture that makes LLM-assisted FinOps useful in production is the briefing builder — a three-stage pipeline that separates detection, context assembly, and narration.

Stage 1 — Behavioral query layer (SQL)

A library of behavioral queries runs against the FOCUS billing dataset. Each query is designed to surface a specific pattern — not a cost spike, but a billing signature. The distinction matters:

Threshold-based alert	Behavioral query
"OpenSearch spend exceeded $200 this month"	"OpenSearch OCU charges present with zero corresponding Bedrock inference in same account/period — ratio: 14.3×"
"Bedrock spend up 40% week-over-week"	"Inference cost for `InvokeModel` increased 40% while token count held flat — cost-per-token ratio shifted, model may have changed"
"New service detected: Amazon SageMaker"	"New `ServiceName` value in account with no prior history — zero-threshold rule triggered, new region: us-west-2"

Behavioral queries use the FOCUS field set: ServiceName, x_UsageType, ConsumedQuantity, BilledCost, SubAccountId, ChargeClass, ResourceId. They run on a schedule — typically hourly for security-adjacent patterns, daily for cost patterns. Each query either returns zero rows (no anomaly) or a result set that becomes structured context.

Stage 2 — Context assembly

When a behavioral query fires, the result set is assembled into a structured context block. The goal is precision, not completeness — the LLM receives exactly the facts needed to explain and recommend, nothing more. A context block for the orphaned Knowledge Base pattern looks like:

{
  "pattern": "orphaned_kb_ocu",
  "severity": "High",
  "account": "prod-ai-team",
  "ocu_cost_30d": 347.52,
  "bedrock_inference_cost_30d": 12.40,
  "waste_ratio": 28.0,
  "ocu_collection_count": 1,
  "resource_id": "bedrock-knowledge-base-a3f7c2d1",
  "billing_start": "2026-04-01",
  "signal": "OpenSearch Serverless OCU billing present with near-zero Bedrock inference in same account/period",
  "behaviors": [
    "ocu_cost_exceeds_inference_10x",
    "bedrock_inference_absent",
    "orphaned_collection_post_kb_delete"
  ]
}

This context is 200 tokens. It contains all the signal the LLM needs: what pattern fired, the dollar magnitude, the ratio that triggered it, the resource identifier, and the behavioral labels that name the mechanism. The LLM does not need to see the raw billing rows. It needs the structured output of the detection layer.

Stage 3 — LLM narration

The LLM receives a system prompt that anchors it to the cloud cost domain and defines output format, plus the structured context block. Its job is to produce an explanation a platform engineer and a VP of Engineering can both read:

What is happening, in plain English
Why it happened (the mechanism — not a guess)
What the dollar impact is
What the fix is, with specific steps
Whether anything else in the account is likely affected

The output is not a dashboard widget. It is a briefing — a paragraph or two that a human can act on immediately, with resource IDs, dollar amounts, and console steps embedded inline. The engineer who reads it knows exactly what to do in the next 10 minutes.

What Makes a Good LLM Briefing

The quality of the briefing is determined entirely by the quality of the structured context, not by the LLM model. The same behavioral query output fed to different LLMs produces comparable results — because the facts are the same. This is the key design insight: invest in the detection layer, not in the model.

A good briefing has four properties:

Named resource IDs. "The OpenSearch Serverless collection bedrock-knowledge-base-a3f7c2d1" — not "an OpenSearch collection." Engineers use the resource ID to find the resource. Without it, the briefing is a report, not an action item.
Dollar anchors. "$347 over the past 30 days, $11.52/day, billing since 2026-04-08" — not "significant OpenSearch charges." Dollar precision enables prioritization and makes the case to stakeholders in one sentence.
The mechanism, not just the symptom. "AWS does not cascade the deletion when a Bedrock Knowledge Base is removed — the OpenSearch collection must be deleted separately from the OpenSearch console" — not "the collection may be orphaned." The engineer should understand why this happened, not just what happened.
Specific fix steps. "Console → Amazon OpenSearch Service → Serverless → Collections → delete bedrock-knowledge-base-a3f7c2d1" — not "investigate the collection." The briefing is done when a new hire could execute the fix without asking anyone.

A bad briefing — one that lacks resource IDs, uses hedged language, or explains the symptom without the mechanism — is worse than no briefing. It generates alert fatigue. Engineers stop reading them.

The Security-Adjacent Case

The same briefing builder architecture applies to security billing signals. A compromised cloud account leaves a cost fingerprint before it leaves a log fingerprint: GPU-class compute appears in an account with no prior GPU usage, data egress spikes in a region the account has never written to, new EC2 instance types appear that match known cryptomining profiles.

These patterns are behavioral — not absolute thresholds — and they require cross-service cross-account reasoning that rules-based alerting cannot perform. A behavioral query that checks: "new ServiceName + new Region combination in the last 24 hours, for any account with no prior history of this combination" — is a zero-false-positive security signal. It fires on the first dollar of a cryptomining campaign, from a data source that is generated by AWS's billing systems rather than the compromised instance itself. An attacker scrubbing CloudTrail or disabling GuardDuty has no reason to think about the billing audit trail — it runs on a different control plane and cannot be suppressed from within the account.

The LLM receives the same structured context — new service, new region, first timestamp, dollar amount, instance type — and produces a briefing that reads: "A GPU-class EC2 instance (p3.8xlarge) appeared in us-west-2 at 02:14 UTC. This account has no prior GPU spend and no prior us-west-2 activity. The cost profile matches cryptocurrency mining workloads. Immediate action: verify this instance was authorized, and if not, terminate it and rotate the credentials used to launch it."

One pipeline. One detection architecture. Two use cases — cost optimization and security response — reading the same FOCUS billing dataset.

Why FOCUS Input Quality Determines Output Quality

The briefing builder depends on FOCUS-formatted billing data, not the legacy CUR format. The difference is significant:

Field need	CUR (legacy)	FOCUS 1.0
Cross-service pattern detection	`product/serviceName` — inconsistent casing, no standard values	`ServiceName` — standardized across providers
Usage type for AI billing	`lineItem/UsageType` — format varies by service	`x_UsageType` — `SearchOCU`, `InvokeModelInference` — queryable
Resource attribution	Partial — some services omit ResourceId	`ResourceId` — required field, present for all billable resources
Multi-cloud normalization	AWS-only schema	Same schema across AWS, Azure, GCP

A behavioral query that joins on ServiceName LIKE '%OpenSearch%' and ServiceName LIKE '%Bedrock%' runs cleanly against FOCUS data. The equivalent CUR query requires service-code lookups, inconsistent field formats, and account-level joins that break under consolidated billing. The upstream data quality is what makes the detection layer reliable — and therefore what makes the LLM briefings trustworthy.

What This Is Not

LLM-assisted FinOps is not a chatbot that answers questions about your bill. That is a different product — one that is useful for exploration but not for operational alerting. A chatbot requires a human to know what question to ask. A briefing builder fires when it finds a pattern, without anyone asking.

It is also not "AI that finds anomalies." The AI does not find anomalies. The behavioral queries find anomalies. The AI explains what the queries found, in language that both an engineer and a VP can act on. That distinction matters for two reasons: first, it means the detection is auditable — you can read the SQL and understand exactly what it fires on. Second, it means the LLM never hallucinates a cost anomaly that does not exist in the billing data, because it is not in the anomaly-finding loop.

Implementation Checklist

Enable FOCUS-formatted billing exports. AWS Data Exports (the CUR successor), not legacy CUR. FOCUS 1.0 schema is the input format the behavioral query layer expects. Configure a daily export to S3 with Parquet format.
Build the query layer before the LLM layer. Write behavioral SQL queries that fire on specific patterns. Validate each against real billing data before wiring the LLM. If the query does not produce the right result rows, the briefing will not be useful regardless of the model.
Structure the context, do not dump raw data. The LLM context block should contain pattern name, dollar amounts, resource IDs, behavioral labels, and the first/last timestamps. Nothing else. 200–500 tokens per anomaly is the target.
Define output format in the system prompt. Specify: resource ID required, dollar amount required, mechanism required, fix steps required. LLMs without output constraints produce inconsistent briefings — consistent briefings are what prevent alert fatigue.
Add the security zero-threshold rule. New ServiceName + new Region combination with no prior account history = immediate alert. This fires on cryptomining, credential compromise, and shadow IT deployments — from a data source the attacker has no reason to tamper with, independent of the log pipeline a sophisticated attacker might disrupt.
Run behavioral queries on a schedule, not on demand. Hourly for security patterns. Daily for cost patterns. The gap between when a pattern starts and when someone notices it is where the damage accumulates — an orphaned KB collection costs $11.52/day, every day until deletion.

See which billing patterns your setup is already positioned to catch

The DropInFinOps free assessment takes 2 minutes and maps your current billing export setup against the behavioral query library — showing which anomaly patterns are detectable today and which ones require a FOCUS migration or additional instrumentation.

Take the free assessment →

AI in FinOps

What LLM-Assisted FinOps Actually Looks Like in Production

The Gap: Attribution vs. Detection

Why the Billing Dataset Is Ideal LLM Input

The Briefing Builder Pattern

Stage 1 — Behavioral query layer (SQL)

Stage 2 — Context assembly

Stage 3 — LLM narration

What Makes a Good LLM Briefing

The Security-Adjacent Case

Why FOCUS Input Quality Determines Output Quality

What This Is Not

Implementation Checklist

More from our guides

What is FinOps?

Common AWS Cost Mistakes

Practical AWS Lambda Automations

AI in FinOps

What LLM-Assisted FinOps Actually Looks Like in Production

The Gap: Attribution vs. Detection

Why the Billing Dataset Is Ideal LLM Input

The Briefing Builder Pattern

Stage 1 — Behavioral query layer (SQL)

Stage 2 — Context assembly

Stage 3 — LLM narration

What Makes a Good LLM Briefing

The Security-Adjacent Case

Why FOCUS Input Quality Determines Output Quality

What This Is Not

Implementation Checklist

More from our guides

What is FinOps?

Common AWS Cost Mistakes

Practical AWS Lambda Automations

Privacy & Cookie Notice

1. What We Collect

2. How We Use Your Data

3. Where Your Data Lives

4. Legal Basis for Processing

5. Data Retention

6. Cookies & Tracking

7. Your Rights (GDPR)

8. Security

9. Updates