SageMaker Real-Time Inference Endpoints Bill When Nobody Is Calling Them

Your endpoint is InService. Invocations this week: 0. The $1,014/month charge is for a model that hasn't served a request since the demo.

Amazon SageMaker real-time inference endpoints cannot scale to zero. The compute instance runs from the moment you call CreateEndpoint until the moment you call DeleteEndpoint — regardless of whether the model receives 0 requests or 10,000. There is no sleep mode. There is no idle timeout. There is no automatic shutdown.

The billing model was designed for production APIs that need constant low-latency availability. The problem is that teams also use real-time endpoints for temporary work: stakeholder demos, A/B test challenger variants, experimental models, post-hackathon prototypes. When the demo ends and the project moves on, the endpoint stays running. The meter stays running with it.

What makes idle endpoints hard to catch: the billing row for a zero-traffic endpoint is structurally identical to the billing row for a fully-loaded one. Both show AmazonSageMaker as the service, USE1-Hosting:ml.g4dn.xlarge as the usage type, and the same per-hour rate. A FinOps tool scanning for cost anomalies finds nothing unusual — the endpoint has been billing at a flat rate since day one. The waste only becomes visible when you join the billing data with the invocation count from a second system: CloudWatch.

The Billing Signal

FOCUS / CUR Field	Active endpoint (10,000 req/day)	Idle endpoint (0 req/day)
`ServiceName`	`Amazon SageMaker`	`Amazon SageMaker` — identical
`x_UsageType`	`USE1-Hosting:ml.g4dn.xlarge`	`USE1-Hosting:ml.g4dn.xlarge` — identical
`x_Operation`	`RunInstance`	`RunInstance` — identical
`BilledCost` per hour	$0.736 (ml.g4dn.xlarge)	$0.736 — identical. The instance bills regardless of requests.
`ConsumedQuantity`	1.0 Hrs	1.0 Hrs — identical. One instance-hour regardless of invocations.
Invocation count	~417 req/hr (in CloudWatch)	0 (in CloudWatch) — this field does not exist in CUR/FOCUS
Temporal cost profile	Flat 24/7 (instance billing)	Flat 24/7 — no anomaly visible in billing alone

The detection condition: SageMaker Hosting usage billed for 7+ consecutive days, with fewer than 100 total invocations in that window across all variants. The billing row looks normal. The invocation count — which lives in CloudWatch, not CUR — is what reveals the waste.

Real Incidents

These incidents are sourced from AWS re:Post threads, FinOps practitioner analysis, and published engineering post-mortems. None are fabricated.

~$166/day — demo endpoint left after experimentation (AWS re:Post): A user discovered a SageMaker endpoint that had been running unattended for weeks after an experimentation project wrapped. The rate was consistent with an ml.g5.12xlarge at $5.67/hr running unattended — $136/day. The endpoint was InService with a green status indicator. No alerts had fired. The cost had been accruing at a flat rate from day one.

~$70/day — ml.g5.4xlarge that couldn't be immediately deleted (AWS re:Post): A user attempting to clean up a development endpoint encountered a deletion failure — the endpoint was stuck in a transient state during a deployment update. At $3.39/hr, the endpoint billed $81/day while the deletion was pending. The user had considered the endpoint "already gone."

Thousands of dollars — endpoint created and forgotten (AWS re:Post): A user reported an endpoint created for initial model validation that was never used in production and never deleted. The exact instance type was not reported, but the accumulated charge reached thousands of dollars before the billing alert triggered. The endpoint had zero production invocations from creation to discovery.

$3,000–$8,000/month — enterprise idle endpoint accumulation (CloudFix + CloudWise analysis): Across enterprise accounts, idle SageMaker endpoints from past projects — demos, concluded A/B tests, deprecated model versions — accumulate into significant monthly waste. Multiple endpoints at GPU rates, each individually unremarkable, sum to thousands per month. The pattern is chronic: no single event causes it, and no existing alert catches it.

40–50% compute cost reduction — EagleView geospatial AI migration (AWS inference cost optimization): EagleView was running GPU endpoints for a geospatial AI workload that did not require sub-second latency. Switching to SageMaker Async Inference, which scales to zero when the queue is empty, eliminated all idle-period charges. The 40–50% reduction came entirely from not paying for endpoints during the hours and days when no inference was needed.

Doesn't Cost Optimization Hub Already Flag This?

As of June 2026, AWS Compute Optimizer generates idle-resource recommendations for SageMaker endpoints, surfaced through Cost Optimization Hub. The criteria: an endpoint with zero invocations over a 14-day lookback window gets a recommendation to verify whether it's still needed, consider scale-to-zero for inference components, or delete it. This is a real, useful native capability, and it closes part of the gap this article describes.

Three things are still true after this update:

It's opt-in, not on by default. Idle recommendations only populate after an account or AWS Organization has opted into Compute Optimizer. An account that hasn't opted in gets nothing — the endpoint keeps billing, invisible, same as before.

The lookback window is twice as long as the detection condition in this article. AWS's own idle criteria wait 14 consecutive days of zero invocations before generating a recommendation. The 7-day CloudWatch alarm in Step 4 below catches the same endpoint a week earlier — on a $1,014/month ml.g5.xlarge, that's roughly $237 more that AWS's native signal lets burn before it says anything.

It's a recommendation you have to go find, not a push alert. Cost Optimization Hub surfaces the finding in a console and API — nobody gets paged. The endpoint sits there as a routine line item until someone opens the hub, filters to SageMaker, and reads the recommendation. The CloudWatch alarm in Step 4 pushes to SNS the moment the threshold is crossed; Cost Optimization Hub is a report you check, not an alert that finds you.

Net effect: if an organization has opted into Compute Optimizer and someone is checking Cost Optimization Hub regularly, idle SageMaker endpoints are less likely to run for months undiscovered. If it hasn't opted in — still the default state for an unconfigured account — nothing has changed.

If you built it: what to look for and how to fix it

For platform engineers, ML engineers, and DevOps engineers who deployed a SageMaker real-time inference endpoint.

The SageMaker console shows your endpoint as InService. That status means the endpoint is functioning correctly — it is not an error indicator. It does not mean the endpoint has recent traffic. Here is how to trace the charge and decide what to do.

Step 1 — Find the endpoint in Cost Explorer

In AWS Cost Explorer: set Service = Amazon SageMaker, group by Usage Type. Look for rows labeled USE1-Hosting:ml.g4dn.xlarge, USE1-Hosting:ml.g5.xlarge, or any other Hosting:{instance-type} pattern. Each row is one endpoint running. If you see entries you cannot immediately attribute to a current production workload, continue to Step 2.

Step 2 — Check invocations in CloudWatch

In CloudWatch Metrics: navigate to SageMaker → Endpoint Metrics. Select the endpoint name. Pull the Invocations metric for the last 7 days as a daily sum. If the graph is flat at zero — or shows a few scattered single-digit values — the endpoint has no real traffic. That is your idle endpoint.

Alternatively, from the CLI:

aws cloudwatch get-metric-statistics \
  --namespace AWS/SageMaker \
  --metric-name Invocations \
  --dimensions Name=EndpointName,Value=YOUR-ENDPOINT-NAME \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 86400 \
  --statistics Sum

If every Sum in the output is 0 (or absent — AWS omits zero-value periods), the endpoint is idle.

Step 3 — Delete the endpoint

If you confirm the endpoint has no current traffic and no scheduled use:

aws sagemaker delete-endpoint --endpoint-name YOUR-ENDPOINT-NAME

Deletion is immediate and irreversible. The billing stops at the hour boundary after deletion. Deleting on day 15 of a month saves approximately 50% of that month's charge. There is no partial-month credit — you pay for every hour up to deletion and nothing after.

If the model might be needed again, consider switching to Serverless Inference (sub-5-second latency tolerance) or Async Inference (latency-insensitive batch workloads). Both scale to zero during idle periods. For net-new endpoints, AWS's Inference Components architecture (re:Invent 2024) supports MinInstanceCount: 0 with an automatic idle timeout — this enables scale-to-zero on real-time endpoints built using the newer deployment model. Classic real-time endpoint configurations (the most common pattern and the default in most AWS documentation) still require a minimum of one instance and must be explicitly deleted to stop billing. See the fix checklist at the end of this article.

Step 4 — Set a CloudWatch alarm so this doesn't happen again

AWS does not alarm on idle endpoints by default. Add one:

aws cloudwatch put-metric-alarm \
  --alarm-name "SageMaker-${ENDPOINT_NAME}-idle-7d" \
  --metric-name Invocations \
  --namespace AWS/SageMaker \
  --dimensions Name=EndpointName,Value=${ENDPOINT_NAME} \
  --statistic Sum \
  --period 86400 \
  --evaluation-periods 7 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:${ACCOUNT_ID}:your-alert-topic \
  --treat-missing-data breaching

This alarms after 7 consecutive days with fewer than 1 invocation/day. treat-missing-data breaching ensures that a completely dark endpoint (zero CloudWatch data points, not just zero values) also triggers the alarm.

If you watch the bill: how to detect this at scale

For FinOps practitioners, cloud finance analysts, and platform teams monitoring spend across multiple accounts.

The per-account Cost Explorer approach catches single-account waste. At multi-account scale, idle endpoints accumulate across dozens of accounts — each instance individually small enough to go unnoticed, collectively significant. The right detection approach starts with a query against your FOCUS or CUR data lake.

Detection query (FOCUS/CUR — Athena)

WITH endpoint_daily AS (
    SELECT
        subaccountid,
        resourceid,
        CAST(chargeperiodstart AS DATE)           AS charge_date,
        SUM(billedcost)                           AS daily_cost,
        SUM(consumedquantity)                     AS endpoint_hours,
        SUM(COALESCE(x_invocation_count, 0))      AS daily_invocations
    FROM your_focus_table
    WHERE servicename LIKE '%SageMaker%'
      AND x_usagetype LIKE '%Hosting%'
      AND chargecategory = 'Usage'
      AND chargeclass = 'Regular'
      AND CAST(chargeperiodstart AS DATE) >= DATE_ADD('day', -7, CURRENT_DATE)
    GROUP BY 1, 2, 3
),
endpoint_summary AS (
    SELECT
        subaccountid,
        resourceid,
        SUM(daily_cost)                           AS endpoint_cost_7d,
        SUM(daily_invocations)                    AS total_invocations_7d,
        ROUND(SUM(daily_cost) * (30.0 / 7.0), 2) AS projected_monthly_cost
    FROM endpoint_daily
    GROUP BY 1, 2
)
SELECT
    subaccountid,
    resourceid,
    ROUND(endpoint_cost_7d, 2)    AS cost_last_7d,
    total_invocations_7d,
    projected_monthly_cost
FROM endpoint_summary
WHERE endpoint_cost_7d > 10.0
  AND total_invocations_7d < 100
ORDER BY endpoint_cost_7d DESC

Note on x_invocation_count: Standard CUR/FOCUS data does not include invocation counts for SageMaker real-time endpoints — instance-hour billing has no per-request rows. This field must be pre-populated by joining CUR data with CloudWatch invocation metrics before loading into your data lake. If you are using native CUR without CloudWatch enrichment, replace the invocation filter with a manual CloudWatch check for any endpoint that shows SageMaker Hosting cost with a flat, unvarying 24/7 cost profile. The flat-cost-profile heuristic catches ~80% of idle endpoints before the CloudWatch join is needed.

Dollar impact math

A single idle ml.g5.xlarge runs $1.408/hr. That is $33.79/day, $1,014/month. At 10 accounts each with one forgotten demo endpoint:

Instance type	Rate/hr	Rate/month	10 accounts
ml.g4dn.xlarge	$0.736	$530	$5,300/mo
ml.g5.xlarge	$1.408	$1,014	$10,140/mo
ml.g5.4xlarge	$2.030	$1,462	$14,620/mo
ml.g5.12xlarge	$7.090	$5,105	$51,050/mo

On-demand SageMaker real-time inference hosting rates, US East (N. Virginia), verified against current AWS pricing as of July 2026. Rates change over time — some of these instance types have gotten cheaper since this article was first published, others more expensive.

GPU endpoints at enterprise scale accumulate quickly. The $3,000–$8,000/month figures in the incident section reflect multi-endpoint accumulation across a single enterprise account — not an edge case.

How to report this to the team that caused it

The team usually has no idea. The endpoint was created for a real, legitimate reason — a demo, a test, an A/B experiment. They finished what they were doing, moved on, and never thought about the endpoint again. The framing that works: "you have an endpoint from [project name] still running. It's at $X/month with zero traffic. Do you need it?" Most of the time the answer is no, and the delete happens that day.

The framing that doesn't work: "you're wasting cloud budget." That creates defensiveness. You want action, not a discussion. Give them the endpoint name, the dollar figure, and one yes/no question.

If you own the outcome: the governance gap and how to close it

For engineering managers, VP Engineering, and CTOs who see a SageMaker line item that nobody can explain.

Your ML team did nothing wrong. They deployed an endpoint to show stakeholders something working. The demo went well. The endpoint stayed running because SageMaker does not have an auto-shutdown mechanism, and nobody's process included a step for "delete the demo endpoint afterward."

The governance gap has three parts:

1. No lifecycle policy for inference endpoints. EC2 instances can be scheduled off. RDS can be stopped. Lambda costs nothing at zero invocations. SageMaker real-time endpoints are different: they run continuously from creation to deletion, and deletion must be explicit. Most teams have a process for provisioning endpoints, and no process for decommissioning them.

2. The billing doesn't reveal the problem. Your FinOps dashboard shows SageMaker spend at a flat, expected level. No spike, no anomaly. The cost has been accruing since the endpoint was created, so it appears in every monthly baseline. Only when you enrich billing data with CloudWatch invocation counts does the waste become visible. Standard FinOps tools don't do this join by default.

3. Ownership evaporates when projects close. The engineer who created the endpoint for Q3's demo has moved to Q1 next year's project. The endpoint has their team tag, but nobody is actively watching it. The cost attribution is correct. The ownership is orphaned.

The decision

For most organizations, the right answer is a two-part fix: (1) delete all endpoints that have had zero invocations for 7+ days with no current project association, and (2) implement a tagging and lifecycle policy that prevents this from recurring.

The delete step has a clear financial case. At GPU instance rates, a 30-day cleanup cycle on idle endpoints typically recovers 10–20% of total SageMaker spend in organizations that don't currently have this process. The savings are immediate — the month you run the cleanup, the bill drops. The month after, it stays lower if the lifecycle policy is in place.

The policy step closes the governance gap. Every endpoint needs an ExpiresOn tag at creation. A Lambda function checks invocations weekly and escalates to the tagged owner before auto-deleting. AWS published a reference implementation for this pattern in their ML blog. The implementation is a few hours of work. The ongoing cost of not having it is measured in GPU-hours per month, indefinitely.

The alternative — accepting idle endpoint waste as a cost of doing ML — is a choice some organizations make deliberately. It is rational if the cost is small relative to engineering time. It becomes irrational when the total reaches thousands per month and still has no line of sight in the standard tooling.

Fix checklist

Delete idle endpoints immediately. If invocations have been near zero for 7+ days and there is no active use case, delete the endpoint now: aws sagemaker delete-endpoint --endpoint-name NAME. Billing stops at the next hour boundary. Deleting on day 15 saves 50% of the month.
Add a CloudWatch Invocations alarm. AWS does not alert on idle endpoints by default. Set an alarm on the Invocations metric for 7 consecutive days below 1 invocation/day with treat-missing-data breaching. Route to endpoint owner via SNS.
Evaluate Serverless Inference for non-latency-critical workloads. SageMaker Serverless Inference scales to zero automatically and costs $0 when idle. Trade-off: 100–500ms cold start latency on the first request. For internal tools and non-interactive use cases, the cold start is acceptable and the billing model is dramatically more appropriate.
Evaluate Async Inference for batch-capable workloads. SageMaker Async Inference scales to zero when the queue is empty. For any workload that can tolerate seconds-to-minutes of processing latency — document processing, batch scoring, non-real-time model calls — async inference eliminates idle-period charges entirely.
Enforce tagging at endpoint creation. Require Owner, Project, and ExpiresOn tags via a Service Control Policy. An untagged endpoint has no owner to contact when the invocations drop to zero. A tagged endpoint can be decommissioned automatically or with a single escalation email.
Implement an endpoint lifecycle Lambda. AWS published a reference implementation in "Identify idle endpoints in Amazon SageMaker" (ML blog, 2023). The pattern: a Lambda runs weekly, checks CloudWatch Invocations for all endpoints, sends a warning to the tagged owner after 3 days of zero traffic, and auto-deletes after 14 days with no response. Three hours to implement. Prevents the pattern from recurring.

Sources

Announcing Six New Idle Resource Recommendations in AWS Compute Optimizer — AWS Cloud Financial Management blog, June 2026.
Viewing idle resource recommendations — AWS Compute Optimizer documentation, idle criteria per resource (including SageMaker endpoints).
Identifying opportunities with Cost Optimization Hub — AWS Cost Management documentation, supported resources list.
Amazon SageMaker AI Pricing — current on-demand hosting rates, verified July 2026.

Find out what else is hiding in your AI cloud bill

Idle SageMaker endpoints are one of several AI spend patterns that standard FinOps tools miss — either because they require a billing-plus-CloudWatch join to detect, accumulate below general anomaly thresholds, or show no deviation from an endpoint's historical cost baseline. The DropInFinOps free assessment takes 2 minutes and shows you which patterns your current billing setup is positioned to catch — and which ones are accumulating undetected.

Take the free assessment →

SageMaker Real-Time Inference Endpoints Bill When Nobody Is Calling Them

SageMaker Real-Time Inference Endpoints Bill When Nobody Is Calling Them

The Billing Signal

Real Incidents

Doesn't Cost Optimization Hub Already Flag This?

If you built it: what to look for and how to fix it

Step 1 — Find the endpoint in Cost Explorer

Step 2 — Check invocations in CloudWatch

Step 3 — Delete the endpoint

Step 4 — Set a CloudWatch alarm so this doesn't happen again

If you watch the bill: how to detect this at scale

Detection query (FOCUS/CUR — Athena)

Dollar impact math

How to report this to the team that caused it

If you own the outcome: the governance gap and how to close it

The decision

Fix checklist

Sources

More from our guides

What is FinOps?

Common AWS Cost Mistakes

Practical AWS Lambda Automations

SageMaker Real-Time Inference Endpoints Bill When Nobody Is Calling Them

SageMaker Real-Time Inference Endpoints Bill When Nobody Is Calling Them

The Billing Signal

Real Incidents

Doesn't Cost Optimization Hub Already Flag This?

If you built it: what to look for and how to fix it

Step 1 — Find the endpoint in Cost Explorer

Step 2 — Check invocations in CloudWatch

Step 3 — Delete the endpoint

Step 4 — Set a CloudWatch alarm so this doesn't happen again

If you watch the bill: how to detect this at scale

Detection query (FOCUS/CUR — Athena)

Dollar impact math

How to report this to the team that caused it

If you own the outcome: the governance gap and how to close it

The decision

Fix checklist

Sources

More from our guides

What is FinOps?

Common AWS Cost Mistakes

Practical AWS Lambda Automations

Privacy & Cookie Notice

1. What We Collect

2. How We Use Your Data

3. Where Your Data Lives

4. Legal Basis for Processing

5. Data Retention

6. Cookies & Tracking

7. Your Rights (GDPR)

8. Security

9. Updates