SageMaker Real-Time Inference Endpoints Bill When Nobody Is Calling Them

Your endpoint is InService. Invocations this week: 0. The $724/month charge is for a model that hasn't served a request since the demo.

Amazon SageMaker real-time inference endpoints cannot scale to zero. The compute instance runs from the moment you call CreateEndpoint until the moment you call DeleteEndpoint โ€” regardless of whether the model receives 0 requests or 10,000. There is no sleep mode. There is no idle timeout. There is no automatic shutdown.

The billing model was designed for production APIs that need constant low-latency availability. The problem is that teams also use real-time endpoints for temporary work: stakeholder demos, A/B test challenger variants, experimental models, post-hackathon prototypes. When the demo ends and the project moves on, the endpoint stays running. The meter stays running with it.

What makes idle endpoints hard to catch: the billing row for a zero-traffic endpoint is structurally identical to the billing row for a fully-loaded one. Both show AmazonSageMaker as the service, USE1-Hosting:ml.g4dn.xlarge as the usage type, and the same per-hour rate. A FinOps tool scanning for cost anomalies finds nothing unusual โ€” the endpoint has been billing at a flat rate since day one. The waste only becomes visible when you join the billing data with the invocation count from a second system: CloudWatch.

The Billing Signal

FOCUS / CUR FieldActive endpoint (10,000 req/day)Idle endpoint (0 req/day)
ServiceName Amazon SageMaker Amazon SageMaker โ€” identical
x_UsageType USE1-Hosting:ml.g4dn.xlarge USE1-Hosting:ml.g4dn.xlarge โ€” identical
x_Operation RunInstance RunInstance โ€” identical
BilledCost per hour $0.736 (ml.g4dn.xlarge) $0.736 โ€” identical. The instance bills regardless of requests.
ConsumedQuantity 1.0 Hrs 1.0 Hrs โ€” identical. One instance-hour regardless of invocations.
Invocation count ~417 req/hr (in CloudWatch) 0 (in CloudWatch) โ€” this field does not exist in CUR/FOCUS
Temporal cost profile Flat 24/7 (instance billing) Flat 24/7 โ€” no anomaly visible in billing alone

The detection condition: SageMaker Hosting usage billed for 7+ consecutive days, with fewer than 100 total invocations in that window across all variants. The billing row looks normal. The invocation count โ€” which lives in CloudWatch, not CUR โ€” is what reveals the waste.

Real Incidents

These incidents are sourced from AWS re:Post threads, FinOps practitioner analysis, and published engineering post-mortems. None are fabricated.

~$166/day โ€” demo endpoint left after experimentation (AWS re:Post): A user discovered a SageMaker endpoint that had been running unattended for weeks after an experimentation project wrapped. The rate was consistent with an ml.g5.12xlarge at $5.67/hr running unattended โ€” $136/day. The endpoint was InService with a green status indicator. No alerts had fired. The cost had been accruing at a flat rate from day one.

~$70/day โ€” ml.g5.4xlarge that couldn't be immediately deleted (AWS re:Post): A user attempting to clean up a development endpoint encountered a deletion failure โ€” the endpoint was stuck in a transient state during a deployment update. At $3.39/hr, the endpoint billed $81/day while the deletion was pending. The user had considered the endpoint "already gone."

Thousands of dollars โ€” endpoint created and forgotten (AWS re:Post): A user reported an endpoint created for initial model validation that was never used in production and never deleted. The exact instance type was not reported, but the accumulated charge reached thousands of dollars before the billing alert triggered. The endpoint had zero production invocations from creation to discovery.

$3,000โ€“$8,000/month โ€” enterprise idle endpoint accumulation (CloudFix + CloudWise analysis): Across enterprise accounts, idle SageMaker endpoints from past projects โ€” demos, concluded A/B tests, deprecated model versions โ€” accumulate into significant monthly waste. Multiple endpoints at GPU rates, each individually unremarkable, sum to thousands per month. The pattern is chronic: no single event causes it, and no existing alert catches it.

40โ€“50% compute cost reduction โ€” EagleView geospatial AI migration (AWS inference cost optimization): EagleView was running GPU endpoints for a geospatial AI workload that did not require sub-second latency. Switching to SageMaker Async Inference, which scales to zero when the queue is empty, eliminated all idle-period charges. The 40โ€“50% reduction came entirely from not paying for endpoints during the hours and days when no inference was needed.


If you built it: what to look for and how to fix it

For platform engineers, ML engineers, and DevOps engineers who deployed a SageMaker real-time inference endpoint.

The SageMaker console shows your endpoint as InService. That status means the endpoint is functioning correctly โ€” it is not an error indicator. It does not mean the endpoint has recent traffic. Here is how to trace the charge and decide what to do.

Step 1 โ€” Find the endpoint in Cost Explorer

In AWS Cost Explorer: set Service = Amazon SageMaker, group by Usage Type. Look for rows labeled USE1-Hosting:ml.g4dn.xlarge, USE1-Hosting:ml.g5.xlarge, or any other Hosting:{instance-type} pattern. Each row is one endpoint running. If you see entries you cannot immediately attribute to a current production workload, continue to Step 2.

Step 2 โ€” Check invocations in CloudWatch

In CloudWatch Metrics: navigate to SageMaker โ†’ Endpoint Metrics. Select the endpoint name. Pull the Invocations metric for the last 7 days as a daily sum. If the graph is flat at zero โ€” or shows a few scattered single-digit values โ€” the endpoint has no real traffic. That is your idle endpoint.

Alternatively, from the CLI:

aws cloudwatch get-metric-statistics \
  --namespace AWS/SageMaker \
  --metric-name Invocations \
  --dimensions Name=EndpointName,Value=YOUR-ENDPOINT-NAME \
  --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 86400 \
  --statistics Sum

If every Sum in the output is 0 (or absent โ€” AWS omits zero-value periods), the endpoint is idle.

Step 3 โ€” Delete the endpoint

If you confirm the endpoint has no current traffic and no scheduled use:

aws sagemaker delete-endpoint --endpoint-name YOUR-ENDPOINT-NAME

Deletion is immediate and irreversible. The billing stops at the hour boundary after deletion. Deleting on day 15 of a month saves approximately 50% of that month's charge. There is no partial-month credit โ€” you pay for every hour up to deletion and nothing after.

If the model might be needed again, consider switching to Serverless Inference (sub-5-second latency tolerance) or Async Inference (latency-insensitive batch workloads). Both scale to zero during idle periods. For net-new endpoints, AWS's Inference Components architecture (re:Invent 2024) supports MinInstanceCount: 0 with an automatic idle timeout โ€” this enables scale-to-zero on real-time endpoints built using the newer deployment model. Classic real-time endpoint configurations (the most common pattern and the default in most AWS documentation) still require a minimum of one instance and must be explicitly deleted to stop billing. See the fix checklist at the end of this article.

Step 4 โ€” Set a CloudWatch alarm so this doesn't happen again

AWS does not alarm on idle endpoints by default. Add one:

aws cloudwatch put-metric-alarm \
  --alarm-name "SageMaker-${ENDPOINT_NAME}-idle-7d" \
  --metric-name Invocations \
  --namespace AWS/SageMaker \
  --dimensions Name=EndpointName,Value=${ENDPOINT_NAME} \
  --statistic Sum \
  --period 86400 \
  --evaluation-periods 7 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:${ACCOUNT_ID}:your-alert-topic \
  --treat-missing-data breaching

This alarms after 7 consecutive days with fewer than 1 invocation/day. treat-missing-data breaching ensures that a completely dark endpoint (zero CloudWatch data points, not just zero values) also triggers the alarm.


If you watch the bill: how to detect this at scale

For FinOps practitioners, cloud finance analysts, and platform teams monitoring spend across multiple accounts.

The per-account Cost Explorer approach catches single-account waste. At multi-account scale, idle endpoints accumulate across dozens of accounts โ€” each instance individually small enough to go unnoticed, collectively significant. The right detection approach starts with a query against your FOCUS or CUR data lake.

Detection query (FOCUS/CUR โ€” Athena)

WITH endpoint_daily AS (
    SELECT
        subaccountid,
        resourceid,
        CAST(chargeperiodstart AS DATE)           AS charge_date,
        SUM(billedcost)                           AS daily_cost,
        SUM(consumedquantity)                     AS endpoint_hours,
        SUM(COALESCE(x_invocation_count, 0))      AS daily_invocations
    FROM your_focus_table
    WHERE servicename LIKE '%SageMaker%'
      AND x_usagetype LIKE '%Hosting%'
      AND chargecategory = 'Usage'
      AND chargeclass = 'Regular'
      AND CAST(chargeperiodstart AS DATE) >= DATE_ADD('day', -7, CURRENT_DATE)
    GROUP BY 1, 2, 3
),
endpoint_summary AS (
    SELECT
        subaccountid,
        resourceid,
        SUM(daily_cost)                           AS endpoint_cost_7d,
        SUM(daily_invocations)                    AS total_invocations_7d,
        ROUND(SUM(daily_cost) * (30.0 / 7.0), 2) AS projected_monthly_cost
    FROM endpoint_daily
    GROUP BY 1, 2
)
SELECT
    subaccountid,
    resourceid,
    ROUND(endpoint_cost_7d, 2)    AS cost_last_7d,
    total_invocations_7d,
    projected_monthly_cost
FROM endpoint_summary
WHERE endpoint_cost_7d > 10.0
  AND total_invocations_7d < 100
ORDER BY endpoint_cost_7d DESC

Note on x_invocation_count: Standard CUR/FOCUS data does not include invocation counts for SageMaker real-time endpoints โ€” instance-hour billing has no per-request rows. This field must be pre-populated by joining CUR data with CloudWatch invocation metrics before loading into your data lake. If you are using native CUR without CloudWatch enrichment, replace the invocation filter with a manual CloudWatch check for any endpoint that shows SageMaker Hosting cost with a flat, unvarying 24/7 cost profile. The flat-cost-profile heuristic catches ~80% of idle endpoints before the CloudWatch join is needed.

Dollar impact math

A single idle ml.g5.xlarge runs $1.006/hr. That is $24.14/day, $724/month. At 10 accounts each with one forgotten demo endpoint:

Instance typeRate/hrRate/month10 accounts
ml.g4dn.xlarge$0.736$530$5,300/mo
ml.g5.xlarge$1.006$724$7,240/mo
ml.g5.4xlarge$3.388$2,439$24,390/mo
ml.g5.12xlarge$5.672$4,084$40,840/mo

GPU endpoints at enterprise scale accumulate quickly. The $3,000โ€“$8,000/month figures in the incident section reflect multi-endpoint accumulation across a single enterprise account โ€” not an edge case.

How to report this to the team that caused it

The team usually has no idea. The endpoint was created for a real, legitimate reason โ€” a demo, a test, an A/B experiment. They finished what they were doing, moved on, and never thought about the endpoint again. The framing that works: "you have an endpoint from [project name] still running. It's at $X/month with zero traffic. Do you need it?" Most of the time the answer is no, and the delete happens that day.

The framing that doesn't work: "you're wasting cloud budget." That creates defensiveness. You want action, not a discussion. Give them the endpoint name, the dollar figure, and one yes/no question.


If you own the outcome: the governance gap and how to close it

For engineering managers, VP Engineering, and CTOs who see a SageMaker line item that nobody can explain.

Your ML team did nothing wrong. They deployed an endpoint to show stakeholders something working. The demo went well. The endpoint stayed running because SageMaker does not have an auto-shutdown mechanism, and nobody's process included a step for "delete the demo endpoint afterward."

The governance gap has three parts:

1. No lifecycle policy for inference endpoints. EC2 instances can be scheduled off. RDS can be stopped. Lambda costs nothing at zero invocations. SageMaker real-time endpoints are different: they run continuously from creation to deletion, and deletion must be explicit. Most teams have a process for provisioning endpoints, and no process for decommissioning them.

2. The billing doesn't reveal the problem. Your FinOps dashboard shows SageMaker spend at a flat, expected level. No spike, no anomaly. The cost has been accruing since the endpoint was created, so it appears in every monthly baseline. Only when you enrich billing data with CloudWatch invocation counts does the waste become visible. Standard FinOps tools don't do this join by default.

3. Ownership evaporates when projects close. The engineer who created the endpoint for Q3's demo has moved to Q1 next year's project. The endpoint has their team tag, but nobody is actively watching it. The cost attribution is correct. The ownership is orphaned.

The decision

For most organizations, the right answer is a two-part fix: (1) delete all endpoints that have had zero invocations for 7+ days with no current project association, and (2) implement a tagging and lifecycle policy that prevents this from recurring.

The delete step has a clear financial case. At GPU instance rates, a 30-day cleanup cycle on idle endpoints typically recovers 10โ€“20% of total SageMaker spend in organizations that don't currently have this process. The savings are immediate โ€” the month you run the cleanup, the bill drops. The month after, it stays lower if the lifecycle policy is in place.

The policy step closes the governance gap. Every endpoint needs an ExpiresOn tag at creation. A Lambda function checks invocations weekly and escalates to the tagged owner before auto-deleting. AWS published a reference implementation for this pattern in their ML blog. The implementation is a few hours of work. The ongoing cost of not having it is measured in GPU-hours per month, indefinitely.

The alternative โ€” accepting idle endpoint waste as a cost of doing ML โ€” is a choice some organizations make deliberately. It is rational if the cost is small relative to engineering time. It becomes irrational when the total reaches thousands per month and still has no line of sight in the standard tooling.


Fix checklist


Find out what else is hiding in your AI cloud bill

Idle SageMaker endpoints are one of several AI spend patterns that standard FinOps tools miss โ€” either because they require a billing-plus-CloudWatch join to detect, accumulate below general anomaly thresholds, or show no deviation from an endpoint's historical cost baseline. The DropInFinOps free assessment takes 2 minutes and shows you which patterns your current billing setup is positioned to catch โ€” and which ones are accumulating undetected.

Take the free assessment โ†’