ML, anomaly detection, and forecasting

TensorCost’s ML surface covers four jobs:

Anomaly detection across GPU metrics and AI spend (statistical baseline + optional ML layer).
Burn-rate alerting against budget hierarchies (in flight).
Cost forecasting at tenant, application, and team scope.
Recommendations — the four shipped Bedrock recommenders, plus their GPU-side counterparts.

This page is the customer-facing tour. Engineers and integrators looking for the algorithm details can dig into ADR-0011 (event store) and the per-recommender source under apps-new/services/ai-service/src/recommenders/.

Anomaly detection — statistical layer

Every metric stream — GPU utilization, memory, temperature, daily cost, hourly cost, agent invocation count, cache-hit rate — runs through three independent statistical detectors:

Method	Trigger
Z-score	`	value - μ	> 3σ` against a rolling baseline
IQR	Value falls outside `[Q1 − 1.5·IQR, Q3 + 1.5·IQR]`
Rate-of-change	Δ exceeds 3σ of historical Δ distribution

An anomaly is confirmed when at least 2 of 3 methods agree (composite confidence ≥ 0.5). Day-of-week seasonal baselines avoid false positives on predictable patterns (training jobs that always spike Tuesday morning, etc.). Every detection writes a row to the analysis audit log with the inputs, scores, decision, and reason — useful for tuning thresholds and for SOC 2 processing-integrity evidence.

Anomaly detection — ML layer (opt-in)

Customers on growth and enterprise plans can enable a per-tenant Isolation Forest layer that composes with the statistical scores.

Model	Source	What it catches
`gpu_anomaly`	Per-reading GPU metrics + cyclical time + idle duration	Stuck workloads, hot GPUs, idle-but-running, sudden drops
`ai_spend_anomaly`	Hourly AI-spend aggregates — cost, request count, tokens, latency, unique models, rolling 24h mean	Cost spikes, runaway inference loops, new-provider surges

Training

MLTrainingJob runs daily. For each tenant + model type, retrain triggers on:

Trigger	Reason label
No active model	`no_active_model`
Active model older than `ML_RETRAIN_INTERVAL_DAYS` (default 7)	`model_stale_{N}d`
False-positive rate over the last window > 20%	`high_fp_rate_{N}pct`
Manual trigger from the ML models page	`manual`

Minimum samples default to 1,000. For ai_spend_anomaly, one sample = one filled hour bucket (so a 14-day lookback caps at 336 samples; new tenants need to accumulate hours, or operators set ML_MIN_TRAINING_POINTS=200 until volume builds).

Inference and ensemble

For each incoming metric:

The statistical scorer runs.
If ml-enabled is on and the model is loaded, the sidecar predicts.
The composite score is a weighted average — ML_ENSEMBLE_WEIGHT defaults to 0.2 (20% ML, 80% statistical).
An alert fires when the composite crosses the severity threshold.

The ML score alone is never a sole trigger. The statistical layer always runs.

Circuit breaker

All sidecar calls are wrapped in a circuit breaker. After 3 consecutive failures or 30s timeouts, the breaker opens for 60s and detection falls back to the statistical layer. The ML models admin page surfaces breaker state — CLOSED, HALF_OPEN, or OPEN.

Runaway-loop and retry-storm detection

For agent workloads — where one bug can burn $10,000 overnight — TensorCost ships a dedicated runaway-loop detector alongside the generic anomaly detection. It triggers on:

Sudden invocation-count spike (>3σ from the per-agent baseline).
Unusually long invocation chains (same conversation_id exceeding N calls).
Cost-per-user-session spike vs the rolling baseline.

Alert payloads include the top 5 offending request IDs and a one-click Pause agent action where the customer has wired in their orchestrator. Routes through PagerDuty, Slack, email, or webhook per configuration.

Cost forecasting

cost-service produces 30-, 60-, and 90-day forecasts at three scopes:

Tenant total
Application
Team

Forecasts are produced by an ensemble of seasonal-naïve and ARIMA-style models, blended by recent error rates. Each forecast carries a confidence interval and a “drift score” — high drift means the forecast is less reliable, typically because a workload pattern just changed (a new agent rolled out, a fine-tune started, a Bedrock provisioned-throughput contract activated). When the forecasted month-end exceeds the budget, a budget breach predicted event fires (see real-time events) and the configured channels are notified.

Recommendations

Recommendations are computed by per-domain recommenders. Four are shipped today against Bedrock; the same patterns apply to Azure OpenAI / Vertex / OpenAI / Anthropic as those adapters land.

Bedrock recommenders (shipped)

Recommender	What it surfaces
Model routing	Prompts routed to flagship models that cluster like efficient-model traffic — proposes Opus → Haiku, GPT-4o → GPT-4o-mini, etc., with sample request IDs and projected savings
Prompt cache	Repeated prefix patterns where Bedrock prompt caching would cut input-token cost ~90%, with the exact code/SDK change
Provisioned-throughput break-even	On-demand vs Provisioned Throughput math per `(model, application)` — flags both under- and over-commitment
Runaway-loop	The detector above, surfacing as recommendations + alerts

Each recommendation carries:

A specific dollar impact estimate.
The evidence (sample request IDs, cost breakdown, A/B plan).
An accept / dismiss-with-reason / snooze action.

Acceptance writes a row to the savings ledger; verified savings populate after the 30-day verification window. Verification methodology is documented in the Savings methodology PDF (downloadable from the dashboard).

GPU recommenders

Recommender	What it surfaces
Idle-instance auto-stop	EC2 / GKE instances below the idle threshold for `N` minutes
MIG slice rightsizing	A100/H100/H200/B200 MIG profiles that don’t match observed load
Spot blending	Training jobs with checkpointing that could move to spot
RI / Savings Plan / Committed Use	Multi-year commitment optimization across your contract surface
GPU-type swap	”Your workload runs 30% faster on H100 at 2× cost — net save 40%“

Live training and recommendation events

Every job and recommender publishes Socket.IO events:

ml.training.started, ml.training.completed, ml.training.failed
recommendation.created, recommendation.accepted, recommendation.dismissed
forecast.generated, forecast.budget_breach_predicted
anomaly.detected, anomaly.resolved
runaway_loop.detected, runaway_loop.paused

See real-time events for the full catalog and consumption patterns.

The ML models admin page

Admins on plans with the ML layer get an ML models page that shows:

Whether ml-enabled is active for the tenant.
Active models per type with version, sample count, accuracy metrics.
Ensemble weight.
Circuit breaker state.
Per-model retraining status with reason labels.
A train now button per model.

Live training progress streams in via Socket.IO so the page reflects starts, completions, and failures without a page reload.

Data minimization

The recommenders see model IDs, token counts, latency, and (where the customer enabled invocation-log capture) request IDs and hashes of prompts/responses. Raw prompts and responses are never stored. Customers concerned about content can disable request-metadata capture entirely and still get model/cost/latency-level recommendations.

Getting Started

Architecture

Setup

Features

Reference

ML, anomaly detection, and forecasting

ML, anomaly detection, and forecasting

Anomaly detection — statistical layer

Anomaly detection — ML layer (opt-in)

Training

Inference and ensemble

Circuit breaker

Runaway-loop and retry-storm detection

Cost forecasting

Recommendations

Bedrock recommenders (shipped)

GPU recommenders

Live training and recommendation events

The ML models admin page

Data minimization

Getting Started

Architecture

Setup

Features

Reference

Documentation Index

​ML, anomaly detection, and forecasting

​Anomaly detection — statistical layer

​Anomaly detection — ML layer (opt-in)

​Training

​Inference and ensemble

​Circuit breaker

​Runaway-loop and retry-storm detection

​Cost forecasting

​Recommendations

​Bedrock recommenders (shipped)

​GPU recommenders

​Live training and recommendation events

​The ML models admin page

​Data minimization

ML, anomaly detection, and forecasting

Anomaly detection — statistical layer

Anomaly detection — ML layer (opt-in)

Training

Inference and ensemble

Circuit breaker

Runaway-loop and retry-storm detection

Cost forecasting

Recommendations

Bedrock recommenders (shipped)

GPU recommenders

Live training and recommendation events

The ML models admin page

Data minimization