ML, anomaly detection, and forecasting
TensorCost’s ML surface covers four jobs:- Anomaly detection across GPU metrics and AI spend (statistical baseline + optional ML layer).
- Burn-rate alerting against budget hierarchies (in flight).
- Cost forecasting at tenant, application, and team scope.
- Recommendations — the four shipped Bedrock recommenders, plus their GPU-side counterparts.
apps-new/services/ai-service/src/recommenders/.
Anomaly detection — statistical layer
Every metric stream — GPU utilization, memory, temperature, daily cost, hourly cost, agent invocation count, cache-hit rate — runs through three independent statistical detectors:| Method | Trigger | ||
|---|---|---|---|
| Z-score | ` | value - μ | > 3σ` against a rolling baseline |
| IQR | Value falls outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR] | ||
| Rate-of-change | Δ exceeds 3σ of historical Δ distribution |
Anomaly detection — ML layer (opt-in)
Customers ongrowth and enterprise plans can enable a per-tenant Isolation Forest layer that composes with the statistical scores.
| Model | Source | What it catches |
|---|---|---|
gpu_anomaly | Per-reading GPU metrics + cyclical time + idle duration | Stuck workloads, hot GPUs, idle-but-running, sudden drops |
ai_spend_anomaly | Hourly AI-spend aggregates — cost, request count, tokens, latency, unique models, rolling 24h mean | Cost spikes, runaway inference loops, new-provider surges |
Training
MLTrainingJob runs daily. For each tenant + model type, retrain triggers on:
| Trigger | Reason label |
|---|---|
| No active model | no_active_model |
Active model older than ML_RETRAIN_INTERVAL_DAYS (default 7) | model_stale_{N}d |
| False-positive rate over the last window > 20% | high_fp_rate_{N}pct |
| Manual trigger from the ML models page | manual |
ai_spend_anomaly, one sample = one filled hour bucket (so a 14-day lookback caps at 336 samples; new tenants need to accumulate hours, or operators set ML_MIN_TRAINING_POINTS=200 until volume builds).
Inference and ensemble
For each incoming metric:- The statistical scorer runs.
- If
ml-enabledis on and the model is loaded, the sidecar predicts. - The composite score is a weighted average —
ML_ENSEMBLE_WEIGHTdefaults to 0.2 (20% ML, 80% statistical). - An alert fires when the composite crosses the severity threshold.
Circuit breaker
All sidecar calls are wrapped in a circuit breaker. After 3 consecutive failures or 30s timeouts, the breaker opens for 60s and detection falls back to the statistical layer. The ML models admin page surfaces breaker state —CLOSED, HALF_OPEN, or OPEN.
Runaway-loop and retry-storm detection
For agent workloads — where one bug can burn $10,000 overnight — TensorCost ships a dedicated runaway-loop detector alongside the generic anomaly detection. It triggers on:- Sudden invocation-count spike (>3σ from the per-agent baseline).
- Unusually long invocation chains (same
conversation_idexceedingNcalls). - Cost-per-user-session spike vs the rolling baseline.
Cost forecasting
cost-service produces 30-, 60-, and 90-day forecasts at three scopes:
- Tenant total
- Application
- Team
Recommendations
Recommendations are computed by per-domain recommenders. Four are shipped today against Bedrock; the same patterns apply to Azure OpenAI / Vertex / OpenAI / Anthropic as those adapters land.Bedrock recommenders (shipped)
| Recommender | What it surfaces |
|---|---|
| Model routing | Prompts routed to flagship models that cluster like efficient-model traffic — proposes Opus → Haiku, GPT-4o → GPT-4o-mini, etc., with sample request IDs and projected savings |
| Prompt cache | Repeated prefix patterns where Bedrock prompt caching would cut input-token cost ~90%, with the exact code/SDK change |
| Provisioned-throughput break-even | On-demand vs Provisioned Throughput math per (model, application) — flags both under- and over-commitment |
| Runaway-loop | The detector above, surfacing as recommendations + alerts |
- A specific dollar impact estimate.
- The evidence (sample request IDs, cost breakdown, A/B plan).
- An accept / dismiss-with-reason / snooze action.
GPU recommenders
| Recommender | What it surfaces |
|---|---|
| Idle-instance auto-stop | EC2 / GKE instances below the idle threshold for N minutes |
| MIG slice rightsizing | A100/H100/H200/B200 MIG profiles that don’t match observed load |
| Spot blending | Training jobs with checkpointing that could move to spot |
| RI / Savings Plan / Committed Use | Multi-year commitment optimization across your contract surface |
| GPU-type swap | ”Your workload runs 30% faster on H100 at 2× cost — net save 40%“ |
Live training and recommendation events
Every job and recommender publishes Socket.IO events:ml.training.started,ml.training.completed,ml.training.failedrecommendation.created,recommendation.accepted,recommendation.dismissedforecast.generated,forecast.budget_breach_predictedanomaly.detected,anomaly.resolvedrunaway_loop.detected,runaway_loop.paused
The ML models admin page
Admins on plans with the ML layer get an ML models page that shows:- Whether
ml-enabledis active for the tenant. - Active models per type with version, sample count, accuracy metrics.
- Ensemble weight.
- Circuit breaker state.
- Per-model retraining status with reason labels.
- A train now button per model.