Documentation Index
Fetch the complete documentation index at: https://docs.tensorcost.com/llms.txt
Use this file to discover all available pages before exploring further.
ML, anomaly detection, and forecasting
TensorCost’s ML surface covers four jobs:
- Anomaly detection across GPU metrics and AI spend (statistical baseline + optional ML layer).
- Burn-rate alerting against budget hierarchies (in flight).
- Cost forecasting at tenant, application, and team scope.
- Recommendations — the four shipped Bedrock recommenders, plus their GPU-side counterparts.
This page is the customer-facing tour. Engineers and integrators looking for the algorithm details can dig into ADR-0011 (event store) and the per-recommender source under apps-new/services/ai-service/src/recommenders/.
Anomaly detection — statistical layer
Every metric stream — GPU utilization, memory, temperature, daily cost, hourly cost, agent invocation count, cache-hit rate — runs through three independent statistical detectors:
| Method | Trigger | | |
|---|
| Z-score | ` | value - μ | > 3σ` against a rolling baseline |
| IQR | Value falls outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR] | | |
| Rate-of-change | Δ exceeds 3σ of historical Δ distribution | | |
An anomaly is confirmed when at least 2 of 3 methods agree (composite confidence ≥ 0.5). Day-of-week seasonal baselines avoid false positives on predictable patterns (training jobs that always spike Tuesday morning, etc.).
Every detection writes a row to the analysis audit log with the inputs, scores, decision, and reason — useful for tuning thresholds and for SOC 2 processing-integrity evidence.
Anomaly detection — ML layer (opt-in)
Customers on growth and enterprise plans can enable a per-tenant Isolation Forest layer that composes with the statistical scores.
| Model | Source | What it catches |
|---|
gpu_anomaly | Per-reading GPU metrics + cyclical time + idle duration | Stuck workloads, hot GPUs, idle-but-running, sudden drops |
ai_spend_anomaly | Hourly AI-spend aggregates — cost, request count, tokens, latency, unique models, rolling 24h mean | Cost spikes, runaway inference loops, new-provider surges |
Training
MLTrainingJob runs daily. For each tenant + model type, retrain triggers on:
| Trigger | Reason label |
|---|
| No active model | no_active_model |
Active model older than ML_RETRAIN_INTERVAL_DAYS (default 7) | model_stale_{N}d |
| False-positive rate over the last window > 20% | high_fp_rate_{N}pct |
| Manual trigger from the ML models page | manual |
Minimum samples default to 1,000. For ai_spend_anomaly, one sample = one filled hour bucket (so a 14-day lookback caps at 336 samples; new tenants need to accumulate hours, or operators set ML_MIN_TRAINING_POINTS=200 until volume builds).
Inference and ensemble
For each incoming metric:
- The statistical scorer runs.
- If
ml-enabled is on and the model is loaded, the sidecar predicts.
- The composite score is a weighted average —
ML_ENSEMBLE_WEIGHT defaults to 0.2 (20% ML, 80% statistical).
- An alert fires when the composite crosses the severity threshold.
The ML score alone is never a sole trigger. The statistical layer always runs.
Circuit breaker
All sidecar calls are wrapped in a circuit breaker. After 3 consecutive failures or 30s timeouts, the breaker opens for 60s and detection falls back to the statistical layer. The ML models admin page surfaces breaker state — CLOSED, HALF_OPEN, or OPEN.
Runaway-loop and retry-storm detection
For agent workloads — where one bug can burn $10,000 overnight — TensorCost ships a dedicated runaway-loop detector alongside the generic anomaly detection. It triggers on:
- Sudden invocation-count spike (>3σ from the per-agent baseline).
- Unusually long invocation chains (same
conversation_id exceeding N calls).
- Cost-per-user-session spike vs the rolling baseline.
Alert payloads include the top 5 offending request IDs and a one-click Pause agent action where the customer has wired in their orchestrator. Routes through PagerDuty, Slack, email, or webhook per configuration.
Cost forecasting
cost-service produces 30-, 60-, and 90-day forecasts at three scopes:
- Tenant total
- Application
- Team
Forecasts are produced by an ensemble of seasonal-naïve and ARIMA-style models, blended by recent error rates. Each forecast carries a confidence interval and a “drift score” — high drift means the forecast is less reliable, typically because a workload pattern just changed (a new agent rolled out, a fine-tune started, a Bedrock provisioned-throughput contract activated).
When the forecasted month-end exceeds the budget, a budget breach predicted event fires (see real-time events) and the configured channels are notified.
Recommendations
Recommendations are computed by per-domain recommenders. Four are shipped today against Bedrock; the same patterns apply to Azure OpenAI / Vertex / OpenAI / Anthropic as those adapters land.
Bedrock recommenders (shipped)
| Recommender | What it surfaces |
|---|
| Model routing | Prompts routed to flagship models that cluster like efficient-model traffic — proposes Opus → Haiku, GPT-4o → GPT-4o-mini, etc., with sample request IDs and projected savings |
| Prompt cache | Repeated prefix patterns where Bedrock prompt caching would cut input-token cost ~90%, with the exact code/SDK change |
| Provisioned-throughput break-even | On-demand vs Provisioned Throughput math per (model, application) — flags both under- and over-commitment |
| Runaway-loop | The detector above, surfacing as recommendations + alerts |
Each recommendation carries:
- A specific dollar impact estimate.
- The evidence (sample request IDs, cost breakdown, A/B plan).
- An accept / dismiss-with-reason / snooze action.
Acceptance writes a row to the savings ledger; verified savings populate after the 30-day verification window. Verification methodology is documented in the Savings methodology PDF (downloadable from the dashboard).
GPU recommenders
| Recommender | What it surfaces |
|---|
| Idle-instance auto-stop | EC2 / GKE instances below the idle threshold for N minutes |
| MIG slice rightsizing | A100/H100/H200/B200 MIG profiles that don’t match observed load |
| Spot blending | Training jobs with checkpointing that could move to spot |
| RI / Savings Plan / Committed Use | Multi-year commitment optimization across your contract surface |
| GPU-type swap | ”Your workload runs 30% faster on H100 at 2× cost — net save 40%“ |
Live training and recommendation events
Every job and recommender publishes Socket.IO events:
ml.training.started, ml.training.completed, ml.training.failed
recommendation.created, recommendation.accepted, recommendation.dismissed
forecast.generated, forecast.budget_breach_predicted
anomaly.detected, anomaly.resolved
runaway_loop.detected, runaway_loop.paused
See real-time events for the full catalog and consumption patterns.
The ML models admin page
Admins on plans with the ML layer get an ML models page that shows:
- Whether
ml-enabled is active for the tenant.
- Active models per type with version, sample count, accuracy metrics.
- Ensemble weight.
- Circuit breaker state.
- Per-model retraining status with reason labels.
- A train now button per model.
Live training progress streams in via Socket.IO so the page reflects starts, completions, and failures without a page reload.
Data minimization
The recommenders see model IDs, token counts, latency, and (where the customer enabled invocation-log capture) request IDs and hashes of prompts/responses. Raw prompts and responses are never stored. Customers concerned about content can disable request-metadata capture entirely and still get model/cost/latency-level recommendations.