Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tensorcost.com/llms.txt

Use this file to discover all available pages before exploring further.

ML, anomaly detection, and forecasting

TensorCost’s ML surface covers four jobs:
  1. Anomaly detection across GPU metrics and AI spend (statistical baseline + optional ML layer).
  2. Burn-rate alerting against budget hierarchies (in flight).
  3. Cost forecasting at tenant, application, and team scope.
  4. Recommendations — the four shipped Bedrock recommenders, plus their GPU-side counterparts.
This page is the customer-facing tour. Engineers and integrators looking for the algorithm details can dig into ADR-0011 (event store) and the per-recommender source under apps-new/services/ai-service/src/recommenders/.

Anomaly detection — statistical layer

Every metric stream — GPU utilization, memory, temperature, daily cost, hourly cost, agent invocation count, cache-hit rate — runs through three independent statistical detectors:
MethodTrigger
Z-score`value - μ> 3σ` against a rolling baseline
IQRValue falls outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR]
Rate-of-changeΔ exceeds 3σ of historical Δ distribution
An anomaly is confirmed when at least 2 of 3 methods agree (composite confidence ≥ 0.5). Day-of-week seasonal baselines avoid false positives on predictable patterns (training jobs that always spike Tuesday morning, etc.). Every detection writes a row to the analysis audit log with the inputs, scores, decision, and reason — useful for tuning thresholds and for SOC 2 processing-integrity evidence.

Anomaly detection — ML layer (opt-in)

Customers on growth and enterprise plans can enable a per-tenant Isolation Forest layer that composes with the statistical scores.
ModelSourceWhat it catches
gpu_anomalyPer-reading GPU metrics + cyclical time + idle durationStuck workloads, hot GPUs, idle-but-running, sudden drops
ai_spend_anomalyHourly AI-spend aggregates — cost, request count, tokens, latency, unique models, rolling 24h meanCost spikes, runaway inference loops, new-provider surges

Training

MLTrainingJob runs daily. For each tenant + model type, retrain triggers on:
TriggerReason label
No active modelno_active_model
Active model older than ML_RETRAIN_INTERVAL_DAYS (default 7)model_stale_{N}d
False-positive rate over the last window > 20%high_fp_rate_{N}pct
Manual trigger from the ML models pagemanual
Minimum samples default to 1,000. For ai_spend_anomaly, one sample = one filled hour bucket (so a 14-day lookback caps at 336 samples; new tenants need to accumulate hours, or operators set ML_MIN_TRAINING_POINTS=200 until volume builds).

Inference and ensemble

For each incoming metric:
  1. The statistical scorer runs.
  2. If ml-enabled is on and the model is loaded, the sidecar predicts.
  3. The composite score is a weighted average — ML_ENSEMBLE_WEIGHT defaults to 0.2 (20% ML, 80% statistical).
  4. An alert fires when the composite crosses the severity threshold.
The ML score alone is never a sole trigger. The statistical layer always runs.

Circuit breaker

All sidecar calls are wrapped in a circuit breaker. After 3 consecutive failures or 30s timeouts, the breaker opens for 60s and detection falls back to the statistical layer. The ML models admin page surfaces breaker state — CLOSED, HALF_OPEN, or OPEN.

Runaway-loop and retry-storm detection

For agent workloads — where one bug can burn $10,000 overnight — TensorCost ships a dedicated runaway-loop detector alongside the generic anomaly detection. It triggers on:
  • Sudden invocation-count spike (>3σ from the per-agent baseline).
  • Unusually long invocation chains (same conversation_id exceeding N calls).
  • Cost-per-user-session spike vs the rolling baseline.
Alert payloads include the top 5 offending request IDs and a one-click Pause agent action where the customer has wired in their orchestrator. Routes through PagerDuty, Slack, email, or webhook per configuration.

Cost forecasting

cost-service produces 30-, 60-, and 90-day forecasts at three scopes:
  • Tenant total
  • Application
  • Team
Forecasts are produced by an ensemble of seasonal-naïve and ARIMA-style models, blended by recent error rates. Each forecast carries a confidence interval and a “drift score” — high drift means the forecast is less reliable, typically because a workload pattern just changed (a new agent rolled out, a fine-tune started, a Bedrock provisioned-throughput contract activated). When the forecasted month-end exceeds the budget, a budget breach predicted event fires (see real-time events) and the configured channels are notified.

Recommendations

Recommendations are computed by per-domain recommenders. Four are shipped today against Bedrock; the same patterns apply to Azure OpenAI / Vertex / OpenAI / Anthropic as those adapters land.

Bedrock recommenders (shipped)

RecommenderWhat it surfaces
Model routingPrompts routed to flagship models that cluster like efficient-model traffic — proposes Opus → Haiku, GPT-4o → GPT-4o-mini, etc., with sample request IDs and projected savings
Prompt cacheRepeated prefix patterns where Bedrock prompt caching would cut input-token cost ~90%, with the exact code/SDK change
Provisioned-throughput break-evenOn-demand vs Provisioned Throughput math per (model, application) — flags both under- and over-commitment
Runaway-loopThe detector above, surfacing as recommendations + alerts
Each recommendation carries:
  • A specific dollar impact estimate.
  • The evidence (sample request IDs, cost breakdown, A/B plan).
  • An accept / dismiss-with-reason / snooze action.
Acceptance writes a row to the savings ledger; verified savings populate after the 30-day verification window. Verification methodology is documented in the Savings methodology PDF (downloadable from the dashboard).

GPU recommenders

RecommenderWhat it surfaces
Idle-instance auto-stopEC2 / GKE instances below the idle threshold for N minutes
MIG slice rightsizingA100/H100/H200/B200 MIG profiles that don’t match observed load
Spot blendingTraining jobs with checkpointing that could move to spot
RI / Savings Plan / Committed UseMulti-year commitment optimization across your contract surface
GPU-type swap”Your workload runs 30% faster on H100 at 2× cost — net save 40%“

Live training and recommendation events

Every job and recommender publishes Socket.IO events:
  • ml.training.started, ml.training.completed, ml.training.failed
  • recommendation.created, recommendation.accepted, recommendation.dismissed
  • forecast.generated, forecast.budget_breach_predicted
  • anomaly.detected, anomaly.resolved
  • runaway_loop.detected, runaway_loop.paused
See real-time events for the full catalog and consumption patterns.

The ML models admin page

Admins on plans with the ML layer get an ML models page that shows:
  • Whether ml-enabled is active for the tenant.
  • Active models per type with version, sample count, accuracy metrics.
  • Ensemble weight.
  • Circuit breaker state.
  • Per-model retraining status with reason labels.
  • A train now button per model.
Live training progress streams in via Socket.IO so the page reflects starts, completions, and failures without a page reload.

Data minimization

The recommenders see model IDs, token counts, latency, and (where the customer enabled invocation-log capture) request IDs and hashes of prompts/responses. Raw prompts and responses are never stored. Customers concerned about content can disable request-metadata capture entirely and still get model/cost/latency-level recommendations.