Documentation Index
Fetch the complete documentation index at: https://docs.tensorcost.com/llms.txt
Use this file to discover all available pages before exploring further.
Agent installation
The unified TensorCost agent runs inside your compute footprint and streams GPU + workload metrics to TensorCost over a long-lived gRPC stream. This page is the customer-side install playbook.Onboarding budget is ≤15 minutes from “I have a tenant” to “the dashboard is showing my fleet.” If a step here takes longer, page support — that’s a friction bug we want to fix.
What the agent does
- Detects EC2 / EKS / GKE / AKS / bare-metal context automatically via IMDSv2 (where available).
- Establishes a long-lived gRPC stream to TensorCost on TCP/50051 (TLS-terminated NLB).
- Authenticates with HMAC-SHA256 signed
AgentHello(tenant ID, agent ID, nonce, timestamp). Server enforces±300sclock skew and Redis-backed nonce replay protection. - Pushes GPU health, MIG topology, NVML samples, container/pod context, and Spot interruption events.
- Receives commands (start/stop/scale, drain, run-policy) only when the tenant admin has explicitly granted the agent execution scope.
Requirements
| Item | Detail |
|---|---|
| Outbound network | TCP/50051 to your regional gRPC NLB (e.g. grpc.us-east-1.tensorcost.com); TCP/443 to api.tensorcost.com for fallback HTTPS sync |
| Privileges | Read-only NVML access (host /usr/lib/x86_64-linux-gnu/nvidia or --gpus all in Docker); container or pod metadata; no write privileges required |
| Cloud IAM (optional) | Read-only role for cost-explorer / SageMaker / Bedrock pulls; covered by the same CFN stack as the role install |
| Container runtime | Docker, containerd, or systemd unit |
| Architecture | linux/amd64 and linux/arm64 images shipped |
Pick your install path
CloudFormation (one-click)
Best for AWS customers running EC2 / ECS. Deploys IAM role + external ID + ECS task definition.
Terraform module
Best for IaC shops. Maintained on the public registry under
vaadh-labs/tensorcost-agent.Helm chart
Best for EKS / GKE / AKS. DaemonSet with NVML host mounts.
Path A — CloudFormation one-click
Generate an external ID
In the TensorCost dashboard, open Integrations → Add agent and copy the auto-generated
ExternalId. The wizard also shows your TenantId and the platform BackendAccountId.Launch the stack
Click Launch stack in the wizard, or run:The stack creates exactly one IAM role (
TensorCost-AgentRunner-<ExternalId>) with the minimum read-only EC2 / CloudWatch / Cost Explorer policy. In OnboardingMode=Organization it also attaches a managed policy that enables jump-role access into member accounts via OrganizationAccountAccessRole.Paste the role ARN back into the wizard
Click Validate. The wizard polls STS-AssumeRole and a CloudWatch read for up to two minutes. Green = done.
Path B — Terraform module
module.tensorcost_agent.role_arn into the wizard’s Validate step.
Path C — Helm chart (Kubernetes)
DaemonSet with NVML host mounts and tolerations for nvidia.com/gpu. Override per-cluster values with a values.yaml if you have non-default GPU resource names (e.g. amd.com/gpu).
How authentication works
- HMAC pepper is per-deployment (AWS Secrets Manager) and rotated quarterly per the secret rotation runbook.
- Per-key rotation: every credential refresh issues a new
key_id; revoked keys are blacklisted at the verifier. - The escape hatch
GPU_GRPC_ALLOW_UNREGISTERED_AGENTS=truewas removed; every agent must carry a verified hello.
0.0.0.0/0 at the NLB SG with documented compensating controls — see ADR-0010 (linked from SOC 2 readiness).
Verification — did the agent connect?
Open Agents
Sidebar → Agents. The new agent appears within five minutes with
status='ok', last-seen timestamp, and detected platform (EC2 / EKS / bare-metal).Confirm metrics flow
Sidebar → GPU fleet. Within 15 minutes you should see GPU utilization, memory, and temperature time series.
Configuration reference
The agent reads YAML (--config /etc/tensorcost/agent.yaml) or environment variables. Env vars take precedence.
Core
| Env var | YAML key | Default | Purpose |
|---|---|---|---|
TENSORCOST_TENANT_ID | tenant.id | (required) | Your tenant UUID |
TENSORCOST_EXTERNAL_ID | tenant.external_id | (required) | The wizard-issued external ID |
TENSORCOST_AGENT_NAME | agent.name | hostname | Human-readable name in the Agents view |
TENSORCOST_GRPC_TARGET | grpc.target | regional default | gRPC endpoint, e.g. grpc.us-east-1.tensorcost.com:50051 |
TENSORCOST_TLS_CA | grpc.tls_ca | system | Override CA bundle (rare) |
TENSORCOST_HEARTBEAT_INTERVAL | agent.heartbeat_interval | 15s | Heartbeat cadence |
TENSORCOST_LOG_LEVEL | logging.level | info | trace, debug, info, warn, error |
OTEL_ENABLED | telemetry.otel.enabled | false | OpenTelemetry export |
OTEL_EXPORTER_OTLP_ENDPOINT | telemetry.otel.endpoint | http://localhost:4317 | OTLP target |
NVML / GPU sampling
| Env var | YAML key | Default | Purpose |
|---|---|---|---|
NVML_ENABLED | nvml.enabled | true (host install) | Enable local GPU sampling |
NVML_SAMPLE_INTERVAL | nvml.sample_interval | 10s | Sampling cadence |
NVML_IDLE_GPU_THRESHOLD | nvml.idle_gpu_threshold | 10 | Utilization% below this counts as idle |
NVML_IDLE_DURATION | nvml.idle_duration | 120s | Time idle before flagging |
MIG_ENABLED | mig.enabled | auto | Detect MIG slices on A100/H100/H200/B200 |
Cloud (optional, when running outside the IAM-role context)
If the agent runs on EC2 with an instance profile, leave the cloud blocks empty — IMDS picks them up. Override only when running in a foreign environment.| Env var | YAML key | Purpose |
|---|---|---|
AWS_ENABLED | clouds.aws.enabled | Enable AWS pulls |
AWS_REGION | clouds.aws.region | Override default region |
AZURE_ENABLED | clouds.azure.enabled | Enable Azure pulls |
AZURE_SUBSCRIPTION_ID | clouds.azure.subscription_id | Subscription ID |
GCP_ENABLED | clouds.gcp.enabled | Enable GCP pulls |
GCP_PROJECT_ID | clouds.gcp.project_id | Project ID |
K8S_ENABLED | clouds.kubernetes.enabled | Enable K8s pod/namespace tagging |
Upgrades
Upgrades are customer-driven — we don’t push agent binaries into your environment. The recommended upgrade path matches your install path:github.com/vaadh-labs/tensorcost-agent/releases. Critical security patches are tagged and announced via your tenant’s notification channel.
Common day-1 failures
| Symptom | Likely cause | Remediation |
|---|---|---|
STS AssumeRole AccessDenied (ExternalId) | Wizard’s ExternalId doesn’t match the CFN parameter | Re-deploy CFN with the wizard-shown ExternalId, or update the wizard connection row |
STS AssumeRole AccessDenied (no ExternalId mention) | Default TensorCostBackendAccountId=000000000000 left in CFN | Re-deploy with the platform account ID shown in the wizard |
agent gRPC handshake: HMAC mismatch | Pasted credentials with leading/trailing whitespace | Re-mint credentials, paste cleanly |
agent shows status='unknown' for >30 min | Outbound TCP/50051 blocked by firewall / corporate proxy | Allowlist the regional gRPC NLB; verify curl -vI https://api.tensorcost.com/health works from the host |
NVML not available | NVIDIA drivers missing or container lacks --gpus all / device plugin | Mount /usr/lib/x86_64-linux-gnu/nvidia read-only; on K8s, ensure the node has the NVIDIA device plugin |
0 events ingested for >24h | Agent registered but no metric emit (rare) | Run --self-test; check /var/log/tensorcost-agent.log for errors |
Data the agent collects
| Class | Cadence | Examples |
|---|---|---|
| Inventory | Every 5 min | Instance ID, type, region, GPU model, MIG profile, tags |
| GPU metrics | 10s NVML sample, batched every 60s | Utilization, memory, temperature, power, clock, ECC errors |
| Container/pod context | On change | Namespace, pod name, deployment, requested vs allocated GPU |
| Spot signals | On event | Interruption notice, hibernation, capacity rebalance |
| Health | Every 5 min | Uptime, queue depth, error counts |
/var/log/tensorcost-agent.log, capped at 100 MB).