Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tensorcost.com/llms.txt

Use this file to discover all available pages before exploring further.

Agent installation

The unified TensorCost agent runs inside your compute footprint and streams GPU + workload metrics to TensorCost over a long-lived gRPC stream. This page is the customer-side install playbook.
Onboarding budget is ≤15 minutes from “I have a tenant” to “the dashboard is showing my fleet.” If a step here takes longer, page support — that’s a friction bug we want to fix.

What the agent does

  • Detects EC2 / EKS / GKE / AKS / bare-metal context automatically via IMDSv2 (where available).
  • Establishes a long-lived gRPC stream to TensorCost on TCP/50051 (TLS-terminated NLB).
  • Authenticates with HMAC-SHA256 signed AgentHello (tenant ID, agent ID, nonce, timestamp). Server enforces ±300s clock skew and Redis-backed nonce replay protection.
  • Pushes GPU health, MIG topology, NVML samples, container/pod context, and Spot interruption events.
  • Receives commands (start/stop/scale, drain, run-policy) only when the tenant admin has explicitly granted the agent execution scope.

Requirements

ItemDetail
Outbound networkTCP/50051 to your regional gRPC NLB (e.g. grpc.us-east-1.tensorcost.com); TCP/443 to api.tensorcost.com for fallback HTTPS sync
PrivilegesRead-only NVML access (host /usr/lib/x86_64-linux-gnu/nvidia or --gpus all in Docker); container or pod metadata; no write privileges required
Cloud IAM (optional)Read-only role for cost-explorer / SageMaker / Bedrock pulls; covered by the same CFN stack as the role install
Container runtimeDocker, containerd, or systemd unit
Architecturelinux/amd64 and linux/arm64 images shipped

Pick your install path

CloudFormation (one-click)

Best for AWS customers running EC2 / ECS. Deploys IAM role + external ID + ECS task definition.

Terraform module

Best for IaC shops. Maintained on the public registry under vaadh-labs/tensorcost-agent.

Helm chart

Best for EKS / GKE / AKS. DaemonSet with NVML host mounts.

Path A — CloudFormation one-click

1

Generate an external ID

In the TensorCost dashboard, open Integrations → Add agent and copy the auto-generated ExternalId. The wizard also shows your TenantId and the platform BackendAccountId.
2

Launch the stack

Click Launch stack in the wizard, or run:
aws cloudformation deploy \
  --template-url https://downloads.tensorcost.com/cfn/agent-stack.yml \
  --stack-name tensorcost-agent \
  --parameter-overrides \
      TenantId=$TENANT_ID \
      ExternalId=$EXTERNAL_ID \
      OnboardingMode=SingleAccount \
  --capabilities CAPABILITY_NAMED_IAM
The stack creates exactly one IAM role (TensorCost-AgentRunner-<ExternalId>) with the minimum read-only EC2 / CloudWatch / Cost Explorer policy. In OnboardingMode=Organization it also attaches a managed policy that enables jump-role access into member accounts via OrganizationAccountAccessRole.
3

Paste the role ARN back into the wizard

Click Validate. The wizard polls STS-AssumeRole and a CloudWatch read for up to two minutes. Green = done.
4

Confirm the agent connected

Open Agents in the sidebar. Within five minutes of a successful CFN deploy, your fleet should appear with status='ok'.

Path B — Terraform module

module "tensorcost_agent" {
  source     = "vaadh-labs/tensorcost-agent/aws"
  version    = "~> 1.0"

  tenant_id           = var.tensorcost_tenant_id
  external_id         = var.tensorcost_external_id
  onboarding_mode     = "SingleAccount"   # or "Organization"
  agent_image_tag     = "v1.4.2"

  enable_bedrock_read = true
  enable_cur_read     = true
  cur_bucket_name     = "my-cur-bucket"
}
Apply, then paste module.tensorcost_agent.role_arn into the wizard’s Validate step.

Path C — Helm chart (Kubernetes)

helm repo add tensorcost https://charts.tensorcost.com
helm repo update

helm upgrade --install tensorcost-agent tensorcost/agent \
  --namespace tensorcost --create-namespace \
  --set tenant.id=$TENANT_ID \
  --set tenant.externalId=$EXTERNAL_ID \
  --set image.tag=v1.4.2 \
  --set telemetry.otel.enabled=true
The chart ships as a DaemonSet with NVML host mounts and tolerations for nvidia.com/gpu. Override per-cluster values with a values.yaml if you have non-default GPU resource names (e.g. amd.com/gpu).

How authentication works

  • HMAC pepper is per-deployment (AWS Secrets Manager) and rotated quarterly per the secret rotation runbook.
  • Per-key rotation: every credential refresh issues a new key_id; revoked keys are blacklisted at the verifier.
  • The escape hatch GPU_GRPC_ALLOW_UNREGISTERED_AGENTS=true was removed; every agent must carry a verified hello.
For the full ingress design — including why we accept 0.0.0.0/0 at the NLB SG with documented compensating controls — see ADR-0010 (linked from SOC 2 readiness).

Verification — did the agent connect?

1

Open Agents

Sidebar → Agents. The new agent appears within five minutes with status='ok', last-seen timestamp, and detected platform (EC2 / EKS / bare-metal).
2

Confirm metrics flow

Sidebar → GPU fleet. Within 15 minutes you should see GPU utilization, memory, and temperature time series.
3

Run a self-test

On the agent host:
docker exec tensorcost-agent /opt/tensorcost/bin/agent --self-test
The self-test exercises IMDS, HMAC signing, gRPC handshake, NVML, and reports a one-line PASS/FAIL.

Configuration reference

The agent reads YAML (--config /etc/tensorcost/agent.yaml) or environment variables. Env vars take precedence.

Core

Env varYAML keyDefaultPurpose
TENSORCOST_TENANT_IDtenant.id(required)Your tenant UUID
TENSORCOST_EXTERNAL_IDtenant.external_id(required)The wizard-issued external ID
TENSORCOST_AGENT_NAMEagent.namehostnameHuman-readable name in the Agents view
TENSORCOST_GRPC_TARGETgrpc.targetregional defaultgRPC endpoint, e.g. grpc.us-east-1.tensorcost.com:50051
TENSORCOST_TLS_CAgrpc.tls_casystemOverride CA bundle (rare)
TENSORCOST_HEARTBEAT_INTERVALagent.heartbeat_interval15sHeartbeat cadence
TENSORCOST_LOG_LEVELlogging.levelinfotrace, debug, info, warn, error
OTEL_ENABLEDtelemetry.otel.enabledfalseOpenTelemetry export
OTEL_EXPORTER_OTLP_ENDPOINTtelemetry.otel.endpointhttp://localhost:4317OTLP target

NVML / GPU sampling

Env varYAML keyDefaultPurpose
NVML_ENABLEDnvml.enabledtrue (host install)Enable local GPU sampling
NVML_SAMPLE_INTERVALnvml.sample_interval10sSampling cadence
NVML_IDLE_GPU_THRESHOLDnvml.idle_gpu_threshold10Utilization% below this counts as idle
NVML_IDLE_DURATIONnvml.idle_duration120sTime idle before flagging
MIG_ENABLEDmig.enabledautoDetect MIG slices on A100/H100/H200/B200

Cloud (optional, when running outside the IAM-role context)

If the agent runs on EC2 with an instance profile, leave the cloud blocks empty — IMDS picks them up. Override only when running in a foreign environment.
Env varYAML keyPurpose
AWS_ENABLEDclouds.aws.enabledEnable AWS pulls
AWS_REGIONclouds.aws.regionOverride default region
AZURE_ENABLEDclouds.azure.enabledEnable Azure pulls
AZURE_SUBSCRIPTION_IDclouds.azure.subscription_idSubscription ID
GCP_ENABLEDclouds.gcp.enabledEnable GCP pulls
GCP_PROJECT_IDclouds.gcp.project_idProject ID
K8S_ENABLEDclouds.kubernetes.enabledEnable K8s pod/namespace tagging

Upgrades

Upgrades are customer-driven — we don’t push agent binaries into your environment. The recommended upgrade path matches your install path:
aws cloudformation update-stack \
  --stack-name tensorcost-agent \
  --use-previous-template \
  --parameter-overrides AgentImageTag=v1.5.0 \
  --capabilities CAPABILITY_NAMED_IAM
We publish release notes for every agent version at github.com/vaadh-labs/tensorcost-agent/releases. Critical security patches are tagged and announced via your tenant’s notification channel.

Common day-1 failures

SymptomLikely causeRemediation
STS AssumeRole AccessDenied (ExternalId)Wizard’s ExternalId doesn’t match the CFN parameterRe-deploy CFN with the wizard-shown ExternalId, or update the wizard connection row
STS AssumeRole AccessDenied (no ExternalId mention)Default TensorCostBackendAccountId=000000000000 left in CFNRe-deploy with the platform account ID shown in the wizard
agent gRPC handshake: HMAC mismatchPasted credentials with leading/trailing whitespaceRe-mint credentials, paste cleanly
agent shows status='unknown' for >30 minOutbound TCP/50051 blocked by firewall / corporate proxyAllowlist the regional gRPC NLB; verify curl -vI https://api.tensorcost.com/health works from the host
NVML not availableNVIDIA drivers missing or container lacks --gpus all / device pluginMount /usr/lib/x86_64-linux-gnu/nvidia read-only; on K8s, ensure the node has the NVIDIA device plugin
0 events ingested for >24hAgent registered but no metric emit (rare)Run --self-test; check /var/log/tensorcost-agent.log for errors
When stuck, page support via your shared Slack Connect channel, or email support@tensorcost.com with the agent log and the connection’s sync-history drawer.

Data the agent collects

ClassCadenceExamples
InventoryEvery 5 minInstance ID, type, region, GPU model, MIG profile, tags
GPU metrics10s NVML sample, batched every 60sUtilization, memory, temperature, power, clock, ECC errors
Container/pod contextOn changeNamespace, pod name, deployment, requested vs allocated GPU
Spot signalsOn eventInterruption notice, hibernation, capacity rebalance
HealthEvery 5 minUptime, queue depth, error counts
The agent is stateless — restart at any time without data loss. Outbound traffic is TLS-encrypted to the regional NLB. No customer data is written to disk except the rotating local log (/var/log/tensorcost-agent.log, capped at 100 MB).