Agent installation

The unified TensorCost agent runs inside your compute footprint and streams GPU + workload metrics to TensorCost over a long-lived gRPC stream. This page is the customer-side install playbook.

Onboarding budget is ≤15 minutes from “I have a tenant” to “the dashboard is showing my fleet.” If a step here takes longer, page support — that’s a friction bug we want to fix.

What the agent does

Detects EC2 / EKS / GKE / AKS / bare-metal context automatically via IMDSv2 (where available).
Establishes a long-lived gRPC stream to TensorCost on TCP/50051 (TLS-terminated NLB).
Authenticates with HMAC-SHA256 signed AgentHello (tenant ID, agent ID, nonce, timestamp). Server enforces ±300s clock skew and Redis-backed nonce replay protection.
Pushes GPU health, MIG topology, NVML samples, container/pod context, and Spot interruption events.
Receives commands (start/stop/scale, drain, run-policy) only when the tenant admin has explicitly granted the agent execution scope.

Requirements

Item	Detail
Outbound network	TCP/50051 to your regional gRPC NLB (e.g. `grpc.us-east-1.tensorcost.com`); TCP/443 to `api.tensorcost.com` for fallback HTTPS sync
Privileges	Read-only NVML access (host `/usr/lib/x86_64-linux-gnu/nvidia` or `--gpus all` in Docker); container or pod metadata; no write privileges required
Cloud IAM (optional)	Read-only role for cost-explorer / SageMaker / Bedrock pulls; covered by the same CFN stack as the role install
Container runtime	Docker, containerd, or systemd unit
Architecture	linux/amd64 and linux/arm64 images shipped

Pick your install path

CloudFormation (one-click)

Best for AWS customers running EC2 / ECS. Deploys IAM role + external ID + ECS task definition.

Terraform module

Best for IaC shops. Maintained on the public registry under vaadhlabs/tensorcost-agent.

Helm chart

Best for EKS / GKE / AKS. DaemonSet with NVML host mounts.

Path A — CloudFormation one-click

Generate an external ID

In the TensorCost dashboard, open Integrations → Add agent and copy the auto-generated ExternalId. The wizard also shows your TenantId and the platform BackendAccountId.

Launch the stack

Click Launch stack in the wizard, or run:

aws cloudformation deploy \
  --template-url https://downloads.tensorcost.com/cfn/agent-stack.yml \
  --stack-name tensorcost-agent \
  --parameter-overrides \
      TenantId=$TENANT_ID \
      ExternalId=$EXTERNAL_ID \
      OnboardingMode=SingleAccount \
  --capabilities CAPABILITY_NAMED_IAM

The stack creates exactly one IAM role (TensorCost-AgentRunner-<ExternalId>) with the minimum read-only EC2 / CloudWatch / Cost Explorer policy. In OnboardingMode=Organization it also attaches a managed policy that enables jump-role access into member accounts via OrganizationAccountAccessRole.

Paste the role ARN back into the wizard

Click Validate. The wizard polls STS-AssumeRole and a CloudWatch read for up to two minutes. Green = done.

Confirm the agent connected

Open Agents in the sidebar. Within five minutes of a successful CFN deploy, your fleet should appear with status='ok'.

Path B — Terraform module

module "tensorcost_agent" {
  source     = "vaadhlabs/tensorcost-agent/aws"
  version    = "~> 1.0"

  tenant_id           = var.tensorcost_tenant_id
  external_id         = var.tensorcost_external_id
  onboarding_mode     = "SingleAccount"   # or "Organization"
  agent_image_tag     = "v1.4.2"

  enable_bedrock_read = true
  enable_cur_read     = true
  cur_bucket_name     = "my-cur-bucket"
}

Apply, then paste module.tensorcost_agent.role_arn into the wizard’s Validate step.

Path C — Helm chart (Kubernetes)

helm repo add tensorcost https://charts.tensorcost.com
helm repo update

helm upgrade --install tensorcost-agent tensorcost/agent \
  --namespace tensorcost --create-namespace \
  --set tenant.id=$TENANT_ID \
  --set tenant.externalId=$EXTERNAL_ID \
  --set image.tag=v1.4.2 \
  --set telemetry.otel.enabled=true

The chart ships as a DaemonSet with NVML host mounts and tolerations for nvidia.com/gpu. Override per-cluster values with a values.yaml if you have non-default GPU resource names (e.g. amd.com/gpu).

How authentication works

HMAC pepper is per-deployment (AWS Secrets Manager) and rotated quarterly per the secret rotation runbook.
Per-key rotation: every credential refresh issues a new key_id; revoked keys are blacklisted at the verifier.
The escape hatch GPU_GRPC_ALLOW_UNREGISTERED_AGENTS=true was removed; every agent must carry a verified hello.

For the full ingress design — including why we accept 0.0.0.0/0 at the NLB SG with documented compensating controls — see ADR-0010 (linked from SOC 2 readiness).

Verification — did the agent connect?

Open Agents

Sidebar → Agents. The new agent appears within five minutes with status='ok', last-seen timestamp, and detected platform (EC2 / EKS / bare-metal).

Confirm metrics flow

Sidebar → GPU fleet. Within 15 minutes you should see GPU utilization, memory, and temperature time series.

Run a self-test

On the agent host:

docker exec tensorcost-agent /opt/tensorcost/bin/agent --self-test

The self-test exercises IMDS, HMAC signing, gRPC handshake, NVML, and reports a one-line PASS/FAIL.

Configuration reference

The agent reads YAML (--config /etc/tensorcost/agent.yaml) or environment variables. Env vars take precedence.

Core

Env var	YAML key	Default	Purpose
`TENSORCOST_TENANT_ID`	`tenant.id`	(required)	Your tenant UUID
`TENSORCOST_EXTERNAL_ID`	`tenant.external_id`	(required)	The wizard-issued external ID
`TENSORCOST_AGENT_NAME`	`agent.name`	hostname	Human-readable name in the Agents view
`TENSORCOST_GRPC_TARGET`	`grpc.target`	regional default	gRPC endpoint, e.g. `grpc.us-east-1.tensorcost.com:50051`
`TENSORCOST_TLS_CA`	`grpc.tls_ca`	system	Override CA bundle (rare)
`TENSORCOST_HEARTBEAT_INTERVAL`	`agent.heartbeat_interval`	`15s`	Heartbeat cadence
`TENSORCOST_LOG_LEVEL`	`logging.level`	`info`	`trace`, `debug`, `info`, `warn`, `error`
`OTEL_ENABLED`	`telemetry.otel.enabled`	`false`	OpenTelemetry export
`OTEL_EXPORTER_OTLP_ENDPOINT`	`telemetry.otel.endpoint`	`http://localhost:4317`	OTLP target

NVML / GPU sampling

Env var	YAML key	Default	Purpose
`NVML_ENABLED`	`nvml.enabled`	`true` (host install)	Enable local GPU sampling
`NVML_SAMPLE_INTERVAL`	`nvml.sample_interval`	`10s`	Sampling cadence
`NVML_IDLE_GPU_THRESHOLD`	`nvml.idle_gpu_threshold`	`10`	Utilization% below this counts as idle
`NVML_IDLE_DURATION`	`nvml.idle_duration`	`120s`	Time idle before flagging
`MIG_ENABLED`	`mig.enabled`	`auto`	Detect MIG slices on A100/H100/H200/B200

Cloud (optional, when running outside the IAM-role context)

If the agent runs on EC2 with an instance profile, leave the cloud blocks empty — IMDS picks them up. Override only when running in a foreign environment.

Env var	YAML key	Purpose
`AWS_ENABLED`	`clouds.aws.enabled`	Enable AWS pulls
`AWS_REGION`	`clouds.aws.region`	Override default region
`AZURE_ENABLED`	`clouds.azure.enabled`	Enable Azure pulls
`AZURE_SUBSCRIPTION_ID`	`clouds.azure.subscription_id`	Subscription ID
`GCP_ENABLED`	`clouds.gcp.enabled`	Enable GCP pulls
`GCP_PROJECT_ID`	`clouds.gcp.project_id`	Project ID
`K8S_ENABLED`	`clouds.kubernetes.enabled`	Enable K8s pod/namespace tagging

Upgrades

Upgrades are customer-driven — we don’t push agent binaries into your environment. The recommended upgrade path matches your install path:

aws cloudformation update-stack \
  --stack-name tensorcost-agent \
  --use-previous-template \
  --parameter-overrides AgentImageTag=v1.5.0 \
  --capabilities CAPABILITY_NAMED_IAM

We publish release notes for every agent version at github.com/vaadhlabs/tensorcost-agent/releases. Critical security patches are tagged and announced via your tenant’s notification channel.

Common day-1 failures

Symptom	Likely cause	Remediation
`STS AssumeRole AccessDenied (ExternalId)`	Wizard’s `ExternalId` doesn’t match the CFN parameter	Re-deploy CFN with the wizard-shown ExternalId, or update the wizard connection row
`STS AssumeRole AccessDenied (no ExternalId mention)`	Default `TensorCostBackendAccountId=000000000000` left in CFN	Re-deploy with the platform account ID shown in the wizard
`agent gRPC handshake: HMAC mismatch`	Pasted credentials with leading/trailing whitespace	Re-mint credentials, paste cleanly
`agent shows status='unknown' for >30 min`	Outbound TCP/50051 blocked by firewall / corporate proxy	Allowlist the regional gRPC NLB; verify `curl -vI https://api.tensorcost.com/health` works from the host
`NVML not available`	NVIDIA drivers missing or container lacks `--gpus all` / device plugin	Mount `/usr/lib/x86_64-linux-gnu/nvidia` read-only; on K8s, ensure the node has the NVIDIA device plugin
`0 events ingested for >24h`	Agent registered but no metric emit (rare)	Run `--self-test`; check `/var/log/tensorcost-agent.log` for errors

When stuck, page support via your shared Slack Connect channel, or email support@tensorcost.com with the agent log and the connection’s sync-history drawer.

Data the agent collects

Class	Cadence	Examples
Inventory	Every 5 min	Instance ID, type, region, GPU model, MIG profile, tags
GPU metrics	10s NVML sample, batched every 60s	Utilization, memory, temperature, power, clock, ECC errors
Container/pod context	On change	Namespace, pod name, deployment, requested vs allocated GPU
Spot signals	On event	Interruption notice, hibernation, capacity rebalance
Health	Every 5 min	Uptime, queue depth, error counts

The agent is stateless — restart at any time without data loss. Outbound traffic is TLS-encrypted to the regional NLB. No customer data is written to disk except the rotating local log (/var/log/tensorcost-agent.log, capped at 100 MB).

​Agent installation

​What the agent does

​Requirements

​Pick your install path

CloudFormation (one-click)

Terraform module

Helm chart

​Path A — CloudFormation one-click

​Path B — Terraform module

​Path C — Helm chart (Kubernetes)

​How authentication works

​Verification — did the agent connect?

​Configuration reference

​Core

​NVML / GPU sampling

​Cloud (optional, when running outside the IAM-role context)

​Upgrades

​Common day-1 failures

​Data the agent collects

Agent installation

What the agent does

Requirements

Pick your install path

Path A — CloudFormation one-click

Path B — Terraform module

Path C — Helm chart (Kubernetes)

How authentication works

Verification — did the agent connect?

Configuration reference

Core

NVML / GPU sampling

Cloud (optional, when running outside the IAM-role context)

Upgrades

Common day-1 failures

Data the agent collects