Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tensorcost.com/llms.txt

Use this file to discover all available pages before exploring further.

SOC 2 readiness and security posture

TensorCost is built and operated by Vaadh Labs. This page describes our customer-facing security posture: the controls in place today, the compliance certifications in flight, and the architecture details an enterprise security reviewer cares about.
We are SOC 2 Type I in progress, targeting completion in Q3 2026 with Type II following on a 6–9 month observation window. ISO 27001 is on the 2027 roadmap. FedRAMP Moderate is on the 2028 roadmap. The trust portal at trust.tensorcost.com is in build; for now, request the security package directly via security@tensorcost.com.

Compliance posture at a glance

FrameworkStatusTarget
SOC 2 Type IIn progressQ3 2026
SOC 2 Type IIScheduledQ1 2027 (6–9 months post Type I)
ISO 27001Roadmap2027
HIPAA BAAOn requestAvailable for Enterprise tier
FedRAMP ModerateRoadmap2028
Penetration testAnnual + on-major-releaseLatest summary on request

Trust service criteria — what we have, what’s in flight

Security (mandatory)

In place:
  • Multi-tenant Row Level Security on Postgres. 21 tables enforced today, target 50; rolling out wave-by-wave with a runAsBypass discipline lint. See the RLS architecture below.
  • JWT auth with refresh tokens; Cognito federation for browser SSO; SAML / OIDC for enterprise SSO.
  • Three built-in roles (member, admin, owner); RBAC primitive shared across REST, gRPC, and MCP.
  • HMAC-SHA256 + Redis nonce replay for every agent gRPC stream (see agent ingress).
  • TLS 1.3 in transit for every customer-facing endpoint (REST, WS, gRPC).
  • AES-256 encryption at rest (RDS, S3, EBS); per-tenant KMS-derived keys for sensitive credential storage.
  • AWS STS AssumeRole + external ID for cross-account reads; 15-minute temp credentials, never persisted.
  • Read-only IAM on customer accounts. No bedrock:Invoke, no s3:Put, no logs:Put anywhere.
  • Customer-owned S3 buckets for raw Bedrock invocation logs. We never copy raw logs to our account.
  • Redaction at ingestion — raw prompts and responses are hashed; only hashes enter ai_spend_events. See redaction at ingestion.
  • Rate limiting at the gateway and per-tenant on the gRPC ingress.
  • CloudTrail, GuardDuty, AWS Config enabled across the platform AWS organization.
  • Dependabot, secret scanning, required PR reviews + status checks on every repo.
In flight:
  • Trust portal scaffolding at trust.tensorcost.com.
  • Per-tenant gRPC token-bucket rate limit (ADR-0010 CC-4).
  • mTLS upgrade path for the agent ingress (post-Envoy migration).
  • AI-service RLS rollout (next wave: 17 tables across the ai.* schema).

Availability

TierSLADR
Free / design partnerBest-effort
Growth99.9%RPO 24h, RTO 4h
Enterprise99.95%RPO 1h, RTO 1h
  • RDS Multi-AZ with automated backups (35-day retention).
  • Redis ElastiCache cluster mode with replicas.
  • ALB cross-zone load balancing; NLB targets in three AZs.
  • Per-region deployment in us-east-1 (default), with EU-region rollout planned.
  • Public status page at status.tensorcost.com (embed coming).
  • Documented incident response procedure with PagerDuty rotation.

Confidentiality

  • All tenant data scoped by tenant_id and enforced by RLS.
  • Sensitive config values (webhook URLs, routing keys, cloud credentials) encrypted at the column level using a per-tenant KMS-derived key. Surfaced as masked values in API responses; full values only via decrypt-scoped endpoints.
  • Customizable data retention (30–365 days for raw metrics; 90–730 days for recommendations; 7 years for audit).
  • Tenant offboarding flow with 30-day soft-delete; hard delete preserves audit-trail rows. See tenant offboarding.
  • NDA template signed by every employee and contractor with production access.

Processing integrity

  • Analysis audit log captures every anomaly-detection decision: baseline statistics used, current value, each method’s score, composite confidence, classification.
  • Inference feedback records every recommendation outcome (accepted / rejected / modified) with reason + post-action measurements; feeds back into ML training.
  • Action queue tracks every enforcement action through pending → approved → executing → completed/failed with full audit trail.
  • Event persistence via the event store (ADR-0011) — cross-instance durable event ledger.
  • Daily ingest reconciliation — agent-emitted vs backend-received counts, with a delta alert.

Privacy

  • Privacy policy and DPA template on request.
  • Personal data collected: email, name, IP / user-agent on auth, timezone preference.
  • Subprocessor list available on request and republished annually.
  • Cookie policy on the marketing site.
  • DSAR process documented; DSARs handled within 30 days.

Multi-tenant RLS

Every customer is a tenant; every row in every table that holds tenant data carries a tenant_id column and is protected by Postgres Row Level Security.
-- Pattern applied across 21 tables today, target 50
ALTER TABLE cost.savings_ledger ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation
  ON cost.savings_ledger
  USING (tenant_id = current_setting('app.tenant_id')::uuid);
Application code sets app.tenant_id per request via a Sequelize hook; cross-tenant reads are unreachable by construction. The only path that bypasses RLS is runAsBypass(tenantId, fn) from @tensorcost/db-utils, used by:
  • gRPC handlers (after the agent’s tenantId is verified by HMAC).
  • Cron-driven sweeps (anomaly detection, recommenders).
  • Cross-service joins.
A RUN-AS-BYPASS-LINT rule is on the way; until then, code review enforces the discipline. A botched RLS migration silently returns zero rows, which during a customer demo is indistinguishable from a feature that “doesn’t work” — so each table’s rollout requires both the migration and a wrap+verify pass on every reader before merge.

Agent ingress

Per ADR-0010, the unified GPU agent connects to TensorCost over a long-lived gRPC stream on TCP/50051, fronted by an AWS Network Load Balancer with TLS-only listeners. The task security group accepts 0.0.0.0/0 on TCP/50051; four compensating controls make this safe:
ControlImplementation
CC-1: HMAC + replay guardEvery AgentHello carries a tenant-bound HMAC-SHA256, ±300s skew window, and a Redis-backed nonce reuse rejection (SETNX nonce:{tenantId}:{keyId}:{hex(nonce)} with 600s TTL). Verified with timingSafeEqual against a per-deployment Secrets-Manager pepper.
CC-2: Per-stream tenant bindingThe verified AgentHello populates session.tenantId; every DB write runs inside runAsBypass with the resolved tenantId so RLS policies cannot leak cross-tenant.
CC-3: TLS-only at the NLBPlaintext gRPC connections fail before reaching the handler. mTLS is the planned upgrade path once we have an Envoy / ingress gateway.
CC-4: Per-tenant per-stream rate limitIn flight; tracked as a follow-up. Today’s compensating control is the NLB connection-per-source-IP soft limit.
The escape hatch GPU_GRPC_ALLOW_UNREGISTERED_AGENTS=true was removed; every agent must carry a verified hello.

Redaction at ingestion

Raw prompts and responses never enter our storage.
  • Bedrock invocation logs are read read-only from customer-owned S3 buckets (or CloudWatch Logs).
  • The ingestion parser computes prompt_hash and response_hash (SHA-256) and writes those.
  • Unit tests assert no raw prompt or response text escapes the parser.
  • Customers concerned about content can disable request-metadata capture entirely and still receive model/cost/latency-level attribution and recommendations.
  • Anonymized aggregates that feed our public benchmark report are gated behind explicit per-tenant consent.

Customer onboarding — least-privilege IAM

The CFN onboarding stack ships with the minimum read-only IAM policy required. Modes:
ModeWhat the role can doWhere the stack deploys
SingleAccount (default)Read-only Bedrock + CloudWatch + EC2 + Cost Explorer in one accountThe single account being monitored
OrganizationAll of SingleAccount plus sts:AssumeRole into OrganizationAccountAccessRole (or AWSControlTowerExecution) per member accountThe Organization’s management account
Variants supported in the wizard:
  • Consolidated billing with separate payer — payer-account CUR read + member-account Bedrock log read.
  • SCP-restricted environmentsRolePathPrefix parameter for Organizations that mandate a custom path.
  • AWS Control Tower / Landing Zone — defaults to AWSControlTowerExecution jump role.
  • Cross-account CUR — separate bucket-policy snippet shipped as a distinct artifact (security teams routinely review these independently).
Full onboarding flow with day-by-day expectations, top-5 day-1 failure remediations, and Organization-mode verification snippets lives in our internal customer-onboarding runbook (request via your Slack Connect channel).

Secret rotation

Per the secret-rotation runbook, all platform secrets rotate on a documented cadence:
SecretCadenceOwnerMechanism
Agent HMAC pepper (AGENT_HMAC_PEPPER)QuarterlyPlatformSecrets Manager rotation lambda; old + new accepted for 24h overlap
Database master credentialsQuarterlyPlatformRDS-managed rotation
Cognito signing keysAnnuallyPlatformCognito-managed
LaunchDarkly SDK keysOn suspected compromise + annuallyPlatformLD console + Secrets Manager
Customer agent credentialsCustomer-drivenCustomerPOST /v1/identity/agent-credentials for new key, then revoke old
Customer integration credentials90 days suggestedCustomerPOST /v1/integration/connections/:id/rotate-secret
Customer-facing secrets (agent HMAC keys, OpenAI API keys we hold, etc.) are stored encrypted with a per-tenant KMS-derived key in integration.connection_secret (RLS-enforced).

Tenant offboarding

A documented state machine — active → offboarding_pending → offboarding_archive → deleted — covers every churn or GDPR Art. 17 request:
  1. Soft delete (30 days) — tenant moves to offboarding_pending. Ingestion stops, dashboards become read-only, recommendations freeze. Data is retained for 30 days to allow recovery.
  2. Archive (7 days) — tenant moves to offboarding_archive. A signed export bundle (CSV + JSON) is delivered to the tenant’s offboarding contact.
  3. Hard delete — every tenant-scoped row is purged via chunked DELETE per schema (large schemas like ai.ai_spend_events are deleted in batches to avoid bloat). Customer-owned S3 buckets are not touched — those are the customer’s to manage.
  4. Audit-trail preservationaudit.* rows survive offboarding indefinitely; required for compliance and legal-hold preservation.
Customer-side teardown (revoking the IAM role, disabling Bedrock invocation logging, removing CFN stacks) is a customer responsibility; we ship a runbook and a Slack Connect handoff.

Penetration testing

CadenceScope
AnnualFull external + authenticated app + API
On major releaseTargeted (the new surface)
On requestCustomer-driven re-test; latest summary available
We use independent third-party testers and publish the executive summary to enterprise prospects under NDA.

Background checks and security training

  • Background checks on every team member with production access.
  • Annual security awareness training (KnowBe4) for all staff.
  • Quarterly tabletop incident-response exercises.

What an enterprise prospect typically asks for

  • Security questionnaire (SIG / CAIQ) — completed on request, typically 48–72 hours.
  • Architecture diagrams at three levels (exec, platform-lead, security-reviewer). Available on request.
  • Subprocessor list and DPA. Available on request.
  • SOC 2 bridge letter — issued during the Type I observation window (Q2-Q3 2026).
  • Penetration-test executive summary — under NDA.
  • Sample audit-trail export — under NDA.
Email security@tensorcost.com for the security package or to start a security review.

Reporting a vulnerability

Please report suspected vulnerabilities by emailing security@tensorcost.com. Include:
  • A description of the issue and potential impact.
  • Reproduction steps (PoC, logs, screenshots).
  • Affected versions / commit hashes.
  • Suggested mitigations if any.
We acknowledge receipt within 3 business days and provide a status update within 7 business days after triage. Please do not open public issues for suspected vulnerabilities. We coordinate disclosure with the reporter and publish security advisories with remediation details when fixes ship.