Documentation Index
Fetch the complete documentation index at: https://docs.tensorcost.com/llms.txt
Use this file to discover all available pages before exploring further.
SOC 2 readiness and security posture
TensorCost is built and operated by Vaadh Labs. This page describes our customer-facing security posture: the controls in place today, the compliance certifications in flight, and the architecture details an enterprise security reviewer cares about.
We are SOC 2 Type I in progress, targeting completion in Q3 2026 with Type II following on a 6–9 month observation window. ISO 27001 is on the 2027 roadmap. FedRAMP Moderate is on the 2028 roadmap. The trust portal at trust.tensorcost.com is in build; for now, request the security package directly via security@tensorcost.com.
Compliance posture at a glance
| Framework | Status | Target |
|---|
| SOC 2 Type I | In progress | Q3 2026 |
| SOC 2 Type II | Scheduled | Q1 2027 (6–9 months post Type I) |
| ISO 27001 | Roadmap | 2027 |
| HIPAA BAA | On request | Available for Enterprise tier |
| FedRAMP Moderate | Roadmap | 2028 |
| Penetration test | Annual + on-major-release | Latest summary on request |
Trust service criteria — what we have, what’s in flight
Security (mandatory)
In place:
- Multi-tenant Row Level Security on Postgres. 21 tables enforced today, target 50; rolling out wave-by-wave with a
runAsBypass discipline lint. See the RLS architecture below.
- JWT auth with refresh tokens; Cognito federation for browser SSO; SAML / OIDC for enterprise SSO.
- Three built-in roles (
member, admin, owner); RBAC primitive shared across REST, gRPC, and MCP.
- HMAC-SHA256 + Redis nonce replay for every agent gRPC stream (see agent ingress).
- TLS 1.3 in transit for every customer-facing endpoint (REST, WS, gRPC).
- AES-256 encryption at rest (RDS, S3, EBS); per-tenant KMS-derived keys for sensitive credential storage.
- AWS STS AssumeRole + external ID for cross-account reads; 15-minute temp credentials, never persisted.
- Read-only IAM on customer accounts. No
bedrock:Invoke, no s3:Put, no logs:Put anywhere.
- Customer-owned S3 buckets for raw Bedrock invocation logs. We never copy raw logs to our account.
- Redaction at ingestion — raw prompts and responses are hashed; only hashes enter
ai_spend_events. See redaction at ingestion.
- Rate limiting at the gateway and per-tenant on the gRPC ingress.
- CloudTrail, GuardDuty, AWS Config enabled across the platform AWS organization.
- Dependabot, secret scanning, required PR reviews + status checks on every repo.
In flight:
- Trust portal scaffolding at
trust.tensorcost.com.
- Per-tenant gRPC token-bucket rate limit (ADR-0010 CC-4).
- mTLS upgrade path for the agent ingress (post-Envoy migration).
- AI-service RLS rollout (next wave: 17 tables across the
ai.* schema).
Availability
| Tier | SLA | DR |
|---|
| Free / design partner | Best-effort | — |
| Growth | 99.9% | RPO 24h, RTO 4h |
| Enterprise | 99.95% | RPO 1h, RTO 1h |
- RDS Multi-AZ with automated backups (35-day retention).
- Redis ElastiCache cluster mode with replicas.
- ALB cross-zone load balancing; NLB targets in three AZs.
- Per-region deployment in
us-east-1 (default), with EU-region rollout planned.
- Public status page at
status.tensorcost.com (embed coming).
- Documented incident response procedure with PagerDuty rotation.
Confidentiality
- All tenant data scoped by
tenant_id and enforced by RLS.
- Sensitive config values (webhook URLs, routing keys, cloud credentials) encrypted at the column level using a per-tenant KMS-derived key. Surfaced as masked values in API responses; full values only via
decrypt-scoped endpoints.
- Customizable data retention (30–365 days for raw metrics; 90–730 days for recommendations; 7 years for audit).
- Tenant offboarding flow with 30-day soft-delete; hard delete preserves audit-trail rows. See tenant offboarding.
- NDA template signed by every employee and contractor with production access.
Processing integrity
- Analysis audit log captures every anomaly-detection decision: baseline statistics used, current value, each method’s score, composite confidence, classification.
- Inference feedback records every recommendation outcome (accepted / rejected / modified) with reason + post-action measurements; feeds back into ML training.
- Action queue tracks every enforcement action through pending → approved → executing → completed/failed with full audit trail.
- Event persistence via the event store (ADR-0011) — cross-instance durable event ledger.
- Daily ingest reconciliation — agent-emitted vs backend-received counts, with a delta alert.
Privacy
- Privacy policy and DPA template on request.
- Personal data collected: email, name, IP / user-agent on auth, timezone preference.
- Subprocessor list available on request and republished annually.
- Cookie policy on the marketing site.
- DSAR process documented; DSARs handled within 30 days.
Multi-tenant RLS
Every customer is a tenant; every row in every table that holds tenant data carries a tenant_id column and is protected by Postgres Row Level Security.
-- Pattern applied across 21 tables today, target 50
ALTER TABLE cost.savings_ledger ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation
ON cost.savings_ledger
USING (tenant_id = current_setting('app.tenant_id')::uuid);
Application code sets app.tenant_id per request via a Sequelize hook; cross-tenant reads are unreachable by construction. The only path that bypasses RLS is runAsBypass(tenantId, fn) from @tensorcost/db-utils, used by:
- gRPC handlers (after the agent’s
tenantId is verified by HMAC).
- Cron-driven sweeps (anomaly detection, recommenders).
- Cross-service joins.
A RUN-AS-BYPASS-LINT rule is on the way; until then, code review enforces the discipline. A botched RLS migration silently returns zero rows, which during a customer demo is indistinguishable from a feature that “doesn’t work” — so each table’s rollout requires both the migration and a wrap+verify pass on every reader before merge.
Agent ingress
Per ADR-0010, the unified GPU agent connects to TensorCost over a long-lived gRPC stream on TCP/50051, fronted by an AWS Network Load Balancer with TLS-only listeners. The task security group accepts 0.0.0.0/0 on TCP/50051; four compensating controls make this safe:
| Control | Implementation |
|---|
| CC-1: HMAC + replay guard | Every AgentHello carries a tenant-bound HMAC-SHA256, ±300s skew window, and a Redis-backed nonce reuse rejection (SETNX nonce:{tenantId}:{keyId}:{hex(nonce)} with 600s TTL). Verified with timingSafeEqual against a per-deployment Secrets-Manager pepper. |
| CC-2: Per-stream tenant binding | The verified AgentHello populates session.tenantId; every DB write runs inside runAsBypass with the resolved tenantId so RLS policies cannot leak cross-tenant. |
| CC-3: TLS-only at the NLB | Plaintext gRPC connections fail before reaching the handler. mTLS is the planned upgrade path once we have an Envoy / ingress gateway. |
| CC-4: Per-tenant per-stream rate limit | In flight; tracked as a follow-up. Today’s compensating control is the NLB connection-per-source-IP soft limit. |
The escape hatch GPU_GRPC_ALLOW_UNREGISTERED_AGENTS=true was removed; every agent must carry a verified hello.
Redaction at ingestion
Raw prompts and responses never enter our storage.
- Bedrock invocation logs are read read-only from customer-owned S3 buckets (or CloudWatch Logs).
- The ingestion parser computes
prompt_hash and response_hash (SHA-256) and writes those.
- Unit tests assert no raw prompt or response text escapes the parser.
- Customers concerned about content can disable request-metadata capture entirely and still receive model/cost/latency-level attribution and recommendations.
- Anonymized aggregates that feed our public benchmark report are gated behind explicit per-tenant consent.
Customer onboarding — least-privilege IAM
The CFN onboarding stack ships with the minimum read-only IAM policy required. Modes:
| Mode | What the role can do | Where the stack deploys |
|---|
SingleAccount (default) | Read-only Bedrock + CloudWatch + EC2 + Cost Explorer in one account | The single account being monitored |
Organization | All of SingleAccount plus sts:AssumeRole into OrganizationAccountAccessRole (or AWSControlTowerExecution) per member account | The Organization’s management account |
Variants supported in the wizard:
- Consolidated billing with separate payer — payer-account CUR read + member-account Bedrock log read.
- SCP-restricted environments —
RolePathPrefix parameter for Organizations that mandate a custom path.
- AWS Control Tower / Landing Zone — defaults to
AWSControlTowerExecution jump role.
- Cross-account CUR — separate bucket-policy snippet shipped as a distinct artifact (security teams routinely review these independently).
Full onboarding flow with day-by-day expectations, top-5 day-1 failure remediations, and Organization-mode verification snippets lives in our internal customer-onboarding runbook (request via your Slack Connect channel).
Secret rotation
Per the secret-rotation runbook, all platform secrets rotate on a documented cadence:
| Secret | Cadence | Owner | Mechanism |
|---|
Agent HMAC pepper (AGENT_HMAC_PEPPER) | Quarterly | Platform | Secrets Manager rotation lambda; old + new accepted for 24h overlap |
| Database master credentials | Quarterly | Platform | RDS-managed rotation |
| Cognito signing keys | Annually | Platform | Cognito-managed |
| LaunchDarkly SDK keys | On suspected compromise + annually | Platform | LD console + Secrets Manager |
| Customer agent credentials | Customer-driven | Customer | POST /v1/identity/agent-credentials for new key, then revoke old |
| Customer integration credentials | 90 days suggested | Customer | POST /v1/integration/connections/:id/rotate-secret |
Customer-facing secrets (agent HMAC keys, OpenAI API keys we hold, etc.) are stored encrypted with a per-tenant KMS-derived key in integration.connection_secret (RLS-enforced).
Tenant offboarding
A documented state machine — active → offboarding_pending → offboarding_archive → deleted — covers every churn or GDPR Art. 17 request:
- Soft delete (30 days) — tenant moves to
offboarding_pending. Ingestion stops, dashboards become read-only, recommendations freeze. Data is retained for 30 days to allow recovery.
- Archive (7 days) — tenant moves to
offboarding_archive. A signed export bundle (CSV + JSON) is delivered to the tenant’s offboarding contact.
- Hard delete — every tenant-scoped row is purged via chunked DELETE per schema (large schemas like
ai.ai_spend_events are deleted in batches to avoid bloat). Customer-owned S3 buckets are not touched — those are the customer’s to manage.
- Audit-trail preservation —
audit.* rows survive offboarding indefinitely; required for compliance and legal-hold preservation.
Customer-side teardown (revoking the IAM role, disabling Bedrock invocation logging, removing CFN stacks) is a customer responsibility; we ship a runbook and a Slack Connect handoff.
Penetration testing
| Cadence | Scope |
|---|
| Annual | Full external + authenticated app + API |
| On major release | Targeted (the new surface) |
| On request | Customer-driven re-test; latest summary available |
We use independent third-party testers and publish the executive summary to enterprise prospects under NDA.
Background checks and security training
- Background checks on every team member with production access.
- Annual security awareness training (KnowBe4) for all staff.
- Quarterly tabletop incident-response exercises.
What an enterprise prospect typically asks for
- Security questionnaire (SIG / CAIQ) — completed on request, typically 48–72 hours.
- Architecture diagrams at three levels (exec, platform-lead, security-reviewer). Available on request.
- Subprocessor list and DPA. Available on request.
- SOC 2 bridge letter — issued during the Type I observation window (Q2-Q3 2026).
- Penetration-test executive summary — under NDA.
- Sample audit-trail export — under NDA.
Email security@tensorcost.com for the security package or to start a security review.
Reporting a vulnerability
Please report suspected vulnerabilities by emailing security@tensorcost.com. Include:
- A description of the issue and potential impact.
- Reproduction steps (PoC, logs, screenshots).
- Affected versions / commit hashes.
- Suggested mitigations if any.
We acknowledge receipt within 3 business days and provide a status update within 7 business days after triage. Please do not open public issues for suspected vulnerabilities. We coordinate disclosure with the reporter and publish security advisories with remediation details when fixes ship.