Documentation Index
Fetch the complete documentation index at: https://docs.tensorcost.com/llms.txt
Use this file to discover all available pages before exploring further.
Observability
GoCloudera backend and agent both instrument themselves with OpenTelemetry for distributed tracing, correlate logs with trace IDs, and handle SIGTERM with a bounded graceful shutdown. This page describes what you get out of the box and how to point it at your observability stack.
Distributed tracing
Tracing is OFF by default on self-hosted installs and ON in managed staging + production. Enable with a single env var.
Backend (Node.js)
Set OTEL_ENABLED=true. The backend loads the OpenTelemetry SDK before Express initializes, so auto-instrumentation can patch the HTTP, Express, gRPC, Redis, and PostgreSQL clients.
| Env var | Default | Purpose |
|---|
OTEL_ENABLED | false | Master toggle. |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4318 | OTLP/HTTP endpoint. In ECS this is the ADOT sidecar running on localhost. |
OTEL_SERVICE_NAME | gpu-dashboard-api | Service label that appears in your tracing UI. |
The default endpoint points at localhost because the recommended deployment shape is an ADOT Collector sidecar in the same ECS task that forwards to AWS X-Ray, Datadog, Honeycomb, or any other OTLP-compatible backend.
If the @opentelemetry/* packages are missing from the image for some reason, the tracing module degrades to a no-op log and the server still starts — no hard dependency.
Agent (Python)
Set OTEL_ENABLED=true on the agent process. The agent auto-instruments grpc (agent→backend stream) and requests (HTTP fallback sync), so every call is a span.
| Env var | Default | Purpose |
|---|
OTEL_ENABLED | false | Master toggle. |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP/gRPC endpoint (Python uses gRPC by default). |
OTEL_SERVICE_NAME | unified-gpu-agent | Service label. |
Agent spans propagate trace context into the gRPC metadata, so a request initiated by the dashboard can be followed all the way through: Express route → backend service → gRPC push to agent → agent execution → command result back over the stream.
Structured logs + trace correlation
Backend logs are JSON (Winston) with one object per log line. When OpenTelemetry is active, every log line is automatically tagged with the current trace_id and span_id:
{
"level": "info",
"service": "gpu-dashboard-api",
"timestamp": "2026-04-16 16:41:38",
"message": "MLTrainingJob: retraining gpu_anomaly for tenant …",
"trace_id": "7a0db7e045…",
"span_id": "2e757533…"
}
This lets you jump directly from a log line in CloudWatch (or Datadog, Loki, etc.) to the full distributed trace in your tracing UI.
Agent logs use the same pattern when OpenTelemetry is enabled — each log line includes trace_id / span_id where applicable.
Log levels
The backend reads LOG_LEVEL at startup:
| Level | What you see |
|---|
error | Only failures. |
warn | Failures + non-fatal issues (e.g. “Redis not available, event system running in local-only mode”). |
info | Default. Normal startup, scheduled job summaries, request/response lines. |
debug | Everything including Event published, Processing event, individual metric sync counts. |
Dev environments default to debug, staging and prod default to info.
Graceful shutdown
Fargate sends SIGTERM and gives the container 30 seconds before SIGKILL. The backend’s shutdown path respects that budget:
- A 30-second watchdog is armed (tunable via
SHUTDOWN_TIMEOUT_MS).
- A re-entry guard prevents a second SIGTERM from corrupting the sequence — it forces an immediate
exit(1) instead.
- Each shutdown step runs inside its own try/catch so one hung step doesn’t block the others:
- Clear the staleness-check interval timer
- Stop the ML training scheduler
- Close the gRPC server (drains in-flight streams)
- Flush + close the LaunchDarkly client
- Shut down the event system (Redis pub/sub)
- Close the database connection pool
- After all steps complete,
exit(0). If the watchdog fires first, exit(1) with a log line explaining which step hung.
The agent uses the same pattern: on SIGTERM it stops the gRPC stream, drains the send queue to disk atomically, sends a final health: shutting_down report, and exits.
Self-hosted: typical deployment shapes
AWS ECS + X-Ray via ADOT
Add the AWS Distro for OpenTelemetry Collector as a sidecar container in the same task definition. Point the backend at http://localhost:4318 (already the default). The ADOT sidecar forwards traces to X-Ray and logs to CloudWatch without any code changes.
Kubernetes + any OTLP-compatible collector
Deploy the OpenTelemetry Collector as a DaemonSet or as a sidecar. Point OTEL_EXPORTER_OTLP_ENDPOINT at the collector’s service DNS. The gpu-agent Helm chart exposes telemetry.otel.exporterEndpoint for this.
Datadog / Honeycomb / Grafana Cloud
Any OTLP-compatible SaaS works. Configure the collector (or point the backend directly) at the SaaS’s OTLP endpoint with the appropriate API key headers.
What’s instrumented
- HTTP — every incoming Express request is a root span with method, route, status code, duration.
- gRPC — agent↔backend bidirectional streams show up as long-lived spans with per-message child spans.
- Database — every Sequelize query produces a span with the SQL statement.
- Redis — every cache
get / set / brPop is a span.
- HTTP outbound — every
axios / fetch / requests call (agent→cloud APIs, backend→ML sidecar) is a span.
Anomaly-detection training runs produce a single trace that covers: MLTrainingJob orchestrator → feature extraction SQL → HTTP call to Python sidecar → ModelArtifact write. That’s usually the most useful trace to look at when training is failing.