Observability
GoCloudera backend and agent both instrument themselves with OpenTelemetry for distributed tracing, correlate logs with trace IDs, and handle SIGTERM with a bounded graceful shutdown. This page describes what you get out of the box and how to point it at your observability stack.Distributed tracing
Tracing is OFF by default on self-hosted installs and ON in managed staging + production. Enable with a single env var.Backend (Node.js)
SetOTEL_ENABLED=true. The backend loads the OpenTelemetry SDK before Express initializes, so auto-instrumentation can patch the HTTP, Express, gRPC, Redis, and PostgreSQL clients.
| Env var | Default | Purpose |
|---|---|---|
OTEL_ENABLED | false | Master toggle. |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4318 | OTLP/HTTP endpoint. In ECS this is the ADOT sidecar running on localhost. |
OTEL_SERVICE_NAME | gpu-dashboard-api | Service label that appears in your tracing UI. |
localhost because the recommended deployment shape is an ADOT Collector sidecar in the same ECS task that forwards to AWS X-Ray, Datadog, Honeycomb, or any other OTLP-compatible backend.
If the @opentelemetry/* packages are missing from the image for some reason, the tracing module degrades to a no-op log and the server still starts — no hard dependency.
Agent (Python)
SetOTEL_ENABLED=true on the agent process. The agent auto-instruments grpc (agent→backend stream) and requests (HTTP fallback sync), so every call is a span.
| Env var | Default | Purpose |
|---|---|---|
OTEL_ENABLED | false | Master toggle. |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP/gRPC endpoint (Python uses gRPC by default). |
OTEL_SERVICE_NAME | unified-gpu-agent | Service label. |
Structured logs + trace correlation
Backend logs are JSON (Winston) with one object per log line. When OpenTelemetry is active, every log line is automatically tagged with the currenttrace_id and span_id:
trace_id / span_id where applicable.
Log levels
The backend readsLOG_LEVEL at startup:
| Level | What you see |
|---|---|
error | Only failures. |
warn | Failures + non-fatal issues (e.g. “Redis not available, event system running in local-only mode”). |
info | Default. Normal startup, scheduled job summaries, request/response lines. |
debug | Everything including Event published, Processing event, individual metric sync counts. |
debug, staging and prod default to info.
Graceful shutdown
Fargate sendsSIGTERM and gives the container 30 seconds before SIGKILL. The backend’s shutdown path respects that budget:
- A 30-second watchdog is armed (tunable via
SHUTDOWN_TIMEOUT_MS). - A re-entry guard prevents a second SIGTERM from corrupting the sequence — it forces an immediate
exit(1)instead. - Each shutdown step runs inside its own try/catch so one hung step doesn’t block the others:
- Clear the staleness-check interval timer
- Stop the ML training scheduler
- Close the gRPC server (drains in-flight streams)
- Flush + close the LaunchDarkly client
- Shut down the event system (Redis pub/sub)
- Close the database connection pool
- After all steps complete,
exit(0). If the watchdog fires first,exit(1)with a log line explaining which step hung.
health: shutting_down report, and exits.
Self-hosted: typical deployment shapes
AWS ECS + X-Ray via ADOT
Add the AWS Distro for OpenTelemetry Collector as a sidecar container in the same task definition. Point the backend athttp://localhost:4318 (already the default). The ADOT sidecar forwards traces to X-Ray and logs to CloudWatch without any code changes.
Kubernetes + any OTLP-compatible collector
Deploy the OpenTelemetry Collector as a DaemonSet or as a sidecar. PointOTEL_EXPORTER_OTLP_ENDPOINT at the collector’s service DNS. The gpu-agent Helm chart exposes telemetry.otel.exporterEndpoint for this.
Datadog / Honeycomb / Grafana Cloud
Any OTLP-compatible SaaS works. Configure the collector (or point the backend directly) at the SaaS’s OTLP endpoint with the appropriate API key headers.What’s instrumented
- HTTP — every incoming Express request is a root span with method, route, status code, duration.
- gRPC — agent↔backend bidirectional streams show up as long-lived spans with per-message child spans.
- Database — every Sequelize query produces a span with the SQL statement.
- Redis — every cache
get/set/brPopis a span. - HTTP outbound — every
axios/fetch/requestscall (agent→cloud APIs, backend→ML sidecar) is a span.