Observability

GoCloudera backend and agent both instrument themselves with OpenTelemetry for distributed tracing, correlate logs with trace IDs, and handle SIGTERM with a bounded graceful shutdown. This page describes what you get out of the box and how to point it at your observability stack.

Distributed tracing

Tracing is OFF by default on self-hosted installs and ON in managed staging + production. Enable with a single env var.

Backend (Node.js)

Set OTEL_ENABLED=true. The backend loads the OpenTelemetry SDK before Express initializes, so auto-instrumentation can patch the HTTP, Express, gRPC, Redis, and PostgreSQL clients.

Env var	Default	Purpose
`OTEL_ENABLED`	`false`	Master toggle.
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4318`	OTLP/HTTP endpoint. In ECS this is the ADOT sidecar running on `localhost`.
`OTEL_SERVICE_NAME`	`gpu-dashboard-api`	Service label that appears in your tracing UI.

The default endpoint points at localhost because the recommended deployment shape is an ADOT Collector sidecar in the same ECS task that forwards to AWS X-Ray, Datadog, Honeycomb, or any other OTLP-compatible backend. If the @opentelemetry/* packages are missing from the image for some reason, the tracing module degrades to a no-op log and the server still starts — no hard dependency.

Agent (Python)

Set OTEL_ENABLED=true on the agent process. The agent auto-instruments grpc (agent→backend stream) and requests (HTTP fallback sync), so every call is a span.

Env var	Default	Purpose
`OTEL_ENABLED`	`false`	Master toggle.
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP/gRPC endpoint (Python uses gRPC by default).
`OTEL_SERVICE_NAME`	`unified-gpu-agent`	Service label.

Agent spans propagate trace context into the gRPC metadata, so a request initiated by the dashboard can be followed all the way through: Express route → backend service → gRPC push to agent → agent execution → command result back over the stream.

Structured logs + trace correlation

Backend logs are JSON (Winston) with one object per log line. When OpenTelemetry is active, every log line is automatically tagged with the current trace_id and span_id:

{
  "level": "info",
  "service": "gpu-dashboard-api",
  "timestamp": "2026-04-16 16:41:38",
  "message": "MLTrainingJob: retraining gpu_anomaly for tenant …",
  "trace_id": "7a0db7e045…",
  "span_id": "2e757533…"
}

This lets you jump directly from a log line in CloudWatch (or Datadog, Loki, etc.) to the full distributed trace in your tracing UI. Agent logs use the same pattern when OpenTelemetry is enabled — each log line includes trace_id / span_id where applicable.

Log levels

The backend reads LOG_LEVEL at startup:

Level	What you see
`error`	Only failures.
`warn`	Failures + non-fatal issues (e.g. “Redis not available, event system running in local-only mode”).
`info`	Default. Normal startup, scheduled job summaries, request/response lines.
`debug`	Everything including `Event published`, `Processing event`, individual metric sync counts.

Dev environments default to debug, staging and prod default to info.

Graceful shutdown

Fargate sends SIGTERM and gives the container 30 seconds before SIGKILL. The backend’s shutdown path respects that budget:

A 30-second watchdog is armed (tunable via SHUTDOWN_TIMEOUT_MS).
A re-entry guard prevents a second SIGTERM from corrupting the sequence — it forces an immediate exit(1) instead.
Each shutdown step runs inside its own try/catch so one hung step doesn’t block the others:
- Clear the staleness-check interval timer
- Stop the ML training scheduler
- Close the gRPC server (drains in-flight streams)
- Flush + close the LaunchDarkly client
- Shut down the event system (Redis pub/sub)
- Close the database connection pool
After all steps complete, exit(0). If the watchdog fires first, exit(1) with a log line explaining which step hung.

The agent uses the same pattern: on SIGTERM it stops the gRPC stream, drains the send queue to disk atomically, sends a final health: shutting_down report, and exits.

Self-hosted: typical deployment shapes

AWS ECS + X-Ray via ADOT

Add the AWS Distro for OpenTelemetry Collector as a sidecar container in the same task definition. Point the backend at http://localhost:4318 (already the default). The ADOT sidecar forwards traces to X-Ray and logs to CloudWatch without any code changes.

Kubernetes + any OTLP-compatible collector

Deploy the OpenTelemetry Collector as a DaemonSet or as a sidecar. Point OTEL_EXPORTER_OTLP_ENDPOINT at the collector’s service DNS. The gpu-agent Helm chart exposes telemetry.otel.exporterEndpoint for this.

Datadog / Honeycomb / Grafana Cloud

Any OTLP-compatible SaaS works. Configure the collector (or point the backend directly) at the SaaS’s OTLP endpoint with the appropriate API key headers.

What’s instrumented

HTTP — every incoming Express request is a root span with method, route, status code, duration.
gRPC — agent↔backend bidirectional streams show up as long-lived spans with per-message child spans.
Database — every Sequelize query produces a span with the SQL statement.
Redis — every cache get / set / brPop is a span.
HTTP outbound — every axios / fetch / requests call (agent→cloud APIs, backend→ML sidecar) is a span.

Anomaly-detection training runs produce a single trace that covers: MLTrainingJob orchestrator → feature extraction SQL → HTTP call to Python sidecar → ModelArtifact write. That’s usually the most useful trace to look at when training is failing.

Getting Started

Architecture

Setup

Features

Reference

Observability

Observability

Distributed tracing

Backend (Node.js)

Agent (Python)

Structured logs + trace correlation

Log levels

Graceful shutdown

Self-hosted: typical deployment shapes

AWS ECS + X-Ray via ADOT

Kubernetes + any OTLP-compatible collector

Datadog / Honeycomb / Grafana Cloud

What’s instrumented

Getting Started

Architecture

Setup

Features

Reference

Documentation Index

​Observability

​Distributed tracing

​Backend (Node.js)

​Agent (Python)

​Structured logs + trace correlation

​Log levels

​Graceful shutdown

​Self-hosted: typical deployment shapes

​AWS ECS + X-Ray via ADOT

​Kubernetes + any OTLP-compatible collector

​Datadog / Honeycomb / Grafana Cloud

​What’s instrumented

Observability

Distributed tracing

Backend (Node.js)

Agent (Python)

Structured logs + trace correlation

Log levels

Graceful shutdown

Self-hosted: typical deployment shapes

AWS ECS + X-Ray via ADOT

Kubernetes + any OTLP-compatible collector

Datadog / Honeycomb / Grafana Cloud

What’s instrumented