Observability

Nimbus has no metrics sidecar, agent, or exporter process. Every diagnostic surface is compiled into the single binary and follows one of three shapes: pull-based JSON endpoints, structured log events on stdout, and an append-only audit file. This page explains how those surfaces are wired and what each one is for. For the operational walkthrough — tokens, curl commands, response fields — see the operators’ observability guide.

One public probe, everything else gated

The HTTP router (crates/nimbus-server/src/router.rs) is composed from route groups with different trust levels, and the diagnostic surfaces split cleanly across two of them:

The public router carries exactly one diagnostic route: /health. Its handler (crates/nimbus-server/src/http/metadata.rs) returns {"ok":true} unconditionally — no credential, no tenant, no storage read. It is a liveness probe for load balancers, nothing more.
The local admin router carries every /debug/* route. Before a request reaches a handler it passes through the middleware chain in crates/nimbus-server/src/local_server/middleware.rs: the request path is classified into a route family, browser Origin headers are checked against the loopback allowlist, the admin credential is extracted, and a fail-closed gate rejects anything not authorized. Every gate decision — allow or deny — is also recorded in the audit log described below.

The route-family vocabulary itself (health, debug, native_api, deploy_admin, the per-adapter families, …) lives in the nimbus-operator crate, so the server, the gate, and the audit log all agree on what kind of request they are looking at. The broader credential model is covered in Auth and trust.

The debug surfaces

Five admin-gated routes expose the binary’s internal state, all handled in crates/nimbus-server/src/http/metadata.rs:

Route	What it reports
`/debug/license/status`	License source, kind, entitlements, and live monthly-active-user accounting
`/debug/encryption/status`	Whether encryption at rest is on, which storage families are protected, and which key provider is configured
`/debug/runtime/metrics`	Effective runtime limits plus live invocation counters, per runtime lane
`/debug/tenants/{tenant_id}/engine/metrics`	The per-tenant engine diagnostics snapshot
`/debug/tenants/{tenant_id}/consistency`	An on-demand consistency verification report

Two design choices are worth noting. /debug/runtime/metrics always returns 200 with a stable shape — its fields are null until a deployment is active, so a freshly started server never surfaces a spurious error to the operator UI. And the tenant-scoped routes validate the tenant id and enter the engine under an operator-class tenant isolation context tagged with the diagnostic surface — diagnostics reads are attributed and policed like any other operator access, not exempted from it.

What the per-tenant snapshot aggregates

The engine diagnostics endpoint is a thin transport wrapper: the server asks the engine, and the engine assembles a TenantEngineDiagnosticsSnapshot (crates/nimbus-engine/src/tenant.rs) from the live per-tenant runtime. Its groups map one-to-one onto the stages a write or read passes through:

Mutation admission — the write admission gate: queue depth against capacity, age of the oldest queued mutation, admitted versus shed counts, and the current load-shedding phase. This is where overload shows up first.
Mutation journal — causal write progress through assignment, durable append, storage application, contiguous publication, and reader visibility. Six ordered heads and four adjacent-phase lags distinguish which phase owns a backlog; the snapshot reconciles concurrent observations without advancing production state. The group also includes pending responses and worker health (running, start and restart counts, failures).
Subscription delivery — the reactive fan-out queue: depth, worker health, and coalescing counters showing how many commits were batched per wakeup. This tells you whether live queries are keeping up with write volume.
Materialized read surface and serving snapshot manager — the read path: how many tables and documents are resident in memory against their capacities, and how many versioned snapshots are retained, pinned by in-flight reads, or pruned. Together they describe read-path cache pressure.
Query planning — counters splitting query executions by plan shape: full scan, single-field index, or composite index, for both plain and paginated queries. A rising full-scan count is the signal to add an index.
Provider write pipeline — present for PostgreSQL and MySQL tenants: the configured and maximum observed in-flight provider-operation depth, journal records versus journal statements, batch attempts, cancellations observed after batch admission, admitted-operation errors, and cumulative admitted-operation time. The operation count does not include SQL statements issued internally while applying an individual durable record. PostgreSQL’s elapsed value covers its append/apply pair; MySQL’s covers its batched journal operation, so the values are not cross-adapter end-to-end latency measures. Other transaction setup, validation, apply, and commit failures remain in their owning diagnostics. PostgreSQL may report a depth of two because it overlaps ordered work on one transaction connection; MySQL deliberately remains at one. A record-to-statement ratio near one under nontrivial batches means writes are no longer collapsing as expected, while growing pipeline errors or cancellations distinguish failed admitted work from an ordinary healthy backlog. The counters are tenant-local and do not include SQL text, error strings, or identifiers.
libSQL replica freshness — present only when the tenant is served from an embedded libSQL replica: the sequence the replica must reach, what it has applied, and which barrier path recent reads took (crates/nimbus-storage/src/libsql/freshness.rs).

The snapshot is a point-in-time read of live counters — collecting it does not pause the tenant.

The consistency verifier

The consistency route runs an active check rather than reading counters. The verifier (crates/nimbus-engine/src/verification.rs) fingerprints the tenant’s authoritative storage snapshot, the shadow materializer, and the embedded replica — each digest covering schema, documents, and scheduled executions at a sequence point — checks the journal bootstrap cut, and reports every invariant violation as a structured mismatch naming both sides. A healthy tenant returns ok: true with an empty mismatch list. This exists because the engine maintains several derived views of the same journal (see Engine and the mutation path), and divergence between them must be detectable on demand, not just asserted in tests.

Structured logging

The binary initializes tracing with a stdout formatter at process start (crates/nimbus-bin/src/main.rs). Filtering uses the standard RUST_LOG variable, parsed as a target=level directive list (the tracing-subscriber Targets filter), defaulting to info when unset. There is no Nimbus-specific log configuration layer: crate-level targets like nimbus_server and nimbus_engine are the filtering knobs.

Latency budgets

Rather than exporting timing histograms, hot paths carry budgeted segment timers that stay silent until a budget is exceeded. Two mirrored modules define them:

crates/nimbus-server/src/latency.rs — transport-side segments: server.auth (10 ms), server.storage (50 ms), server.runtime (100 ms).
crates/nimbus-engine/src/engine/latency.rs — engine-side segments: engine.tenant_load (50 ms), engine.wait_visibility (25 ms), engine.query_prepare (5 ms), engine.query_execute (50 ms), engine.query_cache (5 ms).

When a segment overruns, a WARN event is emitted with structured fields: the stable segment name, elapsed_ms, and budget_ms. The timers finish on drop, so an early return or error path still reports. The result is a log stream that is quiet at the default level but yields an alertable, low-volume latency signal without any metrics infrastructure.

The access audit log

Independently of stdout logging, every request that reaches an admin-gated route family is appended to a JSONL audit file — authorization successes and failures alike. The writer (crates/nimbus-operator/src/audit.rs) serializes one record per line with the timestamp, route family, tenant id when one can be attributed, the auth scope and credential method, the success flag, the request origin, and the reason string. The file is created with owner-only (0600) permissions and lives under the platform state directory (logs/access.jsonl, resolved by crates/nimbus-operator/src/paths.rs).

The middleware emits these records at the gate, not in the handlers — so a denied request is audited even though no handler ever ran, and the audit trail cannot be bypassed by a handler bug. Tenant attribution works across adapters: the writer extracts tenant ids from native and Convex paths and from Firestore gRPC metadata headers, so a Firebase client’s admin-surface attempt is attributed to its project.

What does not exist yet

Nimbus currently has no Prometheus endpoint and no OpenTelemetry exporter. There is no /metrics route, and no Nimbus crate depends on a metrics or tracing exporter (an OpenTelemetry dependency appears only transitively inside the forked JavaScript runtime’s telemetry crate; the server neither configures nor exposes it). The supported model today is:

scrape the /debug/* JSON endpoints with your own collector,
ship stdout (or the systemd journal) to your log pipeline, and
tail access.jsonl for security-relevant events.

If you need Prometheus-format metrics now, a small adapter that polls /debug/runtime/metrics and the per-tenant snapshot is the intended integration point, since both return stable JSON shapes.

Where to go next

Operators’ observability guide — the hands-on walkthrough.
Server and transport — where the router and middleware live.
Auth and trust — the credential model behind the admin gate.
Tenancy — what a tenant is, and why diagnostics are tenant-scoped.