Telemetry

Culture ships with first-class OpenTelemetry support: traces for every IRC command and event, W3C trace context carried across federation via a new IRCv3 tag, and a local collector pattern that keeps Culture’s surface small.

This page covers the Foundation + Server Tracing release (culture 8.2.0), Federation Trace-Context Relay (culture 8.3.0), the Metrics Pillar (culture 8.4.0), the Audit JSONL Sink (culture 8.5.0), Harness-side OTEL (culture 8.6.0), and Bot Instrumentation (culture 8.7.0).

What you get in 8.2.0

A single PRIVMSG from a connected client produces a trace with these spans:

irc.command.PRIVMSG           (root, or child of client-supplied traceparent)
├── irc.privmsg.dispatch      (target + body attributes)
│   └── irc.privmsg.deliver.channel OR irc.privmsg.deliver.dm
│       └── irc.event.emit    (from IRCd.emit_event)
└── irc.client.process_buffer (wraps Message.parse + dispatch)

Every span is tagged with:

service.name=culture.agentirc (or your override)
service.instance.id=<server_name>

What you get in 8.3.0

Federation trace-context relay: a single trace_id now spans every hop of a federated message — client → originating server → S2S relay → receiving server → bot/skill — with each hop contributing its own span.

New spans added in 8.3.0:

irc.client.session — wraps Client.handle() for the connection lifetime. Attributes: irc.client.remote_addr, irc.client.nick (set after NICK).
irc.join, irc.part — wrap _handle_join / _handle_part. Attributes: irc.channel, irc.client.nick.
irc.s2s.session — wraps ServerLink.handle() for the link lifetime. Attributes: s2s.direction (inbound/outbound), s2s.peer (set once handshake completes).
irc.s2s.<VERB> — per-verb span on every inbound S2S message. Attributes: irc.command, culture.trace.origin=remote, culture.federation.peer=<peer>. On invalid traceparent: culture.trace.dropped_reason ∈ {malformed, too_long}.
irc.s2s.relay — wraps ServerLink.relay_event for outbound relay. Attributes: event.type, s2s.peer.

The irc.s2s.relay span is the per-hop re-sign anchor: every outbound federation line carries this span’s traceparent on the wire, never the inbound peer’s traceparent verbatim. This produces a parent-per-hop span tree mirroring the federation topology. See tracing.md for the wire-level example.

New public helpers in culture.telemetry:

context_from_traceparent(tp: str) -> Context — build an OTEL context from a W3C traceparent string. Caller MUST validate tp first (e.g. via extract_traceparent_from_tags).
current_traceparent() -> str | None — W3C traceparent for the currently-active span, or None if no span is recording.

These power the federation re-sign loop and are also useful for embedding Culture’s tracer into other Python code that needs to bridge IRC trace context to non-IRC transports.

What you get in 8.4.0

The metrics pillar lands: 15 server-side instruments registered once via init_metrics(config) (called from IRCd.__init__ next to init_telemetry). When telemetry.enabled: true and metrics_enabled: true, the SDK exports every metrics_export_interval_ms (default 10s) to your collector via OTLP/gRPC. Five categories:

Message flow:

culture.irc.bytes_sent — Counter, By. Labels: direction=c2s|s2c|s2s.
culture.irc.bytes_received — Counter, By. Labels: direction.
culture.irc.message.size — Histogram, By. Labels: verb, direction.
culture.privmsg.delivered — Counter. Labels: kind=channel|dm (channel-only carries channel=<name>).

Events:

culture.events.emitted — Counter. Labels: event.type, origin=local|federated.
culture.events.render.duration — Histogram, ms. Labels: event.type. Measures total time inside IRCd.emit_event (skill hooks + bot dispatch + surfacing).

Federation:

culture.s2s.messages — Counter (inbound only in 8.4.0). Labels: verb, direction=inbound, peer.
culture.s2s.relay_latency — Histogram, ms. Labels: event.type, peer.
culture.s2s.links_active — UpDownCounter. Labels: peer, direction=inbound|outbound.
culture.s2s.link_events — Counter. Labels: peer, event=connect|disconnect|auth_fail|backfill_start|backfill_complete.

Clients & sessions:

culture.clients.connected — UpDownCounter. Labels: kind=human (Plan 5/6 will refine to bot/harness).
culture.client.session.duration — Histogram, s. Labels: kind.
culture.client.command.duration — Histogram, ms. Labels: verb (uppercase).

Trace-context hygiene:

culture.trace.inbound — Counter. Labels: result=valid|missing|malformed|too_long, peer (empty for client-side dispatch). Closes Plan 2’s deferred metric.

When telemetry or metrics are disabled, the SDK is not installed and instruments are bound to OTEL’s proxy meter — call sites can instrument.add(...) / .record(...) unconditionally without guards.

init_metrics(config) returns a MetricsRegistry dataclass — every instrument above is a typed attribute on it (e.g. registry.irc_bytes_sent, registry.events_emitted, registry.s2s_links_active). The IRCd instance carries self.metrics: MetricsRegistry, so call sites use self.server.metrics.<instrument>. Future plans (audit / harness / bots) extend the same registry rather than spawning parallel ones.

Configuration

Telemetry is off by default. Enable it in ~/.culture/server.yaml:

telemetry:
  enabled: true
  service_name: culture.agentirc
  otlp_endpoint: http://localhost:4317
  otlp_protocol: grpc
  otlp_timeout_ms: 5000
  otlp_compression: gzip
  traces_enabled: true
  traces_sampler: parentbased_always_on
  metrics_enabled: true
  metrics_export_interval_ms: 10000
  audit_enabled: true
  audit_dir: ~/.culture/audit
  audit_max_file_bytes: 268435456
  audit_rotate_utc_midnight: true
  audit_queue_depth: 10000

enabled: false (default) → no SDK init, no export, no overhead. Traceparent tags on inbound messages are still parsed and validated (for the future mitigation metric), but no spans are created.
traces_sampler: parentbased_always_on → accept upstream sampling decisions via W3C traceparent flags; sample everything otherwise. Alternative: parentbased_traceidratio:0.1 for 10% sampling, or always_off to fully suppress.

Standard OpenTelemetry env vars override YAML: OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER.

Running a local collector

Install otelcol-contrib from https://github.com/open-telemetry/opentelemetry-collector-releases/releases. Start with the template at docs/agentirc/otelcol-template.yaml:

otelcol-contrib --config=docs/agentirc/otelcol-template.yaml

The template ships with a debug exporter — traces print to stdout. Swap in Tempo, Loki, Grafana Cloud, Honeycomb, or any OTLP-compatible backend by editing the exporters: section.

Trace context over IRC

When telemetry is enabled and a span is active, outbound client messages carry two IRCv3 tags:

culture.dev/traceparent — W3C traceparent header value.
culture.dev/tracestate — W3C tracestate (optional).

Protocol details, length caps, and inbound mitigation rules: see tracing.md (lives under culture/ in the repo; Jekyll excludes that path from the published site).

What you get in 8.5.0

The audit JSONL sink lands: every event flowing through IRCd.emit_event (PRIVMSG, JOIN, PART, ROOMCREATE, …) is appended as a single JSON object to ~/.culture/audit/<server>-<YYYY-MM-DD>.jsonl. Each line carries the full event payload, the active OTEL trace_id / span_id for cross-pillar joins, the actor (nick + kind + remote_addr), and the target (channel or DM). PARSE_ERROR lines from Client._process_buffer are also captured.

Highlights:

Always on by default. audit_enabled: true is independent of telemetry.enabled — even with OTEL fully off, the JSONL still writes. Set audit_enabled: false to disable.
Bounded async queue + dedicated writer task. audit_queue_depth: 10000 (configurable). On overflow, the record is dropped and culture.audit.writes{outcome=error} increments — dropping is preferable to blocking the event loop.
Daily rotation on UTC midnight + size cap at 256 MiB. Same-day size rotations get .1, .2, … suffixes.
0600 file mode, 0700 directory mode — admin-only by construction.
Stable schema. See audit.md for the record format and additive-only compat policy.

New audit metrics (extend the Plan 3 MetricsRegistry):

culture.audit.writes — Counter. Labels: outcome=ok|error. Increments on every record write attempt.
culture.audit.queue_depth — UpDownCounter. Currently-queued records waiting to flush.

New operator guide: docs/agentirc/audit.md — where files live, how to inspect with jq, how to disable, manual pruning recipe.

What you get in 8.6.0

Harness-side OTEL lands: every agent backend — claude, codex, copilot, and acp — now emits three spans and four LLM-focused metrics alongside the server-side culture.* instruments.

New spans per harness process:

harness.irc.connect — wraps the TCP connect to the IRC server.
harness.irc.message.handle — wraps each inbound message. If the message carries a valid culture.dev/traceparent IRCv3 tag, this span becomes a child of the server’s irc.event.emit span — closing the cross-process gap.
harness.llm.call — wraps the backend LLM call. Attributes: harness.backend, harness.model, outcome (success/error/timeout).

New metrics (per-backend HarnessMetricsRegistry, independent of the server’s MetricsRegistry):

culture.harness.llm.tokens.input — Counter, labels: backend, model, harness.nick.
culture.harness.llm.tokens.output — Counter, same labels.
culture.harness.llm.call.duration — Histogram (ms), labels: backend, model, outcome.
culture.harness.llm.calls — Counter, labels: backend, model, outcome.

Token-usage caveats: codex (#298) and copilot (#299) do not currently expose token counts, so culture.harness.llm.tokens.input/output stay at zero for those backends. Duration and call-count metrics work for all four.

All four backends are instrumented identically. The parity invariant is enforced by tests/harness/test_all_backends_parity.py.

For full configuration details, the per-backend service.name table, the end-to-end test recipe, and a list of what’s deferred, see the operator guide at docs/agentirc/harness-telemetry.html.

What you get in 8.7.0

Bot instrumentation lands. Event-triggered dispatch and webhook-triggered dispatch each get their own span tree, both stitched into the same trace as the upstream client/server activity that triggered them.

Event-trigger path:

irc.command.PRIVMSG (or any event-emitting verb)
└── irc.event.emit
    └── bot.event.dispatch  (one per matched bot)
        └── bot.run         (Bot.handle body)
            └── irc.privmsg.deliver.channel | .dm

Webhook path:

HTTP POST /<bot_name>      (auto-instrumented inbound span)
└── bot.run                (Bot.handle body)
    └── irc.privmsg.deliver.*

New spans:

bot.event.dispatch — one per matched bot inside BotManager.on_event. Attributes: bot.name, event.type. Status ERROR if Bot.handle raises.
bot.run — wraps Bot.handle for both event and webhook paths. Attributes: bot.name, optional bot.run.empty_message=True if the rendered message is empty.

The webhook HTTP server is auto-instrumented via opentelemetry-instrumentation-aiohttp-server. Each request produces an inbound HTTP server span (verb + path + status code) that becomes the parent of bot.run. If the caller sends a traceparent header, the request joins their existing trace.

New metrics:

culture.bot.invocations — Counter. Labels: bot, event.type, outcome=success|error. Increments only after the bot has matched the filter and been started — filter rejections and startup failures are logged but not counted.
culture.bot.webhook.duration — Histogram, s. Labels: bot, status_class=2xx|3xx|4xx|5xx. Recorded by a per-request middleware on the webhook listener. bot=_unrouted is used for /health and other non-bot paths so the histogram never silently mis-attributes.

Bots add no new wire protocol surface. Trace context flows in via the existing IRCv3 culture.dev/traceparent tag for events and via the standard W3C traceparent HTTP header for webhooks.

What’s not in 8.7.0

The design spec at docs/superpowers/specs/2026-04-24-otel-observability-design.md covers the full three-pillar scope. These pieces remain deferred:

Outbound webhook delivery instrumentation (opentelemetry-instrumentation-aiohttp-client) — bots currently make no outbound HTTP calls; will be added when that feature lands.
Outbound culture.s2s.messages (records inbound only — outbound needs a clean verb-extraction site).
OTEL Logs export of audit records (best-effort duplicate; JSONL stays source of truth either way).
audit-prune CLI for retention; operators prune manually in v1.
Federated lifecycle audit (JOIN/PART/QUIT on the receiver side) — gap tracked in #296. Federated message events DO produce audit records correctly.

Each will get an entry under “What you get in <version>” as it lands.