Skip to content

Muthur

AI-powered Kubernetes monitoring. muthur receives alerts from any number of Kubernetes clusters, enriches them with logs and metrics, evaluates them with Claude to produce a structured root-cause analysis, and delivers rich notifications to Discord, Telegram, Slack, PagerDuty, or generic webhooks.

The name comes from MU/TH/UR 6000, the ship's computer from Alien.

  • muthur — the central AI brain. One instance per home cluster.
  • muthur-collector — a lightweight per-cluster agent that receives AlertManager webhooks, enriches them, and forwards protobuf payloads to muthur.

Both components are written in Go and distributed as Helm charts.

Motivation

Default Prometheus AlertManager integrations deliver alerts as terse labels with no context:

[FIRING] KubePodCrashLooping
pod: api-server-6b9c7d5f4-xk2qp
namespace: production

You get the name and a pod. That's it. To actually understand what broke you open Grafana, find the right dashboard, pull up the pod's logs in Loki, try to correlate a metric spike with a log line, and manually write down what you think the root cause is. Every alert costs 5–10 minutes of context switching before you even start fixing anything.

muthur automates that first step. Every alert arrives in Discord (or wherever) already enriched with:

  • Root cause — one-sentence summary derived from logs and metrics
  • Evidence — specific log lines or metric trends supporting the conclusion
  • Recommended action — a starting point for remediation
  • A clickable Grafana deep link pre-filtered to the alert's namespace and pod
  • A severity-coloured embed so critical stands out at a glance

The operator's first minute on an alert goes from "open three tabs" to "read the message". That's the entire value proposition.

Architecture

muthur is designed to be multi-tenant across clusters. One central instance receives forwarded alerts from any number of per-cluster collectors.

graph LR
    subgraph CLUSTER_A[Remote cluster A]
        AMa[AlertManager] -->|webhook| COLa[muthur-collector]
        COLa -.->|query| LOKIa[Loki]
        COLa -.->|query| PROMa[Prometheus]
        COLa -.->|lookup| K8Sa[K8s API]
    end

    subgraph CLUSTER_B[Remote cluster B]
        AMb[AlertManager] -->|webhook| COLb[muthur-collector]
        COLb -.->|query| LOKIb[Loki]
    end

    subgraph HOME[Home cluster]
        MUTHUR[muthur]
        COLh[muthur-collector] -->|in-cluster| MUTHUR
        AMh[AlertManager] -->|webhook| COLh
    end

    COLa ==>|protobuf over HTTPS| MUTHUR
    COLb ==>|protobuf over HTTPS| MUTHUR

    MUTHUR -->|evaluate| CLAUDE[Anthropic Claude API]
    MUTHUR -->|silence| AMh
    MUTHUR -->|notify| DISCORD[Discord]
    MUTHUR -->|notify| SLACK[Slack]
    MUTHUR -->|notify| PD[PagerDuty]

Collectors in remote clusters reach muthur over its public ingress. The collector that runs in the same cluster as muthur itself skips the ingress entirely and uses the internal service URL — it's faster and sidesteps any edge layer (Cloudflare, WAF, corporate proxies).

Every collector ships its own cluster_id and a pre-shared token. muthur validates both on ingest. A compromised collector can only spoof alerts from its own cluster.

Alert lifecycle

A firing alert traverses the following sequence. Both sides of the hop between collector and muthur are asynchronous — handlers return immediately and do the heavy lifting in goroutines, so AlertManager's 10 s webhook timeout is never a concern regardless of how slow Claude or downstream notifiers happen to be.

sequenceDiagram
    participant P as Prometheus
    participant AM as AlertManager
    participant C as muthur-collector
    participant M as muthur
    participant CL as Claude
    participant D as Discord

    P->>AM: rule fires
    AM->>C: POST /webhook
    C-->>AM: 200 OK (immediate)

    Note over C: goroutine
    C->>C: resolve target (k8s API)
    C->>C: fetch logs (Loki)
    C->>C: fetch metrics (Prometheus)
    C->>C: redact PII

    C->>M: POST /ingest (protobuf)
    M-->>C: 202 Accepted (immediate)

    Note over M: goroutine
    M->>M: dedup check
    M->>CL: evaluate alert
    CL-->>M: { root_cause, evidence, action }
    M->>M: route by cluster_id / severity
    M->>D: webhook (rich embed)

When the alert clears, AlertManager sends a resolved webhook which follows the same path but skips Claude evaluation and dedup entirely — muthur emits a short "alert cleared" notification in green through the same receivers, closing the loop visually for the operator.

muthur (central brain)

The central component is responsible for:

  • Authentication — per-cluster tokens validated against cluster_id
  • Deduplication — SHA256-keyed sliding window with configurable TTL, so the same alert repeating every 30 seconds doesn't produce 100 notifications
  • Evaluation — structured JSON output from Claude: severity, root cause, evidence, recommended action, optional silence request
  • Routing — AlertManager-style first-match rules by severity, cluster_id, alert_name, namespace
  • Notification delivery — one goroutine per receiver per alert, failures logged but never block other deliveries
  • AlertManager silence integration — when Claude flags an alert as known transient noise, muthur can optionally POST a silence back to AlertManager to stop the retriggering

Receivers are defined AlertManager-style: named instances with per-instance config, referenced from routing rules. Multiple receivers of the same type are allowed — you can have one Discord webhook for ops, another for audit, and a third for dev, and route each alert to exactly the channels that matter. Sensitive values (webhook URLs, API tokens) are mounted as files rather than injected as env vars, so they survive neither a /proc leak nor a crash dump.

muthur-collector (per-cluster agent)

The collector runs in every monitored cluster. Its job is to resolve, enrich, redact, and forward — no AI calls, no business logic.

Target resolution. From AlertManager labels, the collector figures out what the alert is actually about using eight strategies in order: pod label, deployment label, statefulset, daemonset, node, persistentvolumeclaim, namespace, or unknown. For deployments and statefulsets it resolves all pods via the Kubernetes API label selector and fetches aggregate metrics.

Log and metric enrichment. For resolved pods the collector queries Loki for recent log lines and Prometheus for a time series around the alert timestamp. Both enrichment sources are optional and non-fatal: a failed fetch logs a warning and the pipeline continues. Either can be disabled entirely via LOKI_ENABLED=false / PROMETHEUS_ENABLED=false for clusters that lack one of them.

PII redaction. 20 built-in regex patterns scrub logs before forwarding:

Category Patterns
PII email, phone, SSN, street address
Network IPv4, IPv6
Credentials Bearer tokens, JWT, AWS access/secret keys, generic API keys, PEM private keys, password= fields
Financial credit card (Visa, Mastercard, Amex, Discover), IBAN
Identifiers UUID

Redaction stats (total lines, redacted lines, total replacements) travel with the payload so the operator can see how much content was touched without the content itself being exposed. Claude is explicitly instructed never to reconstruct original values from redacted placeholders.

Notification formats

Each receiver channel uses its native rich format. Severity maps to colour consistently across all channels.

Severity Colour Meaning
critical red production-impacting, page someone
warning orange degraded but not down
info blue notable event
resolved green alert cleared, no action needed
Channel Format
Discord Embed with coloured left bar, title as clickable Grafana deep link, fields for cluster / severity / namespace / target, root cause in description, evidence and action as separate fields, timestamp in footer
Telegram HTML parse mode with bold headings, <code> for values, inline Grafana link, link previews disabled
Slack Block Kit attachment with coloured bar, header block, section with fields, section with mrkdwn analysis, context block with timestamp
PagerDuty Events API v2 — trigger events with full analysis in custom_details, component = pod name, automatic resolve events reusing dedup_key when the alert clears
Webhook Structured JSON payload with status, severity, cluster / alert / namespace / target, full analysis object, labels map — designed for integration with arbitrary downstream systems

Security model

  • Per-cluster tokens. Each collector has its own token. muthur validates that X-Collector-Token matches the expected value for the payload's cluster_id. A compromised cluster cannot forge alerts from another.
  • File-mounted secrets. All sensitive values — Anthropic API key, collector tokens, notifier credentials — are mounted into the pod as files under /secrets/. Never passed as environment variables.
  • PII redaction before forwarding. The collector redacts logs before they leave the cluster boundary, so even if a forwarded payload is intercepted there's nothing sensitive in it.
  • Read-only root filesystem, non-root user, dropped capabilities on both components' pods.

Deployment

Both components are distributed as Helm charts from https://vojtechpastyrik.github.io/charts.

Minimal collector values for a cluster with Loki and Prometheus available:

config:
  clusterId: cluster-prod
  lokiEnabled: true
  lokiUrl: http://loki-gateway.logging
  prometheusUrl: http://kube-prometheus-stack-prometheus.monitoring.svc:9090
  grafanaBaseUrl: https://grafana.example.com

externalSecrets:
  enabled: true
  remoteSecretPath: muthur-collector

Minimal muthur values with one Discord receiver and a catch-all route:

ingress:
  enabled: true
  host: muthur.example.com

receivers:
  - name: ops-discord
    type: discord
    secretKeys:
      webhook_url: ops-discord-webhook

routing:
  rules:
    - name: all
      match: {}
      receivers: [ops-discord]

collectors:
  - clusterId: cluster-prod
    tokenSecretKey: collector-token-cluster-prod

externalSecrets:
  enabled: true
  remoteSecretPath: muthur
  collectorTokenKeys:
    - collector-token-cluster-prod
  receiverSecretKeys:
    - ops-discord-webhook

Full configuration reference (all values, all env vars, all receiver types) lives in the chart READMEs in the repositories — this page is an architectural overview, not a reference manual.

Adding a new cluster

  1. Deploy the collector with its own clusterId and grafanaBaseUrl.
  2. Create a secret store item with central-agent-url and central-agent-token fields.
  3. On muthur, add the cluster to collectors and its token key to externalSecrets.collectorTokenKeys.
  4. Add a matching field to muthur's secret store item.
  5. Optionally add a routing rule for the new cluster_id.

No code changes, no image rebuild. Everything is chart values and secrets.

Repositories

Both charts are released on tag and published to vojtechpastyrik.github.io/charts.