Muthur¶
AI-powered Kubernetes monitoring. muthur receives alerts from any number of Kubernetes clusters, enriches them with logs and metrics, evaluates them with Claude to produce a structured root-cause analysis, and delivers rich notifications to Discord, Telegram, Slack, PagerDuty, or generic webhooks.
The name comes from MU/TH/UR 6000, the ship's computer from Alien.
- muthur — the central AI brain. One instance per home cluster.
- muthur-collector — a lightweight per-cluster agent that receives AlertManager webhooks, enriches them, and forwards protobuf payloads to muthur.
Both components are written in Go and distributed as Helm charts.
Motivation¶
Default Prometheus AlertManager integrations deliver alerts as terse labels with no context:
You get the name and a pod. That's it. To actually understand what broke you open Grafana, find the right dashboard, pull up the pod's logs in Loki, try to correlate a metric spike with a log line, and manually write down what you think the root cause is. Every alert costs 5–10 minutes of context switching before you even start fixing anything.
muthur automates that first step. Every alert arrives in Discord (or wherever) already enriched with:
- Root cause — one-sentence summary derived from logs and metrics
- Evidence — specific log lines or metric trends supporting the conclusion
- Recommended action — a starting point for remediation
- A clickable Grafana deep link pre-filtered to the alert's namespace and pod
- A severity-coloured embed so critical stands out at a glance
The operator's first minute on an alert goes from "open three tabs" to "read the message". That's the entire value proposition.
Architecture¶
muthur is designed to be multi-tenant across clusters. One central instance receives forwarded alerts from any number of per-cluster collectors.
graph LR
subgraph CLUSTER_A[Remote cluster A]
AMa[AlertManager] -->|webhook| COLa[muthur-collector]
COLa -.->|query| LOKIa[Loki]
COLa -.->|query| PROMa[Prometheus]
COLa -.->|lookup| K8Sa[K8s API]
end
subgraph CLUSTER_B[Remote cluster B]
AMb[AlertManager] -->|webhook| COLb[muthur-collector]
COLb -.->|query| LOKIb[Loki]
end
subgraph HOME[Home cluster]
MUTHUR[muthur]
COLh[muthur-collector] -->|in-cluster| MUTHUR
AMh[AlertManager] -->|webhook| COLh
end
COLa ==>|protobuf over HTTPS| MUTHUR
COLb ==>|protobuf over HTTPS| MUTHUR
MUTHUR -->|evaluate| CLAUDE[Anthropic Claude API]
MUTHUR -->|silence| AMh
MUTHUR -->|notify| DISCORD[Discord]
MUTHUR -->|notify| SLACK[Slack]
MUTHUR -->|notify| PD[PagerDuty]
Collectors in remote clusters reach muthur over its public ingress. The collector that runs in the same cluster as muthur itself skips the ingress entirely and uses the internal service URL — it's faster and sidesteps any edge layer (Cloudflare, WAF, corporate proxies).
Every collector ships its own cluster_id and a pre-shared token. muthur
validates both on ingest. A compromised collector can only spoof alerts
from its own cluster.
Alert lifecycle¶
A firing alert traverses the following sequence. Both sides of the hop between collector and muthur are asynchronous — handlers return immediately and do the heavy lifting in goroutines, so AlertManager's 10 s webhook timeout is never a concern regardless of how slow Claude or downstream notifiers happen to be.
sequenceDiagram
participant P as Prometheus
participant AM as AlertManager
participant C as muthur-collector
participant M as muthur
participant CL as Claude
participant D as Discord
P->>AM: rule fires
AM->>C: POST /webhook
C-->>AM: 200 OK (immediate)
Note over C: goroutine
C->>C: resolve target (k8s API)
C->>C: fetch logs (Loki)
C->>C: fetch metrics (Prometheus)
C->>C: redact PII
C->>M: POST /ingest (protobuf)
M-->>C: 202 Accepted (immediate)
Note over M: goroutine
M->>M: dedup check
M->>CL: evaluate alert
CL-->>M: { root_cause, evidence, action }
M->>M: route by cluster_id / severity
M->>D: webhook (rich embed)
When the alert clears, AlertManager sends a resolved webhook which follows
the same path but skips Claude evaluation and dedup entirely — muthur emits
a short "alert cleared" notification in green through the same receivers,
closing the loop visually for the operator.
muthur (central brain)¶
The central component is responsible for:
- Authentication — per-cluster tokens validated against
cluster_id - Deduplication — SHA256-keyed sliding window with configurable TTL, so the same alert repeating every 30 seconds doesn't produce 100 notifications
- Evaluation — structured JSON output from Claude: severity, root cause, evidence, recommended action, optional silence request
- Routing — AlertManager-style first-match rules by
severity,cluster_id,alert_name,namespace - Notification delivery — one goroutine per receiver per alert, failures logged but never block other deliveries
- AlertManager silence integration — when Claude flags an alert as known transient noise, muthur can optionally POST a silence back to AlertManager to stop the retriggering
Receivers are defined AlertManager-style: named instances with per-instance
config, referenced from routing rules. Multiple receivers of the same type
are allowed — you can have one Discord webhook for ops, another for audit,
and a third for dev, and route each alert to exactly the channels that
matter. Sensitive values (webhook URLs, API tokens) are mounted as files
rather than injected as env vars, so they survive neither a /proc leak
nor a crash dump.
muthur-collector (per-cluster agent)¶
The collector runs in every monitored cluster. Its job is to resolve, enrich, redact, and forward — no AI calls, no business logic.
Target resolution. From AlertManager labels, the collector figures out what the alert is actually about using eight strategies in order: pod label, deployment label, statefulset, daemonset, node, persistentvolumeclaim, namespace, or unknown. For deployments and statefulsets it resolves all pods via the Kubernetes API label selector and fetches aggregate metrics.
Log and metric enrichment. For resolved pods the collector queries Loki
for recent log lines and Prometheus for a time series around the alert
timestamp. Both enrichment sources are optional and non-fatal: a failed
fetch logs a warning and the pipeline continues. Either can be disabled
entirely via LOKI_ENABLED=false / PROMETHEUS_ENABLED=false for clusters
that lack one of them.
PII redaction. 20 built-in regex patterns scrub logs before forwarding:
| Category | Patterns |
|---|---|
| PII | email, phone, SSN, street address |
| Network | IPv4, IPv6 |
| Credentials | Bearer tokens, JWT, AWS access/secret keys, generic API keys, PEM private keys, password= fields |
| Financial | credit card (Visa, Mastercard, Amex, Discover), IBAN |
| Identifiers | UUID |
Redaction stats (total lines, redacted lines, total replacements) travel with the payload so the operator can see how much content was touched without the content itself being exposed. Claude is explicitly instructed never to reconstruct original values from redacted placeholders.
Notification formats¶
Each receiver channel uses its native rich format. Severity maps to colour consistently across all channels.
| Severity | Colour | Meaning |
|---|---|---|
| critical | red | production-impacting, page someone |
| warning | orange | degraded but not down |
| info | blue | notable event |
| resolved | green | alert cleared, no action needed |
| Channel | Format |
|---|---|
| Discord | Embed with coloured left bar, title as clickable Grafana deep link, fields for cluster / severity / namespace / target, root cause in description, evidence and action as separate fields, timestamp in footer |
| Telegram | HTML parse mode with bold headings, <code> for values, inline Grafana link, link previews disabled |
| Slack | Block Kit attachment with coloured bar, header block, section with fields, section with mrkdwn analysis, context block with timestamp |
| PagerDuty | Events API v2 — trigger events with full analysis in custom_details, component = pod name, automatic resolve events reusing dedup_key when the alert clears |
| Webhook | Structured JSON payload with status, severity, cluster / alert / namespace / target, full analysis object, labels map — designed for integration with arbitrary downstream systems |
Security model¶
- Per-cluster tokens. Each collector has its own token. muthur validates
that
X-Collector-Tokenmatches the expected value for the payload'scluster_id. A compromised cluster cannot forge alerts from another. - File-mounted secrets. All sensitive values — Anthropic API key,
collector tokens, notifier credentials — are mounted into the pod as
files under
/secrets/. Never passed as environment variables. - PII redaction before forwarding. The collector redacts logs before they leave the cluster boundary, so even if a forwarded payload is intercepted there's nothing sensitive in it.
- Read-only root filesystem, non-root user, dropped capabilities on both components' pods.
Deployment¶
Both components are distributed as Helm charts from
https://vojtechpastyrik.github.io/charts.
Minimal collector values for a cluster with Loki and Prometheus available:
config:
clusterId: cluster-prod
lokiEnabled: true
lokiUrl: http://loki-gateway.logging
prometheusUrl: http://kube-prometheus-stack-prometheus.monitoring.svc:9090
grafanaBaseUrl: https://grafana.example.com
externalSecrets:
enabled: true
remoteSecretPath: muthur-collector
Minimal muthur values with one Discord receiver and a catch-all route:
ingress:
enabled: true
host: muthur.example.com
receivers:
- name: ops-discord
type: discord
secretKeys:
webhook_url: ops-discord-webhook
routing:
rules:
- name: all
match: {}
receivers: [ops-discord]
collectors:
- clusterId: cluster-prod
tokenSecretKey: collector-token-cluster-prod
externalSecrets:
enabled: true
remoteSecretPath: muthur
collectorTokenKeys:
- collector-token-cluster-prod
receiverSecretKeys:
- ops-discord-webhook
Full configuration reference (all values, all env vars, all receiver types) lives in the chart READMEs in the repositories — this page is an architectural overview, not a reference manual.
Adding a new cluster¶
- Deploy the collector with its own
clusterIdandgrafanaBaseUrl. - Create a secret store item with
central-agent-urlandcentral-agent-tokenfields. - On muthur, add the cluster to
collectorsand its token key toexternalSecrets.collectorTokenKeys. - Add a matching field to muthur's secret store item.
- Optionally add a routing rule for the new
cluster_id.
No code changes, no image rebuild. Everything is chart values and secrets.
Repositories¶
- github.com/VojtechPastyrik/muthur — central brain
- github.com/VojtechPastyrik/muthur-collector — per-cluster agent
Both charts are released on tag and published to vojtechpastyrik.github.io/charts.