Coroot — Architecture¶
Detailed component breakdown, deployment topologies, and data flow diagrams for Coroot.
Component Architecture¶
flowchart TB
subgraph K8s["Kubernetes Cluster"]
subgraph Agents["Data Plane (DaemonSet + Deployment)"]
NA["coroot-node-agent\n(eBPF DaemonSet)\nPer node"]
CA["coroot-cluster-agent\n(Deployment)\nDatabase discovery"]
end
subgraph Control["Control Plane"]
OP["Coroot Operator\n(Lifecycle management)"]
CR["Coroot CR\n(Custom Resource)"]
end
subgraph Server["Coroot Server (StatefulSet)"]
direction TB
InspEng["Inspection Engine\n18+ auto-inspections"]
AIRCA["AI RCA Engine\n(pattern detection)"]
SvcMap["Service Map Builder\n(eBPF topology)"]
SLOEng["SLO Engine\n(error budget tracking)"]
APIGW["API / Web UI\n(port 8080)"]
end
subgraph Storage["Storage (StatefulSet / External)"]
Prom["Prometheus / VM / Mimir\n(metrics)"]
CH["ClickHouse\n(logs, traces, profiles)"]
end
end
subgraph External["Optional"]
OTEL["OTel SDK\n(app-level traces)"]
LLM["LLM API\n(Enterprise AI RCA)"]
end
NA -->|"metrics, traces,\nlogs, profiles"| Server
CA -->|"DB metrics\n(pg_stat, INFO)"| Server
OTEL -->|OTLP| Server
Server -->|"remote_write /\nPromQL"| Prom
Server -->|"clickhouse-native"| CH
LLM -.->|"API"| AIRCA
OP -->|"reconcile"| CR
CR -->|"manages"| Agents
CR -->|"manages"| Server
CR -->|"manages"| Storage
style Server fill:#1565c0,color:#fff
style Agents fill:#2e7d32,color:#fff
style Storage fill:#e65100,color:#fff
18 Built-In Inspections¶
Coroot runs 18 automated inspection categories continuously on every discovered service:
| Category | Inspections |
|---|---|
| SLOs | Availability SLO, Latency SLO |
| Instances | Pod restarts, unavailable replicas |
| CPU | CPU throttling, CPU usage near limits |
| GPU | GPU utilization, memory usage |
| Memory | OOM kills, memory near limits |
| Storage | Disk usage, I/O latency |
| Network | Connection errors, DNS failures, TCP retransmits |
| Logs | Error log rate spikes, warning patterns |
| Runtime | JVM heap/GC, .NET GC, Python GIL contention |
| Databases | Postgres, MySQL, MongoDB, Redis, Memcached health |
| Deployments | Rollout tracking, canary detection |
Deployment Topologies¶
Single-Cluster (Standard)¶
flowchart LR
subgraph Cluster["K8s Cluster"]
NA1["node-agent<br/>(node 1)"]
NA2["node-agent<br/>(node 2)"]
NAN["node-agent<br/>(node N)"]
CA["cluster-agent"]
CS["Coroot Server"]
CH["ClickHouse<br/>(2 shards × 2 replicas)"]
Prom["Prometheus / VM"]
end
NA1 --> CS
NA2 --> CS
NAN --> CS
CA --> CS
CS --> CH
CS --> Prom
Multi-Cluster (Hub and Spoke)¶
flowchart TB
subgraph Central["Central Cluster"]
CS["Coroot Server\n(full install)"]
CH["ClickHouse"]
Prom["Prometheus / VM"]
end
subgraph Remote1["Remote Cluster 1"]
NA_R1["node-agents"]
CA_R1["cluster-agent"]
end
subgraph Remote2["Remote Cluster 2"]
NA_R2["node-agents"]
CA_R2["cluster-agent"]
end
NA_R1 -->|"agentsOnly=true"| CS
CA_R1 --> CS
NA_R2 -->|"agentsOnly=true"| CS
CA_R2 --> CS
CS --> CH
CS --> Prom
style Central fill:#1565c0,color:#fff
Sequence: Incident Detection → RCA¶
sequenceDiagram
participant App as Application
participant Kernel as Linux Kernel
participant Agent as node-agent (eBPF)
participant Server as Coroot Server
participant Insp as Inspection Engine
participant RCA as AI RCA
participant Alert as Alert Channel
Kernel->>Agent: eBPF events (TCP, DNS, disk)
Agent->>Server: Metrics + traces + logs
Server->>Insp: Run 18 inspection categories
Insp->>Insp: SLO breach detected
Insp->>RCA: Trigger root cause analysis
RCA->>RCA: Walk dependency graph
RCA->>RCA: Correlate metrics ↔ traces ↔ logs
RCA->>RCA: Rank root causes
RCA->>Alert: Send alert with RCA summary
Note over Alert: Slack / PagerDuty / Webhook