Ceph — How It Works¶
RADOS object store, CRUSH data placement, OSD write path, and recovery mechanics.
RADOS Architecture¶
flowchart TB
subgraph Clients["Client Access"]
RBD["RBD Client\n(block)"]
RGW["RGW\n(S3/Swift)"]
CephFS_C["CephFS Client\n(POSIX)"]
end
subgraph RADOS["RADOS (Core)"]
PG["Placement Groups\n(PG)"]
CRUSH_A["CRUSH Map\n(deterministic placement)"]
end
subgraph Daemons["Cluster Daemons"]
MON["MON ×3+\n(quorum, cluster map)"]
MGR["MGR ×2\n(metrics, dashboard)"]
OSD1["OSD 1"]
OSD2["OSD 2"]
OSD3["OSD 3"]
OSDN["OSD N"]
MDS_D["MDS ×2+\n(CephFS metadata)"]
end
Clients --> RADOS
RADOS --> Daemons
CRUSH_A --> PG
PG --> OSD1
PG --> OSD2
PG --> OSD3
style RADOS fill:#c62828,color:#fff
style Daemons fill:#1565c0,color:#fff
CRUSH Algorithm¶
The CRUSH (Controlled Replication Under Scalable Hashing) algorithm is what makes Ceph unique — clients calculate data placement directly, without a central lookup table.
flowchart LR
Object["Object ID"] --> Hash["Hash\n(CRUSH)"]
Hash --> PG_C["Placement Group\n(PG = hash mod num_pgs)"]
PG_C --> CRUSH_C["CRUSH Rules\n(rack/host/OSD failure domains)"]
CRUSH_C --> OSD_Set["OSD Set\n{OSD.4, OSD.17, OSD.29}"]
style CRUSH_C fill:#e65100,color:#fff
Write Path¶
sequenceDiagram
participant Client as Client
participant Primary as Primary OSD
participant Replica1 as Replica OSD 1
participant Replica2 as Replica OSD 2
participant Journal as WAL/DB (BlueStore)
Client->>Primary: Write object
Primary->>Journal: Write to WAL (journal)
par Replicate
Primary->>Replica1: Forward write
Primary->>Replica2: Forward write
end
Replica1-->>Primary: ACK
Replica2-->>Primary: ACK
Primary-->>Client: Write complete
Note over Primary: All replicas confirmed<br/>before client ACK (strong consistency)
BlueStore (Default OSD Backend)¶
| Component | Role |
|---|---|
| BlockDevice | Raw block device (HDD/SSD/NVMe) — no filesystem |
| RocksDB | Object metadata store |
| WAL | Write-ahead log (best on fast NVMe) |
| DB | RocksDB data (best on SSD) |
| Data | Object data (can be HDD) |
Data Protection¶
| Method | Overhead | Speed | Use Case |
|---|---|---|---|
| Replication (3×) | 200% | Fast writes | Hot data, databases |
| Erasure Coding (4+2) | 50% | Slower writes, fast reads | Cold data, archives |
| FastEC (Tentacle) | 50% | Improved small I/O | General purpose |