Operations¶

Deployment & Typical Setup¶

Quick Dev Setup (All-in-One Docker)¶

The fastest way to start with LGTM — a single Docker image with all components:

docker run --name lgtm \
  -p 3000:3000 \
  -p 4317:4317 \
  -p 4318:4318 \
  --rm -ti grafana/otel-lgtm

Grafana UI: http://localhost:3000 (admin/admin)
OTLP gRPC: localhost:4317
OTLP HTTP: localhost:4318

Includes: OTel Collector, Prometheus, Loki, Tempo, Pyroscope, Grafana — all pre-wired.

Production Setup (Kubernetes)¶

For production, deploy each component independently via Helm:

# Add repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Deploy in order: storage backends first, then Grafana
helm install mimir grafana/mimir-distributed -n monitoring -f mimir-values.yaml
helm install loki grafana/loki -n monitoring -f loki-values.yaml
helm install tempo grafana/tempo-distributed -n monitoring -f tempo-values.yaml
helm install pyroscope grafana/pyroscope -n monitoring -f pyroscope-values.yaml
helm install alloy grafana/alloy -n monitoring -f alloy-values.yaml
helm install grafana grafana/grafana -n monitoring -f grafana-values.yaml

Production Readiness Checklist¶

Configuration & Optimal Tuning¶

Label Strategy (CRITICAL for Loki)¶

The #1 operational pitfall is label cardinality. Follow these rules:

Label Type	Good ✅	Bad ❌
Static metadata	`namespace`, `pod`, `job`, `env`	`user_id`, `request_id`, `ip_address`
Bounded values	`status_code` (200, 404, 500)	`timestamp`, `trace_id`
Grouping	`team`, `region`, `cluster`	`url_path` (unbounded)

Target: Keep active label streams < 10,000 per tenant for optimal performance.

Retention Configuration¶

Component	Config Key	Recommended Defaults
Mimir	`blocks_storage.tsdb.retention_period`	13 months (metrics)
Loki	`limits_config.retention_period`	30 days (logs)
Tempo	`compactor.compaction.block_retention`	14–30 days (traces)
Pyroscope	retention config	14 days (profiles)

Per-Tenant Limits (Multi-Tenancy)¶

Set per-tenant limits to prevent noisy neighbors:

# Mimir overrides
overrides:
  tenant-alpha:
    max_global_series_per_user: 500000
    ingestion_rate: 50000    # samples/sec
    max_fetched_series_per_query: 100000
  tenant-beta:
    max_global_series_per_user: 100000
    ingestion_rate: 10000

# Loki overrides
overrides:
  tenant-alpha:
    max_global_streams_per_user: 10000
    ingestion_rate_mb: 10
    max_query_length: 720h

Reliability & Scaling¶

Scaling Decision Matrix¶

Symptom	Component to Scale	How
Slow metric queries	Mimir queriers	Add querier replicas
Write backpressure on metrics	Mimir ingesters	Add ingester replicas
Slow log search	Loki queriers	Add querier replicas, check label cardinality
Log ingestion lag	Loki ingesters	Add ingester replicas, increase limits
Slow trace search	Tempo queriers	Add querier replicas
Cache miss rate > 20%	Memcached	Add memcached replicas, increase memory
Object storage latency	All	Verify same-AZ deployment, enable caching

High Availability Requirements¶

Component	HA Mechanism	Minimum Replicas
Ingesters (all backends)	Replication factor (RF=3 recommended)	3
Distributors	Stateless, load-balanced	2+
Queriers	Stateless, load-balanced	2+
Compactors	Leader election (single active)	1 (with standby)
Store-Gateways (Mimir)	Sharded by blocks	2+
Query Frontends	Stateless, request splitting	2+

Cost¶

Cost Drivers¶

Factor	Primary Driver	Optimization
Object storage	Data volume × retention	Set retention policies, use lifecycle rules, compress
Compute (ingesters)	Ingestion rate	Right-size, use spot/preemptible nodes
Compute (queriers)	Query volume and complexity	Recording rules, caching, query limits
Network (cross-AZ)	Cross-AZ traffic between components	Co-locate in single AZ or use VPC endpoints
Memcached	Cache size × hit ratio	Size to achieve > 80% hit rate

Cost at Scale¶

Scale	Metrics (active series)	Logs (GB/day)	Traces (spans/day)	Est. Monthly Self-Hosted
Small	100k	10 GB	5M	$200–500
Medium	1M	100 GB	50M	$1,000–3,000
Large	10M	1 TB	500M	$5,000–15,000
Enterprise	100M+	10 TB+	5B+	$20,000–100,000+

Cost Optimization Strategies¶

Recording rules — precompute expensive PromQL queries in Mimir
Adaptive Metrics (Grafana Cloud) — automatically drop unused metrics
Adaptive Logs (Grafana Cloud) — automatically reduce noisy log volumes
Sampling — head-based or tail-based trace sampling to reduce Tempo costs
Log pipeline filtering — drop debug/info logs in Alloy before they reach Loki
Single-AZ deployment — eliminate cross-AZ network costs (accept reduced availability)
Object storage lifecycle rules — transition old data to cheaper tiers (S3 IA/Glacier)

Security¶

Authentication & Multi-Tenancy¶

Deploy an auth proxy (NGINX, Envoy, or API gateway) in front of all backends
The proxy authenticates users and injects X-Scope-OrgID based on verified identity
Never expose backends directly without authentication
Use per-tenant limits to prevent resource exhaustion
All inter-component communication should use mTLS in production

Network Security¶

Use NetworkPolicies in Kubernetes to restrict pod-to-pod communication
Only Alloy should talk to backend Distributors
Only Query Frontends should be exposed to Grafana
Object storage should be accessed via VPC endpoints (no public internet)

Best Practices¶

Instrumentation¶

Use OpenTelemetry everywhere — standardize on OTLP as the protocol
Inject trace IDs into logs — this is the foundation of log-trace correlation
Set resource attributes — service.name, deployment.environment, k8s.pod.name on every signal
Use auto-instrumentation first — Java Agent, Python instrument, eBPF for Go
Add manual spans for critical business logic the auto-instrumentation misses
Sample in production — head-based (simple) or tail-based (captures errors/slow) sampling

Operations¶

Monitor the monitoring — deploy a separate, smaller "meta-monitoring" LGTM stack that monitors the primary stack
Use Grafana mixins — pre-built dashboards for Mimir, Loki, Tempo internals
Label governance — enforce labeling standards to prevent cardinality explosions
Test with load — use k6 with the Grafana extension to load-test the stack before production
GitOps everything — dashboards, alerts, data sources, and Helm values in version control

Common Issues & Playbook¶

Symptom	Likely Cause	Fix
"too many outstanding requests"	Ingester overwhelmed	Scale ingesters, increase per-tenant limits
"max streams limit reached" (Loki)	High label cardinality	Reduce label cardinality, drop high-cardinality labels in Alloy
"context deadline exceeded"	Slow object storage or oversized query	Enable caching, add query limits, check AZ placement
Exemplars not showing	Mimir not storing exemplars	Enable `exemplar_storage` in Mimir, verify app instrumentation
Trace-to-logs not working	Missing trace ID in logs	Verify OTel SDK injects `trace_id` into log output
Derived fields not clickable	Regex doesn't match	Test regex against actual log lines, verify Loki DS config
High memory on ingesters	WAL too large or too many active series/streams	Increase ingester memory, tune WAL flush interval
Slow TraceQL queries	Large time range or low selectivity	Narrow time range, add specific attribute filters

Monitoring & Troubleshooting¶

Key Metrics to Monitor (Meta-Monitoring)¶

Component	Metric	What It Tells You
All	`*_request_duration_seconds`	Internal API latency
Ingesters	`_ingester_memory_series` / `_live_entries`	In-memory load
Distributors	`*_distributor_received_samples_total`	Ingestion throughput
Queriers	`*_querier_request_duration_seconds`	Query latency
Compactors	`*_compactor_runs_completed_total`	Compaction health
Object Storage	`*_thanos_objstore_bucket_operation_duration_seconds`	Storage latency

Grafana Mixins¶

Pre-built monitoring dashboards for each LGTM component: - Mimir: grafana/mimir → operations/mimir-mixin/ - Loki: grafana/loki → production/loki-mixin/ - Tempo: grafana/tempo → operations/tempo-mixin/

Operations¶

Deployment & Typical Setup¶

Quick Dev Setup (All-in-One Docker)¶

Production Setup (Kubernetes)¶

Production Readiness Checklist¶

Configuration & Optimal Tuning¶

Label Strategy (CRITICAL for Loki)¶

Retention Configuration¶

Per-Tenant Limits (Multi-Tenancy)¶

Reliability & Scaling¶

Scaling Decision Matrix¶

High Availability Requirements¶

Cost¶

Cost Drivers¶

Cost at Scale¶

Cost Optimization Strategies¶

Security¶

Authentication & Multi-Tenancy¶

Network Security¶

Best Practices¶

Instrumentation¶

Operations¶

Common Issues & Playbook¶

Monitoring & Troubleshooting¶

Key Metrics to Monitor (Meta-Monitoring)¶

Grafana Mixins¶

Related Notes¶