Operations
Deployment & Typical Setup
Quick Dev Setup (All-in-One Docker)
The fastest way to start with LGTM — a single Docker image with all components:
docker run --name lgtm \
-p 3000:3000 \
-p 4317:4317 \
-p 4318:4318 \
--rm -ti grafana/otel-lgtm
- Grafana UI:
http://localhost:3000 (admin/admin)
- OTLP gRPC:
localhost:4317
- OTLP HTTP:
localhost:4318
Includes: OTel Collector, Prometheus, Loki, Tempo, Pyroscope, Grafana — all pre-wired.
Production Setup (Kubernetes)
For production, deploy each component independently via Helm:
# Add repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Deploy in order: storage backends first, then Grafana
helm install mimir grafana/mimir-distributed -n monitoring -f mimir-values.yaml
helm install loki grafana/loki -n monitoring -f loki-values.yaml
helm install tempo grafana/tempo-distributed -n monitoring -f tempo-values.yaml
helm install pyroscope grafana/pyroscope -n monitoring -f pyroscope-values.yaml
helm install alloy grafana/alloy -n monitoring -f alloy-values.yaml
helm install grafana grafana/grafana -n monitoring -f grafana-values.yaml
Production Readiness Checklist
Configuration & Optimal Tuning
Label Strategy (CRITICAL for Loki)
The #1 operational pitfall is label cardinality. Follow these rules:
| Label Type |
Good ✅ |
Bad ❌ |
| Static metadata |
namespace, pod, job, env |
user_id, request_id, ip_address |
| Bounded values |
status_code (200, 404, 500) |
timestamp, trace_id |
| Grouping |
team, region, cluster |
url_path (unbounded) |
Target: Keep active label streams < 10,000 per tenant for optimal performance.
Retention Configuration
| Component |
Config Key |
Recommended Defaults |
| Mimir |
blocks_storage.tsdb.retention_period |
13 months (metrics) |
| Loki |
limits_config.retention_period |
30 days (logs) |
| Tempo |
compactor.compaction.block_retention |
14–30 days (traces) |
| Pyroscope |
retention config |
14 days (profiles) |
Per-Tenant Limits (Multi-Tenancy)
Set per-tenant limits to prevent noisy neighbors:
# Mimir overrides
overrides:
tenant-alpha:
max_global_series_per_user: 500000
ingestion_rate: 50000 # samples/sec
max_fetched_series_per_query: 100000
tenant-beta:
max_global_series_per_user: 100000
ingestion_rate: 10000
# Loki overrides
overrides:
tenant-alpha:
max_global_streams_per_user: 10000
ingestion_rate_mb: 10
max_query_length: 720h
Reliability & Scaling
Scaling Decision Matrix
| Symptom |
Component to Scale |
How |
| Slow metric queries |
Mimir queriers |
Add querier replicas |
| Write backpressure on metrics |
Mimir ingesters |
Add ingester replicas |
| Slow log search |
Loki queriers |
Add querier replicas, check label cardinality |
| Log ingestion lag |
Loki ingesters |
Add ingester replicas, increase limits |
| Slow trace search |
Tempo queriers |
Add querier replicas |
| Cache miss rate > 20% |
Memcached |
Add memcached replicas, increase memory |
| Object storage latency |
All |
Verify same-AZ deployment, enable caching |
High Availability Requirements
| Component |
HA Mechanism |
Minimum Replicas |
| Ingesters (all backends) |
Replication factor (RF=3 recommended) |
3 |
| Distributors |
Stateless, load-balanced |
2+ |
| Queriers |
Stateless, load-balanced |
2+ |
| Compactors |
Leader election (single active) |
1 (with standby) |
| Store-Gateways (Mimir) |
Sharded by blocks |
2+ |
| Query Frontends |
Stateless, request splitting |
2+ |
Cost
Cost Drivers
| Factor |
Primary Driver |
Optimization |
| Object storage |
Data volume × retention |
Set retention policies, use lifecycle rules, compress |
| Compute (ingesters) |
Ingestion rate |
Right-size, use spot/preemptible nodes |
| Compute (queriers) |
Query volume and complexity |
Recording rules, caching, query limits |
| Network (cross-AZ) |
Cross-AZ traffic between components |
Co-locate in single AZ or use VPC endpoints |
| Memcached |
Cache size × hit ratio |
Size to achieve > 80% hit rate |
Cost at Scale
| Scale |
Metrics (active series) |
Logs (GB/day) |
Traces (spans/day) |
Est. Monthly Self-Hosted |
| Small |
100k |
10 GB |
5M |
$200–500 |
| Medium |
1M |
100 GB |
50M |
$1,000–3,000 |
| Large |
10M |
1 TB |
500M |
$5,000–15,000 |
| Enterprise |
100M+ |
10 TB+ |
5B+ |
$20,000–100,000+ |
Cost Optimization Strategies
- Recording rules — precompute expensive PromQL queries in Mimir
- Adaptive Metrics (Grafana Cloud) — automatically drop unused metrics
- Adaptive Logs (Grafana Cloud) — automatically reduce noisy log volumes
- Sampling — head-based or tail-based trace sampling to reduce Tempo costs
- Log pipeline filtering — drop debug/info logs in Alloy before they reach Loki
- Single-AZ deployment — eliminate cross-AZ network costs (accept reduced availability)
- Object storage lifecycle rules — transition old data to cheaper tiers (S3 IA/Glacier)
Security
Authentication & Multi-Tenancy
- Deploy an auth proxy (NGINX, Envoy, or API gateway) in front of all backends
- The proxy authenticates users and injects
X-Scope-OrgID based on verified identity
- Never expose backends directly without authentication
- Use per-tenant limits to prevent resource exhaustion
- All inter-component communication should use mTLS in production
Network Security
- Use NetworkPolicies in Kubernetes to restrict pod-to-pod communication
- Only Alloy should talk to backend Distributors
- Only Query Frontends should be exposed to Grafana
- Object storage should be accessed via VPC endpoints (no public internet)
Best Practices
Instrumentation
- Use OpenTelemetry everywhere — standardize on OTLP as the protocol
- Inject trace IDs into logs — this is the foundation of log-trace correlation
- Set resource attributes —
service.name, deployment.environment, k8s.pod.name on every signal
- Use auto-instrumentation first — Java Agent, Python instrument, eBPF for Go
- Add manual spans for critical business logic the auto-instrumentation misses
- Sample in production — head-based (simple) or tail-based (captures errors/slow) sampling
Operations
- Monitor the monitoring — deploy a separate, smaller "meta-monitoring" LGTM stack that monitors the primary stack
- Use Grafana mixins — pre-built dashboards for Mimir, Loki, Tempo internals
- Label governance — enforce labeling standards to prevent cardinality explosions
- Test with load — use
k6 with the Grafana extension to load-test the stack before production
- GitOps everything — dashboards, alerts, data sources, and Helm values in version control
Common Issues & Playbook
| Symptom |
Likely Cause |
Fix |
| "too many outstanding requests" |
Ingester overwhelmed |
Scale ingesters, increase per-tenant limits |
| "max streams limit reached" (Loki) |
High label cardinality |
Reduce label cardinality, drop high-cardinality labels in Alloy |
| "context deadline exceeded" |
Slow object storage or oversized query |
Enable caching, add query limits, check AZ placement |
| Exemplars not showing |
Mimir not storing exemplars |
Enable exemplar_storage in Mimir, verify app instrumentation |
| Trace-to-logs not working |
Missing trace ID in logs |
Verify OTel SDK injects trace_id into log output |
| Derived fields not clickable |
Regex doesn't match |
Test regex against actual log lines, verify Loki DS config |
| High memory on ingesters |
WAL too large or too many active series/streams |
Increase ingester memory, tune WAL flush interval |
| Slow TraceQL queries |
Large time range or low selectivity |
Narrow time range, add specific attribute filters |
Monitoring & Troubleshooting
| Component |
Metric |
What It Tells You |
| All |
*_request_duration_seconds |
Internal API latency |
| Ingesters |
*_ingester_memory_series / *_live_entries |
In-memory load |
| Distributors |
*_distributor_received_samples_total |
Ingestion throughput |
| Queriers |
*_querier_request_duration_seconds |
Query latency |
| Compactors |
*_compactor_runs_completed_total |
Compaction health |
| Object Storage |
*_thanos_objstore_bucket_operation_duration_seconds |
Storage latency |
Grafana Mixins
Pre-built monitoring dashboards for each LGTM component:
- Mimir: grafana/mimir → operations/mimir-mixin/
- Loki: grafana/loki → production/loki-mixin/
- Tempo: grafana/tempo → operations/tempo-mixin/