Building an Observability Stack for SRE
Observability is the cornerstone of Site Reliability Engineering. Without proper visibility into your systems, you’re flying blind. Here’s how I approach building an observability stack from scratch.
The Three Pillars
Observability rests on three pillars, each providing a different perspective on system behavior:
- Metrics - Numerical measurements over time (latency, error rates, throughput)
- Logs - Discrete events with contextual information
- Traces - End-to-end request flows across services
Metrics with Prometheus
Prometheus is the standard for metrics collection in cloud-native environments. Key considerations:
Recording Rules
Pre-compute expensive queries to improve dashboard performance:
groups:
- name: sli_rules
rules:
- record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
- record: job:http_requests:error_rate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
Alerting Rules
Alert on symptoms, not causes:
groups:
- name: slo_alerts
rules:
- alert: HighErrorRate
expr: job:http_requests:error_rate > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 1% SLO"
description: "{{ $labels.job }} error rate is {{ $value | humanizePercentage }}"
SLIs, SLOs, and Error Budgets
Define what “reliable” means for your services:
| Service | SLI | SLO | Error Budget (30d) |
|---|---|---|---|
| API | Request success rate | 99.9% | 43.2 minutes |
| API | p99 latency < 200ms | 99.5% | 3.6 hours |
| Web | Page load < 2s | 99% | 7.2 hours |
Error Budget Policy
When the error budget is exhausted:
- Freeze feature releases until budget recovers
- Prioritize reliability work in the sprint
- Conduct thorough postmortems for budget-burning incidents
- Increase monitoring coverage in affected areas
Dashboards with Grafana
Effective dashboards follow the USE Method (Utilization, Saturation, Errors) and RED Method (Rate, Errors, Duration):
Dashboard Hierarchy
- Overview dashboard - High-level service health (SLO status)
- Service dashboards - Per-service detailed metrics
- Infrastructure dashboards - Node, pod, and cluster metrics
- Debug dashboards - Deep-dive panels for incident investigation
Log Aggregation with Loki
Loki provides cost-effective log aggregation that integrates with Grafana:
# Promtail configuration
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Incident Response
Good observability accelerates incident response:
- Alert fires - On-call engineer receives notification
- Triage - Check overview dashboard for impact scope
- Investigate - Correlate metrics, logs, and traces
- Mitigate - Apply quick fix to restore service
- Resolve - Deploy permanent fix
- Postmortem - Document learnings and action items
Conclusion
Observability is not a one-time setup; it’s a practice that evolves with your systems. Start with the basics, iterate on your dashboards, and continuously refine your alerts. The goal is to detect and resolve issues before your users notice them.