Building an Observability Stack for SRE

Observability is the cornerstone of Site Reliability Engineering. Without proper visibility into your systems, you’re flying blind. Here’s how I approach building an observability stack from scratch.

The Three Pillars

Observability rests on three pillars, each providing a different perspective on system behavior:

  • Metrics - Numerical measurements over time (latency, error rates, throughput)
  • Logs - Discrete events with contextual information
  • Traces - End-to-end request flows across services

Metrics with Prometheus

Prometheus is the standard for metrics collection in cloud-native environments. Key considerations:

Recording Rules

Pre-compute expensive queries to improve dashboard performance:

groups:
  - name: sli_rules
    rules:
      - record: job:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

      - record: job:http_requests:error_rate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

Alerting Rules

Alert on symptoms, not causes:

groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_requests:error_rate > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% SLO"
          description: "{{ $labels.job }} error rate is {{ $value | humanizePercentage }}"

SLIs, SLOs, and Error Budgets

Define what “reliable” means for your services:

ServiceSLISLOError Budget (30d)
APIRequest success rate99.9%43.2 minutes
APIp99 latency < 200ms99.5%3.6 hours
WebPage load < 2s99%7.2 hours

Error Budget Policy

When the error budget is exhausted:

  1. Freeze feature releases until budget recovers
  2. Prioritize reliability work in the sprint
  3. Conduct thorough postmortems for budget-burning incidents
  4. Increase monitoring coverage in affected areas

Dashboards with Grafana

Effective dashboards follow the USE Method (Utilization, Saturation, Errors) and RED Method (Rate, Errors, Duration):

Dashboard Hierarchy

  1. Overview dashboard - High-level service health (SLO status)
  2. Service dashboards - Per-service detailed metrics
  3. Infrastructure dashboards - Node, pod, and cluster metrics
  4. Debug dashboards - Deep-dive panels for incident investigation

Log Aggregation with Loki

Loki provides cost-effective log aggregation that integrates with Grafana:

# Promtail configuration
scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Incident Response

Good observability accelerates incident response:

  1. Alert fires - On-call engineer receives notification
  2. Triage - Check overview dashboard for impact scope
  3. Investigate - Correlate metrics, logs, and traces
  4. Mitigate - Apply quick fix to restore service
  5. Resolve - Deploy permanent fix
  6. Postmortem - Document learnings and action items

Conclusion

Observability is not a one-time setup; it’s a practice that evolves with your systems. Start with the basics, iterate on your dashboards, and continuously refine your alerts. The goal is to detect and resolve issues before your users notice them.