Building an Observability Stack for SRE

Wed, Mar 25, 2026
2-minute read

Observability is the cornerstone of Site Reliability Engineering. Without proper visibility into your systems, you’re flying blind. Here’s how I approach building an observability stack from scratch.

The Three Pillars

Observability rests on three pillars, each providing a different perspective on system behavior:

Metrics - Numerical measurements over time (latency, error rates, throughput)
Logs - Discrete events with contextual information
Traces - End-to-end request flows across services

Metrics with Prometheus

Prometheus is the standard for metrics collection in cloud-native environments. Key considerations:

Recording Rules

Pre-compute expensive queries to improve dashboard performance:

groups:
  - name: sli_rules
    rules:
      - record: job:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))

      - record: job:http_requests:error_rate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

Alerting Rules

Alert on symptoms, not causes:

groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_requests:error_rate > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% SLO"
          description: "{{ $labels.job }} error rate is {{ $value | humanizePercentage }}"

SLIs, SLOs, and Error Budgets

Define what “reliable” means for your services:

Service	SLI	SLO	Error Budget (30d)
API	Request success rate	99.9%	43.2 minutes
API	p99 latency < 200ms	99.5%	3.6 hours
Web	Page load < 2s	99%	7.2 hours

Error Budget Policy

When the error budget is exhausted:

Freeze feature releases until budget recovers
Prioritize reliability work in the sprint
Conduct thorough postmortems for budget-burning incidents
Increase monitoring coverage in affected areas

Dashboards with Grafana

Effective dashboards follow the USE Method (Utilization, Saturation, Errors) and RED Method (Rate, Errors, Duration):

Dashboard Hierarchy

Overview dashboard - High-level service health (SLO status)
Service dashboards - Per-service detailed metrics
Infrastructure dashboards - Node, pod, and cluster metrics
Debug dashboards - Deep-dive panels for incident investigation

Log Aggregation with Loki

Loki provides cost-effective log aggregation that integrates with Grafana:

# Promtail configuration
scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Incident Response

Good observability accelerates incident response:

Alert fires - On-call engineer receives notification
Triage - Check overview dashboard for impact scope
Investigate - Correlate metrics, logs, and traces
Mitigate - Apply quick fix to restore service
Resolve - Deploy permanent fix
Postmortem - Document learnings and action items

Conclusion

Observability is not a one-time setup; it’s a practice that evolves with your systems. Start with the basics, iterate on your dashboards, and continuously refine your alerts. The goal is to detect and resolve issues before your users notice them.

SRE sre monitoring prometheus grafana