April 15, 2026  ·  12 min read

An observability baseline before the first incident

Designing observability during an incident is too late. The team is already guessing, dashboards disagree, and the logs are either too noisy or too thin to explain what changed. This is the baseline I want in place before traffic and operational complexity force the issue.

Pick the one path that actually matters

Do not begin by instrumenting everything. Pick a single path the business cares about and make it observable end to end. Once that one works, every other path follows the same pattern almost for free.

For an HTTP service, that path might be POST /payments, POST /checkout, or a queue worker that publishes settlements. Pick one and commit to it.

That path should let an engineer answer four questions in under a minute:

  1. Is the service healthy from the user's point of view?
  2. Which dependency or downstream system is slow or failing?
  3. Did the error start after a deployment or traffic change?
  4. How do I move from the symptom to the responsible code path?

If your baseline cannot answer those four, the team does not have observability yet. It only has tooling. The point of all the work below is to make those answers obvious.

One path, instrumented end-to-end, beats ten paths instrumented half-way.

Service-level metrics, chosen on purpose

Infrastructure metrics matter, but application metrics catch real pain earlier. Define a small, opinionated set first, and pick labels and buckets so that incident triage gets faster instead of just noisier.

The baseline I reach for on every HTTP service is six metrics, then nothing:

A basic prometheus_client setup in a Python service gets most of the foundation in place. The hard part is not exporting metrics; it's choosing labels and buckets that match the questions you actually ask during an incident.

import time
from flask import Flask, Response, request
from prometheus_client import (
    CONTENT_TYPE_LATEST,
    CollectorRegistry,
    Histogram,
    generate_latest,
)

registry = CollectorRegistry()

http_duration = Histogram(
    'http_request_duration_seconds',
    'Duration of HTTP requests in seconds',
    labelnames=('route', 'method', 'status_code'),
    buckets=(0.05, 0.1, 0.25, 0.5, 1, 2, 5),
    registry=registry,
)

app = Flask(__name__)

@app.before_request
def _start_timer():
    request._start = time.perf_counter()

@app.after_request
def _record_duration(response):
    duration = time.perf_counter() - request._start
    route = request.url_rule.rule if request.url_rule else request.path
    http_duration.labels(route, request.method, str(response.status_code)).observe(duration)
    return response

@app.get('/metrics')
def metrics():
    return Response(generate_latest(registry), mimetype=CONTENT_TYPE_LATEST)

Two things are easy to get wrong. First, route templating: log /users/:id, never /users/12345, or your label cardinality explodes. Second, buckets: pick them around the latencies you actually care about, not Prometheus defaults.

Structured logs, tagged with the request

If logs are still plain strings, every incident becomes a text-search problem. Structured logs with stable fields let you join across the stack on request id, route, deploy, or customer, without guessing.

A logger like structlog is usually enough. The shape matters more than the library: every line carries a request id, a route, a method, and a service name.

import logging
import os
import uuid
import structlog
from flask import g, request

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt='iso'),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.make_filtering_bound_logger(
        getattr(logging, os.environ.get('LOG_LEVEL', 'INFO').upper())
    ),
)

base_log = structlog.get_logger().bind(
    service='payments-api',
    environment=os.environ.get('APP_ENV'),
)

@app.before_request
def _attach_logger():
    request_id = request.headers.get('x-request-id') or str(uuid.uuid4())
    g.request_id = request_id
    g.log = base_log.bind(
        request_id=request_id,
        route=request.path,
        method=request.method,
    )

@app.after_request
def _propagate_request_id(response):
    response.headers['x-request-id'] = g.get('request_id', '')
    return response

Now every log line in the request flow can be tied back to a single user, deploy window, or customer report without guessing. Pair this with a log pipeline that indexes on requestId and you've collapsed the “where did this start” problem to a single search.

A log line without context is a guess wearing a timestamp.

Trace the boundaries where latency compounds

Tracing earns its weight at the seams: database queries, HTTP calls to upstream APIs, queue producers and consumers, third-party services. The goal is not collecting every span. The goal is answering: where did the latency accumulate?

For AWS workloads, the spans I always want instrumented:

Use OpenTelemetry where you can. It's the format your trace backend will already speak, and it avoids lock-in. Sampling keeps cost in check; head-based sampling at around 10 % is fine for most services, and you can raise it for paths that need every trace.

A trace without context is just a flame chart. Tag your spans with the same fields your logs carry (requestId, route, customerId when you have it), and the trace becomes the join key between metrics and logs.

Alert on user pain, not infrastructure noise

A small, opinionated alert set tied to service behaviour beats a wall of CPU and disk pages every time. The bar is simple: every page wakes someone for a reason that maps to a customer experience.

The alerts I want on day one:

A Prometheus-style rule for latency is enough to start:

groups:
  - name: payments-api
    rules:
      - alert: HighRequestLatencyP95
        expr: histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 1.2
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "p95 latency is above 1.2s for 10 minutes"

More actionable than a random CPU spike that never affected a customer. Two rules of thumb worth internalising:

First, page on symptoms, and only on sustained ones. A one-minute spike is a signal; a ten-minute degradation is a customer.

Second, every alert needs a linked dashboard and a linked runbook, even if the runbook is one paragraph. A page without somewhere to look is a page that gets snoozed.

Boring dashboards that answer one question

The first screen of a service dashboard should answer the four operational questions before the engineer scrolls. If a dashboard needs three tabs and a query language refresher, it is too slow for incident response.

What lives on the first screen:

What does not belong on the first screen: pod CPU, garbage collection counters, container restart rates. They're useful once you already suspect the issue is below the application layer, but they waste attention in the first thirty seconds of a page.

Two small habits make dashboards age well: annotate every deploy, and put a links panel at the top with the runbook, the source repo, and the on-call rotation. That single panel removes 80 % of the “wait, where do I look” cost during a real incident.

Observability is part of system design

Treat the baseline as part of the same conversation as retries, timeouts, idempotency, rate limits, and deployment safety. It is not polish you add later when the team has time.

A service that ships without observability is a service the team will be afraid to change. That fear shows up everywhere: longer review cycles, slower deploys, more meetings about “risk.” Wiring the baseline in early is a one-time cost that pays compounding interest in confidence.

The standard I care about most: when production gets noisy, can an engineer move from symptom to responsible code path without guessing? If the answer is yes, the service is in a much better operational position than its uptime number alone would suggest.

A system you can observe is a system you can change. Without that, every deploy is a small act of faith.

The pre-launch checklist

Everything above, compressed to a single page. Use it before the service ships, then revisit it every quarter.