Pick the one path that actually matters
Do not begin by instrumenting everything. Pick a single path the business cares about and make it observable end to end. Once that one works, every other path follows the same pattern almost for free.
For an HTTP service, that path might be POST /payments, POST /checkout, or a queue worker that publishes settlements. Pick one and commit to it.
That path should let an engineer answer four questions in under a minute:
- Is the service healthy from the user's point of view?
- Which dependency or downstream system is slow or failing?
- Did the error start after a deployment or traffic change?
- How do I move from the symptom to the responsible code path?
If your baseline cannot answer those four, the team does not have observability yet. It only has tooling. The point of all the work below is to make those answers obvious.
One path, instrumented end-to-end, beats ten paths instrumented half-way.
Service-level metrics, chosen on purpose
Infrastructure metrics matter, but application metrics catch real pain earlier. Define a small, opinionated set first, and pick labels and buckets so that incident triage gets faster instead of just noisier.
The baseline I reach for on every HTTP service is six metrics, then nothing:
- request count (per route, per status code)
- error rate (4xx vs 5xx, separately)
- request latency (histogram, not average)
- dependency latency (per downstream)
- queue depth for any async path
- retry count for workers
A basic prometheus_client setup in a Python service gets most of the foundation in place. The hard part is not exporting metrics; it's choosing labels and buckets that match the questions you actually ask during an incident.
import time
from flask import Flask, Response, request
from prometheus_client import (
CONTENT_TYPE_LATEST,
CollectorRegistry,
Histogram,
generate_latest,
)
registry = CollectorRegistry()
http_duration = Histogram(
'http_request_duration_seconds',
'Duration of HTTP requests in seconds',
labelnames=('route', 'method', 'status_code'),
buckets=(0.05, 0.1, 0.25, 0.5, 1, 2, 5),
registry=registry,
)
app = Flask(__name__)
@app.before_request
def _start_timer():
request._start = time.perf_counter()
@app.after_request
def _record_duration(response):
duration = time.perf_counter() - request._start
route = request.url_rule.rule if request.url_rule else request.path
http_duration.labels(route, request.method, str(response.status_code)).observe(duration)
return response
@app.get('/metrics')
def metrics():
return Response(generate_latest(registry), mimetype=CONTENT_TYPE_LATEST)
Two things are easy to get wrong. First, route templating: log /users/:id, never /users/12345, or your label cardinality explodes. Second, buckets: pick them around the latencies you actually care about, not Prometheus defaults.
Structured logs, tagged with the request
If logs are still plain strings, every incident becomes a text-search problem. Structured logs with stable fields let you join across the stack on request id, route, deploy, or customer, without guessing.
A logger like structlog is usually enough. The shape matters more than the library: every line carries a request id, a route, a method, and a service name.
import logging
import os
import uuid
import structlog
from flask import g, request
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt='iso'),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.make_filtering_bound_logger(
getattr(logging, os.environ.get('LOG_LEVEL', 'INFO').upper())
),
)
base_log = structlog.get_logger().bind(
service='payments-api',
environment=os.environ.get('APP_ENV'),
)
@app.before_request
def _attach_logger():
request_id = request.headers.get('x-request-id') or str(uuid.uuid4())
g.request_id = request_id
g.log = base_log.bind(
request_id=request_id,
route=request.path,
method=request.method,
)
@app.after_request
def _propagate_request_id(response):
response.headers['x-request-id'] = g.get('request_id', '')
return response
Now every log line in the request flow can be tied back to a single user, deploy window, or customer report without guessing. Pair this with a log pipeline that indexes on requestId and you've collapsed the “where did this start” problem to a single search.
A log line without context is a guess wearing a timestamp.
Trace the boundaries where latency compounds
Tracing earns its weight at the seams: database queries, HTTP calls to upstream APIs, queue producers and consumers, third-party services. The goal is not collecting every span. The goal is answering: where did the latency accumulate?
For AWS workloads, the spans I always want instrumented:
- inbound HTTP requests
- PostgreSQL or DynamoDB calls
- SQS publish and consume paths
- outbound calls to payment, auth, or notification providers
- cross-service HTTP between your own services
Use OpenTelemetry where you can. It's the format your trace backend will already speak, and it avoids lock-in. Sampling keeps cost in check; head-based sampling at around 10 % is fine for most services, and you can raise it for paths that need every trace.
A trace without context is just a flame chart. Tag your spans with the same fields your logs carry (requestId, route, customerId when you have it), and the trace becomes the join key between metrics and logs.
Alert on user pain, not infrastructure noise
A small, opinionated alert set tied to service behaviour beats a wall of CPU and disk pages every time. The bar is simple: every page wakes someone for a reason that maps to a customer experience.
The alerts I want on day one:
- sustained increase in 5xx responses
- p95 latency above the agreed threshold
- queue lag or queue depth beyond normal range
- worker retry spikes
- per-dependency failure rate when one upstream is unhealthy
A Prometheus-style rule for latency is enough to start:
groups:
- name: payments-api
rules:
- alert: HighRequestLatencyP95
expr: histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)) > 1.2
for: 10m
labels:
severity: page
annotations:
summary: "p95 latency is above 1.2s for 10 minutes"
More actionable than a random CPU spike that never affected a customer. Two rules of thumb worth internalising:
First, page on symptoms, and only on sustained ones. A one-minute spike is a signal; a ten-minute degradation is a customer.
Second, every alert needs a linked dashboard and a linked runbook, even if the runbook is one paragraph. A page without somewhere to look is a page that gets snoozed.
Boring dashboards that answer one question
The first screen of a service dashboard should answer the four operational questions before the engineer scrolls. If a dashboard needs three tabs and a query language refresher, it is too slow for incident response.
What lives on the first screen:
- traffic, error rate, and latency over the last 24h with deploy markers
- per-route breakdown to spot which endpoint is responsible
- per-dependency latency to spot when one upstream is slow
- queue depth and worker retry rate, if the service is async
What does not belong on the first screen: pod CPU, garbage collection counters, container restart rates. They're useful once you already suspect the issue is below the application layer, but they waste attention in the first thirty seconds of a page.
Two small habits make dashboards age well: annotate every deploy, and put a links panel at the top with the runbook, the source repo, and the on-call rotation. That single panel removes 80 % of the “wait, where do I look” cost during a real incident.
Observability is part of system design
Treat the baseline as part of the same conversation as retries, timeouts, idempotency, rate limits, and deployment safety. It is not polish you add later when the team has time.
A service that ships without observability is a service the team will be afraid to change. That fear shows up everywhere: longer review cycles, slower deploys, more meetings about “risk.” Wiring the baseline in early is a one-time cost that pays compounding interest in confidence.
The standard I care about most: when production gets noisy, can an engineer move from symptom to responsible code path without guessing? If the answer is yes, the service is in a much better operational position than its uptime number alone would suggest.
A system you can observe is a system you can change. Without that, every deploy is a small act of faith.
The pre-launch checklist
Everything above, compressed to a single page. Use it before the service ships, then revisit it every quarter.
- Pick one path. Instrument the business-critical path end-to-end before touching anything else.
- Six metrics. Requests, errors, latency, dependency latency, queue depth, retries. Then stop.
- Templated routes. Always
/users/:id, never/users/12345. Cardinality survives. - SLO-shaped buckets. Choose histogram buckets around your real targets, not the defaults.
- Structured logs. JSON.
requestId, route, method, service. Always. - Request id flows. Header in, header out, attached to every log and span.
- Trace boundaries. HTTP, DB, queues, third parties. Sample, don't starve.
- Symptom alerts. 5xx, p95, queue lag. Sustained, not spiky. CPU is not a page.
- Linked runbook. Every alert points to a dashboard and a paragraph of guidance.
- Above-the-fold. A handful of tiles answer the four questions before any scrolling.
- Deploy markers. Annotate every deploy on every service dashboard. Cheap, decisive.
- Definition of done. Observability ships with the service. Not in a follow-up ticket.