Skip to content

Prometheus

Prometheus is the default monitoring system for Kubernetes. It scrapes metrics from targets, stores them as time series, evaluates alerting rules, and sends notifications via Alertmanager.

The Kubernetes ecosystem is built around Prometheus. kube-state-metrics, node-exporter, kubelet, etcd, CoreDNS, and nearly every major CNCF project expose Prometheus-compatible metrics. This means your monitoring stack is mostly wiring existing endpoints together rather than building instrumentation from scratch.

Architecture

flowchart TD
    subgraph Data Sources
        App[Application\n/metrics endpoint]
        KSM[kube-state-metrics]
        NE[node-exporter]
        Kubelet[kubelet\ncadvisor metrics]
    end
    subgraph Prometheus
        Scraper[Scrape engine\npull-based]
        TSDB[(Local TSDB\n15d default)]
        Rules[Rule evaluator\nrecording + alerting]
    end
    subgraph Alerting
        AM[Alertmanager\nrouting + dedup]
        Slack[Slack]
        PD[PagerDuty]
        Email[Email]
    end
    subgraph Long-term
        Thanos[Thanos / Cortex\n/ VictoriaMetrics]
    end

    App --> Scraper
    KSM --> Scraper
    NE --> Scraper
    Kubelet --> Scraper
    Scraper --> TSDB
    TSDB --> Rules
    Rules --> AM
    AM --> Slack
    AM --> PD
    AM --> Email
    TSDB --> Thanos

Prometheus is pull-based - it fetches metrics from targets on a schedule. This is different from push-based systems. The implication: Prometheus needs network access to every target, and targets don't need to know where Prometheus lives.

Data model

Every metric is a time series identified by a name and a set of key-value labels:

http_requests_total{method="POST", status="200", handler="/api/users"} 1234 @timestamp

The four metric types:

Type Behavior Use for
Counter monotonically increasing requests, errors, bytes transferred
Gauge can go up or down current connections, memory, queue depth
Histogram samples observations into buckets request duration, response size
Summary pre-calculated quantiles client-side latency percentiles (less flexible than histograms)

Prefer histograms over summaries for latency. Histogram quantiles are calculated at query time from raw bucket data, so you can change the quantile you care about after collection. Summary quantiles are fixed at instrumentation time.

PromQL

PromQL is a functional query language for selecting and aggregating time series.

Instant and range vectors

# instant vector  -  current value of all time series matching the selector
http_requests_total{status="200"}

# range vector  -  all samples in the last 5 minutes
http_requests_total{status="200"}[5m]

Rate and increase

Always use rate() on counters, not raw values. Counters reset on restart; rate() handles resets correctly.

# per-second rate of HTTP requests over last 5 minutes
rate(http_requests_total[5m])

# total requests in the last hour (useful for SLO burn rate)
increase(http_requests_total[1h])

Aggregation

# total request rate across all pods in the production namespace
sum(rate(http_requests_total{namespace="production"}[5m]))

# request rate per handler, across all pods
sum by (handler) (rate(http_requests_total{namespace="production"}[5m]))

# 99th percentile latency
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
)

Essential Kubernetes queries

# CPU usage per pod (% of limit)
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
  / sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace)

# Memory usage vs request
container_memory_working_set_bytes{container!=""}
  / kube_pod_container_resource_requests{resource="memory"}

# Pod restart rate (last 1h)
increase(kube_pod_container_status_restarts_total[1h]) > 0

# OOMKilled pods in last 24h
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

# Nodes not ready
kube_node_status_condition{condition="Ready", status="true"} == 0

# Deployment rollout progress
kube_deployment_status_replicas_available / kube_deployment_spec_replicas

Prometheus Operator

The Prometheus Operator is the standard way to run Prometheus in Kubernetes. It introduces CRDs that let you manage Prometheus, Alertmanager, and their configuration as Kubernetes objects.

flowchart LR
    PO[Prometheus Operator] --> |watches| SM[ServiceMonitor]
    PO --> |watches| PM[PodMonitor]
    PO --> |watches| PR[PrometheusRule]
    PO --> |generates config| Prom[Prometheus]
    SM --> |scrape target| Svc[Service]
    PM --> |scrape target| Pod[Pod]
    PR --> |loaded as| Alert[Alerting rules]

ServiceMonitor

Tells Prometheus which services to scrape:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-server
  namespace: monitoring
  labels:
    team: platform          # must match Prometheus.spec.serviceMonitorSelector
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      app: api              # matches Service labels
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
      relabelings:
        - sourceLabels: [__meta_kubernetes_pod_node_name]
          targetLabel: node

PodMonitor

Scrapes pods directly, without requiring a Service:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: batch-jobs
  namespace: monitoring
spec:
  namespaceSelector:
    any: true
  selector:
    matchLabels:
      monitoring: "true"
  podMetricsEndpoints:
    - port: metrics
      interval: 30s

PrometheusRule

Define alerting and recording rules as a Kubernetes resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-alerts
  namespace: monitoring
  labels:
    team: platform
spec:
  groups:
    - name: api.rules
      interval: 1m
      rules:
        - record: job:http_requests:rate5m
          expr: sum(rate(http_requests_total[5m])) by (job)

        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
            / sum(rate(http_requests_total[5m])) by (job) > 0.05
          for: 5m
          labels:
            severity: critical
            team: platform
          annotations:
            summary: "High error rate on {{ $labels.job }}"
            description: "Error rate is {{ $value | humanizePercentage }} for job {{ $labels.job }}"

Recording rules

Recording rules pre-compute expensive queries and store the result as a new time series. Use them for: - Queries used in dashboards (run once, read many times) - High-cardinality aggregations referenced in alerts - Multi-step alert expressions

rules:
  - record: namespace:container_cpu_usage:rate5m
    expr: |
      sum by (namespace) (
        rate(container_cpu_usage_seconds_total{container!=""}[5m])
      )

Name recording rules with the convention level:metric:operations - it makes the hierarchy obvious.

Alertmanager

Alertmanager handles routing, deduplication, inhibition, and silencing of alerts from Prometheus.

Routing tree

route:
  group_by: ["alertname", "namespace"]
  group_wait: 30s       # wait before sending the first notification
  group_interval: 5m    # wait before sending an update for an existing group
  repeat_interval: 4h   # wait before resending a resolved alert
  receiver: slack-platform

  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      continue: false

    - match_re:
        namespace: "^finance-.*"
      receiver: slack-finance

Inhibition

Suppress lower-severity alerts when a higher-severity alert is firing for the same target:

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["namespace", "job"]

This prevents alert floods when a service is completely down - you get one critical alert, not ten warnings about symptoms.

Silencing

Silence alerts during planned maintenance:

amtool silence add alertname="HighErrorRate" namespace="production" \
  --duration 2h \
  --comment "Planned maintenance window"
amtool silence query
amtool silence expire <id>

kube-prometheus-stack

The kube-prometheus-stack Helm chart is the standard way to deploy the full monitoring stack:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  -f values.yaml

It bundles Prometheus Operator, Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter, and a set of default recording rules and dashboards.

Long-term storage

Prometheus's local TSDB has a default 15-day retention window. For longer retention and multi-cluster federation:

Thanos - queries multiple Prometheus instances and stores data in object storage (S3, GCS, Azure Blob). The sidecar mode attaches to each Prometheus and uploads completed blocks.

VictoriaMetrics - drop-in Prometheus-compatible replacement with better compression, faster ingestion, and built-in clustering. Simpler operationally than Thanos.

Cortex / Mimir - horizontally scalable, multi-tenant Prometheus. Standard in large organizations using Grafana Cloud.

Cardinality management

High cardinality destroys Prometheus performance. The most common causes:

  • labels with unbounded values: user IDs, request IDs, IP addresses, pod names with hashes
  • recording every HTTP path as a label (use pattern matching or drop high-cardinality paths)
  • short-lived jobs pushing to Pushgateway without cleanup
# Find high-cardinality metrics
curl -sg 'http://prometheus:9090/api/v1/label/__name__/values' | jq '.data | length'
curl -sg 'http://prometheus:9090/api/v1/query?query=topk(10,count by (__name__)({__name__=~".+"}))' \
  | jq '.data.result[] | {metric: .metric.__name__, count: .value[1]}'

Drop unnecessary labels at scrape time with relabelings in the ServiceMonitor.