Prometheus¶
Prometheus is the default monitoring system for Kubernetes. It scrapes metrics from targets, stores them as time series, evaluates alerting rules, and sends notifications via Alertmanager.
The Kubernetes ecosystem is built around Prometheus. kube-state-metrics, node-exporter, kubelet, etcd, CoreDNS, and nearly every major CNCF project expose Prometheus-compatible metrics. This means your monitoring stack is mostly wiring existing endpoints together rather than building instrumentation from scratch.
Architecture¶
flowchart TD
subgraph Data Sources
App[Application\n/metrics endpoint]
KSM[kube-state-metrics]
NE[node-exporter]
Kubelet[kubelet\ncadvisor metrics]
end
subgraph Prometheus
Scraper[Scrape engine\npull-based]
TSDB[(Local TSDB\n15d default)]
Rules[Rule evaluator\nrecording + alerting]
end
subgraph Alerting
AM[Alertmanager\nrouting + dedup]
Slack[Slack]
PD[PagerDuty]
Email[Email]
end
subgraph Long-term
Thanos[Thanos / Cortex\n/ VictoriaMetrics]
end
App --> Scraper
KSM --> Scraper
NE --> Scraper
Kubelet --> Scraper
Scraper --> TSDB
TSDB --> Rules
Rules --> AM
AM --> Slack
AM --> PD
AM --> Email
TSDB --> Thanos
Prometheus is pull-based - it fetches metrics from targets on a schedule. This is different from push-based systems. The implication: Prometheus needs network access to every target, and targets don't need to know where Prometheus lives.
Data model¶
Every metric is a time series identified by a name and a set of key-value labels:
The four metric types:
| Type | Behavior | Use for |
|---|---|---|
| Counter | monotonically increasing | requests, errors, bytes transferred |
| Gauge | can go up or down | current connections, memory, queue depth |
| Histogram | samples observations into buckets | request duration, response size |
| Summary | pre-calculated quantiles client-side | latency percentiles (less flexible than histograms) |
Prefer histograms over summaries for latency. Histogram quantiles are calculated at query time from raw bucket data, so you can change the quantile you care about after collection. Summary quantiles are fixed at instrumentation time.
PromQL¶
PromQL is a functional query language for selecting and aggregating time series.
Instant and range vectors¶
# instant vector - current value of all time series matching the selector
http_requests_total{status="200"}
# range vector - all samples in the last 5 minutes
http_requests_total{status="200"}[5m]
Rate and increase¶
Always use rate() on counters, not raw values. Counters reset on restart; rate() handles resets correctly.
# per-second rate of HTTP requests over last 5 minutes
rate(http_requests_total[5m])
# total requests in the last hour (useful for SLO burn rate)
increase(http_requests_total[1h])
Aggregation¶
# total request rate across all pods in the production namespace
sum(rate(http_requests_total{namespace="production"}[5m]))
# request rate per handler, across all pods
sum by (handler) (rate(http_requests_total{namespace="production"}[5m]))
# 99th percentile latency
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m]))
)
Essential Kubernetes queries¶
# CPU usage per pod (% of limit)
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace)
/ sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, namespace)
# Memory usage vs request
container_memory_working_set_bytes{container!=""}
/ kube_pod_container_resource_requests{resource="memory"}
# Pod restart rate (last 1h)
increase(kube_pod_container_status_restarts_total[1h]) > 0
# OOMKilled pods in last 24h
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
# Nodes not ready
kube_node_status_condition{condition="Ready", status="true"} == 0
# Deployment rollout progress
kube_deployment_status_replicas_available / kube_deployment_spec_replicas
Prometheus Operator¶
The Prometheus Operator is the standard way to run Prometheus in Kubernetes. It introduces CRDs that let you manage Prometheus, Alertmanager, and their configuration as Kubernetes objects.
flowchart LR
PO[Prometheus Operator] --> |watches| SM[ServiceMonitor]
PO --> |watches| PM[PodMonitor]
PO --> |watches| PR[PrometheusRule]
PO --> |generates config| Prom[Prometheus]
SM --> |scrape target| Svc[Service]
PM --> |scrape target| Pod[Pod]
PR --> |loaded as| Alert[Alerting rules]
ServiceMonitor¶
Tells Prometheus which services to scrape:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-server
namespace: monitoring
labels:
team: platform # must match Prometheus.spec.serviceMonitorSelector
spec:
namespaceSelector:
matchNames:
- production
selector:
matchLabels:
app: api # matches Service labels
endpoints:
- port: metrics
interval: 15s
path: /metrics
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
PodMonitor¶
Scrapes pods directly, without requiring a Service:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: batch-jobs
namespace: monitoring
spec:
namespaceSelector:
any: true
selector:
matchLabels:
monitoring: "true"
podMetricsEndpoints:
- port: metrics
interval: 30s
PrometheusRule¶
Define alerting and recording rules as a Kubernetes resource:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-alerts
namespace: monitoring
labels:
team: platform
spec:
groups:
- name: api.rules
interval: 1m
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/ sum(rate(http_requests_total[5m])) by (job) > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for job {{ $labels.job }}"
Recording rules¶
Recording rules pre-compute expensive queries and store the result as a new time series. Use them for: - Queries used in dashboards (run once, read many times) - High-cardinality aggregations referenced in alerts - Multi-step alert expressions
rules:
- record: namespace:container_cpu_usage:rate5m
expr: |
sum by (namespace) (
rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
Name recording rules with the convention level:metric:operations - it makes the hierarchy obvious.
Alertmanager¶
Alertmanager handles routing, deduplication, inhibition, and silencing of alerts from Prometheus.
Routing tree¶
route:
group_by: ["alertname", "namespace"]
group_wait: 30s # wait before sending the first notification
group_interval: 5m # wait before sending an update for an existing group
repeat_interval: 4h # wait before resending a resolved alert
receiver: slack-platform
routes:
- match:
severity: critical
receiver: pagerduty-oncall
continue: false
- match_re:
namespace: "^finance-.*"
receiver: slack-finance
Inhibition¶
Suppress lower-severity alerts when a higher-severity alert is firing for the same target:
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ["namespace", "job"]
This prevents alert floods when a service is completely down - you get one critical alert, not ten warnings about symptoms.
Silencing¶
Silence alerts during planned maintenance:
amtool silence add alertname="HighErrorRate" namespace="production" \
--duration 2h \
--comment "Planned maintenance window"
amtool silence query
amtool silence expire <id>
kube-prometheus-stack¶
The kube-prometheus-stack Helm chart is the standard way to deploy the full monitoring stack:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f values.yaml
It bundles Prometheus Operator, Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter, and a set of default recording rules and dashboards.
Long-term storage¶
Prometheus's local TSDB has a default 15-day retention window. For longer retention and multi-cluster federation:
Thanos - queries multiple Prometheus instances and stores data in object storage (S3, GCS, Azure Blob). The sidecar mode attaches to each Prometheus and uploads completed blocks.
VictoriaMetrics - drop-in Prometheus-compatible replacement with better compression, faster ingestion, and built-in clustering. Simpler operationally than Thanos.
Cortex / Mimir - horizontally scalable, multi-tenant Prometheus. Standard in large organizations using Grafana Cloud.
Cardinality management¶
High cardinality destroys Prometheus performance. The most common causes:
- labels with unbounded values: user IDs, request IDs, IP addresses, pod names with hashes
- recording every HTTP path as a label (use pattern matching or drop high-cardinality paths)
- short-lived jobs pushing to Pushgateway without cleanup
# Find high-cardinality metrics
curl -sg 'http://prometheus:9090/api/v1/label/__name__/values' | jq '.data | length'
curl -sg 'http://prometheus:9090/api/v1/query?query=topk(10,count by (__name__)({__name__=~".+"}))' \
| jq '.data.result[] | {metric: .metric.__name__, count: .value[1]}'
Drop unnecessary labels at scrape time with relabelings in the ServiceMonitor.