Skip to content

Collect and send PostgreSQL query performance metrics via pg_stat_statements #175

@iAziz786

Description

@iAziz786

Summary

Add query performance monitoring by collecting aggregate stats from pg_stat_statements and sending them as a new postgresql.statements metric type to the selfhost control plane. This enables users to identify slow queries, track execution trends, and set alerts on query performance thresholds.

Problem

Currently, hostlink collects system-level PostgreSQL metrics (connections, cache hit ratio, TPS, replication lag) but has no visibility into which queries are slow or resource-intensive. Users cannot answer:

  • Which queries take the most total time?
  • Which queries have the highest average execution time?
  • Which queries are called most frequently and might benefit from optimization?

Approach

Inspired by New Relic's query performance monitoring: use pg_stat_statements aggregate data per query fingerprint, delta-based collection (per-interval, not cumulative), top 20 queries per tick.

Requirements

Hostlink (this repo)

Domain — new metric type

  • Add MetricTypePostgreSQLStatements = "postgresql.statements" constant in domain/metrics/metrics.go alongside existing constants (MetricTypeSystem, MetricTypeNetwork, MetricTypePostgreSQLDatabase, MetricTypeStorage)
  • Add QuerySample struct in domain/metrics/metrics.go:
    type QuerySample struct {
        QueryID             string  `json:"query_id"`
        Query               string  `json:"query"`
        Calls               int64   `json:"calls"`
        TotalExecTimeMs     float64 `json:"total_exec_time_ms"`
        MeanExecTimeMs      float64 `json:"mean_exec_time_ms"`
        MinExecTimeMs       float64 `json:"min_exec_time_ms"`
        MaxExecTimeMs       float64 `json:"max_exec_time_ms"`
        Rows                int64   `json:"rows"`
        SharedBlksHit       int64   `json:"shared_blks_hit"`
        SharedBlksRead      int64   `json:"shared_blks_read"`
        TempBlksWritten     int64   `json:"temp_blks_written"`
        WALBytes            int64   `json:"wal_bytes"`
    }
  • Add PostgreSQLStatementMetrics wrapper struct:
    type PostgreSQLStatementMetrics struct {
        Queries []QuerySample `json:"queries"`
    }
  • Add metric type constant: MetricTypePostgreSQLStatements = "postgresql.statements"

Collector — update interface and implementation

  • Extend pgmetrics.Collector interface in internal/pgmetrics/collector.go with a new method: CollectStatements(credential.Credential) (metrics.PostgreSQLStatementMetrics, error) (avoids breaking the existing Collect signature)
  • Add internal state for statement delta tracking to pgmetrics struct: lastStatements map[string]pgStatementStats and lastStatementsTime time.Time
  • Implement collectStatementMetrics():
    SELECT queryid::text, query, calls,
           total_exec_time, min_exec_time, max_exec_time, mean_exec_time,
           rows, shared_blks_hit, shared_blks_read, temp_blks_written, wal_bytes
    FROM pg_stat_statements
    WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database())
    ORDER BY total_exec_time DESC
    LIMIT 20
  • Delta-based calculation: store previous snapshot keyed by queryid, compute per-interval deltas for calls, total_exec_time, recompute mean_exec_time from deltas
  • First collection returns empty Queries slice — establishes baseline
  • Gracefully skip if pg_stat_statements extension is not available (check error code, log warning, return empty metrics — do NOT fail the push)

Pusher — wire into metricspusher

  • Update NewWithDependencies() in app/services/metrics/metrics.go to accept the extended pgmetrics.Collector (already the dependency — just needs to call the new method)
  • In Push(), call mp.metricscollector.CollectStatements(cred) after existing PG collection
  • Append MetricSet{Type: MetricTypePostgreSQLStatements, Metrics: stmtMetrics} to metricSets
  • Include even when Queries is empty (signals extension available but no activity)

Tests

  • Unit tests for delta calculation: first collection → empty, subsequent → correct deltas, query appearing/disappearing between collections, stats reset
  • Unit test for graceful handling when extension is missing (mock returns error, verify empty result returned without propagation)
  • Integration test with PostgreSQL container (requires shared_preload_libraries = 'pg_stat_statements') — verify actual queries appear in results

Selfhost (separate PR)

Database — new table

  • Migration for agent_query_samples table:
    create_table :agent_query_samples do |t|
      t.string   :agent_pid,       null: false
      t.string   :query_id
      t.text     :query_text
      t.bigint   :calls
      t.float    :total_exec_time_ms
      t.float    :mean_exec_time_ms
      t.float    :min_exec_time_ms
      t.float    :max_exec_time_ms
      t.bigint   :rows
      t.bigint   :shared_blks_hit
      t.bigint   :shared_blks_read
      t.bigint   :temp_blks_written
      t.bigint   :wal_bytes
      t.datetime :timestamp
      t.timestamps
    end
  • Indexes: (agent_pid, timestamp), (agent_pid, query_id, timestamp)
  • Foreign key: agent_pidagents.pid (follows same pattern as agent_heartbeats)

Model

  • AgentQuerySample model in app/models/agent_query_sample.rb:
    class AgentQuerySample < ApplicationRecord
      belongs_to :agent, foreign_key: :agent_pid, primary_key: :pid
    end
  • Add has_many :agent_query_samples to Agent model

Ingestion — heartbeats_controller#create

  • Add branch inside the metric_sets.each loop: when metric_set["type"] == "postgresql.statements", iterate over metric_set["metrics"]["queries"] array and create AgentQuerySample records instead of AgentHeartbeat
  • Continue creating AgentHeartbeat for all other metric types (no change to existing flow)
  • ActionCable broadcast: include postgresql.statements metric set in the broadcast payload (same as other types — no special handling needed)

API — query performance endpoint

  • New route: GET /organizations/:id/query_performance?agent_pid=...&start_date=...&end_date=...&sort_by=total_exec_time|mean_exec_time|calls&limit=20
  • New action in organizations_controller (follows pattern of instance_metrics action at line 155)
  • Query: AgentQuerySample.where(agent_pid:).where(timestamp: range).order(sort_by => :desc).limit(limit)
  • Group by query_id for time-series view of individual query trends

Alerting

  • Add to AlertRule::VALID_METRICS in app/models/alert_rule.rb:
    "postgresql.statements" => %w[mean_exec_time_ms max_exec_time_ms calls]
  • Add when "postgresql.statements" branch in EvaluationService#query_heartbeat (app/services/alerting/evaluation_service.rb:107)
  • Add query_statement_metrics(instance) method: query latest AgentQuerySample per query_id, extract the alert's metric_name from the sample with highest mean_exec_time_ms (alert on worst query, not average across all)

Health check — optional

  • In InstanceHealthCheckJob, optionally check if postgresql.statements data is flowing (staleness check like other metric types)

Architecture

pg_stat_statements ──SQL──► hostlink/pgmetrics ──delta──► metric payload
                                                               │
                                          POST /api/v1/agents/{id}/metrics
                                                               ▼
                                                        selfhost controller
                                                               │
                                              ┌────────────────┼────────────────┐
                                              │                │                │
                                              ▼                ▼                ▼
                                        AgentHeartbeat   AgentQuerySample   ActionCable
                                        (existing)       (NEW - per query   (real-time
                                        system/net/      fingerprint)       broadcast)
                                        pg.database/                     ┌──┘
                                        storage                              │
                                                                         │
                                              ┌──────────────────────────┘
                                              │
                                    ┌─────────┼─────────────┐
                                    ▼         ▼             ▼
                              Alert rules  Query perf    Dashboard
                              (mean_time   API endpoint  (top N slow,
                               > threshold) (historical) sparklines)

Wire changes summary

File Change
domain/metrics/metrics.go Add MetricTypePostgreSQLStatements constant, QuerySample struct, PostgreSQLStatementMetrics struct
internal/pgmetrics/collector.go Add CollectStatements() to Collector interface, implement with delta tracking
app/services/metrics/metrics.go Call CollectStatements() in Push(), append new MetricSet
app/jobs/metricsjob/metricsjob.go No change (already calls mp.Push(cred) which handles everything)

Deferred (v2+)

  • Individual slow query samples (pg_stat_activity snapshots for queries exceeding threshold)
  • Wait time analysis (blk_read_time, blk_write_time — requires track_io_timing)
  • Query plan analysis (EXPLAIN execution)
  • Blocking/lock analysis (pg_stat_activity + pg_locks join)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmonitoringMonitoring and observability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions