pg_gather: PostgreSQL Snapshot Tool Every DBA Should Use

Monday, 29 June 2026

pg_gather: PostgreSQL Snapshot Tool Every DBA Should Use

Assume Its 2 AM. Alerts are firing. CPU is maxed out, connections are piling up, and the application team is already asking for an ETA.

You SSH into the server, open psql, and start your usual routine - check pg_stat_activity, look at locks, scan logs, maybe run a few custom queries you’ve built over the years. Fifteen minutes in, you still don’t have the full picture.

This is exactly where things break down in production - - not because PostgreSQL lacks visibility, but because the data is scattered.

That is where pg_gather changes the game.

Instead of manually stitching together information from multiple system views, logs, and configs, pg_gather captures everything in one go and presents it as a structured report. Its not just a tool its a repeatable diagnostic workflow.

In this article, I will walk through how pg_gather actually fits into real-world DBA operations, when it helps, when it doesnt, and how to use it effectively during incidents - not after they are over.

Why DBAs Struggle During Incidents

In PostgreSQL, all the data you need is already there:

pg_stat_activity shows sessions
pg_locks shows contention
pg_stat_bgwriter hints at checkpoint pressure
Logs tell the story of failures

But here is the problem — they don't tell the story together.

In production, I’ve seen this pattern repeatedly:

One DBA checks locks
Another checks CPU
Someone else scans logs
Nobody correlates it fast enough

By the time you connect the dots, the issue is gone.

What pg_gather Actually Solves

Think of pg_gather as a point-in-time forensic capture. Instead of running 20+ commands manually, you run one:


./pg_gather -d prod_db -o incident_$(date +%s).html

It provides all the necessary information such as,

Full configuration snapshot
Active sessions and wait events
Lock chains (who is blocking whom)
Query statistics (if enabled)
Vacuum and bloat signals
OS-level metrics

The key benefit is not just automation - its consistency.

Every DBA on your team collects the same dataset, every time.

What Gets Captured (And Why It Matters)

Session & Lock Visibility

When a system slows down, it’s usually not CPU - it’s contention. Use the below query.

SELECT pid, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle';

pg_gather expands this into - Blocking trees, Wait chains & Duration analysis

Sometimes, in the production system - problem is often not the slow query - its the one holding the lock.

Configuration Snapshot

One underrated cause of performance issues is bad configuration.

pg_gather captures:

shared_buffers
work_mem
autovacuum settings
connection limits

Useful when Comparing configs across environments (dev vs prod mismatch is common).

Vacuum and Bloat Analysis

Autovacuum is one of those things you ignore… until it breaks.

pg_gather highlights:

Dead tuples
Vacuum lag
Table bloat indicators

Real issue I hv seen: A blocked autovacuum due to a long transaction - resulted in → table bloat → sudden performance drop.

Query-Level Insights (Critical)

Without pg_stat_statements, you are basically blind. One should enable it.

# CREATE EXTENSION pg_stat_statements;
# alter system set shared_preload_libraries = 'pg_stat_statements'



Once done, restart the PostgreSQL service and now pg_gather can show:

Top queries by total time
Frequency vs latency
Execution patterns

Note that : It requires restart, but worth it in every serious production system.

Running pg_gather Safely in Production

A common concern: Will this impact my database?

Short answer: Minimal, but not zero.

Best practice:


nice -n 10 ./pg_gather -d db_name > gather.log 2>&1

Why this matters:

Lower CPU priority
No interference with workload
Full audit trail

Reality check:
On very large systems, even read queries can create noticeable load. Avoid running during peak spikes unless necessary.

When to Run pg_gather (Timing Matters)

This is where most teams get it wrong.

Running it after the issue occurred is not useful. Because the data is gone.
If it has run during the issue, that the 'Goldmine'.

Because it Captures most important information such as : Active locks , Running queries, Wait events

It is advisable, to run it multiple times during a long incident to track progression.

PostgreSQL vs Oracle: Diagnostic Approach

As someone who has worked extensively on both - here is short terminology graph.

Installation and Execution

# Install git
sudo apt install git -y

# Clone repo
git clone https://github.com/jobinau/pg_gather.git
cd pg_gather

# Make executable
chmod +x pg_gather

# Run
./pg_gather -d postgres

Quick Takeaways

pg_gather is a snapshot tool, not monitoring
Run it during incidents, not after
Enables faster root cause analysis
Works best with pg_stat_statements enabled
Safe for production when used carefully
Standardizes troubleshooting across teams
Great for audits and baseline comparisons

DBA Takeaways

Most DBAs waste time collecting data, not analyzing it
Lock chains are often the hidden root cause
Autovacuum issues show up only when it’s too late
Teams without standard diagnostics struggle during incidents
Historical snapshots are extremely valuable for recurring issues

Recommendation:
Maintain a shared repository of pg_gather reports - it also becomes your internal knowledge base.

MINI CASE STUDY: Lock Contention Nightmare

Problem: Application slowdown, API latency jumped from 200ms to 6 seconds.

Symptoms : High CPU + Connection spikes + Queries stuck in wait state

pg_gather Findings:

One transaction holding lock for 18 minutes
Downstream queries waiting
Autovacuum blocked

Root Cause: Uncommitted transaction from batch job.

Fix: SELECT pg_terminate_backend(<pid>);

Outcome:

System recovered in minutes
Without pg_gather, took ~40 minutes to diagnose

Conclusion

pg_gather is one of those tools you don't realize you need until you are deep into a production issue. PostgreSQL gives you excellent visibility - but it expects you to assemble the puzzle yourself. That works when you have time. It does not work when systems are burning.

By standardizing how diagnostics are captured, pg_gather removes guesswork and speeds up decision-making. It doesn’t replace monitoring tools, and it doesn’t give you historical trends like enterprise systems — but for point-in-time analysis, it’s incredibly effective.

If you’re managing production PostgreSQL systems and still relying on manual queries during incidents, you’re adding unnecessary delay to your troubleshooting process.

Run it in your environment. Capture baseline reports. Use it during your next incident.

That’s when it proves its real value.

FAQs

1. Does pg_gather impact performance?
Minimal impact, but noticeable on very large systems. Use nice to reduce priority.

2. Can it replace monitoring tools?
No. It’s a snapshot tool, not real-time monitoring.

3. Do I need superuser access?
Recommended. Without it, some data will be missing.

4. Is pg_stat_statements mandatory?
No, but without it you lose query-level insights.

5. Can I run it on managed cloud databases?
Yes, but access restrictions may limit data collection.

Have you ever been stuck diagnosing a PostgreSQL issue where data was scattered everywhere? Do you rely on custom scripts, or something like pg_gather?

Curious to hear how others handle incident diagnostics in production.

Source : https://github.com/jobinau/pg_gather

https://www.percona.com/blog/what-is-in-pg_gather-version-33/

Learn DBA : A Life Long Learning Experience

Monday, 29 June 2026