Assume Its 2 AM. Alerts are firing. CPU is maxed out, connections are piling up, and the application team is already asking for an ETA.
You SSH into the server, open psql, and start your usual routine - check pg_stat_activity, look at locks, scan logs, maybe run a few custom queries you’ve built
over the years. Fifteen minutes in, you still don’t have the full
picture.
This is exactly where things break down in production - - not because PostgreSQL lacks visibility, but because the data is scattered.
That is where pg_gather changes the game.
Instead of manually stitching together information from multiple system views, logs, and configs, pg_gather captures everything in one go and presents it as a structured report. Its not just a tool its a repeatable diagnostic workflow.
In this article, I will walk through how pg_gather actually fits into real-world DBA operations, when it helps, when it doesnt, and how to use it effectively during incidents - not after they are over.
Why DBAs Struggle During Incidents
In PostgreSQL, all the data you need is already there:
-
pg_stat_activityshows sessions -
pg_locksshows contention -
pg_stat_bgwriterhints at checkpoint pressure - Logs tell the story of failures
But here is the problem — they don't tell the story together.
In production, I’ve seen this pattern repeatedly:
- One DBA checks locks
- Another checks CPU
- Someone else scans logs
- Nobody correlates it fast enough
By the time you connect the dots, the issue is gone.
What pg_gather Actually Solves
Think of pg_gather as a point-in-time forensic capture. Instead of running 20+ commands manually, you run one:
./pg_gather -d prod_db -o incident_$(date +%s).html
It provides all the necessary information such as,
- Full configuration snapshot
- Active sessions and wait events
- Lock chains (who is blocking whom)
- Query statistics (if enabled)
- Vacuum and bloat signals
- OS-level metrics
The key benefit is not just automation - its consistency.
Every DBA on your team collects the same dataset, every time.
What Gets Captured (And Why It Matters)
Session & Lock Visibility
When a system slows down, it’s usually not CPU - it’s contention. Use the below query.
SELECT pid, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle';pg_gather expands this into - Blocking trees, Wait chains & Duration analysis
Sometimes, in the production system - problem is often not the slow query - its the one holding the lock.
Configuration Snapshot
One underrated cause of performance issues is bad configuration.
pg_gather captures:
-
shared_buffers -
work_mem - autovacuum settings
- connection limits
Useful when Comparing configs across environments (dev vs prod mismatch is common).
Vacuum and Bloat Analysis
Autovacuum is one of those things you ignore… until it breaks.
pg_gather highlights:
- Dead tuples
- Vacuum lag
- Table bloat indicators
Real issue I hv seen: A blocked autovacuum due to a long transaction - resulted in → table bloat → sudden performance drop.
Query-Level Insights (Critical)
Without pg_stat_statements, you are basically blind. One should enable it.
# CREATE EXTENSION pg_stat_statements; # alter system set shared_preload_libraries = 'pg_stat_statements'
Once done, restart the PostgreSQL service and now pg_gather can show:
- Top queries by total time
- Frequency vs latency
- Execution patterns
Note that : It requires restart, but worth it in every serious production system.
Running pg_gather Safely in Production
A common concern: Will this impact my database?
Short answer: Minimal, but not zero.
Best practice:
nice -n 10 ./pg_gather -d db_name > gather.log 2>&1
Why this matters:
- Lower CPU priority
- No interference with workload
- Full audit trail
Reality check:
On very large systems, even read
queries can create noticeable load. Avoid running during peak spikes unless
necessary.
When to Run pg_gather (Timing Matters)
This is where most teams get it wrong.
- Running it after the issue occurred is not useful. Because the data is gone.
- If it has run during the issue, that the 'Goldmine'.
Because it Captures most important information such as : Active locks , Running queries, Wait events
It is advisable, to run it multiple times during a long incident to track progression.
PostgreSQL vs Oracle: Diagnostic Approach
As someone who has worked extensively on both - here is short terminology graph.
# Install git
sudo apt install git -y
# Clone repo
git clone https://github.com/jobinau/pg_gather.git
cd pg_gather
# Make executable
chmod +x pg_gather
# Run
./pg_gather -d postgres
# Install git
sudo apt install git -y
# Clone repo
git clone https://github.com/jobinau/pg_gather.git
cd pg_gather
# Make executable
chmod +x pg_gather
# Run
./pg_gather -d postgres
Quick Takeaways
- pg_gather is a snapshot tool, not monitoring
- Run it during incidents, not after
- Enables faster root cause analysis
- Works best with pg_stat_statements enabled
- Safe for production when used carefully
- Standardizes troubleshooting across teams
- Great for audits and baseline comparisons
DBA Takeaways
- Most DBAs waste time collecting data, not analyzing it
- Lock chains are often the hidden root cause
- Autovacuum issues show up only when it’s too late
- Teams without standard diagnostics struggle during incidents
- Historical snapshots are extremely valuable for recurring issues
Recommendation:
Maintain a shared repository of
pg_gather reports - it also becomes your internal knowledge base.
MINI CASE STUDY: Lock Contention Nightmare
Problem: Application slowdown, API latency jumped from 200ms to 6 seconds.
Symptoms : High CPU + Connection spikes + Queries stuck in wait state
pg_gather Findings:
- One transaction holding lock for 18 minutes
- Downstream queries waiting
- Autovacuum blocked
Root Cause: Uncommitted transaction from batch job.
Fix: SELECT pg_terminate_backend(<pid>);
Outcome:
- System recovered in minutes
- Without pg_gather, took ~40 minutes to diagnose
Conclusion
pg_gather is one of those tools you don't realize you need until you are deep into a production issue. PostgreSQL gives you excellent visibility - but it expects you to assemble the puzzle yourself. That works when you have time. It does not work when systems are burning.
By standardizing how diagnostics are captured, pg_gather removes guesswork and speeds up decision-making. It doesn’t replace monitoring tools, and it doesn’t give you historical trends like enterprise systems — but for point-in-time analysis, it’s incredibly effective.
If you’re managing production PostgreSQL systems and still relying on manual queries during incidents, you’re adding unnecessary delay to your troubleshooting process.
Run it in your environment. Capture baseline reports. Use it during your next incident.
That’s when it proves its real value.
FAQs
1. Does pg_gather impact performance?
Minimal impact, but noticeable on
very large systems. Use nice to reduce priority.
2. Can it replace monitoring tools?
No. It’s a snapshot tool, not
real-time monitoring.
3. Do I need superuser access?
Recommended. Without it, some data
will be missing.
4. Is pg_stat_statements mandatory?
No, but without it you lose
query-level insights.
5. Can I run it on managed cloud databases?
Yes, but access restrictions may
limit data collection.
Have you ever been stuck diagnosing a PostgreSQL issue where data was scattered everywhere? Do you rely on custom scripts, or something like pg_gather?
Curious to hear how others handle incident diagnostics in production.
Source : https://github.com/jobinau/pg_gather
https://www.percona.com/blog/what-is-in-pg_gather-version-33/
No comments:
Post a Comment