Thursday, 4 June 2026

PostgreSQL VACUUM: Bloat, Autovacuum and Real Fixes

    A PostgreSQL table can grow quietly for weeks before anyone notices. The application team says they already deleted old data. Storage still looks high. Queries are touching more blocks than expected. Autovacuum is running, but the table does not seem to become smaller. This is where many DBAs first realize that DELETE in PostgreSQL is not the same as physically removing rows from the table file.

PostgreSQL uses MVCC, so old row versions remain inside the table until VACUUM can clean them. This is normal behavior, not a bug. The problem starts when dead tuples grow faster than VACUUM can remove them, or when long-running transactions prevent cleanup. Then you get table bloat, index bloat, stale statistics, poor plans, unnecessary I/O, and sometimes transaction ID wraparound pressure.

This article walks through PostgreSQL VACUUM from a production DBA angle. We will create a realistic test table, generate churn, compare table size before and after VACUUM, tune autovacuum for high-write tables, monitor running VACUUM jobs, and discuss when VACUUM FULL, REINDEX, partitioning, or pg_repack makes more sense.


DELETE Does Not Mean Space Is Gone

In PostgreSQL, when a row is updated or deleted, the old version is not immediately removed from disk. PostgreSQL keeps it because another transaction may still need to see that older version for read consistency.

That is MVCC doing its job.

The cleanup happens later through VACUUM. VACUUM removes dead row versions when they are no longer visible to any active transaction. It also cleans dead index entries and updates visibility information.

A simple delete may look harmless:

DELETE FROM dba_lab.order_events
WHERE created_at < now() - interval '90 days';

But physically, the table may still occupy almost the same space after this delete. Regular VACUUM usually marks the freed space as reusable inside PostgreSQL. It does not always return that space to the operating system.

This is one of the first PostgreSQL mindset shifts for Oracle DBAs. In Oracle, you think in terms of undo, segments, extents, high water mark, shrink, and space advisor. In PostgreSQL, you have to think about heap tuples, dead row versions, visibility, autovacuum, and bloat.


Practical DBA use case

After a purge job, do not only check the row count. Check dead tuples and relation size.

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'order_events';

Remember that n_live_tup and n_dead_tup are estimates. They are still useful for trend monitoring, but for exact inspection you can use extensions like pgstattuple where permitted.

CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstattuple('dba_lab.order_events');


Production caveat

VACUUM may find dead tuples but still not remove them if an old transaction is holding the visibility horizon.

Check old transactions:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    xact_start,
    now() - xact_start AS xact_age,
    backend_xmin,
    left(query, 120) AS query_text
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY xact_start;

An idle-in-transaction session from an application pool can hold back cleanup much longer than expected.


Build a Realistic Test Table

A clean lab helps explain the behavior without touching production. The goal is not to create a perfect benchmark. The goal is to reproduce a common DBA situation: insert data, update some rows, delete some rows, then observe what VACUUM actually changes.

Create a schema:

CREATE SCHEMA IF NOT EXISTS dba_lab;

Create a table that looks closer to a real application event table:

CREATE TABLE dba_lab.order_events
(
    event_id     BIGSERIAL PRIMARY KEY,
    tenant_id    INTEGER NOT NULL,
    event_type   TEXT NOT NULL,
    status       TEXT NOT NULL DEFAULT 'NEW',
    payload      TEXT NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed_at TIMESTAMPTZ
)
WITH (fillfactor = 90);

Create a few practical indexes:

CREATE INDEX idx_order_events_tenant_created
ON dba_lab.order_events (tenant_id, created_at);

CREATE INDEX idx_order_events_status
ON dba_lab.order_events (status);

CREATE INDEX idx_order_events_event_type
ON dba_lab.order_events (event_type);

Load around 1.2 million rows:

INSERT INTO dba_lab.order_events
(
    tenant_id,
    event_type,
    status,
    payload,
    created_at
)
SELECT
    (random() * 2000)::int + 1,
    (ARRAY['LOGIN', 'PAYMENT', 'API_CALL', 'BATCH_JOB'])[1 + floor(random() * 4)::int],
    'NEW',
    repeat(md5(g::text), 24),
    now() - (g || ' seconds')::interval
FROM generate_series(1, 1200000) AS g;

Run analyze:

ANALYZE dba_lab.order_events;

Check size:

SELECT
    pg_size_pretty(pg_relation_size('dba_lab.order_events')) AS table_size,
    pg_size_pretty(pg_indexes_size('dba_lab.order_events')) AS index_size,
    pg_size_pretty(pg_total_relation_size('dba_lab.order_events')) AS total_size;

Sample realistic output:
 table_size | index_size | total_size
------------+------------+------------
 1184 MB    | 156 MB     | 1340 MB

The exact number will differ depending on PostgreSQL version, block size, payload length, index structure, fillfactor, and storage.


Why this test is better

Instead of deleting only the highest IDs, we will create scattered churn. This is closer to real production, where updates and deletes happen across many pages, not always neatly at the end of the table.


Create Churn: Updates and Scattered Deletes

First, update a portion of the table:

UPDATE dba_lab.order_events
SET
    status = 'PROCESSED',
    processed_at = now()
where event_id % 4 = 0 ;

This updates around 300,000 rows. In PostgreSQL, an update creates a new row version and leaves the old version behind until VACUUM can clean it.


Now delete scattered rows:

DELETE FROM dba_lab.order_events
WHERE event_id % 7 = 0;
This removes around 171,000 rows logically.


Now,. Run analyze:

ANALYZE VERBOSE dba_lab.order_events;

Check table statistics:

SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'order_events';

Sample output:

 relname      | n_live_tup | n_dead_tup | last_analyze
--------------+------------+------------+-------------------------------
 order_events | 1028450    | 441720     | 2026-05-30 14:20:44.121+08

Do not expect the numbers to match exactly. PostgreSQL statistics are estimates unless you use deeper inspection tools.

SELECT
    pg_size_pretty(pg_relation_size('dba_lab.order_events')) AS table_size,
    pg_size_pretty(pg_indexes_size('dba_lab.order_events')) AS index_size,
    pg_size_pretty(pg_total_relation_size('dba_lab.order_events')) AS total_size;

Check size again:

 table_size | index_size | total_size
------------+------------+------------
 1428 MB    | 226 MB     | 1654 MB

Why did size increase after updates and deletes?

Because updates created new row versions. Deletes marked rows as dead. Indexes also had to track old and new entries. This is exactly how table and index bloat starts.


Run Regular VACUUM and Read the Result Correctly

SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    vacuum_count,
    autovacuum_count
FROM pg_stat_user_tables
WHERE relname = 'order_events';

Now run:

VACUUM VERBOSE ANALYZE dba_lab.order_events;


Check table statistics again:

relname | n_live_tup | n_dead_tup | last_vacuum | vacuum_count

--------------+------------+------------+------------------------------+--------------
 order_events | 1029000    | 0          | 2026-05-30 14:27:18.921+08   | 1

Now check size:

SELECT
    pg_size_pretty(pg_relation_size('dba_lab.order_events')) AS table_size,
    pg_size_pretty(pg_indexes_size('dba_lab.order_events')) AS index_size,
    pg_size_pretty(pg_total_relation_size('dba_lab.order_events')) AS total_size;

Sample output after regular VACUUM:
 table_size | index_size | total_size
------------+------------+------------
 1416 MB    | 219 MB     | 1635 MB

This is the realistic part.

Dead tuples are gone, but the table size did not drop dramatically. That does not mean VACUUM failed. It means the freed space is mostly inside the table and can be reused by future inserts or updates.

If your workload continues inserting into the same table, this is fine. PostgreSQL can reuse that internal free space. If your business expectation is to return disk space to the OS immediately, regular VACUUM is not the tool for that.


What to check in VACUUM VERBOSE output

Look for lines related to:

tuples: removed, remain, dead but not yet removable
index scans
pages removed
pages remain
dead item identifiers removed
WAL usage
buffer usage


Pay special attention to: dead but not yet removable

If this is high, VACUUM found dead tuples but could not remove them. That usually points to old snapshots, long-running transactions, replication slot xmin, prepared transactions, or standby feedback behavior.


Monitor VACUUM While It Runs

For large tables, do not run VACUUM and wait blindly. Use pg_stat_progress_vacuum.

SELECT
    p.pid,
    a.usename,
    a.application_name,
    p.datname,
    p.relid::regclass AS relation_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    round(
        100.0 * p.heap_blks_scanned / NULLIF(p.heap_blks_total, 0),
        2
    ) AS scanned_pct,
    p.index_vacuum_count,
    p.max_dead_tuples,
    p.num_dead_tuples
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a
  ON a.pid = p.pid
ORDER BY p.pid;

Common phases include:

initializing
scanning heap
vacuuming indexes
vacuuming heap
cleaning up indexes
truncating heap
performing final cleanup

 

This view is for normal VACUUM activity. Do not depend on it for VACUUM FULL.

Check wait events for the VACUUM session:

SELECT
    pid,
    wait_event_type,
    wait_event,
    state,
    query
FROM pg_stat_activity
WHERE query ILIKE 'vacuum%';

Check blockers:
SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    blocking.state AS blocking_state
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks
  ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
 AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
 AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
 AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
 AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
 AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
 AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
 AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
 AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
 AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
 AND blocking_locks.pid <> blocked_locks.pid
JOIN pg_stat_activity blocking
  ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Production caveat

VACUUM is designed to work with normal database activity, but it is not free. It consumes I/O, CPU, memory, and can generate WAL. If a huge cleanup runs during peak traffic, users may notice.

For PostgreSQL 14 and above, check WAL activity:

SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    stats_reset
FROM pg_stat_wal;


Autovacuum Tuning for High-Churn Tables

Autovacuum is not the enemy. It is usually the reason PostgreSQL keeps running without manual maintenance every day.

But default autovacuum settings are generic. A large event table with frequent updates and deletes should not use the same cleanup behavior as a small lookup table.

The basic vacuum trigger is based on:

autovacuum_vacuum_threshold
+
autovacuum_vacuum_scale_factor * table_size

By default, the scale factor can be too relaxed for large tables. For a table with 20 million rows, waiting for 20 percent churn may mean millions of dead tuples before autovacuum becomes aggressive enough.

For a hot table, tune at table level:

ALTER TABLE dba_lab.order_events SET
(
    autovacuum_vacuum_threshold = 10000,
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_analyze_threshold = 10000,
    autovacuum_analyze_scale_factor = 0.01,
    autovacuum_vacuum_cost_limit = 2000,
    autovacuum_vacuum_cost_delay = 2
);

Check table-level options:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    c.reloptions
FROM pg_class c
JOIN pg_namespace n
  ON n.oid = c.relnamespace
WHERE n.nspname = 'dba_lab'
  AND c.relname = 'order_events';

How to identify tables needing tuning

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0),
        2
    ) AS dead_tuple_pct,
    last_autovacuum,
    autovacuum_count
FROM pg_stat_user_tables
WHERE n_live_tup + n_dead_tup > 0
ORDER BY dead_tuple_pct DESC
LIMIT 20;

Production caveat

Do not apply aggressive autovacuum settings everywhere. More frequent vacuuming means more background I/O. Tune the top few high-churn tables first, monitor the effect, then adjust.

For insert-heavy tables in newer PostgreSQL versions, also review insert-triggered autovacuum parameters such as:

SHOW autovacuum_vacuum_insert_threshold;
SHOW autovacuum_vacuum_insert_scale_factor;

These are useful for append-only tables where freezing and visibility map maintenance still matter.


Index Bloat Needs Separate Attention

VACUUM cleans dead index entries, but it does not always compact indexes the way DBAs expect. A heavily updated table may have indexes that remain much larger than necessary.

Check index size and usage:

SELECT
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'order_events'
ORDER BY pg_relation_size(indexrelid) DESC;

If the index is bloated and still required, rebuild it carefully.

REINDEX INDEX CONCURRENTLY dba_lab.idx_order_events_status;

For all indexes on a table:

REINDEX TABLE CONCURRENTLY dba_lab.order_events;


Production caveat

REINDEX CONCURRENTLY is safer for availability than normal REINDEX, but it still needs extra disk space and time. Always check free space before starting.

SELECT pg_size_pretty(pg_relation_size('dba_lab.idx_order_events_status')) AS index_size;

At and at OS level: df -h 

Also remember that unused indexes create unnecessary VACUUM work. Every  update and delete has to maintain more index structures.


Find low-usage indexes:

SELECT
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
Do not drop indexes only based on one snapshot. Validate over a business
    cycle.


Oracle vs PostgreSQL: Same Pain, Different Internals

Oracle and PostgreSQL both provide read consistency, but they do it differently.

In Oracle, old versions are managed through undo. After large deletes, DBAs often think about undo retention, segment size, high water mark, shrink space, and reclaiming extents.

Example Oracle checks:

SELECT
    owner,
    segment_name,
    segment_type,
    bytes / 1024 / 1024 AS size_mb
FROM dba_segments
WHERE owner = 'APPUSER'
ORDER BY bytes DESC;

Shrink example:

ALTER TABLE appuser.order_events ENABLE ROW MOVEMENT;
ALTER TABLE appuser.order_events SHRINK SPACE;

In PostgreSQL, the old row versions remain in the heap until VACUUM removes them. The equivalent DBA thought process is different:

VACUUM VERBOSE ANALYZE dba_lab.order_events;

Then validate:

SELECT
    pg_size_pretty(pg_relation_size('dba_lab.order_events')) AS table_size,
    pg_size_pretty(pg_indexes_size('dba_lab.order_events')) AS index_size,
    pg_size_pretty(pg_total_relation_size('dba_lab.order_events')) AS total_size;

DBA mindset shift

Do not manage PostgreSQL like Oracle.

In PostgreSQL:

  • Regular VACUUM is normal maintenance

  • Autovacuum tuning is workload-specific

  • Long transactions can block cleanup

  • Index bloat may need REINDEX

  • Physical shrink usually needs rewrite-style operations

  • Partitioning is often the best purge strategy





Quick Takeaways

  • DELETE does not immediately remove physical row versions in PostgreSQL.

  • Regular VACUUM removes dead tuples and makes space reusable, but may not shrink the table file.

  • VACUUM FULL can reclaim physical space, but it rewrites the table and needs a strong lock.

  • Long-running transactions can prevent VACUUM from removing dead tuples.

  • Autovacuum defaults are not always suitable for large high-churn tables.

  • Index bloat needs separate monitoring and sometimes REINDEX CONCURRENTLY.

  • Partitioning is often better than massive recurring deletes.


Lessons From the Field: DBA Takeaways

Most VACUUM-related incidents do not start with someone saying, “VACUUM is failing.” They start with symptoms.

  • Storage alerts.
  • Slow reports.
  • High I/O.
  • Replication lag.

Autovacuum workers running for hours.

A purge job that deleted millions of rows but did not reduce filesystem usage.

The first DBA mistake is assuming that VACUUM failed because table size did not shrink. In many case s, VACUUM did exactly what it was supposed to do. It removed dead tuples and made space reusable. It just did not return that space to the OS.

The second mistake is tuning autovacuum globally without knowing which tables are causing the problem. A few high-churn tables usually create most of the pain. Tune those tables directly.

The third mistake is ignoring long transactions. One idle transaction can hold the xmin horizon and make autovacuum look ineffective.

The fourth mistake is forgetting indexes. A table may look acceptable, but indexes may have grown badly due to repeated updates and deletes.

The fifth mistake is running VACUUM FULL too casually. It gives visible space back, but it can block application workload and create replication lag.

A practical weekly DBA check should include:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0),
        2
    ) AS dead_tuple_pct,
    last_autovacuum,
    autovacuum_count
FROM pg_stat_user_tables
WHERE n_live_tup + n_dead_tup > 0
ORDER BY n_dead_tup DESC
LIMIT 20;

Also check old transactions:
SELECT
    pid,
    usename,
    state,
    now() - xact_start AS xact_age,
    backend_xmin,
    left(query, 100) AS query_text
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start;

Preventing bloat is much easier than cleaning it under pressure.


Conclusion

PostgreSQL VACUUM is not just a maintenance command. It is part of how PostgreSQL survives real OLTP workload.

Every update and delete can leave old row versions behind. That is normal MVCC behavior. The production risk begins when cleanup cannot keep up, when long transactions hold old snapshots, when indexes grow silently, or when teams expect DELETE to behave like an immediate file shrink operation.

The practical approach is simple. Monitor dead tuples. Track autovacuum activity. Watch long-running transactions. Review relation and index size trends. Tune autovacuum at table level for high-churn objects. Use regular VACUUM for routine cleanup. Use REINDEX CONCURRENTLY when indexes are the real problem. Use VACUUM FULL only when physical space must be returned and a maintenance window is available.

For large purge workloads, design matters more than cleanup. If you delete millions of old rows every month, partitioning may be a better solution than repeatedly creating dead tuples and then trying to clean them later.

Oracle DBAs moving into PostgreSQL should not map every concept directly. PostgreSQL bloat management has its own rules. Once you understand those rules, VACUUM becomes less mysterious and more predictable.

Review your busiest tables. Check which ones generate the most dead tuples. Confirm whether autovacuum is keeping up. Test table-level tuning before production rollout. Bloat is easier to prevent than to repair during an incident.


FAQs

1. Why did my PostgreSQL table not shrink after VACUUM?

Regular VACUUM removes dead tuples and marks space reusable inside the table. It does not always return disk space to the operating system. For physical shrink, you may need VACUUM FULL, CLUSTER, pg_repack, or partition-level cleanup.

2. Is VACUUM FULL safe for production?

It can be used in production only with planning. VACUUM FULL rewrites the table and requires a strong lock, so it can block application sessions. It also generates WAL and may create replication lag.

3. How do I know if autovacuum is not keeping up?

Check n_dead_tup, last_autovacuum, and autovacuum_count in pg_stat_user_tables. If dead tuples keep increasing and autovacuum is infrequent or long-running, the table may need tuning.

4. Can long-running queries stop VACUUM cleanup?

Yes. Long-running transactions, idle-in-transaction sessions, old snapshots, prepared transactions, and some replication-related xmin conditions can prevent VACUUM from removing dead tuples.

5. Should I tune autovacuum globally or per table?

Start per table. Most environments have a few high-churn tables causing most bloat. Table-level settings are safer than aggressively changing autovacuum behaviour for the entire database.


Lastly .. 



Have you seen a PostgreSQL purge job finish successfully but storage usage remain almost unchanged?

Was the real issue dead tuples, long transactions, index bloat, replication lag, or simply wrong expectations from DELETE?

Share your production experience. VACUUM problems are usually learned during real incidents, not from syntax alone.




No comments:

Post a Comment