Sunday, 20 July 2025

RAC Patching Prerequisites Most DBAs Ignore

 Most RAC patching failures do not happen because somebody forgot the actual patching command.

They happen because the environment was already unhealthy before patching even started.

A CRS resource was unstable for weeks. ASM diskgroup usage was already close to critical. One node had intermittent packet loss. OPatch inventory was inconsistent. Backup validation had never been tested. Somebody skipped conflict analysis because "last quarter's patch worked fine."

Then patch night arrives.

What should have been a rolling maintenance activity suddenly turns into a bridge call involving storage, Linux, network, middleware, and management asking why one RAC node refuses to rejoin the cluster.

That is why experienced RAC DBAs pay more attention to prerequisites than the actual patch steps.

The patching procedure itself is usually documented well. The dangerous part is everything surrounding it.

This article focuses on the operational side of RAC patch preparation that interview questions rarely explore deeply enough. Not checklist-level theory, but the practical validations that prevent ugly midnight failures during GI and database patching.


Why RAC Patching Fails More Often Than Expected

RAC environments amplify small infrastructure problems.

A standalone database with minor issues may continue operating for months without visible impact. RAC behaves differently. Clusterware, interconnect communication, OCR voting disks, ASM dependencies, service relocation, CRS resource management, and rolling operations all depend on stability across nodes.

Patching stresses all of those layers simultaneously.

During patching, you are effectively testing:

  • Cluster resiliency
  • Node communication
  • CRS health
  • Inventory consistency
  • ASM availability
  • Storage latency
  • Service failover behavior
  • Startup sequencing
  • Database recovery behavior

That is why experienced DBAs never trust a RAC environment simply because applications are currently online.

A cluster surviving workload is not the same as a cluster surviving maintenance.


Start With Inventory and OPatch Validation

One of the most overlooked failures in RAC patching is OPatch incompatibility.

DBAs often download the patch, copy it to the server, and immediately start patching without validating whether the OPatch version supports the patch bundle.

Then patching aborts halfway with inventory errors or unsupported version messages.

Always verify OPatch first.

$ORACLE_HOME/OPatch/opatch version

Compare the version with the patch README requirements.

Also validate the inventory itself.

$ORACLE_HOME/OPatch/opatch lsinventory

If inventory output already shows warnings before patching, stop immediately.

Do not assume patching will "fix" inventory inconsistencies. It usually makes recovery harder.

In large RAC environments with multiple Oracle homes, inventory corruption becomes surprisingly common after years of clone operations, abandoned homes, failed one-off patches, or partial GI upgrades.


Space Checks Matter More Than People Think

Patch staging failures are common in RAC systems running on aging filesystems.

DBAs check Oracle Home space but forget:

  • Grid Infrastructure home
  • Central inventory
  • /tmp
  • patch staging filesystem
  • backup mount locations

Oracle recommendations often mention requiring several times the patch size in free space. That estimate is not exaggerated.

During rollback scenarios, temporary extraction, inventory updates, and backup copies can consume much more space than expected.

Check thoroughly: use df -h .,etc

Pay close attention to:

  • ORACLE_HOME
  • GRID_HOME
  • oraInventory
  • ACFS filesystems
  • shared mount points

A full filesystem during GI patching can leave CRS partially upgraded, which is one of the worst recovery situations to handle under pressure.


Backups Are Meaningless Without Validation

Most teams say they have backups.

Far fewer know whether recovery actually works.

Before RAC patching, experienced DBAs validate three different things:

  1. Database recoverability
  2. Archive log continuity
  3. Oracle Home recoverability

RMAN backup alone is not enough.

Run validation checks.

RMAN> VALIDATE DATABASE;

Also validate archive logs.

RMAN> CROSSCHECK ARCHIVELOG ALL;
RMAN> VALIDATE ARCHIVELOG ALL;

Corruption discovered after patching is operationally brutal because now you are troubleshooting both recovery and patching simultaneously.

For Oracle Home and Grid Home backups, many DBAs still rely on filesystem tar backups because they are fast during rollback situations.

Example:

tar -cvpf grid_home_backup.tar $GRID_HOME
tar -cvpf oracle_home_backup.tar $ORACLE_HOME

These backups have saved many DBAs during failed rollback scenarios where OPatch rollback itself became unstable.


Cluster Health Checks Before Patching

This is the step most interview answers oversimplify.

"Run health checks." What exactly are you checking?

Real RAC health verification usually includes:


CRS Status

crsctl stat res -t

You are looking for:

  • intermittent OFFLINE resources
  • UNKNOWN states
  • restart loops
  • node-specific failures

A resource already unstable before patching becomes much harder to diagnose afterward.


Cluster Verification Utility

cluvfy comp healthcheck -collect cluster -bestpractice

CVU output often reveals ignored issues:

  • multicast failures
  • interconnect latency
  • time synchronization drift
  • permission inconsistencies
  • OCR accessibility warnings

Many DBAs ignore warnings because the cluster is currently operational.

That is dangerous during rolling patching.


ASM Health

SELECT name, state, type, total_mb, free_mb
FROM v$asm_diskgroup;

Watch for:

  • rebalance operations
  • nearly full diskgroups
  • offline disks
  • high ASM alert log activity

Patching while ASM is already degraded increases restart risk significantly.


Conflict Detection Is Mandatory

Skipping conflict analysis is one of the fastest ways to ruin a maintenance window.

Always run:

$ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -ph ./

Or:

opatchauto apply -analyze

The -analyze mode catches many issues before downtime begins.

Particularly dangerous areas include:

  • one-off patches
  • JVM patches
  • OJVM conflicts
  • Data Guard environments with inconsistent patch levels
  • GI/database version mismatches

Some conflicts only appear on specific nodes because inventory drift exists between RAC members.

That is why node-to-node inventory consistency matters.


Invalid Objects and Registry Validation

Before patching, compile invalid objects.

@?/rdbms/admin/utlrp.sql

Then review component status.

SELECT comp_name, status, version
FROM dba_registry;

DBAs sometimes ignore invalid components because applications seem unaffected.

During patching, however, invalid Java components, XML DB issues, or partially upgraded registry entries can cause datapatch failures later.

Datapatch problems are notoriously painful because binary patching may succeed while SQL patching fails afterward.

That creates version inconsistencies between binaries and database metadata.


Service Relocation Before Rolling Patching

Good RAC patching minimizes application disruption.

Before patching a node, services are usually relocated away from it.

Example:

srvctl relocate service -db PRODDB -service APP_SVC -oldinst PROD1 -newinst PROD2

This sounds simple until you discover:

  • connection pools not reconnecting
  • FAN events not functioning correctly
  • application affinity hardcoded to one node
  • overloaded surviving node
  • CPU spikes after relocation

Many environments discover their failover design weaknesses only during maintenance windows.

That is why smart DBAs test service relocation long before actual patch night.


Scheduler Jobs and Crontab Control

Leaving batch jobs running during RAC patching creates unnecessary instability.

Common offenders:

  • ETL jobs
  • shell-based exports
  • stats gathering
  • custom monitoring scripts
  • filesystem cleanup jobs
  • application cron activity

Disable when appropriate.

EXEC DBMS_SCHEDULER.DISABLE('JOB_NAME');

And verify OS cron entries using crontab -l command

Some outages blamed on patching were actually caused by overlapping maintenance scripts restarting services unexpectedly during CRS operations.


Test the Patch Outside Production

This sounds obvious.  The biggest problem is not whether the patch installs successfully.

The real value of staging tests is discovering operational side effects:

  • startup delays
  • datapatch runtime
  • clusterware restart behavior
  • application reconnection timing
  • service failback issues
  • monitoring false alarms
  • ASM rebalance behavior
  • backup agent failures

Even a small sandbox RAC reveals patterns that documentation never mentions.


Production Failure Scenario: GI Patch Left CRS Half Online

One environment had a two-node RAC where patching failed after the first node reboot.

Symptoms included:

  • CRS stack partially online
  • ASM running
  • database resources offline
  • node eviction messages
  • CSS communication instability

Initial assumption was patch corruption.

Actual root cause was filesystem exhaustion inside the Grid Home during inventory update.

The patch had technically succeeded on one node but failed during final inventory synchronization.

Because no Grid Home backup existed, rollback became extremely complicated.

Recovery required:

  • restoring Grid Home from backup
  • rebuilding inventory pointers
  • re-running rootcrs scripts
  • OCR validation
  • CRS startup sequencing corrections

The outage lasted far longer than the patching itself.

The painful lesson: Filesystem validation would have prevented the entire incident.



Conclusion

RAC patching is rarely dangerous because of the documented procedure.

It becomes dangerous because environments accumulate hidden operational debt over time.

Inventory inconsistencies. Weak backups. Unstable CRS resources. ASM pressure. Untested failover. Forgotten one-off patches. Silent corruption. Broken monitoring assumptions.

Experienced DBAs treat prerequisites as the actual patching work.

The patch command itself is usually the easiest part of the night.

The environments that patch smoothly are not necessarily the ones with the best automation.

They are the ones where DBAs continuously validate cluster health long before maintenance begins.


FAQs

Should RAC patching always be rolling?

Not necessarily. Some patches require non-rolling application depending on GI version, patch type, or known issues. Always verify README documentation carefully.


Is opatchauto always safer than manual patching?

Not always.

opatchauto simplifies orchestration but can hide failures behind automation layers. Experienced DBAs still validate each underlying component manually.


Why does datapatch fail even after successful binary patching?

Usually because of invalid components, dictionary inconsistencies, Java issues, or registry problems already existing before maintenance.


How much free space should exist before patching?

Many DBAs use a practical minimum of 3-4x patch size across Oracle Home, Grid Home, and staging areas.

More is safer during rollback situations.


Why validate backups if RMAN jobs already succeed daily?

Because successful backups do not guarantee successful restores.

Recovery validation and corruption checks are separate operational responsibilities.


Should scheduler jobs always be disabled?

Not always, but long-running jobs, ETL pipelines, exports, and automation scripts should be reviewed carefully before maintenance begins.

For LinkedIn, you can use this short operational-style post:



No comments:

Post a Comment