Most RAC patching failures do not happen because somebody forgot the actual patching command.
They happen because the environment was already unhealthy before patching even started.
A CRS resource was unstable for weeks. ASM diskgroup usage was already close to critical. One node had intermittent packet loss. OPatch inventory was inconsistent. Backup validation had never been tested. Somebody skipped conflict analysis because "last quarter's patch worked fine."
Then patch night arrives.
What should have been a rolling maintenance activity suddenly turns into a bridge call involving storage, Linux, network, middleware, and management asking why one RAC node refuses to rejoin the cluster.
That is why experienced RAC DBAs pay more attention to prerequisites than the actual patch steps.
The patching procedure itself is usually documented well. The dangerous part is everything surrounding it.
This article focuses on the operational side of RAC patch preparation that interview questions rarely explore deeply enough. Not checklist-level theory, but the practical validations that prevent ugly midnight failures during GI and database patching.
Why RAC Patching Fails More Often Than Expected
RAC environments amplify small infrastructure problems.
A standalone database with minor issues may continue operating for months without visible impact. RAC behaves differently. Clusterware, interconnect communication, OCR voting disks, ASM dependencies, service relocation, CRS resource management, and rolling operations all depend on stability across nodes.
Patching stresses all of those layers simultaneously.
During patching, you are effectively testing:
- Cluster resiliency
- Node communication
- CRS health
- Inventory consistency
- ASM availability
- Storage latency
- Service failover behavior
- Startup sequencing
- Database recovery behavior
That is why experienced DBAs never trust a RAC environment simply because applications are currently online.
A cluster surviving workload is not the same as a cluster surviving maintenance.
Start With Inventory and OPatch Validation
One of the most overlooked failures in RAC patching is OPatch incompatibility.
DBAs often download the patch, copy it to the server, and immediately start patching without validating whether the OPatch version supports the patch bundle.
Then patching aborts halfway with inventory errors or unsupported version messages.
Always verify OPatch first.
$ORACLE_HOME/OPatch/opatch version
Compare the version with the patch README requirements.
Also validate the inventory itself.
$ORACLE_HOME/OPatch/opatch lsinventory
If inventory output already shows warnings before patching, stop immediately.
Do not assume patching will "fix" inventory inconsistencies. It usually makes recovery harder.
In large RAC environments with multiple Oracle homes, inventory corruption becomes surprisingly common after years of clone operations, abandoned homes, failed one-off patches, or partial GI upgrades.
Space Checks Matter More Than People Think
Patch staging failures are common in RAC systems running on aging filesystems.
DBAs check Oracle Home space but forget:
- Grid Infrastructure home
- Central inventory
-
/tmp - patch staging filesystem
- backup mount locations
Oracle recommendations often mention requiring several times the patch size in free space. That estimate is not exaggerated.
During rollback scenarios, temporary extraction, inventory updates, and backup copies can consume much more space than expected.
Check thoroughly: use df -h .,etc
Pay close attention to:
- ORACLE_HOME
- GRID_HOME
- oraInventory
- ACFS filesystems
- shared mount points
A full filesystem during GI patching can leave CRS partially upgraded, which is one of the worst recovery situations to handle under pressure.
Backups Are Meaningless Without Validation
Most teams say they have backups.
Far fewer know whether recovery actually works.
Before RAC patching, experienced DBAs validate three different things:
- Database recoverability
- Archive log continuity
- Oracle Home recoverability
RMAN backup alone is not enough.
Run validation checks.
RMAN> VALIDATE DATABASE;
Also validate archive logs.
RMAN> CROSSCHECK ARCHIVELOG ALL;RMAN> VALIDATE ARCHIVELOG ALL;
Corruption discovered after patching is operationally brutal because now you are troubleshooting both recovery and patching simultaneously.
For Oracle Home and Grid Home backups, many DBAs still rely on filesystem tar backups because they are fast during rollback situations.
Example:
tar -cvpf grid_home_backup.tar $GRID_HOMEtar -cvpf oracle_home_backup.tar $ORACLE_HOME
These backups have saved many DBAs during failed rollback scenarios where OPatch rollback itself became unstable.
Cluster Health Checks Before Patching
This is the step most interview answers oversimplify.
"Run health checks." What exactly are you checking?
Real RAC health verification usually includes:
CRS Status
crsctl stat res -t
You are looking for:
- intermittent OFFLINE resources
- UNKNOWN states
- restart loops
- node-specific failures
A resource already unstable before patching becomes much harder to diagnose afterward.
Cluster Verification Utility
cluvfy comp healthcheck -collect cluster -bestpractice
CVU output often reveals ignored issues:
- multicast failures
- interconnect latency
- time synchronization drift
- permission inconsistencies
- OCR accessibility warnings
Many DBAs ignore warnings because the cluster is currently operational.
That is dangerous during rolling patching.
ASM Health
SELECT name, state, type, total_mb, free_mb
FROM v$asm_diskgroup;
Watch for:
- rebalance operations
- nearly full diskgroups
- offline disks
- high ASM alert log activity
Patching while ASM is already degraded increases restart risk significantly.
Conflict Detection Is Mandatory
Skipping conflict analysis is one of the fastest ways to ruin a maintenance window.
Always run:
$ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -ph ./
Or:
opatchauto apply -analyze
The -analyze mode catches many issues before downtime begins.
Particularly dangerous areas include:
- one-off patches
- JVM patches
- OJVM conflicts
- Data Guard environments with inconsistent patch levels
- GI/database version mismatches
Some conflicts only appear on specific nodes because inventory drift exists between RAC members.
That is why node-to-node inventory consistency matters.
Invalid Objects and Registry Validation
Before patching, compile invalid objects.
@?/rdbms/admin/utlrp.sql
Then review component status.
SELECT comp_name, status, versionFROM dba_registry;
DBAs sometimes ignore invalid components because applications seem unaffected.
During patching, however, invalid Java components, XML DB issues, or partially upgraded registry entries can cause datapatch failures later.
Datapatch problems are notoriously painful because binary patching may succeed while SQL patching fails afterward.
That creates version inconsistencies between binaries and database metadata.
Service Relocation Before Rolling Patching
Good RAC patching minimizes application disruption.
Before patching a node, services are usually relocated away from it.
Example:
srvctl relocate service -db PRODDB -service APP_SVC -oldinst PROD1 -newinst PROD2
This sounds simple until you discover:
- connection pools not reconnecting
- FAN events not functioning correctly
- application affinity hardcoded to one node
- overloaded surviving node
- CPU spikes after relocation
Many environments discover their failover design weaknesses only during maintenance windows.
That is why smart DBAs test service relocation long before actual patch night.
Scheduler Jobs and Crontab Control
Leaving batch jobs running during RAC patching creates unnecessary instability.
Common offenders:
- ETL jobs
- shell-based exports
- stats gathering
- custom monitoring scripts
- filesystem cleanup jobs
- application cron activity
Disable when appropriate.
EXEC DBMS_SCHEDULER.DISABLE('JOB_NAME');
And verify OS cron entries using crontab -l command
Some outages blamed on patching were actually caused by overlapping maintenance scripts restarting services unexpectedly during CRS operations.
Test the Patch Outside Production
This sounds obvious. The biggest problem is not whether the patch installs successfully.
The real value of staging tests is discovering operational side effects:
- startup delays
- datapatch runtime
- clusterware restart behavior
- application reconnection timing
- service failback issues
- monitoring false alarms
- ASM rebalance behavior
- backup agent failures
Even a small sandbox RAC reveals patterns that documentation never mentions.
Production Failure Scenario: GI Patch Left CRS Half Online
One environment had a two-node RAC where patching failed after the first node reboot.
Symptoms included:
- CRS stack partially online
- ASM running
- database resources offline
- node eviction messages
- CSS communication instability
Initial assumption was patch corruption.
Actual root cause was filesystem exhaustion inside the Grid Home during inventory update.
The patch had technically succeeded on one node but failed during final inventory synchronization.
Because no Grid Home backup existed, rollback became extremely complicated.
Recovery required:
- restoring Grid Home from backup
- rebuilding inventory pointers
- re-running rootcrs scripts
- OCR validation
- CRS startup sequencing corrections
The outage lasted far longer than the patching itself.
The painful lesson: Filesystem validation would have prevented the entire incident.
Conclusion
RAC patching is rarely dangerous because of the documented procedure.
It becomes dangerous because environments accumulate hidden operational debt over time.
Inventory inconsistencies. Weak backups. Unstable CRS resources. ASM pressure. Untested failover. Forgotten one-off patches. Silent corruption. Broken monitoring assumptions.
Experienced DBAs treat prerequisites as the actual patching work.
The patch command itself is usually the easiest part of the night.
The environments that patch smoothly are not necessarily the ones with the best automation.
They are the ones where DBAs continuously validate cluster health long before maintenance begins.
FAQs
Should RAC patching always be rolling?
Not necessarily. Some patches require non-rolling application depending on GI version, patch type, or known issues. Always verify README documentation carefully.
Is opatchauto always safer than manual patching?
Not always.
opatchauto simplifies orchestration but can hide failures behind automation layers. Experienced DBAs still validate each underlying component manually.
Why does datapatch fail even after successful binary patching?
Usually because of invalid components, dictionary inconsistencies, Java issues, or registry problems already existing before maintenance.
How much free space should exist before patching?
Many DBAs use a practical minimum of 3-4x patch size across Oracle Home, Grid Home, and staging areas.
More is safer during rollback situations.
Why validate backups if RMAN jobs already succeed daily?
Because successful backups do not guarantee successful restores.
Recovery validation and corruption checks are separate operational responsibilities.
Should scheduler jobs always be disabled?
Not always, but long-running jobs, ETL pipelines, exports, and automation scripts should be reviewed carefully before maintenance begins.
For LinkedIn, you can use this short operational-style post:
No comments:
Post a Comment