Sunday, 26 April 2026

ORA-01017 in RAC 12c and above ? Stepwise Permission Fix & Cause identification

  As an Oracle DBA, few things are more frustrating than a sudden loss of remote connectivity right after a routine SYS password reset. You type in the credentials, and bam -- ORA-01017 greets you, even though your local connections work fine. In production RAC environments, this isn’t just about a mistyped password.

Recently, I faced a tricky scenario in an Oracle 19c RAC setup with proper role separation between the grid and oracle OS users. What seemed like a simple password mismatch quickly unraveled into a multi-layered “permission deadlock,” involving rogue listeners, contaminated IPC sockets, and GPnP directory access issues. It took a careful, stepwise approach to restore connectivity across all nodes without compromising the cluster.



Saturday, 18 April 2026

Oracle RAC Load Balancing Internals Explained

 Most RAC clusters look healthy until the workload shifts suddenly.

A reporting job starts hammering one node. Connection pools keep sending sessions to the same instance. CPU climbs, gc waits spike, application response times become unpredictable, and suddenly everyone starts blaming storage, SQL plans, or the network.

But many times, the real problem sits in the RAC connection routing layer itself.

I have seen large RAC environments where all nodes were technically UP, yet one instance was drowning while another sat nearly idle. The cluster wasn’t failing. The load balancing strategy was.

Oracle RAC load balancing is often misunderstood because people assume SCAN alone magically distributes workload intelligently. It does not.

There are multiple layers involved:

  • client-side balancing
  • listener-based balancing
  • runtime advisory feedback
  • FAN/ONS notifications
  • connection pool behavior
  • service-level workload management

And if even one layer is misconfigured, RAC can behave very differently from what DBAs expect.

This article breaks down how Oracle RAC actually distributes workload internally, where the common operational failures happen, and what DBAs should monitor before imbalance becomes an outage.


RAC Load Balancing Is Not One Feature

Oracle RAC does not use a single load balancing mechanism.

Instead, workload distribution happens across several independent layers that cooperate together.

The architecture looks roughly like this:




and at a high level : 



The important thing here: Not every client participates equally.

A SQL*Plus connection behaves differently from a JDBC UCP pool. OCI behaves differently from a third-party app server. Some clients understand Runtime Load Balancing. Others only perform basic random distribution.

That distinction matters heavily during failovers and uneven workload conditions.


Client-Side Load Balancing: The Most Misunderstood Layer

Most people encounter RAC balancing through the TNS entry:

(DESCRIPTION=
  (LOAD_BALANCE=ON)
  (FAILOVER=ON)
  (ADDRESS_LIST=
    (ADDRESS=(PROTOCOL=TCP)(HOST=scan1)(PORT=1521))
    (ADDRESS=(PROTOCOL=TCP)(HOST=scan2)(PORT=1521))
    (ADDRESS=(PROTOCOL=TCP)(HOST=scan3)(PORT=1521))
  )
  (CONNECT_DATA=
    (SERVICE_NAME=prodapp)
  )
)

DBAs often assume this means Oracle intelligently chooses the least loaded node. It does not.

With client-side load balancing enabled, the client simply picks one SCAN listener randomly.

That’s all.

No CPU awareness.

No session awareness.

No response time awareness.

 

This layer exists mainly to distribute initial connection attempts.

In smaller systems, that may be sufficient. In busy OLTP systems with connection storms, it becomes inadequate quickly.

I hve seen environments where:

  • one node received most pooled reconnects after firewall resets
  • app pools pinned themselves unintentionally
  • connection reuse caused severe skew
  • middle-tier pools ignored runtime advisory updates entirely

The RAC cluster looked balanced from CRS perspective, but actual application traffic was heavily concentrated.

That is where server-side intelligence becomes critical.


Server-Side Load Balancing: Where RAC Starts Making Decisions

Once the SCAN listener receives the request, Oracle can redirect the connection toward a better target instance.

This is Server-Side Load Balancing (SSLB).

Instead of random placement, the listener evaluates advisory metrics published by RAC instances.

The workflow looks like this:

Client

SCAN Listener

Node Listener

Best Candidate RAC Instance

This depends on several moving pieces functioning correctly:

  • SCAN listeners
  • remote_listener configuration
  • dynamic service registration
  • LREG background process
  • service metrics publication

A surprising number of RAC environments have partial failures here.

One node stops dynamically registering correctly. Remote listener entries become stale. DNS inconsistencies appear. SCAN VIP relocation behaves unexpectedly after patching.

Then connections start concentrating unevenly.

You can validate registration health quickly:

show parameter remote_listener;

Check service registration:

lsnrctl status

Or directly from CRS:

srvctl status service -d PROD

Watch for services missing from one node.

That usually shows up long before application teams notice imbalance.


The Load Balancing Advisory: RAC's Real Intelligence Layer

The real intelligence inside RAC balancing comes from the Load Balancing Advisory (LBA).

This is the component many DBAs never investigate deeply.

Each RAC instance continuously publishes performance metrics including:

  • active sessions
  • service response times
  • throughput
  • CPU pressure
  • service quality metrics

These metrics feed into Oracle Clusterware and are propagated through ONS.

RAC essentially keeps scoring instance health continuously.

The listener then uses these scores when deciding where to place new sessions.


Typical values:


A badly chosen service goal can create subtle imbalance.

For example:

  • batch workloads using SHORT goals
  • OLTP services configured for THROUGHPUT
  • mixed workloads sharing same service

These issues rarely fail dramatically. They fail slowly through latency drift and uneven resource pressure.


ONS and FAN: The Parts That Usually Break Quietly

Oracle Notification Service (ONS) is responsible for distributing Fast Application Notification (FAN) events.

This becomes essential for:

  • Fast Connection Failover (FCF)
  • Runtime Connection Load Balancing (RCLB)
  • adaptive pool balancing
  • fast dead connection cleanup

Without healthy ONS communication, many application pools behave badly during failures.

Common symptoms:

  • application hangs after node failure
  • stale pooled sessions
  • reconnect storms
  • long TCP timeout waits
  • uneven pool distribution
  • connection spikes after VIP relocation

DBAs frequently validate database health but forget ONS entirely.

That becomes dangerous during failovers.

You can inspect ONS configuration:

srvctl config nodeapps

And verify FAN-related events inside alert logs and clusterware logs. One subtle issue I have seen repeatedly is that the firewalls silently dropping ONS traffic between middleware and cluster nodes. Everything appears healthy until failover occurs.

Then the app hangs for minutes because pools never received FAN notifications.


Runtime Connection Load Balancing Changes Everything

Runtime Connection Load Balancing (RCLB) works differently from initial listener balancing.

This operates inside intelligent connection pools such as:

  • UCP
  • OCI pools
  • JDBC RAC-aware pools

Instead of merely routing initial sessions, the pool continuously adapts using RAC advisory updates.

That means:

  • overloaded instances receive fewer new requests
  • unhealthy nodes lose pooled traffic gradually
  • idle sessions get drained intelligently
  • reconnects prefer healthier instances

This is where RAC becomes truly adaptive.

Without RCLB, connection balancing is mostly static.

With RCLB enabled correctly, RAC reacts dynamically to workload shifts.

Unfortunately, many applications claim RAC support while ignoring these capabilities entirely.

The result:  RAC nodes stay alive, but workload distribution becomes operationally ugly.


What Happens During Node Failure

This is where RAC architecture earns its reputation.

Suppose NODE2 crashes unexpectedly.

The flow typically looks like this:

NODE2 Failure

VIP Relocation

ONS/FAN Notification

Connection Pools Mark Sessions Dead

SCAN Redirects New Connections

Remaining Nodes Absorb Workload

  1. If configured properly, failover is extremely fast.
  2. If not, applications may sit on dead TCP sessions waiting for timeout expiration.

That distinction matters enormously during outages.

A properly tuned RAC failover may recover within seconds.

A poorly configured one can create cascading application failures lasting several minutes.


RAC Load Balancing Problems That Appear at Scale

Small RAC systems often mask configuration mistakes. Whereas,  Large systems expose them brutally.

Common scaling problems include:

Session Stickiness

Connection pools reuse existing sessions heavily.

Even if RAC redistributes new connections properly, old sessions may stay concentrated forever.


Service Misplacement

DBAs sometimes place reporting and OLTP services on same preferred instances.

Then batch spikes crush latency-sensitive traffic.

Better design:

  • separate services
  • separate service goals
  • workload isolation
  • explicit preferred/available instance definitions


Connection Storms

After network interruptions, thousands of clients reconnect simultaneously.

This creates:

  • listener pressure
  • CPU spikes
  • authentication storms
  • excessive logon triggers
  • library cache contention

RAC balancing helps, but only partially.

The real fix often involves application-side reconnect throttling.


RAC Internals Still Matter: GCS and GES

Even perfect session balancing cannot save poorly distributed data access patterns.

RAC nodes constantly synchronize:

  • data blocks
  • row locks
  • cache ownership

This happens through:

  • GCS (Global Cache Service)
  • GES (Global Enqueue Service)

Poor application partitioning creates excessive interconnect traffic.

Then you start seeing waits like:

gc cr request
gc buffer busy acquire
gc current block busy

At that point, the problem is no longer connection balancing.

It becomes workload locality. I hve seen applications scale from 2-node RAC to 8-node RAC and actually perform worse because the workload generated enormous block shipping overhead.

More nodes do not automatically mean better scalability. Sometimes they increase contention dramatically.


Production Failure Scenario

One environment I supported had a 4-node RAC cluster handling heavy OLTP traffic.

Symptoms looked strange:

  • NODE1 constantly above 85% CPU
  • NODE3 and NODE4 mostly idle
  • application latency spikes during business hours
  • SCAN listeners healthy
  • no obvious database errors

Initial suspicion focused on SQL tuning.

But session distribution showed almost 70% of connections pinned to NODE1.

Root cause turned out to be:

  • JDBC pools not configured for RCLB
  • ONS blocked partially by firewall rules
  • old stale pool behavior after previous maintenance

The cluster itself was healthy.

The middleware simply stopped reacting to RAC advisory updates.

Fix involved:

  • correcting ONS ports
  • enabling FAN subscriptions
  • validating UCP runtime balancing
  • draining existing pools gradually

Within hours, workload normalized across all nodes.

No SQL tuning was needed at all.


DBA Insights Most Teams Learn Too Late

SCAN Is Not Intelligent Balancing

SCAN helps availability and routing. It is not workload intelligence by itself.


Connection Pools Matter More Than DBAs Think

Many balancing failures originate outside the database. The middleware layer often decides how good RAC actually behaves.


ONS Failures Hide Quietly

ONS issues rarely trigger obvious alarms. But they destroy failover responsiveness.


Session Counts Alone Are Misleading

Equal session distribution does not mean equal workload. One node may run heavier SQL while another handles mostly idle sessions.

Always correlate:

  • CPU
  • DB time
  • service metrics
  • response times
  • gc waits


RAC Cannot Fix Bad Application Design

Hot block contention remains hot block contention. RAC sometimes amplifies bad locality patterns instead of solving them


Conclusion

Oracle RAC load balancing is far more than SCAN listeners and a TNS parameter.

Real balancing depends on cooperation between:

  • listeners
  • services
  • ONS
  • FAN
  • advisory metrics
  • connection pools
  • application behavior

When all layers work together, RAC handles workload shifts remarkably well.

When one layer breaks, imbalance develops quietly until users start feeling it.

The dangerous part is that most RAC balancing failures do not crash the cluster. They degrade performance gradually through skewed workload distribution, stale pools, and slow failover behavior.


Experienced DBAs learn to monitor the connection path itself, not just the database health.

Because many RAC incidents begin long before the instance actually goes down.




Saturday, 4 April 2026

Oracle Data Pump Migrations: What Breaks in Real Upgrades

Most DBAs discover the real complexity of Oracle database migrations only after the first large-scale Export/Import cutover goes sideways. On paper, Oracle Data Pump looks straightforward. Export the database, import it into a newer release, validate the objects, switch applications, done.

Reality is usually messier.