Most RAC clusters look healthy until the workload shifts
suddenly.
A reporting job starts hammering one node. Connection pools keep sending
sessions to the same instance. CPU climbs, gc waits spike, application
response times become unpredictable, and suddenly everyone starts blaming
storage, SQL plans, or the network.
But many times, the real problem sits in the RAC connection routing layer
itself.
I have seen large RAC environments where all nodes were technically UP, yet
one instance was drowning while another sat nearly idle. The cluster wasn’t
failing. The load balancing strategy was.
Oracle RAC load balancing is often misunderstood because people assume SCAN
alone magically distributes workload intelligently. It does not.
There are multiple layers involved:
-
client-side balancing
-
listener-based balancing
-
runtime advisory feedback
-
FAN/ONS notifications
-
connection pool behavior
-
service-level workload management
And if even one layer is misconfigured, RAC can behave very differently
from what DBAs expect.
This article breaks down how Oracle RAC actually distributes workload
internally, where the common operational failures happen, and what DBAs
should monitor before imbalance becomes an outage.
RAC Load Balancing Is Not One Feature
Oracle RAC does not use a single load balancing mechanism.
Instead, workload distribution happens across several independent layers
that cooperate together.
The architecture looks roughly like this:
and at a high level :
The important thing here: Not every client participates equally.
A SQL*Plus
connection behaves differently from a JDBC UCP pool. OCI behaves differently
from a third-party app server. Some clients understand Runtime Load
Balancing. Others only perform basic random distribution.
That distinction matters heavily during failovers and uneven workload
conditions.
Client-Side Load Balancing: The Most Misunderstood Layer
Most people encounter RAC balancing through the TNS entry:
(DESCRIPTION=
(LOAD_BALANCE=ON)
(FAILOVER=ON)
(ADDRESS_LIST=
(ADDRESS=(PROTOCOL=TCP)(HOST=scan1)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=scan2)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=scan3)(PORT=1521))
)
(CONNECT_DATA=
(SERVICE_NAME=prodapp)
)
)
DBAs often assume this means Oracle intelligently chooses the least loaded
node. It does not.
With client-side load balancing enabled, the client simply picks one SCAN
listener randomly.
That’s all.
No CPU awareness.
No session awareness.
No response time awareness.
This layer exists mainly to distribute initial connection attempts.
In smaller systems, that may be sufficient. In busy OLTP systems with
connection storms, it becomes inadequate quickly.
I hve seen environments where:
-
one node received most pooled reconnects after firewall resets
-
app pools pinned themselves unintentionally
-
connection reuse caused severe skew
-
middle-tier pools ignored runtime advisory updates entirely
The RAC cluster looked balanced from CRS perspective, but actual
application traffic was heavily concentrated.
That is where server-side intelligence becomes critical.
Server-Side Load Balancing: Where RAC Starts Making Decisions
Once the SCAN listener receives the request, Oracle can redirect the
connection toward a better target instance.
This is Server-Side Load Balancing (SSLB).
Instead of random placement, the listener evaluates advisory metrics
published by RAC instances.
The workflow looks like this:
Client
↓
SCAN Listener
↓
Node Listener
↓
Best Candidate RAC Instance
This depends on several moving pieces functioning correctly:
-
SCAN listeners
-
remote_listener configuration
-
dynamic service registration
-
LREG background process
-
service metrics publication
A surprising number of RAC environments have partial failures here.
One node stops dynamically registering correctly. Remote listener entries
become stale. DNS inconsistencies appear. SCAN VIP relocation behaves
unexpectedly after patching.
Then connections start concentrating unevenly.
You can validate registration health quickly:
Check service registration:
Or directly from CRS:
Watch for services missing from one node.
That usually shows up long before application teams notice imbalance.
The Load Balancing Advisory: RAC's Real Intelligence Layer
The real intelligence inside RAC balancing comes from the Load Balancing
Advisory (LBA).
This is the component many DBAs never investigate deeply.
Each RAC instance continuously publishes performance metrics
including:
-
active sessions
-
service response times
-
throughput
-
CPU pressure
-
service quality metrics
These metrics feed into Oracle Clusterware and are propagated through
ONS.
RAC essentially keeps scoring instance health continuously.
The listener then uses these scores when deciding where to place new
sessions.
Typical values:
A badly chosen service goal can create subtle imbalance.
For example:
-
batch workloads using SHORT goals
-
OLTP services configured for THROUGHPUT
-
mixed workloads sharing same service
These issues rarely fail dramatically. They fail slowly through latency
drift and uneven resource pressure.
ONS and FAN: The Parts That Usually Break Quietly
Oracle Notification Service (ONS) is responsible for distributing Fast
Application Notification (FAN) events.
This becomes essential for:
-
Fast Connection Failover (FCF)
-
Runtime Connection Load Balancing (RCLB)
-
adaptive pool balancing
-
fast dead connection cleanup
Without healthy ONS communication, many application pools behave badly
during failures.
Common symptoms:
-
application hangs after node failure
-
stale pooled sessions
-
reconnect storms
-
long TCP timeout waits
-
uneven pool distribution
-
connection spikes after VIP relocation
DBAs frequently validate database health but forget ONS entirely.
That becomes dangerous during failovers.
You can inspect ONS configuration:
And verify FAN-related events inside alert logs and clusterware logs. One subtle issue I have seen repeatedly is that the firewalls silently dropping ONS traffic between middleware and cluster
nodes. Everything appears healthy until failover occurs.
Then the app hangs for minutes because pools never received FAN
notifications.
Runtime Connection Load Balancing Changes Everything
Runtime Connection Load Balancing (RCLB) works differently from initial
listener balancing.
This operates inside intelligent connection pools such as:
-
UCP
-
OCI pools
-
JDBC RAC-aware pools
Instead of merely routing initial sessions, the pool continuously adapts
using RAC advisory updates.
That means:
-
overloaded instances receive fewer new requests
-
unhealthy nodes lose pooled traffic gradually
-
idle sessions get drained intelligently
-
reconnects prefer healthier instances
This is where RAC becomes truly adaptive.
Without RCLB, connection balancing is mostly static.
With RCLB enabled correctly, RAC reacts dynamically to workload
shifts.
Unfortunately, many applications claim RAC support while ignoring these
capabilities entirely.
The result: RAC nodes stay alive, but workload distribution becomes operationally
ugly.
What Happens During Node Failure
This is where RAC architecture earns its reputation.
Suppose NODE2 crashes unexpectedly.
The flow typically looks like this:
NODE2 Failure
↓
VIP Relocation
↓
ONS/FAN Notification
↓
Connection Pools Mark Sessions Dead
↓
SCAN Redirects New Connections
↓
Remaining Nodes Absorb Workload
- If configured properly, failover is extremely fast.
- If not, applications may sit on dead TCP sessions waiting for timeout
expiration.
That distinction matters enormously during outages.
A properly tuned RAC failover may recover within seconds.
A poorly configured one can create cascading application failures lasting
several minutes.
RAC Load Balancing Problems That Appear at Scale
Small RAC systems often mask configuration mistakes. Whereas, Large systems expose them brutally.
Common scaling problems include:
Session Stickiness
Connection pools reuse existing sessions heavily.
Even if RAC redistributes new connections properly, old sessions may stay
concentrated forever.
Service Misplacement
DBAs sometimes place reporting and OLTP services on same preferred
instances.
Then batch spikes crush latency-sensitive traffic.
Better design:
-
separate services
-
separate service goals
-
workload isolation
-
explicit preferred/available instance definitions
Connection Storms
After network interruptions, thousands of clients reconnect
simultaneously.
This creates:
-
listener pressure
-
CPU spikes
-
authentication storms
-
excessive logon triggers
-
library cache contention
RAC balancing helps, but only partially.
The real fix often involves application-side reconnect throttling.
RAC Internals Still Matter: GCS and GES
Even perfect session balancing cannot save poorly distributed data access
patterns.
RAC nodes constantly synchronize:
-
data blocks
-
row locks
-
cache ownership
This happens through:
-
GCS (Global Cache Service)
-
GES (Global Enqueue Service)
Poor application partitioning creates excessive interconnect traffic.
Then you start seeing waits like:
At that point, the problem is no longer connection balancing.
It becomes workload locality. I hve seen applications scale from 2-node RAC to 8-node RAC and actually
perform worse because the workload generated enormous block shipping
overhead.
More nodes do not automatically mean better scalability. Sometimes they increase contention dramatically.
Production Failure Scenario
One environment I supported had a 4-node RAC cluster handling heavy OLTP
traffic.
Symptoms looked strange:
-
NODE1 constantly above 85% CPU
-
NODE3 and NODE4 mostly idle
-
application latency spikes during business hours
-
SCAN listeners healthy
-
no obvious database errors
Initial suspicion focused on SQL tuning.
But session distribution showed almost 70% of connections pinned to
NODE1.
Root cause turned out to be:
-
JDBC pools not configured for RCLB
-
ONS blocked partially by firewall rules
-
old stale pool behavior after previous maintenance
The cluster itself was healthy.
The middleware simply stopped reacting to RAC advisory updates.
Fix involved:
-
correcting ONS ports
-
enabling FAN subscriptions
-
validating UCP runtime balancing
-
draining existing pools gradually
Within hours, workload normalized across all nodes.
No SQL tuning was needed at all.
DBA Insights Most Teams Learn Too Late
SCAN Is Not Intelligent Balancing
SCAN helps availability and routing. It is not workload intelligence by itself.
Connection Pools Matter More Than DBAs Think
Many balancing failures originate outside the database. The middleware layer often decides how good RAC actually behaves.
ONS Failures Hide Quietly
ONS issues rarely trigger obvious alarms. But they destroy failover responsiveness.
Session Counts Alone Are Misleading
Equal session distribution does not mean equal workload. One node may run heavier SQL while another handles mostly idle
sessions.
Always correlate:
-
CPU
-
DB time
-
service metrics
-
response times
-
gc waits
RAC Cannot Fix Bad Application Design
Hot block contention remains hot block contention. RAC sometimes amplifies bad locality patterns instead of solving
them
Conclusion
Oracle RAC load balancing is far more than SCAN listeners and a TNS
parameter.
Real balancing depends on cooperation between:
-
listeners
-
services
-
ONS
-
FAN
-
advisory metrics
-
connection pools
-
application behavior
When all layers work together, RAC handles workload shifts remarkably
well.
When one layer breaks, imbalance develops quietly until users start feeling
it.
The dangerous part is that most RAC balancing failures do not crash the
cluster. They degrade performance gradually through skewed workload
distribution, stale pools, and slow failover behavior.
Experienced DBAs learn to monitor the connection path itself, not just the
database health.
Because many RAC incidents begin long before the instance actually goes
down.