Saturday, 9 May 2026

RMAN Restartable Backups Explained for Production DBAs

Most DBAs have seen this at least once. You wake up, check the overnight backup report, and RMAN failed halfway through a 14 TB database backup because the backup filesystem filled up, a network mount disconnected, or one RAC node crashed during the run.



Sunday, 26 April 2026

ORA-01017 in RAC 12c and above ? Stepwise Permission Fix & Cause identification

  As an Oracle DBA, few things are more frustrating than a sudden loss of remote connectivity right after a routine SYS password reset. You type in the credentials, and bam -- ORA-01017 greets you, even though your local connections work fine. In production RAC environments, this isn’t just about a mistyped password.

Recently, I faced a tricky scenario in an Oracle 19c RAC setup with proper role separation between the grid and oracle OS users. What seemed like a simple password mismatch quickly unraveled into a multi-layered “permission deadlock,” involving rogue listeners, contaminated IPC sockets, and GPnP directory access issues. It took a careful, stepwise approach to restore connectivity across all nodes without compromising the cluster.



Saturday, 18 April 2026

Oracle RAC Load Balancing Internals Explained

 Most RAC clusters look healthy until the workload shifts suddenly.

A reporting job starts hammering one node. Connection pools keep sending sessions to the same instance. CPU climbs, gc waits spike, application response times become unpredictable, and suddenly everyone starts blaming storage, SQL plans, or the network.

But many times, the real problem sits in the RAC connection routing layer itself.

I have seen large RAC environments where all nodes were technically UP, yet one instance was drowning while another sat nearly idle. The cluster wasn’t failing. The load balancing strategy was.

Oracle RAC load balancing is often misunderstood because people assume SCAN alone magically distributes workload intelligently. It does not.

There are multiple layers involved:

  • client-side balancing
  • listener-based balancing
  • runtime advisory feedback
  • FAN/ONS notifications
  • connection pool behavior
  • service-level workload management

And if even one layer is misconfigured, RAC can behave very differently from what DBAs expect.

This article breaks down how Oracle RAC actually distributes workload internally, where the common operational failures happen, and what DBAs should monitor before imbalance becomes an outage.


RAC Load Balancing Is Not One Feature

Oracle RAC does not use a single load balancing mechanism.

Instead, workload distribution happens across several independent layers that cooperate together.

The architecture looks roughly like this:




and at a high level : 



The important thing here: Not every client participates equally.

A SQL*Plus connection behaves differently from a JDBC UCP pool. OCI behaves differently from a third-party app server. Some clients understand Runtime Load Balancing. Others only perform basic random distribution.

That distinction matters heavily during failovers and uneven workload conditions.


Client-Side Load Balancing: The Most Misunderstood Layer

Most people encounter RAC balancing through the TNS entry:

(DESCRIPTION=
  (LOAD_BALANCE=ON)
  (FAILOVER=ON)
  (ADDRESS_LIST=
    (ADDRESS=(PROTOCOL=TCP)(HOST=scan1)(PORT=1521))
    (ADDRESS=(PROTOCOL=TCP)(HOST=scan2)(PORT=1521))
    (ADDRESS=(PROTOCOL=TCP)(HOST=scan3)(PORT=1521))
  )
  (CONNECT_DATA=
    (SERVICE_NAME=prodapp)
  )
)

DBAs often assume this means Oracle intelligently chooses the least loaded node. It does not.

With client-side load balancing enabled, the client simply picks one SCAN listener randomly.

That’s all.

No CPU awareness.

No session awareness.

No response time awareness.

 

This layer exists mainly to distribute initial connection attempts.

In smaller systems, that may be sufficient. In busy OLTP systems with connection storms, it becomes inadequate quickly.

I hve seen environments where:

  • one node received most pooled reconnects after firewall resets
  • app pools pinned themselves unintentionally
  • connection reuse caused severe skew
  • middle-tier pools ignored runtime advisory updates entirely

The RAC cluster looked balanced from CRS perspective, but actual application traffic was heavily concentrated.

That is where server-side intelligence becomes critical.


Server-Side Load Balancing: Where RAC Starts Making Decisions

Once the SCAN listener receives the request, Oracle can redirect the connection toward a better target instance.

This is Server-Side Load Balancing (SSLB).

Instead of random placement, the listener evaluates advisory metrics published by RAC instances.

The workflow looks like this:

Client

SCAN Listener

Node Listener

Best Candidate RAC Instance

This depends on several moving pieces functioning correctly:

  • SCAN listeners
  • remote_listener configuration
  • dynamic service registration
  • LREG background process
  • service metrics publication

A surprising number of RAC environments have partial failures here.

One node stops dynamically registering correctly. Remote listener entries become stale. DNS inconsistencies appear. SCAN VIP relocation behaves unexpectedly after patching.

Then connections start concentrating unevenly.

You can validate registration health quickly:

show parameter remote_listener;

Check service registration:

lsnrctl status

Or directly from CRS:

srvctl status service -d PROD

Watch for services missing from one node.

That usually shows up long before application teams notice imbalance.


The Load Balancing Advisory: RAC's Real Intelligence Layer

The real intelligence inside RAC balancing comes from the Load Balancing Advisory (LBA).

This is the component many DBAs never investigate deeply.

Each RAC instance continuously publishes performance metrics including:

  • active sessions
  • service response times
  • throughput
  • CPU pressure
  • service quality metrics

These metrics feed into Oracle Clusterware and are propagated through ONS.

RAC essentially keeps scoring instance health continuously.

The listener then uses these scores when deciding where to place new sessions.


Typical values:


A badly chosen service goal can create subtle imbalance.

For example:

  • batch workloads using SHORT goals
  • OLTP services configured for THROUGHPUT
  • mixed workloads sharing same service

These issues rarely fail dramatically. They fail slowly through latency drift and uneven resource pressure.


ONS and FAN: The Parts That Usually Break Quietly

Oracle Notification Service (ONS) is responsible for distributing Fast Application Notification (FAN) events.

This becomes essential for:

  • Fast Connection Failover (FCF)
  • Runtime Connection Load Balancing (RCLB)
  • adaptive pool balancing
  • fast dead connection cleanup

Without healthy ONS communication, many application pools behave badly during failures.

Common symptoms:

  • application hangs after node failure
  • stale pooled sessions
  • reconnect storms
  • long TCP timeout waits
  • uneven pool distribution
  • connection spikes after VIP relocation

DBAs frequently validate database health but forget ONS entirely.

That becomes dangerous during failovers.

You can inspect ONS configuration:

srvctl config nodeapps

And verify FAN-related events inside alert logs and clusterware logs. One subtle issue I have seen repeatedly is that the firewalls silently dropping ONS traffic between middleware and cluster nodes. Everything appears healthy until failover occurs.

Then the app hangs for minutes because pools never received FAN notifications.


Runtime Connection Load Balancing Changes Everything

Runtime Connection Load Balancing (RCLB) works differently from initial listener balancing.

This operates inside intelligent connection pools such as:

  • UCP
  • OCI pools
  • JDBC RAC-aware pools

Instead of merely routing initial sessions, the pool continuously adapts using RAC advisory updates.

That means:

  • overloaded instances receive fewer new requests
  • unhealthy nodes lose pooled traffic gradually
  • idle sessions get drained intelligently
  • reconnects prefer healthier instances

This is where RAC becomes truly adaptive.

Without RCLB, connection balancing is mostly static.

With RCLB enabled correctly, RAC reacts dynamically to workload shifts.

Unfortunately, many applications claim RAC support while ignoring these capabilities entirely.

The result:  RAC nodes stay alive, but workload distribution becomes operationally ugly.


What Happens During Node Failure

This is where RAC architecture earns its reputation.

Suppose NODE2 crashes unexpectedly.

The flow typically looks like this:

NODE2 Failure

VIP Relocation

ONS/FAN Notification

Connection Pools Mark Sessions Dead

SCAN Redirects New Connections

Remaining Nodes Absorb Workload

  1. If configured properly, failover is extremely fast.
  2. If not, applications may sit on dead TCP sessions waiting for timeout expiration.

That distinction matters enormously during outages.

A properly tuned RAC failover may recover within seconds.

A poorly configured one can create cascading application failures lasting several minutes.


RAC Load Balancing Problems That Appear at Scale

Small RAC systems often mask configuration mistakes. Whereas,  Large systems expose them brutally.

Common scaling problems include:

Session Stickiness

Connection pools reuse existing sessions heavily.

Even if RAC redistributes new connections properly, old sessions may stay concentrated forever.


Service Misplacement

DBAs sometimes place reporting and OLTP services on same preferred instances.

Then batch spikes crush latency-sensitive traffic.

Better design:

  • separate services
  • separate service goals
  • workload isolation
  • explicit preferred/available instance definitions


Connection Storms

After network interruptions, thousands of clients reconnect simultaneously.

This creates:

  • listener pressure
  • CPU spikes
  • authentication storms
  • excessive logon triggers
  • library cache contention

RAC balancing helps, but only partially.

The real fix often involves application-side reconnect throttling.


RAC Internals Still Matter: GCS and GES

Even perfect session balancing cannot save poorly distributed data access patterns.

RAC nodes constantly synchronize:

  • data blocks
  • row locks
  • cache ownership

This happens through:

  • GCS (Global Cache Service)
  • GES (Global Enqueue Service)

Poor application partitioning creates excessive interconnect traffic.

Then you start seeing waits like:

gc cr request
gc buffer busy acquire
gc current block busy

At that point, the problem is no longer connection balancing.

It becomes workload locality. I hve seen applications scale from 2-node RAC to 8-node RAC and actually perform worse because the workload generated enormous block shipping overhead.

More nodes do not automatically mean better scalability. Sometimes they increase contention dramatically.


Production Failure Scenario

One environment I supported had a 4-node RAC cluster handling heavy OLTP traffic.

Symptoms looked strange:

  • NODE1 constantly above 85% CPU
  • NODE3 and NODE4 mostly idle
  • application latency spikes during business hours
  • SCAN listeners healthy
  • no obvious database errors

Initial suspicion focused on SQL tuning.

But session distribution showed almost 70% of connections pinned to NODE1.

Root cause turned out to be:

  • JDBC pools not configured for RCLB
  • ONS blocked partially by firewall rules
  • old stale pool behavior after previous maintenance

The cluster itself was healthy.

The middleware simply stopped reacting to RAC advisory updates.

Fix involved:

  • correcting ONS ports
  • enabling FAN subscriptions
  • validating UCP runtime balancing
  • draining existing pools gradually

Within hours, workload normalized across all nodes.

No SQL tuning was needed at all.


DBA Insights Most Teams Learn Too Late

SCAN Is Not Intelligent Balancing

SCAN helps availability and routing. It is not workload intelligence by itself.


Connection Pools Matter More Than DBAs Think

Many balancing failures originate outside the database. The middleware layer often decides how good RAC actually behaves.


ONS Failures Hide Quietly

ONS issues rarely trigger obvious alarms. But they destroy failover responsiveness.


Session Counts Alone Are Misleading

Equal session distribution does not mean equal workload. One node may run heavier SQL while another handles mostly idle sessions.

Always correlate:

  • CPU
  • DB time
  • service metrics
  • response times
  • gc waits


RAC Cannot Fix Bad Application Design

Hot block contention remains hot block contention. RAC sometimes amplifies bad locality patterns instead of solving them


Conclusion

Oracle RAC load balancing is far more than SCAN listeners and a TNS parameter.

Real balancing depends on cooperation between:

  • listeners
  • services
  • ONS
  • FAN
  • advisory metrics
  • connection pools
  • application behavior

When all layers work together, RAC handles workload shifts remarkably well.

When one layer breaks, imbalance develops quietly until users start feeling it.

The dangerous part is that most RAC balancing failures do not crash the cluster. They degrade performance gradually through skewed workload distribution, stale pools, and slow failover behavior.


Experienced DBAs learn to monitor the connection path itself, not just the database health.

Because many RAC incidents begin long before the instance actually goes down.




Saturday, 4 April 2026

Oracle Data Pump Migrations: What Breaks in Real Upgrades

Most DBAs discover the real complexity of Oracle database migrations only after the first large-scale Export/Import cutover goes sideways. On paper, Oracle Data Pump looks straightforward. Export the database, import it into a newer release, validate the objects, switch applications, done.

Reality is usually messier.



Tuesday, 31 March 2026

PostgreSQL Backup Guide: pg_dump vs pg_dumpall

 If you have ever been in a production outage where a single schema was corrupted or a developer dropped the wrong table, you already know one thing ; your backup strategy is only as good as your restore flexibility.

In PostgreSQL, logical backups using pg_dump and pg_dumpall are your first line of defense for granular recovery. Unlike physical backups, they give you the ability to restore specific objects, migrate databases across environments, and even troubleshoot data inconsistencies without touching the entire cluster.



Friday, 20 March 2026

Essential Oracle Database Keywords for DBAs - Part 2

When you are preparing for an Oracle DBA interview — or troubleshooting a production issue at 2 AM — definitions alone are not enough. You need clarity. You need context. And most importantly, you need to understand how Oracle behaves under pressure.



Sunday, 15 March 2026

When Oracle Uses PGA Instead of SGA for Large Table Scans

Most Oracle DBAs learn early in their careers that the System Global Area (SGA) is the primary memory structure used for caching data blocks. The assumption is simple: data blocks are read from disk, placed in the buffer cache, and reused by future sessions.

However, that assumption doesn’t always hold true in production.

In many real-world workloads , especially analytics queries, reporting jobs, or large batch processing-- . Oracle intentionally bypasses the buffer cache and reads data directly into the Program Global Area (PGA). This behaviour surprises many DBAs when they notice unusual wait events like direct path read, or when frequently queried tables seem to generate repeated disk I/O.



Monday, 9 March 2026

Adding an Instance to Oracle RAC: Step-by-Step Guide for DBAs

 Expanding your Oracle Real Application Clusters (RAC) database by adding a new instance can be intimidating, especially if you want to ensure zero downtime and seamless integration. Whether you're introducing a new node or simply adding an instance to an existing one, the process involves careful planning, configuration, and execution. In this guide, we’ll walk you through every step, from preparing the environment to verifying the new instance’s operation. By the end of this article, you will have a clear understanding of both DBCA and SRVCTL methods for adding instances, best practices for shared storage and network setup, and tips to avoid common pitfalls.



Thursday, 5 March 2026

Oracle Utilities for DBAs: Complete Guide to Essential Tools

 Managing an Oracle Database efficiently requires mastering a variety of utilities designed for administration, data movement, performance tuning, and troubleshooting. Whether you’re performing backups, moving data, monitoring performance, or tuning SQL queries, Oracle provides tools to streamline every task. In this guide, we’ll cover the most commonly used Oracle utilities, grouped by functionality, with practical examples and best practices for real-world DBA scenarios.



Monday, 2 March 2026

Understanding Oracle 19c DML Internals for OLTP Performance

For Oracle DBAs managing high-volume OLTP (Online Transaction Processing) systems, understanding how core DML (Data Manipulation Language) operations function under the hood is essential. SELECT, INSERT, UPDATE, and DELETE statements are the backbone of any database, but their internal mechanics—parsing, execution, undo/redo generation, buffer cache interactions, and transaction control—can significantly impact system performance.



Saturday, 28 February 2026

Oracle 26ai ASC: Fast Loads Without HCC Pain

 Most DBAs who run large Oracle data warehouses already know the trade-off.

The ETL team wants maximum direct-path load speed because nightly ingestion windows keep shrinking. The storage team wants Hybrid Columnar Compression (HCC) everywhere because storage costs are exploding. Meanwhile, DBAs get stuck in the middle trying to balance ingest performance, archive growth, Smart Scan efficiency, backup footprint, and query response times.



Oracle Backup Success Story : Predictable Backups, Confident Restores - Oracle Features for Modern VLDBs

 When we talk about database growth, we usually celebrate it. But growth without backup redesign is silent risk.

This was the story of a 60.5 TB Oracle production database, where individual datafiles had grown between 500GB to 800GB+, and backups were quietly destabilizing the entire ecosystem.

What started as a "long backup" issue turned into something much deeper.



Friday, 27 February 2026

Essential Oracle Database Keywords Every DBA Should Know - Part 1

 Whether you are just stepping into the world of Oracle databases or have years of experience managing complex environments, understanding the foundational keywords and concepts is crucial. Oracle databases come with a rich ecosystem of terms, from memory structures and wait events to transaction control and performance monitoring. This guide walks you through essential Oracle database keywords, explaining each term in plain language with practical examples. You’ll learn not only what these concepts mean but also how they impact daily database operations, troubleshooting, and performance tuning. By mastering these terms, freshers gain a strong starting point, while seasoned DBAs can refresh and refine their knowledge. 

In this first installment, we cover critical keywords ranging from Buffer, Cache, and Parsing, to Data Pump and SQL Plan Baselines, giving you a solid foundation for Oracle administration.



Monday, 23 February 2026

SQL Query Tuning in Oracle: A Practical Guide for DBAs

 If you're an Oracle DBA, you already know this feeling: a message pops up — “The application is slow.” No context. No logs. Just urgency.

And more often than not, the root cause comes down to a poorly performing SQL query.

SQL tuning in Oracle isn’t just about adding an index or running the SQL Tuning Advisor. It’s about following a structured, evidence-based approach that eliminates guesswork. Over the years, I’ve realized that the biggest difference between average and effective SQL query tuning lies in discipline — knowing what to check, in what order, and why.



Monday, 16 February 2026

Oracle Data Archiving Best Practices Guide

As an Oracle DBA, you already know this feeling - the database keeps growing, storage keeps expanding, backups take longer, maintenance windows shrink, and suddenly performance complaints start coming in.

Handling large data volumes isn't just about adding more disks or increasing SGA. It's about implementing Oracle Data Archiving best practices that balance performance, cost, compliance, and scalability.



Monday, 9 February 2026

How to Boost Oracle Data Pump Performance for Faster IMPDP Operations

 Oracle Data Pump (IMPDP) is a powerful utility for moving data between Oracle databases, but large-scale imports and exports can often be slow and resource-intensive. Whether you’re managing a standalone database or handling complex LOB-heavy schemas, improving Data Pump performance is key to reducing downtime and ensuring smooth operations. In this guide, we’ll explore practical strategies to enhance IMPDP performance, including parallelism, network-based imports, LOB optimizations, and buffer tuning.



Wednesday, 4 February 2026

VLDB Backup Optimization – RMAN Concepts Revision for Interviews

 

Q 1: What is the main challenge when backing up a multi-terabyte Oracle database with very large datafiles?



Monday, 26 January 2026

Republic Day Reflections for DBAs: Oracle, PostgreSQL, MSSQL, and the Art of Governance

As India celebrates 26th January - Republic Day, it’s a perfect moment for DBAs to draw parallels between national governance and database management. Just as the Constitution defines rules, responsibilities, and structures for our nation, robust databases rely on architecture, policies, and governance to thrive.



Friday, 23 January 2026

Oracle Hybrid Partitioned Tables in Real Systems

 Most large Oracle environments eventually hit the same wall.

The database keeps growing, storage costs climb every quarter, backup windows become unpredictable, and suddenly DBAs are spending more time managing historical data than supporting the actual application workload. 

Usually the first reaction is archive-and-delete.

Then the reporting team complains because last year's data disappeared.



Monday, 12 January 2026

All Important Things About Oracle RAC Every DBA Should Know

  In today's always-on digital world, databases are no longer just data stores—they are the backbone of business continuity. Whether it’s a bank processing millions of transactions per second, a pharma company maintaining regulatory compliance, or an eCommerce platform surviving flash sales, downtime is simply not an option. This is exactly where All Important Things About Oracle RAC become critical for anyone working with enterprise databases.



Monday, 5 January 2026

Classic vs Integrated Capture in Oracle GoldenGate: Key Differences Explained

 Oracle GoldenGate is a robust solution for real-time data replication, offering flexibility to suit a wide range of enterprise database environments. One of the fundamental decisions when implementing GoldenGate is choosing between Classic Capture and Integrated Capture. While both approaches serve the same purpose—capturing database changes for replication—they differ significantly in performance, scalability, and compatibility with modern Oracle features such as RAC, multitenant architecture, and TDE (Transparent Data Encryption).



Monday, 24 November 2025

Oracle Data Pump Features: 11g vs 19c – Key Differences DBAs Should Know

 If you have worked with Oracle databases, you are likely familiar with Oracle Data Pump, the high-performance utility for exporting and importing database objects. Over the years, it has evolved significantly, especially from Oracle 11g to 19c. Understanding the differences is crucial for DBAs planning migrations, upgrades, or performance optimizations.



Monday, 17 November 2025

Reducing RPO and Managing Recovery Time for Oracle Bigfile Tablespaces

 Managing Oracle Bigfile Tablespaces can seem daunting when it comes to backup and recovery. With a single datafile potentially exceeding hundreds of terabytes, restoring after corruption or failure may appear time-consuming. However, modern Oracle features like RMAN incremental backups, block change tracking, ASM striping, and flashback technologies allow DBAs to reduce both Recovery Point Objective (RPO) and Recovery Time Objective (RTO) efficiently.



Thursday, 14 August 2025

Celebrating Freedom with PostgreSQL: A Tribute on India's Independence Day

 As India marks 78 years of freedom, it’s a perfect moment to draw some  parallels between our nation’s journey and our favorite open-source database, PostgreSQL. This Independence Day, let’s explore how the spirit of freedom and innovation reflects in both our national history and PostgreSQL’s capabilities.


Sunday, 10 August 2025

Step-by-Step Guide: Configuring Yum Repository in Linux for Oracle DBAs

 For Oracle DBAs managing Linux systems, configuring Yum repositories is essential for streamlining package management tasks like installation, updates, and dependency resolution. In this guide, I’ll walk you through setting up a Yum repository in a few simple steps, ensuring a smooth package management experience.