When we talk about database growth, we usually celebrate it. But growth without backup redesign is silent risk.
This was the story of a 60.5 TB Oracle production database, where individual datafiles had grown between 500GB to 800GB+, and backups were quietly destabilizing the entire ecosystem.
What started as a "long backup" issue turned into something much deeper.
🔹 The Tipping Point
Our environment looked structured:
✔ Weekly Level 0 & Daily Level 1 incremental backup
✔ Archivelog backup scheduled for every 6 hours
✔ Parallelism configured
✔ Backups to NFS
✔ ASM layout optimized
Yet full and incremental backups were running 8–10 hours.
And here's where it became critica when the archivelog backup scheduled every 6 hours started failing. Because RMAN does not allow overlapping backup operations on the same database.
🔹 Technical Root Cause
We investigated channel behaviour & what we discovered:
Parallelism was configured, but one huge 780GB datafile monopolized a channel.
Other channels went idle → i.e., wasted resources.
Backup pieces grew too large → resulting in restore delays, + storage alerts.
Archive-log backups skipped → FRA filling & operational risk increased.
Now it was not just performance - it was recovery integrity too.
– Multisection Backups (SECTION SIZE)
We redesigned the backup like this:
BACKUP INCREMENTAL LEVEL 1 CUMULATIVE SECTION SIZE 64G DATABASE;
Why 64G?
Largest datafile ≈ 800GB & Parallelism = 8
Mathematically: 800 ÷ 8 ≈ 100GB
But we intentionally chose 64G for Better workload balancing & more granular scheduling
A common approach is:
SECTION SIZE = Largest Datafile Size ÷ Parallelism ÷ 1.5 (or 2 for safety)
- Largest Datafile Size → in bytes/GB
- Parallelism → number of RMAN channels configured
- 1.5–2 factor → this gives extra granularity and prevents "long-tail" issues
Example:
-
Largest datafile = 800 GB & Parallelism = 8
SECTION SIZE = 800 ÷ 8 ÷ 1.5 ≈ 66 GB
Now that 800GB file was split into ~12–13 sections. + All 8 channels stayed busy.
There were no idle time and backup duration dropped significantly.
RMAN calculates section offsets:
start_block and end_block for each section amd Channels process sections independently; that means metadata in RMAN tracks which sections are done.- Stabilizing Output with MAXPIECESIZE
We also noticed backup pieces were becoming huge ~200GB+.
That's risky in a multi-terabyte environment.
So we configured:
CONFIGURE CHANNEL DEVICE TYPE DISK MAXPIECESIZE 128G;
Now:
-
Backup pieces capped at 128GB
-
This helped with not only a predictable file sizing but Easier storage management and Faster restore testing.
It should be larger than the section size, or at least equal to the section size, otherwise RMAN may fail to create the backup piece.
- If not explicitly set, RMAN defaults to the OS file size limits.
Parallelism Considerations
Later in further iterations, we leveraged the RMAN Parallelism feature which has again helped us speed up the backup time.
RMAN uses multiple channels to process datafiles or sections
Each channel runs a separate OS-level process, reading datafile blocks and writing backup sets
- Available storage bandwidth + CPU capacity &
- I/O latency + Network throughput as well (for NFS/cloud)
Backup window reduced from: 10 hours → 3 hours 15 minutes
But more importantly, archivelog backups stopped failing, FRA stabilized, and restore testing became predictable.
🔹 Technical Areas We Evaluated during Iterations
From a DBA + Backup Team perspective, we assessed:
-
I/O throughput capacity (use AWR)
-
ASM diskgroup latency
-
NFS mount performance (include backup/storage friends)
-
CPU usage during compression (AWR/ASH)
-
Channel allocation behaviour (Comparisons)
-
Restore validation timing
-
FRA growth trends + Archivelog generation rate
Optimization wasn't done in isolation.
It was done holistically.
🔹 Why Benchmarking Matters
Tuning RMAN is not a one-time configuration. It is an iterative process. When you benchmark and monitor properly:
-
You reduce operational risk
-
You validate storage capability
-
You detect bottlenecks early
-
You build reusable tuning standards
-
You create reference architecture for future databases
In fact, this exercise became a blueprint for other multi-terabyte databases in our environment.
Instead of reacting later, we proactively optimized future systems during growth planning.
🔹 Key Takeaways
- Large datafiles require multisection strategy when database has grown huge
- Parallelism must align with storage bandwidth
- MAXPIECESIZE improves operational stability
- Long-running backups can silently break archivelog schedules
- Benchmarking after change is non-negotiable.
- Each VLDB has different I/O throughput, CPU, and network characteristics
Final Thought
Optimization is not about aggressive settings. It is about controlled, measured, and validated improvement. When done correctly, backup tuning doesn't just reduce runtime. It restores operational confidence.
And sometimes, the happiest outcome isn't the 3-hour backup. It's the fact that no one is seeing "Log Backup Skipped” alerts anymore. :)
That is when you know the ecosystem is healthy.
No comments:
Post a Comment