As an Oracle DBA, few things are more frustrating than a sudden loss of remote connectivity right after a routine SYS password reset. You type in the credentials, and bam -- ORA-01017 greets you, even though your local connections work fine. In production RAC environments, this isn’t just about a mistyped password.
Recently, I faced a tricky scenario in an Oracle 19c RAC setup with proper role separation between the grid and oracle OS users. What seemed like a simple password mismatch quickly unraveled into a multi-layered “permission deadlock,” involving rogue listeners, contaminated IPC sockets, and GPnP directory access issues. It took a careful, stepwise approach to restore connectivity across all nodes without compromising the cluster.
In this article, I will walk you through:
- The root causes behind ORA-01017 and ORA-29780 in RAC environments.
- How OS-level permission issues can block database-to-cluster communication.
- A practical three-step resolution strategy covering root, grid, and oracle users.
- Key takeaways and real-world DBA insights that save hours of troubleshooting.
By the end, you will understand why such deadlocks occur and how to methodically restore full RAC functionality.
Understanding the Permission Deadlock
When you encounter ORA-01017 immediately after a SYS password reset, it is often more than a credential mismatch. In RAC setups, remote connectivity depends on a “trust chain” between the Database, ASM, and Grid Infrastructure. Breaking any link can cascade into errors like:
ORA-01017: invalid username/password; logon denied
ORA-17503: ksfdopn:2 Failed to open file
ORA-27300: OS system dependent operation failed with status 13
ORA-29780: unable to connect to GPnP daemon
Typical Root Causes
-
Rogue Listener Process
If a listener is manually started under thedaemonOS user instead ofgrid, it lacks membership inasmdbaand cannot access ASM password files. The listener seems alive but silently blocks remote connections. -
Contaminated IPC Socket Files
The rogue listener creates socket files in/var/tmp/.oracleowned bydaemon. These stale files prevent the legitimate Grid Infrastructure listener from starting correctly. -
GPnP Directory Permissions
Theoracleuser must have read access to the Grid Plug and Play (GPnP) directory. Missing permissions here trigger ORA-29780, as the database cannot locate cluster resources.
Image Idea: Diagram showing Database → ASM → GPnP → Listener flow, highlighting blocked connections caused by OS-level permission issues.
Step-By-Step Resolution
In my experience, solving this type of deadlock requires strict sequencing: root user cleanup, grid-managed listener restoration, and oracle password resync.
Step 1: OS & Socket Cleanup (root)
Clear the environment and ensure permissions are correct:
# Grant oracle user read access to GPnP
chmod -R 755 /u01/app/12.1.0/grid_1/gpnp
# Fix SetUID bits for Oracle binaries
chmod 6751 /u01/app/oracle/product/12.1.0/db_1/bin/oracle
chmod 6751 /u01/app/12.1.0/grid_1/bin/oracle
# Remove stale IPC socket files
rm -f /var/tmp/.oracle/s#*
rm -f /tmp/.oracle/s#*
rm -f /var/tmp/.oracle/sLISTENER
# Reset temporary directory permissions
chmod 01777 /var/tmp/.oracle
chmod 01777 /tmp/.oracle
This step ensures that stale files and incorrect permissions do not block listener startup.
Step 2: Network Layer Restoration (grid)
Never start listeners manually in a role-separated RAC; always use srvctl.
# Stop rogue listener
lsnrctl stop LISTENER
# Start RAC-managed listener via Clusterware
srvctl start listener -n srv1
# Verify TCP endpoint
lsnrctl status LISTENER
Confirm no rogue processes remain:
ps -ef | grep tnslsnr
This step hands control back to Clusterware and ensures the listener runs with proper permissions.
Image Idea: Diagram comparing rogue listener vs RAC-managed listener startup paths.
Step 3: Password Synchronization (oracle)
Once communication is restored, recreate the password file in ASM and sync the instance:
# Recreate password file
orapwd dbuniquename=rac password=oracle file=+DATA format=12 force=y
# Sync SYS password across instances
sqlplus / as sysdba
ALTER USER SYS IDENTIFIED BY oracle;
Finally, test remote connectivity on all nodes:
sqlplus sys/oracle@rac1 as sysdba
sqlplus sys/oracle@rac2 as sysdba
Connectivity should now be fully restored.
Quick Takeaways
- ORA-01017 in RAC often hides OS-level permission issues.
- Check listener ownership, stale IPC sockets, and GPnP directory permissions first.
-
Never manually start listeners in a role-separated RAC; always use
srvctl. - Clean environment and correct SetUID bits prevent future deadlocks.
- Sync password files after listener restoration to avoid cascading ORA-29780 errors.
Lessons From the Field: DBA Takeaways
Many RAC troubleshooting guides focus solely on database commands, but the real culprit is often at the OS level. In production:
-
Stale socket files in
/var/tmp/.oraclesilently block listener processes. - Incorrect binary permissions or GPnP access can mimic complex database failures.
- Manual listener starts are a common operational mistake; it may appear functional locally but silently fails cluster-wide.
- Monitoring listener processes, directory permissions, and ASM accessibility proactively can prevent hours of downtime.
Mini Case Study: Real RAC Outage
In one 12c RAC cluster, a password reset triggered ORA-01017 on all nodes. Local sqlplus / as sysdba worked, but remote apps failed. Investigation revealed a daemon-owned listener and stale /var/tmp/.oracle sockets. By following the three-step approach above, the environment was restored in under 45 minutes, preventing what could have been a multi-hour production outage.
Conclusion
ORA-01017 and ORA-29780 errors in RAC environments are rarely just about credentials. OS-level permission misconfigurations, rogue listener processes, and GPnP access issues are common hidden causes. By methodically cleaning sockets, restoring listener control via Clusterware, and synchronizing passwords, you can restore full connectivity quickly and safely.
For DBAs, the lesson is clear: always consider the OS and cluster layer first before diving into database troubleshooting. Test configurations regularly, monitor socket files and listener ownership, and enforce strict role separation. A small misstep in permissions can masquerade as a catastrophic database failure.
FAQs
1. Why do I get ORA-01017 immediately after SYS password reset?
It’s often due to stale listener processes, IPC socket files, or GPnP permission issues rather than a wrong password.
2. Can I manually start a listener in a RAC environment?
Not safely; manual starts can break the trust chain. Always use srvctl.
3. What are GPnP permissions and why are they important?
Grid Plug and Play (GPnP) directories store cluster configuration data. Missing read access for oracle can block resource discovery.
4. How do I clean stale IPC sockets?
Remove files starting with s# or sLISTENER in /var/tmp/.oracle and /tmp/.oracle, then reset directory permissions.
5. How can I prevent this problem in production?
Regularly audit listener ownership, SetUID bits, GPnP permissions, and monitor stale socket files.

No comments:
Post a Comment