Thursday, July 31, 2014

ORA-00942 During upgrade to 11.2.0.3 or later on Windows

Problem
On Windows platform (32 bit or 64 bit), while upgrading 10.2.0.4 (or earlier version) database to 11.2.0.3 (with bundle patch applied), ORA-00942 error is expected to appear during the process, regardless of whether upgrade is being done manually or using DBUA. Following is the screen shot of this error which can safely be ignored and let the upgrade process run to the finish.





















After the upgrade finishes, you can query DBA_REGISTRY to confirm that all components are valid.

Reason of the Problem
This is a known issue which appears on Windows platform while upgrading a database to a bundle patch applied; 11.2.0.3 (or greater) home. If you have a look at the log file of upgrade process (oracle_server.log), you will see following error which is related to the database replay.

########################################################
update sys.WRR$_REPLAY_DIVERGENCE set cap_file_id = file_id;
update sys.WRR$_REPLAY_DIVERGENCE set cap_file_id = file_id
*
ERROR at line 1:
ORA-00942: table or view does not exist


commit;

Commit complete.

########################################################

Solution
Once upgrade finishes and DBA_REGISTRY shows all components are VALID, there is nothing much to worry about. Next step is just to manually fix the issue happened during the upgrade. Following scripts are to be run as SYSDBA to recreate the database replay objects and fix this issue.

cd %oracle_home%\rdbms\admin
sqlplus "/ as sysdba"

SQL> @catnowrr.sql
SQL> @catwrr.sql
SQL> exit

Wednesday, July 23, 2014

ORA-29702 While Starting 10.2.0.4 Instance(s) on 11.2.0.3 RAC/Clusterware

I was in process of upgrading my 3 nodes 10.2.0.4 RAC database to 11.2.0.3 RAC on Windows 2008. Since we had new hardware, we decided to build a new 11.2.0.3 RAC and create a 10.2.0.4 standby on the new RAC and then do a switch-over and immediately after that, perform the upgrade of switched over database. To accomplish this task, we followed following steps

1) Build a new 11.2.03 RAC by installing 11.2.0.3 GI and RDBMS
2) Install 10.2.0.4 on this new RAC
3) Build standby database on this new RAC setup
4) Perform switchover to this new RAC standby
5) Upgrade 10.2.0.4 to 11.2.0.3
6) Perform any post upgrade tasks

Problem
On step 3 while building standby database, when I tried to start the instances, it failed to start with error "ORA-29702: error occurred in Cluster Group Service operation".

Reason
In Oracle 11.2 RAC, there is a concept of "pinning of nodes". By default, RAC nodes are not pinned, 
click here for the details.
10g is an older version in which, by default, nodes are actually "pinned" (having consistent configuration information), whereas in 11g; functionality is different and this error was appearing because of this reason.

Solution
Pinning of nodes resolved this issue. Following are the steps to check current nodes configuration and how to pin the nodes.

#####Getting information

%GRID_HOME%\bin\olsnodes -t
node01 Unpinned 
node02 Unpinned
node02 Unpinned 

#####Pinning the nodes by executing following from node1
crsctl pin css -n node01 node02 node03 
#####Getting information
%GRID_HOME%\bin\olsnodes -t
node01 1 Pinned
node02 2 Pinned
node02 3 Pinned 

After pinning the nodes, I was able to start all standby instances.

Monday, July 21, 2014

ORA-04030: Out of Process Memory

This error comes because of shortage of memory available to be provided to a new session’s server process which wants to connect to the database.

Windows 32-bit
This error is more common on 32-bit Windows based systems. On Windows, Oracle session processes run as threads under main oracle.exe process and maximum 2G can be allocated to single OS process under windows 32-bit, and hence when oracle.exe reaches near 2G, it starts throwing ORA-04030.

In today's modern era, I don't think there are many 32-bit servers are left. If someone is till using these old servers, probably they will be replacing 32-bit serves very soon with the 64-bit ones.

When Does this Error Come?
Suppose we have allocated 1.5G RAM to SGA, now we have total 500 MB available for the user sessions (remember that maximum 2G for whole oracle.exe). If more and more sessions keep connecting with the database, we can observe (in task manager) that "oracle.exe" memory consumption also keeps increasing. Once this RAM allocation reaches near to 2G, this error would start appearing.


Above explanation is regarding 32-bit servers, but this error may also come in 64-bit environments and can come on Windows as well as Unix based systems. Reason of this error on 64-bit platforms is same: OS is not able to allocate memory to the pga of a new incoming session and hence session fails to connect with the database and throws ORA-04030. On Linux systems, you should also check limits for oracle user by using command "ulimit -a" and also kernel parameters related to memory that are set in /etc/sysctl.conf file. Follow oracle documentation for the appropriate settings.

Solution
1. Increase the amount of RAM (if it is not a 32 bit system with restriction of maximum 4G RAM and maximum 2G to a single process).
2. If Windows 32-bit, decrease the size of SGA so that more sessions can connect with the database and no session fails with this error
3. Switch to Shared Server configuration to preserve memory.

4. For 64-bit platforms, make sure that you calculate memory allocation properly. Suppose if you have 64G total memory, you should leave around 30% (20G) for the OS. Leftover 44G should be distributed between (SGA+PGA_AGGREGATE_TARGET) and process memory for each Oracle session that will connect with the database. Suppose if I expect total 300 sessions connecting with the database simultaneously, I usually reserve 10 MB per session, and it would mean total 3G to be reserved for all oracle sessions, and then rest of (44 - 3 = 41G) can be used for SGA and PGA_AGGREGATE_TARGET. 

Monday, July 14, 2014

ORA-4031: unable to allocate "#XXXX#" bytes of shared memory

Most commonly this error comes because of smaller SHARED POOL - which is also mentioned with the error message as follows

ERROR at line 1:
ORA-04031: unable to allocate 4018 bytes of shared memory ("shared
pool
","unknown object","sga heap(1,0)","kglsim heap")

Reason
When an object(PL/SQL, SQL) is loaded in the shared pool, oracle needs contiguous free space in the pool to load the whole object, and if it cannot find a free chunk of space to load the whole object in contiguous memory blocks, it throws ORA-4031 error. The reason for this error is fragmentation of shared pool where it may have some free space available, but, free space is not contiguous to load the object in a single bigger free chunk of memory.

It would be pertinent to mention that "free space" does not necessarily mean "free/unused space" in memory. Free space actually means the space either free (never used), or space available to be reused as listed in LRU (Least Recently Used) list. LRU list is maintained to help oracle process to use the memory blocks which are not "in use" for a longer of period as compared to other objects in the pool.

Temporary Solution
Issue following statement to free up the whole shared pool.

SQL> ALTER SYSTEM FLUSH SHARED_POOL;

Permanent Solution
Increase the amount of memory used by shared pool by setting either of the following (based on memory settings you are using)

SHARED_POOL_SIZE
SGA_TARGET
MEMORY_TARGET

Other Memory Structures
Other than shared pool, you may also face same ORA-4031 if sufficient free memory is not available in other POOLs i.e. JAVA pool, Streams pool or Large pool. In either case, have a close look at the memory requirements for each of the pool for which you are receiving this error message.

Monday, July 7, 2014

ORA-00376: "file#" cannot be read at this time.

I have experienced this error message (message contains file# of the file which has problems) several times, but only on Windows platform (But this may come on any platform). After this error, all data in the mentioned datafile is no longer available and status of this datafile become "Recover" if we run query:

select name,status from v$datafile where file#=<file# in error message>; 

I am not able to find the any cause of why file automatically gone into "RECOVER" state, but solution for this problem is quite simple.

Solution
If you take immediate action when error comes, you can run "RECOVER" command right away to solve the problem. Oracle would perform  recovery using either online redo log files or available archived log files (if needed) to move forward SCN of this datafile to the current SCN to make datafile consistent. After recovery, bring this datafile online.

SQL> RECOVER DATAFILE <file#>;
SQL> ALTER DATABASE DATAFILE  <file#> ONLINE;

Note: If you delayed the recovery for some time, the information in redo logs may be over written and recovery command may suggest you the archived log which is required for this recovery and you would need to restore that archived log file for recovery to proceed.
Moreover, if database is in NOARCHVELOG mode, and that redo log group is over written which is required for this recovery, you would not be able to recover this datafile.

Friday, July 4, 2014

ORA-12154: TNS:could not resolve the connect identifier specified in RAC (with Server Side Load Balancing)

"ORA-12154: TNS:could not resolve the connect identifier specified" is a very generic and common error message; faced while connecting to an Oracle database. If you "google" this error, you would find a lot of different reasons and solutions for this error message; and I will not repeat any of them, rather I would explain a different problem (which I faced recently), and the solution for it.

Problem
On my 2 nodes (10.2.0.4) RAC, everything was working fine until I enabled the server side load balancing for this RAC. After enabling server side load balancing (to enable oracle listener to forward the clients connections to the least loaded RAC node, to keep the load balanced), I started receiving "ORA-12154: TNS:could not resolve the connect identifier specified" on my application server. But this error message was very random. This was quite surprising that why my connections are failing randomly, not always. My TNS entries were as follows. First TNS entry which you can see is the actual TNS entry used. Rest of the entries were working fine because these were making connection directly to required instance mentioned in INSTANCE_NAME parameter and hence server side load balancing does not work for these entries


######################################
PRODDB.WORLD, PRODDB = 
(DESCRIPTION = 
(ADDRESS = (PROTOCOL = TCP)(HOST = XXX.XXX.XX.10)(PORT = 1521)) 
(ADDRESS = (PROTOCOL = TCP)(HOST = XXX.XXX.XX.11)(PORT = 1521)) 
(LOAD_BALANCE = yes) 
(CONNECT_DATA = 
(SERVER = DEDICATED) 
(SERVICE_NAME = proddb) 
) 
) 

PRODDB1.WORLD, PRODDB1 = 
(DESCRIPTION = 
(ADDRESS = (PROTOCOL = TCP)(HOST = XXX.XXX.XX.10)(PORT = 1521)) 
(LOAD_BALANCE = yes) 
(CONNECT_DATA = 
(SERVER = DEDICATED) 
(SERVICE_NAME = proddb)
(INSTANCE_NAME = proddb1)

) 
) 

PRODDB2.WORLD, PRODDB2 = 
(DESCRIPTION = 
(ADDRESS = (PROTOCOL = TCP)(HOST = XXX.XXX.XX.11)(PORT = 1521)) 
(LOAD_BALANCE = yes) 
(CONNECT_DATA = 
(SERVER = DEDICATED) 
(SERVICE_NAME = proddb)
(INSTANCE_NAME = proddb2)

) 
) 


#####################################

Root Cause
Investigation revealed that I did not have the entries of my both RAC nodes in the host file of my application server. After adding the entries of both RAC nodes in the host file of my application server, my problem was resolved (There wasn't any DNS server in my environment).


How the Entries in the HOST file Made the Difference
In RAC environment when server side load balancing is on, when a connection request comes to a RAC node (let say node1), and oracle listener plans to forward this connection to any other node in the RAC (let say node2), it would pass the hostname of node2 back to the client for further processing of the connection request so that client can connect to node2. Now if client system can resolve the hostname of node2 either through DNS or host file, connection would be successful, otherwise it would return "ORA-12154".


Moral of the Story
Always make sure that all systems (servers) involved with your oracle database environment, have each other's entries either in their host file or are resolvable through DNS. As a best practice, ALWAYS add entries in the host file. This is also true for client side – if you want to connect to the database server from a client, it is recommended to add the entries in the host file of the client.

Tuesday, July 1, 2014

ORA-03113: end-of-file on communication channel

There could be several reasons for this error message returned to a session - for example, session is killed; network disconnection occurred, database is shutdown etc. Sometimes none of this happens but still end users complain about having disconnected from the database with this error message; time to time, and one of the very common reasons for this is firewall.

If our client (or application server) and database server has a firewall between them, firewall may have the functionality to drop a connection if firewall does not sense any activity on that connection for a specified about of time. This functionality of dropping the connection may vary from manufacturer to manufacturer and model to model of firewall; that how firewall would monitor the connection and what would make firewall to kill the connection.

SQLNET.EXPIRE_TIME
To avoid the problem of "unwanted" disconnection, sqlnet.expire_time can be added in the sqlnet.ora file of database server. This parameter has a value in minutes and it would cause database server to send a packed to each client connect to it at an interval equal to the value of this parameter. This small packet sending would actually deceive firewall so that it considers it an active connection and does not drop it.

We can initially set it to a value of 10 minutes which can be alerted with coordination of network admin who can help to find out firewall settings. Add the following line in $ORACLE_HOME/network/admin/sqlnet.ora

############

SQLNET.EXPIRE_TIME = 10

############

Actual work of this parameter is to check the client sessions which might have exited but server process related to the session is still intact; so that this server process could also be terminated and resources could be released. For details about this parameter, please 
Click Here