We all believed that a 4-node Oracle 11.2.0.2 RAC on Linux is the the most solid and reliable environment; it should never go down, until we hit the Bug 12899768 - Processed messages remain in Queue causing space issues (Doc ID 12899768.8). This bug is only relevant when using Advanced Queuing which we don't use, but the Advanced Queuing was turned on and for some reason it was messed up. According to Oracle, this bug affects both 11.2.0.2 and 11.2.0.3. Actually, it is a latent bug; it may not occur at all, but can occur any time. When it happened in our system, we first experienced slowness of the database, then the database became unacceptable. In the alert log, the following error message come out repetitively:
Thu Dec 12 17:02:16 2013
Errors in file /u01/app/oracle/diag/rdbms/racdb/racdb1/trace/racdb1_q00n_81253.trc:
ORA-01405: fetched column value is NULL
ORA-01405: fetched column value is NULL
When trying to connect to the database, this error might be seen:
ORA-00257 Archiver error connect internal only
In fact, the Oracle RAC didn't go all the way down because of the bug, however because the FRA storage area runs out of space, the system cannot take new connections, thus to the end users, the system is completely down, which makes DBA look bad.
Uisng crsctl to check Oracle RAC, you might see at least one node is in INTERMEDIATE status with detail info "Stuck Archiver":
ora.sblract.db
1 ONLINE INTERMEDIATE racdb01 Stuck Archiver
2 ONLINE INTERMEDIATE racdb02 Stuck Archiver
3 ONLINE ONLINE racdb03 Open
4 ONLINE ONLINE racdb03 Open
To fix this problem, you have to log in to RMAN using sys account and manually delete archive logs from FRA in ASM storage, then the system could be back to normal. However, since the real issue of Advanced Queuing has not been addressed yet, the archive logs are still generated like crazy and in couple of hours the storage will be full again and the database will be unreachable.
To get rid of this bug, Oracle provides the following work-around steps (Doc ID 1162862.1):
1. Stop all the queues in the relevant queue table
[oracle@racdb01 ~]$ sqlplus / as sysdba
SQL> set line 200
SQL> select
owner, name, queue_type, enqueue_enabled, dequeue_enabled
from
dba_queues
where
owner='SYS' and
queue_table='SYS$SERVICE_METRICS_TAB' and
queue_type='NORMAL_QUEUE';
2 3 4 5 6 7 8
OWNER NAME QUEUE_TYPE ENQUEUE DEQUEUE
------------------------------ ------------------------------ -------------------- ------- -------
SYS SYS$SERVICE_METRICS NORMAL_QUEUE YES YES
SQL>
SQL> exec dbms_aqadm.stop_queue(queue_name => 'SYS.SYS$SERVICE_METRICS');
PL/SQL procedure successfully completed.
SQL>
SQL> select inst_id, value from gv$parameter where name like 'event';
INST_ID
----------
VALUE
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3
2
1
SQL>
SQL> alter system set events '10852 trace name context off';
System altered.
2. Delete the unnecessary/corrupted entries
SQL> delete from sys.aq$_SYS$SERVICE_METRICS_TAB_L where msgid ='00000000000000000000000000000000';
6488 rows deleted.
SQL> commit;
Commit complete.
SQL>
3. Purge the PROCESSED messages for all consumers as reported in the AQ$ view
(This step takes five or more minutes.)
SQL> DECLARE
2 po dbms_aqadm.aq$_purge_options_t;
3 BEGIN
4 po.block := TRUE;
5 DBMS_AQADM.PURGE_QUEUE_TABLE(
6 queue_table => 'SYS.SYS$SERVICE_METRICS_TAB',
7 purge_condition => 'qtview.queue = ''SYS.AQ$SYS$SERVICE_METRICS_TAB''
8 and qtview.msg_state = ''PROCESSED''',
9 purge_options => po);
10 END;
11 /
PL/SQL procedure successfully completed.
SQL> commit;
Commit complete.
4. Run below query to make sure rows were deleted
SQL> select
count(*)
from
SYS.aq$_SYS$SERVICE_METRICS_TAB_L
where
msgid ='00000000000000000000000000000000'; 2 3 4 5 6
COUNT(*)
----------
0
SQL>
5. Enable / Re-enable Event 10852
Once the above query returns 0 row, enable (or re-enable if using already) event 10852 executing below commands which will disable the new 11.2 dequeue functionality and revert back to pre 11.2 where the dequeue log table is not used:
SQL> alter system set events '10852 trace name context forever, level 16384';
System altered.
SQL> alter system set event='10852 trace name context forever, level 16384' scope=spfile;
System altered.
SQL>
6. Re-start the queue again
SQL> exec dbms_aqadm.start_queue(queue_name => 'SYS.SYS$SERVICE_METRICS')
PL/SQL procedure successfully completed.
SQL>
Note: To permanently fix this bug, apply back-port for bug 12899768 if available for your RDBMS 11.2 version, or upgrade to 12.1 where bug is fixed.
References
Troubleshooting High Redo Generation Issues (Doc ID 782935.1)
ORA-01405: Fetched Column Value Is Null For An Advanced Queue (Doc ID 1162862.1)
No comments:
Post a Comment