Pages

Oracle Archiving Generates Excessive Logs to Fill Up Storage Space

We all believed that a 4-node Oracle 11.2.0.2 RAC on Linux is the the most solid and reliable environment; it should never go down, until we hit the Bug 12899768 - Processed messages remain in Queue causing space issues (Doc ID 12899768.8). This bug is only relevant when using Advanced Queuing which we don't use, but the Advanced Queuing was turned on and for some reason it was messed up.  According to Oracle, this bug affects both 11.2.0.2 and 11.2.0.3.  Actually, it is a latent bug; it may not occur at all, but can occur any time.  When it happened in our system, we first experienced slowness of the database, then the database became unacceptable.  In the alert log, the following error message come out repetitively:

Thu Dec 12 17:02:16 2013
Errors in file /u01/app/oracle/diag/rdbms/racdb/racdb1/trace/racdb1_q00n_81253.trc:
ORA-01405: fetched column value is NULL
ORA-01405: fetched column value is NULL

When trying to connect to the database, this error might be seen:

ORA-00257 Archiver error connect internal only

In fact, the Oracle RAC didn't go all the way down because of the bug, however because the FRA storage area runs out of space, the system cannot take new connections, thus to the end users, the system is completely down, which makes DBA look bad.

Uisng crsctl to check Oracle RAC, you might see at least one node is in INTERMEDIATE status with detail info "Stuck Archiver":

ora.sblract.db
      1     ONLINE  INTERMEDIATE racdb01   Stuck Archiver
      2     ONLINE  INTERMEDIATE racdb02   Stuck Archiver
      3     ONLINE  ONLINE       racdb03   Open
      4     ONLINE  ONLINE       racdb03   Open

To fix this problem, you have to log in to RMAN using sys account and manually delete archive logs from FRA in ASM storage, then the system could be back to normal.  However, since the real issue of Advanced Queuing has not been addressed yet, the archive logs are still generated like crazy and in couple of hours the storage will be full again and the database will be unreachable.

To get rid of this bug, Oracle provides the following work-around steps (Doc ID 1162862.1):

1. Stop all the queues in the relevant queue table

[oracle@racdb01 ~]$ sqlplus / as sysdba
SQL> set line 200
SQL> select
     owner, name, queue_type, enqueue_enabled, dequeue_enabled
   from
     dba_queues
   where
     owner='SYS'  and
     queue_table='SYS$SERVICE_METRICS_TAB' and
     queue_type='NORMAL_QUEUE';

  2    3    4    5    6    7    8
OWNER                          NAME                           QUEUE_TYPE           ENQUEUE DEQUEUE
------------------------------ ------------------------------ -------------------- ------- -------
SYS                            SYS$SERVICE_METRICS            NORMAL_QUEUE           YES     YES

SQL>
SQL> exec dbms_aqadm.stop_queue(queue_name => 'SYS.SYS$SERVICE_METRICS');
PL/SQL procedure successfully completed.

SQL>
SQL> select inst_id, value from gv$parameter where name like 'event';

   INST_ID
----------
VALUE
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         3
         2
         1

SQL>
SQL> alter system set events  '10852 trace name context off';
System altered.

2. Delete the unnecessary/corrupted entries

SQL> delete from sys.aq$_SYS$SERVICE_METRICS_TAB_L where msgid ='00000000000000000000000000000000';

6488 rows deleted.

SQL> commit;
Commit complete.
SQL>

3. Purge the PROCESSED messages for all consumers as reported in the AQ$ view

(This step takes five or more minutes.)

SQL> DECLARE
  2        po dbms_aqadm.aq$_purge_options_t;
  3     BEGIN
  4        po.block := TRUE;
  5        DBMS_AQADM.PURGE_QUEUE_TABLE(
  6         queue_table     => 'SYS.SYS$SERVICE_METRICS_TAB',
  7         purge_condition => 'qtview.queue =  ''SYS.AQ$SYS$SERVICE_METRICS_TAB''
  8         and qtview.msg_state = ''PROCESSED''',
  9         purge_options   => po);
 10     END;
 11     /

PL/SQL procedure successfully completed.
SQL> commit;
Commit complete.

4. Run below query to make sure rows were deleted

SQL> select
     count(*)
   from
     SYS.aq$_SYS$SERVICE_METRICS_TAB_L
   where
     msgid ='00000000000000000000000000000000';  2    3    4    5    6

  COUNT(*)
----------
         0

SQL>

5. Enable / Re-enable Event 10852

Once the above query returns 0 row, enable (or re-enable if using already) event 10852 executing below commands which will disable the new 11.2 dequeue functionality and revert back to pre 11.2 where the dequeue log table is not used:

SQL> alter system set events '10852 trace name context forever, level 16384';
System altered.

SQL> alter system set event='10852 trace name context forever, level 16384' scope=spfile;
System altered.
SQL>

6. Re-start the queue again

SQL> exec dbms_aqadm.start_queue(queue_name => 'SYS.SYS$SERVICE_METRICS')

PL/SQL procedure successfully completed.
SQL>

Note: To permanently fix this bug, apply back-port for bug 12899768 if available for your RDBMS 11.2 version, or upgrade to 12.1 where bug is fixed.

References

Troubleshooting High Redo Generation Issues (Doc ID 782935.1)

ORA-01405: Fetched Column Value Is Null For An Advanced Queue (Doc ID 1162862.1)

No comments:

Post a Comment