Wednesday, November 11, 2009

creating user defined metric for monitoring job durations

In this production system there are some number of database jobs (not Scheduler Jobs) running on different time intervals. Some of these jobs are running on every five minutes while the other running on every hour. There are also some batch jobs scheduled to be run on every night.

There was a custom developed Oracle9i script to monitor every job duration and warn the DBA if there is a broken job or a job with more than three hours of duration. As this database is upgraded to 10g now we are able to use "Notification Rules" and "User Defined Metrics" to monitor the database jobs. This approach will supply us the historical data collection and also the standart notification method with all other EM Console messages. We also will not need to compile this old package on all of the production databases instead the User Defined Metric can be applied to all of the databases with including the metric in the monitoring templates.

"User Defined Metrics" page can be reached from a link at the bottom of the main page of the Enterprise Manager DB Console just under the Related Links section shown below.



When you click to the "User-Defined Metrics" link on the main page of the EM DB Console, User Defined Metrics are listed.



You can manage the your own custom metrics by using the right top buttons. You can simply select on of the User Defined Metric and click on the "View Data" Button.



When you click on the "View Data" button the Metric Value History of the specified Metric can be seen.



Again from the main page of the "User-Defined Metrics" you can click on the "Create" Button to create a brand new User Defined Metric. There is a sample new metric creation page values which i used for the metric UDM_LONG_RUNNING_JOBS. This metric calculates the job durations and notified the DBA if there are jobs which have more than 180 Minutes of duration time.



This metric has a string value and collection SQL of the metric is as follows. If the return value contains "DURATION" string then EM Console pretends as it exceeds a Critical Treshold and throws a Critical Alert. Metric collection runs every 5 minutes. At least 1 occurance is sufficient to send a Critical Alert.


select RETURN from
(
select 'DURATION: ' || round((sysdate-this_date)*24*60,2) || 'minutes JOB: ' || job || ' -> ' || substr(WHAT,1,50) as RETURN from
(
select
CASE when round((sysdate-this_date)*24*60,2) >= 5 then 'TRUE' else 'FALSE' end as return,
round((sysdate-this_date)*24*60,2) as minutes,
this_date,
sysdate,
job,
schema_user,
last_date,
next_date,
total_time,
interval,
what
from
dba_jobs
where
broken='N' and
LAST_DATE is not NULL
order by
next_date asc
)
where return='TRUE'
group by round((sysdate-this_date)*24*60,2),job,substr(WHAT,1,50)
UNION
select 'OK' as RETURN from dual
)
where rownum=1


The last thing is, if you want an email notification when a Critical Alert triggered by this User Defined Metric you should add this metric to the notification rules which is in the preferences page of the EM Grid Control.






Wednesday, November 4, 2009

what is datapump doing in my system tablespace

One of our datapump export jobs was continously ending unsuccessfully with the following error.


Job "SYSTEM"."SYS_EXPORT_FULL_06" stopped due to fatal error at 01:42:36


I decided to investigate the log file and seems there is a lack of available space in the system tablespace and the datapump job goes in to the resumable wait and after 2 hours (which is a default resumable timeout value) fails. The error lines as follows.


ORA-39171: Job is experiencing a resumable wait.
ORA-01653: unable to extend table SYSTEM.SYS_EXPORT_FULL_06 by 128 in tablespace SYSTEM


I think I should investigate the root cause of the problem. I want to know what objects are using the SYSTEM tablespace first. Listing the largest segments would be handy.


select 
segment_name, bytes/1024/1024 as MB from dba_segments 
where 
tablespace_name='SYSTEM' 
order by bytes desc; 

/*
SEGMENT_NAME                MB
------------                --
SYS_LOB0001316900C00039$$   3072
SYS_LOB0001344581C00039$$   3054
SYS_LOB0001343732C00039$$   3029
SYS_LOB0001320071C00039$$   2375
SYS_LOB0001344712C00039$$   1216
C_OBJ#                      816
I_COL1                      745
I_COL2                      484
*/


What are these SYSTEM generated segments?? They are allocating lots of space here.


select 
owner, segment_name, segment_type, tablespace_name, bytes/1024/1024 as MB 
from 
dba_segments 
where 
segment_name = 'SYS_LOB0001344712C00039$$'

/*
OWNER   SEGMENT_NAME                SEGMENT_TYPE    TABLESPACE_NAME MB
-----   ------------                ------------    --------------- --
SYSTEM  SYS_LOB0001344712C00039$$   LOBSEGMENT      SYSTEM          1216
*/

select 
owner, table_name, column_name, segment_name 
from 
dba_lobs 
where segment_name = 'SYS_LOB0001344712C00039$$';

/*
OWNER   TABLE_NAME          COLUMN_NAME     SEGMENT_NAME
-----   ----------          -----------     ------------
SYSTEM  SYS_EXPORT_FULL_05  XML_CLOB        SYS_LOB0001344712C00039$$
*/ 


Now it is clear. Datapump jobs locked themselves because of the tables created by the jobs itself (master tables of the datapump jobs). I think when the jobs couldnt ended up with success (or failed unexpectedly), datapump cannot tidy up its behind. Let me see the datapump jobs existing in the database.


select * from dba_datapump_jobs;
select * from dba_datapump_sessions;

/*
OWNER_NAME  JOB_NAME            OPERATION       JOB_MODE        STATE
----------  --------            --------        --------        -----       
SYSTEM      SYS_EXPORT_FULL_04  EXPORT          FULL            NOT RUNNING 
SYSTEM      SYS_EXPORT_FULL_05  EXPORT          FULL            NOT RUNNING
*/


I think that is what i guess. The jobs are not running and there no sessions attached to them. I decided to drop the tables to obtain some free space to the SYSTEM tablespace.


drop table SYSTEM.SYS_EXPORT_FULL_04;

select 
segment_name, bytes/1024/1024 as MB from dba_segments 
where 
tablespace_name='SYSTEM' 
order by bytes desc; 

select to_char(trunc(sum(bytes)/1024/1024),'99,999,999') as FREE_MB from dba_free_space where tablespace_name='SYSTEM';

/*
FREE_MB
------
14,644
*/


Now it is OK. 20Gb of SYSTEM tablespace is now has more than 14Gb of free space. Let me investigate the problem further for not to fall into the same problem again. When i check the available space of filesystem which datapump job works at, there seems a little space left on the disks. Then it is possible to be an unexpected job failure.

Meanwhile, As far as i know if you are using datapump in parallel mode you should obtain more than one dumpfile. You can attach %U parameter at the end of the file then datapump will automatically create the number of the files for every parallel thread. But in this case there is only one file is specified and 6 parallel thread is trying to be used. I dont know if this is the real cause of the problem but i will definitely remove the parallel clause from the parameter file.

There is another option here that i didnt tried before which is compression. I am not sure of this but may be the compression operation of the METADATA is done in the system tablespace? Or somehow needs some space in the database. It look silly because if the compression is done in an algorithm like this it is not really efficient, but i still dont want to use this parameter so also removed the compression parameter from the parameter file.


full=y
compression=METADATA_ONLY
directory=DATA_PUMP_DIR
dumpfile=fusion_export_full.exp
logfile=fusion_export_full.log
parallel=6
userid=system/@




Wednesday, October 28, 2009

DRCP vs Oracle Shared Server

While i was hanging out in the Oracle Database 11g New Features documents on OTN I found something interesting which is not documented on Oracle 11g New Features, Database Resident Connection Pooling (DRCP). I was surprised because this feature is very handy and usable in a lot of environment especially if you serve your data to more than one application running on more than one application server and with enormous number of connections.

Before Oracle Database 11g we were all using Oracle Shared Server architecture to lower the server side user processes and their overheads to the server resources. But with 11g the similar architecture is available now called Database Resident Connection Pooling (DRCP).

Quoted from the related document linked at the end of the this post:


"In one of our test scenarios on a 2GB system, DRCP was able to support 10 times the number of connections compared to shared servers, and 20 times the number of connections compared to dedicated servers."


The question; is DRCP a replacement of Oracle Shared Server?

Related:
Database Resident Connection Pooling (DRCP) Oracle Database 11g Technical White Paper

Wednesday, October 14, 2009

Some Notes on Oracle Streams Table Replication

In this test, replication is made from Oracle Database version 11.1.0.7 to 11.1.0.6 database on a virtual machine. This is not a performance test but only a functional test of the streams basic abilities. This test is made on table replication basis maybe I can publish the schema replication based test in another article.

Some preparation needed first and i assume both the databases are in archivelog mode. Actually documentation says only the source database is enough for archiving redologs. That is reasonable for capturing the changes and not to miss one. But my databases are both in archivelog mode so I cover this necessity.


--check if there are unsupported column types.
select * from DBA_STREAMS_UNSUPPORTED;

--create a tablespace on the source database for test purposes
create tablespace TS_STREAMS_SRC DATAFILE 'D:\oracledb11g\oradata\orcl\TS_STREAMS_SRC_01.dbf'
SIZE 10M AUTOEXTEND ON NEXT 10M MAXSIZE 100M EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;

--create a tablespace on the destination database for test purposes
create tablespace TS_STREAMS_DEST DATAFILE 'D:\oracledb11g\oradata\orcl\TS_STREAMS_DEST_01.dbf'
SIZE 10M AUTOEXTEND ON NEXT 10M MAXSIZE 100M EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;

--creating the streams admin user on both of the databases.
create user U_STREAMS_ADM identified by password
DEFAULT TABLESPACE TS_STREAMS_SRC
QUOTA UNLIMITED ON TS_STREAMS_SRC;

--streams user should have dba privilege
grant connect,resource,dba to U_STREAMS_ADM;
exec DBMS_STREAMS_AUTH.GRANT_ADMIN_PRIVILEGE('U_STREAMS_ADM',TRUE);

--create database links on the databases to access both from SOURCE to DESTINATION and from DESTINATION to SOURCE

--here ORCL_SOURCE is the tns alias for the SOURCE database and ORCL_DESTINATION is the tns alias for the DESTINATION database


create database link ORCL.SOURCE.LOCAL connect to U_STREAMS_ADM_SRC identified by password using 'ORCL_SOURCE';

create database link ORCL.DESINATION.LOCAL connect to U_STREAMS_ADM_SRC identified by password using 'ORCL_DESTINATION';

--a second destination should be defined on the source database to send the redo logs to the destination database
alter system set LOG_ARCHIVE_DEST_2='SERVICE=ORCL ASYNC NOREGISTER VALID_FOR=(ONLINE_LOGFILES,PRIMARY_ROLE) DB_UNIQUE_NAME=ORCL';

-- creating a table replication with MAINTAIN_TABLES
-- with this configuration CAPTURE and PROPAGATION processes are on the source database
-- APPLY process is on the DESTINATION database

DECLARE
tables DBMS_UTILITY.UNCL_ARRAY;
BEGIN
tables(1) := 'scott.dept';
DBMS_STREAMS_ADM.MAINTAIN_TABLES(
table_names => tables,
source_directory_object => NULL,
destination_directory_object => NULL,
source_database => 'ORCL.SOURCE.LOCAL',
destination_database => 'ORCL.DESTINATION.LOCAL',
capture_name => 'CAPTURE_TABLE_GRP1',
capture_queue_name => 'QUEUE_CAPT_TABLE_GRP1',
capture_queue_user => 'U_STREAMS_ADM',
propagation_name => 'PROPAGATE_TABLE_GRP1',
apply_name => 'APPLY_TABLE_GRP1',
apply_queue_name => 'QUEUE_APPLY_TABLE_GRP1',
apply_queue_user => 'U_STREAMS_ADM',
perform_actions => TRUE,
script_directory_object => NULL,
script_name => NULL,
bi_directional => FALSE,
include_ddl => TRUE,
instantiation => DBMS_STREAMS_ADM.INSTANTIATION_TABLE_NETWORK);
END;


The procedure worked smoothly while I was waiting for it to be end with an exception. But if there is a problem with the instantiation of the replication then the following set of commands helps you to clear the inadequate configuration and then you can run the MAINTAIN_TABLES again.


SELECT * FROM DBA_RECOVERABLE_SCRIPT_ERRORS;
execute DBMS_STREAMS_ADM.RECOVER_OPERATION('A6B753822B9C4C30A5CD87B9571ACF03','PURGE');
exec DBMS_STREAMS_ADM.REMOVE_STREAMS_CONFIGURATION();

--The insert tests just worked fine as follows.
insert into SCOTT.DEPT values (60,'IT-DEVELOPMENT','IST');
insert into SCOTT.DEPT values (70,'IT-STORAGE','IST');
insert into SCOTT.DEPT values (80,'IT-BACKUP','IST');
insert into SCOTT.DEPT values (90,'IT-WINDOWS','IST');
insert into SCOTT.DEPT values (95,'IT-UNIX','IST');
insert into SCOTT.DEPT values (96,'IT-SECURITY','IST');
commit;

--Findout the configuration and status of the processes
select * from dba_capture;
select * from dba_propagation;
select * from dba_apply;
select * from v$streams_capture;


Lets see if the configuration is flexible enough to add a new table rule to the existing set easily. I followed the documentation with the following commands in order.


--on source database
exec DBMS_CAPTURE_ADM.STOP_CAPTURE('CAPTURE_TABLE_GRP1');
exec DBMS_PROPAGATION_ADM.STOP_PROPAGATION('PROPAGATE_TABLE_GRP1');

--on the destination database
exec DBMS_APPLY_ADM.STOP_APPLY('APPLY_TABLE_GRP1');

--add new rule to the CAPTURE process
exec DBMS_STREAMS_ADM.ADD_TABLE_RULES(
TABLE_NAME =>'SCOTT.EMP',
STREAMS_TYPE => 'CAPTURE',
STREAMS_NAME => 'CAPTURE_TABLE_GRP1',
SOURCE_DATABASE => 'ORCL.SOURCE.LOCAL',
QUEUE_NAME => 'QUEUE_CAPT_TABLE_GRP1',
INCLUDE_DML => TRUE,
INCLUDE_DDL => TRUE,
--DDL_RULE_NAME => NULL,
--DML_RULE_NAME => NULL,
INCLUSION_RULE => TRUE,
AND_CONDITION => NULL);

--add new rule to the PROPAGATION process
exec DBMS_STREAMS_ADM.ADD_TABLE_PROPAGATION_RULES(
TABLE_NAME => 'SCOTT.EMP',
STREAMS_NAME => '', --if propagation exists on source_queue_name then uses the existing propagation
SOURCE_QUEUE_NAME => 'U_STREAMS_ADM.QUEUE_CAPT_TABLE_GRP1',
DESTINATION_QUEUE_NAME => 'U_STREAMS_ADM.QUEUE_APPLY_TABLE_GRP1@ORCL.DESTINATION.LOCAL',
INCLUDE_DML => TRUE,
INCLUDE_DDL => TRUE,
SOURCE_DATABASE => 'ORCL.SOURCE.LOCAL',
INCLUSION_RULE => TRUE,
AND_CONDITION => NULL,
QUEUE_TO_QUEUE => TRUE --TRUE=queue_to_queue FALSE=queue_to_dblink
);

--now the last thing is to add a new rule to the APPLY process
exec DBMS_STREAMS_ADM.ADD_TABLE_RULES(
TABLE_NAME =>'SCOTT.EMP',
STREAMS_TYPE => 'APPLY',
STREAMS_NAME => 'APPLY_TABLE_GRP1',
SOURCE_DATABASE => 'ORCL.SOURCE.LOCAL',
QUEUE_NAME => 'QUEUE_APPLY_TABLE_GRP1',
INCLUDE_DML => TRUE,
INCLUDE_DDL => TRUE,
--DDL_RULE_NAME => NULL,
--DML_RULE_NAME => NULL,
INCLUSION_RULE => TRUE,
AND_CONDITION => NULL);


We can now check if the new rules are added and the conditions of the rules with the following views. We cannot see and new process configuration as i added the rules to the existing APPLY, PROPAGATION and CAPTURE processes.

DBA_RULES
DBA_RULE_SETS
DBA_STREAMS_RULES
DBA_RULE_SET_RULES

If you did something wrong wile adding the rules you can easily delete the rules by running the following procedures which are under the DBMS_RULE_ADM. Be sure what are you deleting by selecting DBA_STREAMS_RULES. You dont want to delete a rule associated with a working process.


--If you want to delete a rule you should first remove it
DBMS_RULE_ADM.REMOVE_RULE
DBMS_RULE_ADM.DELETE_RULE


I added the rules and now I should instantiate the table/tables manually. MAINTAIN_TABLES procedure makes this instantiation automatically for you but if you want to add the rules explicitly then you need to instantiate them.


--set the instantiation SCN of the SOURCE table/tables
DECLARE
iscn NUMBER; --Variable to hold instantiation SCN value
BEGIN
iscn := DBMS_FLASHBACK.GET_SYSTEM_CHANGE_NUMBER();
DBMS_APPLY_ADM.SET_TABLE_INSTANTIATION_SCN(
source_object_name => 'scott.emp',
source_database_name => 'ORCL.SOURCE.LOCAL',
instantiation_scn => iscn);
END;

--Create the table on the DESTINATION database.
create table scott.emp as select * from scott.emp@ORCL.SOURCE.LOCAL;

--The last thing is not to forget the SUPPLEMENTAL LOGGING for the replicated table to capture and apply the changes.
ALTER TABLE scott.emp ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

--After checking all the configuration again now I can start the processes i have stopped.
-- on source database.

exec DBMS_CAPTURE_ADM.START_CAPTURE('CAPTURE_TABLE_GRP1');
exec DBMS_PROPAGATION_ADM.START_PROPAGATION('PROPAGATE_TABLE_GRP1');

--On the destination database.
exec DBMS_APPLY_ADM.START_APPLY('APPLY_TABLE_GRP1');


It seems everything worked fine as the Oracle documentation stated. Adding rules to working processes of the streams are easy accept you have to manually instantiate the tables via create table .. as .. or export/import or expdp/impdp if you are adding the rules manually. Now i want to go furher and alter a replication rule of an existing configuration of the apply process to capture only a subset of the changes.

Rules can be seen by selecting DBA_RULES, DBA_RULESETS, DBA_RULE_SET_RULES or DBA_STREAMS_RULES which i prefer to use. One should be careful which rule he/she is going to alter. There are DML and DDL rules for every process. And the rule that will be altered should be related with the correct CAPTURE PROCESS (since there can be more than one capture process).

In my case the rule that i will going to alter is BONUS85 as follows. Stopping the processes before any change in the configuration is a good idea. Without stopping and starting the processes the new configuration cannot be applied to the existing streams processes.


--On source database
exec DBMS_CAPTURE_ADM.STOP_CAPTURE('CAPTURE_TABLE_GRP1');
exec DBMS_PROPAGATION_ADM.STOP_PROPAGATION('PROPAGATE_TABLE_GRP1');

--On the destination database
exec DBMS_APPLY_ADM.STOP_APPLY('APPLY_TABLE_GRP1');

--Altering a rule capturing a subset of the data in the table.
execute dbms_rule_adm.alter_rule('BONUS85',':dml.get_object_owner()=''SCOTT'' AND :dml.get_object_name()=''BONUS''
AND :dml.is_null_tag()=''Y'' AND :dml.get_source_database_name() = ''ORCL.DIGITURK.LOCAL''
AND (:dml.get_value(''NEW'',''ENAME'').AccessVarchar2()=''ERGEM'')');

--On source database
exec DBMS_CAPTURE_ADM.START_CAPTURE('CAPTURE_TABLE_GRP1');
exec DBMS_PROPAGATION_ADM.START_PROPAGATION('PROPAGATE_TABLE_GRP1');

--On the destination database
exec DBMS_APPLY_ADM.START_APPLY('APPLY_TABLE_GRP1');

--Related views for streams monitoring
select * from v$streams_apply_coordinator
select * from v$streams_apply_reader
select * from v$streams_apply_server
select * from v$streams_capture
select * from v$streams_pool_advice
select * from v$streams_transaction
select * from v$streams_monitor


As a conclusion Oracle Streams is easy to configure and use. It seems flexible to change the configuration easily and altering the existing rules to gather the subset of the changes. I did not test the performance but functional tests seems OK.

Of course there is a lot of options you can use; Downstream Capture Processes, Sync Captures, Parallel Capture/Apply Processes and so on. Oracle Documentation has sufficient information about the Streams.

Related:
Oracle Streams Replication Administrators Guide
Oracle Streams Concepts and Administration
PSOUG Streams Demo



Friday, May 15, 2009

SQL Server detected a logical consistency-based I/O error

After migrating the SQL Server 2005 SP2 32-bit database to SP3 64-bit environment some queries started to complain about the following error in the SQL Server Error Logs.


Message:
SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0x0; actual signature: 0x5be5d620). It occurred during a read of page (1:718204) in database ID 7 at offset 0x0000015eaf8000 in file 'E:\MSSQL\DATA\SMPPGatewayBEP.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.


It is obvious that there happened an IO inconsistency. Maybe while the backup or the restore operation. I decided to execute DBCC CHECKDB('SMPPGatewayBEP') But the following output generated by the server.


Msg 8967, Level 16, State 216, Line 1
An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
DBCC results for 'SMPPGatewayBEP'.
Msg 8921, Level 16, State 1, Line 1
Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
CHECKDB found 0 allocation errors and 0 consistency errors in database 'SMPPGatewayBEP'.


After a little search on msdn i decided to try to set the database into the single user mode by running:


alter database SMPPGatewayBEP SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DBCC CHECKDB('SMPPGatewayBEP', REPAIR_ALLOW_DATA_LOSS)


Now it started to repair the data pages by reading the log pages. There is thousands of lines of this output so i put the head and tail part of the output.


DBCC results for 'SMPPGatewayBEP'.
Service Broker Msg 9675, State 1: Message Types analyzed: 14.
Service Broker Msg 9676, State 1: Service Contracts analyzed: 6.
Service Broker Msg 9667, State 1: Services analyzed: 3.
Service Broker Msg 9668, State 1: Service Queues analyzed: 3.
Service Broker Msg 9669, State 1: Conversation Endpoints analyzed: 0.
Service Broker Msg 9674, State 1: Conversation Groups analyzed: 0.
Service Broker Msg 9670, State 1: Remote Service Bindings analyzed: 0.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:352523) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:352524) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:352525) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:352526) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:352527) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:719092) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:719093) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.

...
..
.

Could not repair this error.
Could not repair this error.
Could not repair this error.
Could not repair this error.
The error has been repaired.
The error has been repaired.
The error has been repaired.
Could not repair this error.
Could not repair this error.
Could not repair this error.
Could not repair this error.
Could not repair this error.
Could not repair this error.
The error has been repaired.
CHECKDB found 0 allocation errors and 1857 consistency errors not associated with any single object.
CHECKDB fixed 0 allocation errors and 973 consistency errors not associated with any single object.
DBCC results for 'sys.sysrowsetcolumns'.
There are 716 rows in 8 pages for object "sys.sysrowsetcolumns".
DBCC results for 'sys.sysrowsets'.
There are 101 rows in 1 pages for object "sys.sysrowsets".
DBCC results for 'sysallocunits'.
There are 115 rows in 2 pages for object "sysallocunits".
DBCC results for 'sys.sysfiles1'.
There are 2 rows in 1 pages for object "sys.sysfiles1".
DBCC results for 'sys.syshobtcolumns'.
There are 716 rows in 9 pages for object "sys.syshobtcolumns".
DBCC results for 'sys.syshobts'.
There are 101 rows in 1 pages for object "sys.syshobts".
DBCC results for 'sys.sysftinds'.
There are 0 rows in 0 pages for object "sys.sysftinds".
DBCC results for 'sys.sysserefs'.
There are 115 rows in 1 pages for object "sys.sysserefs".
DBCC results for 'sys.sysowners'.
There are 15 rows in 1 pages for object "sys.sysowners".
DBCC results for 'sys.sysprivs'.
There are 135 rows in 1 pages for object "sys.sysprivs".
DBCC results for 'sys.sysschobjs'.
There are 134 rows in 4 pages for object "sys.sysschobjs".
DBCC results for 'sys.syscolpars'.
There are 629 rows in 13 pages for object "sys.syscolpars".
DBCC results for 'sys.sysnsobjs'.
There are 1 rows in 1 pages for object "sys.sysnsobjs".
DBCC results for 'sys.syscerts'.
There are 0 rows in 0 pages for object "sys.syscerts".

...
..
.

Repair: The page (1:718716) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:718717) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:718718) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:718719) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719048) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719049) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719050) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719051) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719052) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719053) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719054) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).
Repair: The page (1:719055) has been deallocated from object ID 174623665, index ID 7, partition ID 72057594160414720, alloc unit ID 72057594168016896 (type In-row data).

...
..
.

The error has been repaired.
The error has been repaired.
The error has been repaired.
There are 7270488 rows in 35601 pages for object "WSLog".
CHECKDB found 0 allocation errors and 8 consistency errors in table 'WSLog' (object ID 398624463).
CHECKDB fixed 0 allocation errors and 8 consistency errors in table 'WSLog' (object ID 398624463).
DBCC results for 'SettingsInbound'.
There are 1 rows in 1 pages for object "SettingsInbound".
DBCC results for 'SettingsDelivered'.
There are 1 rows in 1 pages for object "SettingsDelivered".
DBCC results for 'sysdiagrams'.
There are 0 rows in 0 pages for object "sysdiagrams".
DBCC results for 'ThreadMonitoring'.
There are 0 rows in 0 pages for object "ThreadMonitoring".
DBCC results for 'Settings'.
There are 1 rows in 1 pages for object "Settings".
DBCC results for 'SMPPIncoming'.
There are 95 rows in 4 pages for object "SMPPIncoming".
DBCC results for 'BlankMessages'.
There are 0 rows in 0 pages for object "BlankMessages".
DBCC results for 'TestTrace'.
There are 2134061 rows in 46701 pages for object "TestTrace".
DBCC results for 'ServiceParams'.
There are 18 rows in 1 pages for object "ServiceParams".
DBCC results for 'sys.queue_messages_1977058079'.
There are 0 rows in 0 pages for object "sys.queue_messages_1977058079".
DBCC results for 'sys.queue_messages_2009058193'.
There are 0 rows in 0 pages for object "sys.queue_messages_2009058193".
DBCC results for 'sys.queue_messages_2041058307'.
There are 0 rows in 0 pages for object "sys.queue_messages_2041058307".
CHECKDB found 0 allocation errors and 6622 consistency errors in database 'SMPPGatewayBEP'.
CHECKDB fixed 0 allocation errors and 5738 consistency errors in database 'SMPPGatewayBEP'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (SMPPGatewayBEP, repair_allow_data_loss).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.


After one hour finally the summary shows that lots of the page errors are repaired. Maybe I should run the same DBCC CHECKDB second time, to be sure if the remaining errors would be gone but we didnt have to wait for that. After setting the database into multiuser mode again and backing it up users were able to run their reports again.


alter database SMPPGatewayBEP SET MULTI_USER;




Monday, February 23, 2009

Problem with adding a new node to RAC after ocr and voting disk change

After changing the ocr and voting disks with different raw devices, that used with installation, later on i faced a problem while adding a new node to my existing 2node cluster installation.

After the first step of adding a new node to a cluster (cloning my CRS_HOME (cluster home of oracle RAC) and ran all scripts), the last step in the documentation was running root.sh in the new node's CRS_HOME. But the unexpected error was;


root@radbank04 # ./root.sh
"/dev/rdsk/c6t600601607D731F006596A6CD2CE9DD11d0s0" does not exist. Create it before proceeding.
Make sure that this file is shared across cluster nodes.
1

I am pretty sure with the raw device definitions are OK in the new node. I decided to examine the "root.sh" to see what it is running in the background. "root.sh" was consisting of two other scripts "rootinstall" and "rootconfig".

root@radbank04 # more root.sh
#!/bin/sh
/oracle/product/crs10g/install/rootinstall
/oracle/product/crs10g/install/rootconfig

I found the following part of the script in the very beginning of the rootconfig script.

SILENT=false
ORA_CRS_HOME=/oracle/product/crs10g
CRS_ORACLE_OWNER=oracle
CRS_DBA_GROUP=oinstall
CRS_VNDR_CLUSTER=false
CRS_OCR_LOCATIONS=/dev/rdsk/c6t600601607D731F00BAE716BB2CE9DD11d0s0,/dev/rdsk/c6t600601607D731F006496A6CD2CE9DD11d0s0
CRS_CLUSTER_NAME=crs
CRS_HOST_NAME_LIST=radbank02,1,radbank03,2
CRS_NODE_NAME_LIST=radbank02,1,radbank03,2
CRS_PRIVATE_NAME_LIST=radbank02-priv1,1,radbank03-priv1,2
CRS_LANGUAGE_ID='AMERICAN_AMERICA.WE8ISO8859P1'
CRS_VOTING_DISKS=/dev/rdsk/c6t600601607D731F006596A6CD2CE9DD11d0s0
CRS_NODELIST=radbank02,radbank03
CRS_NODEVIPS='radbank02/radbank02-vip/255.255.255.0/bge0,radbank03/radbank03-vip/255.255.255.0/bge0'

Somehow (my second somehow in the 10gR2 RAC environment) crs records was showing the old ocr and voting disks. The new disks i have replaced in my existing RAC environment before was totally different. I decided to modify the rootconfig file and replace the old defined ocr and voting disks with the new ones as follows. An rerun the rootconfig file. I dont know if this action is recommended by oracle but i think this is the only change that i have in my hand.

SILENT=false
ORA_CRS_HOME=/oracle/product/crs10g
CRS_ORACLE_OWNER=oracle
CRS_DBA_GROUP=oinstall
CRS_VNDR_CLUSTER=false
#CRS_OCR_LOCATIONS=/dev/rdsk/c6t600601607D731F00BAE716BB2CE9DD11d0s0,/dev/rdsk/c6t600601607D731F006496A6CD2CE9DD11d0s0
CRS_OCR_LOCATIONS=/dev/rdsk/c6t600601607D731F00AE6F711488FEDD11d0s0,/dev/rdsk/c6t600601607D731F00AF6F711488FEDD11d0s0
CRS_CLUSTER_NAME=crs
CRS_HOST_NAME_LIST=radbank02,1,radbank03,2
CRS_NODE_NAME_LIST=radbank02,1,radbank03,2
CRS_PRIVATE_NAME_LIST=radbank02-priv1,1,radbank03-priv1,2
CRS_LANGUAGE_ID='AMERICAN_AMERICA.WE8ISO8859P1'
#CRS_VOTING_DISKS=/dev/rdsk/c6t600601607D731F006596A6CD2CE9DD11d0s0
CRS_VOTING_DISKS=/dev/rdsk/c6t600601607D731F00A8833C0377FEDD11d0s0,/dev/rdsk/c6t600601607D731F00A6833C0377FEDD11d0s0,/dev/rdsk/c6t60
0601607D731F00A7833C0377FEDD11d0s0
CRS_NODELIST=radbank02,radbank03
CRS_NODEVIPS='radbank02/radbank02-vip/255.255.255.0/bge0,radbank03/radbank03-vip/255.255.255.0/bge0'

root@radbank04 # /oracle/product/crs10g/install/rootconfig

Checking to see if Oracle CRS stack is already configured
OCR LOCATIONS = /dev/rdsk/c6t600601607D731F00AE6F711488FEDD11d0s0,/dev/rdsk/c6t600601607D731F00AF6F711488FEDD11d0s0
OCR backup directory '/oracle/product/crs10g/cdata/crs' does not exist. Creating now
Setting the permissions on OCR backup directory
Setting up NS directories
Oracle Cluster Registry configuration upgraded successfully
WARNING: directory '/oracle/product' is not owned by root
WARNING: directory '/oracle' is not owned by root
clscfg: EXISTING configuration version 3 detected.
clscfg: version 3 is 10G Release 2.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node :
node 1: radbank02 radbank02-priv1 radbank02
node 2: radbank03 radbank03-priv1 radbank03
clscfg: Arguments check out successfully.

NO KEYS WERE WRITTEN. Supply -force parameter to override.
-force is destructive and will destroy any previous cluster
configuration.
Oracle Cluster Registry for cluster has already been initialized
Startup will be queued to init within 30 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
CSS is active on these nodes.
radbank02
radbank03
radbank04
CSS is active on all nodes.

root@radbank04 # ./crs_stat -t

Name Type Target State Host
------------------------------------------------------------
ora.monsc.db application ONLINE ONLINE radbank03
ora....c1.inst application ONLINE ONLINE radbank02
ora....c2.inst application ONLINE ONLINE radbank03
ora.montc.db application ONLINE ONLINE radbank02
ora....ntc1.cs application ONLINE ONLINE radbank02
ora....c1.inst application ONLINE ONLINE radbank02
ora....tc1.srv application ONLINE ONLINE radbank02
ora....ntc2.cs application ONLINE ONLINE radbank03
ora....c2.inst application ONLINE ONLINE radbank03
ora....tc2.srv application ONLINE ONLINE radbank03
ora....ntc3.cs application ONLINE ONLINE radbank03
ora....tc1.srv application ONLINE ONLINE radbank02
ora....tc2.srv application ONLINE ONLINE radbank03
ora....SM1.asm application ONLINE ONLINE radbank02
ora....02.lsnr application ONLINE ONLINE radbank02
ora....02.lsnr application ONLINE ONLINE radbank02
ora....k02.gsd application ONLINE ONLINE radbank02
ora....k02.ons application ONLINE ONLINE radbank02
ora....k02.vip application ONLINE ONLINE radbank02
ora....SM2.asm application ONLINE ONLINE radbank03
ora....03.lsnr application ONLINE ONLINE radbank03
ora....03.lsnr application ONLINE ONLINE radbank03
ora....k03.gsd application ONLINE ONLINE radbank03
ora....k03.ons application ONLINE ONLINE radbank03
ora....k03.vip application ONLINE ONLINE radbank03
ora....k04.gsd application ONLINE ONLINE radbank04
ora....k04.ons application ONLINE ONLINE radbank04
ora....k04.vip application ONLINE ONLINE radbank04


My new node is part of my existing cluster now. Now time to add the ORACLE_HOME (database installation) and the instances with the services.

Resources:

Oracle® Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide 10g Release 2 (10.2)

Add/Remove ASM disks and asm_power_limit

Before going live on production our storage team decided to reconfigure their storage and raidsets for better performance. So i gave the raw devices back from asm and of course used the opportunity to test the asm_power_limit parameter.

From the following listing you can find how to list asm disks, adding and removing disks from a disk group and watch the rebalance operation from v$asm_operation view.

By changing the init parameter "asm_power_limit", asm rebalance operations are significantly getting faster. The difference can obviously be seen by setting the parameter to 10 (instead of default value 1) and then giving the same operations that need rebalance. Of course this was not an online system so we may want asm to exhaust all the IO for rebalancing. But in case of production systems there can be unwanted IO performance degradation of the application by this kind of aggressive setting.


SQL> select group_number,state,name,total_mb from v$asm_disk;

GROUP_NUMBER STATE    NAME                   TOTAL_MB
------------ -------- -------------------- ----------
0 NORMAL                               924
0 NORMAL                               924
0 NORMAL                               924
0 NORMAL                               924
1 NORMAL   DATA_0009                614300
1 NORMAL   DATA_0008                614290
1 NORMAL   DATA_0002                614290
1 NORMAL   DATA_0001                614290
1 NORMAL   DATA_0000                614290
1 NORMAL   DATA_0007                614290
1 NORMAL   DATA_0006                614290
1 NORMAL   DATA_0005                614290
1 NORMAL   DATA_0004                614290
1 NORMAL   DATA_0003                614290

14 rows selected.

SQL> alter diskgroup DATA drop disk DATA_0000;

Diskgroup altered.

SQL> alter diskgroup DATA drop disk DATA_0001;

Diskgroup altered.

SQL> select group_number,state,name,total_mb,label,path from v$asm_disk;

GROUP_NUMBER STATE    NAME              TOTAL_MB LABEL PATH
------------ -------- --------------- ---------- ----- --------------------------------------------------
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A8833C0377FEDD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A7833C0377FEDD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A6833C0377FEDD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F006596A6CD2CE9DD11d0s0
1 NORMAL   DATA_0009           614300       /dev/rdsk/c6t600601607D731F00D994060263E8DD11d0s0
1 NORMAL   DATA_0008           614290       /dev/rdsk/c6t600601607D731F00D894060263E8DD11d0s0
1 NORMAL   DATA_0002           614290       /dev/rdsk/c6t600601607D731F006A37A4E0A0E7DD11d0s0
1 DROPPING DATA_0001           614290       /dev/rdsk/c6t600601607D731F006937A4E0A0E7DD11d0s0
1 DROPPING DATA_0000           614290       /dev/rdsk/c6t600601607D731F006837A4E0A0E7DD11d0s0
1 NORMAL   DATA_0007           614290       /dev/rdsk/c6t600601607D731F007608DB3DA0E7DD11d0s0
1 NORMAL   DATA_0006           614290       /dev/rdsk/c6t600601607D731F007508DB3DA0E7DD11d0s0
1 NORMAL   DATA_0005           614290       /dev/rdsk/c6t600601607D731F007408DB3DA0E7DD11d0s0
1 NORMAL   DATA_0004           614290       /dev/rdsk/c6t600601607D731F007308DB3DA0E7DD11d0s0
1 NORMAL   DATA_0003           614290       /dev/rdsk/c6t600601607D731F007208DB3DA0E7DD11d0s0


SQL> select * from v$asm_operation;

OPERA STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES
----- ---- ---------- ---------- ---------- ---------- ---------- -----------
REBAL RUN          10         10       7198      12996        260          22
REBAL RUN          10         10       7679      12979        242          21
REBAL RUN          10         10       9286      12930        201          18
REBAL RUN          10         10      11647      12899        237           5

SQL> select * from v$asm_operation;

no rows selected

SQL> select group_number,state,name,total_mb,label,path from v$asm_disk;

GROUP_NUMBER STATE    NAME              TOTAL_MB LABEL PATH
------------ -------- --------------- ---------- ----- --------------------------------------------------
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A8833C0377FEDD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A7833C0377FEDD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A6833C0377FEDD11d0s0
0 NORMAL                       614290       /dev/rdsk/c6t600601607D731F006937A4E0A0E7DD11d0s0
0 NORMAL                       614290       /dev/rdsk/c6t600601607D731F006837A4E0A0E7DD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F006596A6CD2CE9DD11d0s0
1 NORMAL   DATA_0009           614300       /dev/rdsk/c6t600601607D731F00D994060263E8DD11d0s0
1 NORMAL   DATA_0008           614290       /dev/rdsk/c6t600601607D731F00D894060263E8DD11d0s0
1 NORMAL   DATA_0002           614290       /dev/rdsk/c6t600601607D731F006A37A4E0A0E7DD11d0s0
1 NORMAL   DATA_0007           614290       /dev/rdsk/c6t600601607D731F007608DB3DA0E7DD11d0s0
1 NORMAL   DATA_0006           614290       /dev/rdsk/c6t600601607D731F007508DB3DA0E7DD11d0s0
1 NORMAL   DATA_0005           614290       /dev/rdsk/c6t600601607D731F007408DB3DA0E7DD11d0s0
1 NORMAL   DATA_0004           614290       /dev/rdsk/c6t600601607D731F007308DB3DA0E7DD11d0s0
1 NORMAL   DATA_0003           614290       /dev/rdsk/c6t600601607D731F007208DB3DA0E7DD11d0s0


SQL> alter diskgroup DATA add disk '/dev/rdsk/c6t600601607D731F006837A4E0A0E7DD11d0s0';

Diskgroup altered.

SQL> select group_number,state,name,total_mb,label,path from v$asm_disk;

GROUP_NUMBER STATE    NAME              TOTAL_MB LABEL PATH
------------ -------- --------------- ---------- ----- --------------------------------------------------
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A8833C0377FEDD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A7833C0377FEDD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F00A6833C0377FEDD11d0s0
0 NORMAL                       614290       /dev/rdsk/c6t600601607D731F006937A4E0A0E7DD11d0s0
0 NORMAL                          924       /dev/rdsk/c6t600601607D731F006596A6CD2CE9DD11d0s0
1 NORMAL   DATA_0009           614300       /dev/rdsk/c6t600601607D731F00D994060263E8DD11d0s0
1 NORMAL   DATA_0008           614290       /dev/rdsk/c6t600601607D731F00D894060263E8DD11d0s0
1 NORMAL   DATA_0002           614290       /dev/rdsk/c6t600601607D731F006A37A4E0A0E7DD11d0s0
1 NORMAL   DATA_0007           614290       /dev/rdsk/c6t600601607D731F007608DB3DA0E7DD11d0s0
1 NORMAL   DATA_0006           614290       /dev/rdsk/c6t600601607D731F007508DB3DA0E7DD11d0s0
1 NORMAL   DATA_0005           614290       /dev/rdsk/c6t600601607D731F007408DB3DA0E7DD11d0s0
1 NORMAL   DATA_0004           614290       /dev/rdsk/c6t600601607D731F007308DB3DA0E7DD11d0s0
1 NORMAL   DATA_0003           614290       /dev/rdsk/c6t600601607D731F007208DB3DA0E7DD11d0s0
1 NORMAL   DATA_0000           614290       /dev/rdsk/c6t600601607D731F006837A4E0A0E7DD11d0s0


SQL> show parameter asm_power_limit;

NAME                            TYPE        VALUE
------------------------------- ----------- -------------------------
asm_power_limit                 integer     1

SQL> alter system set asm_power_limit=10;

System altered.

SQL> show parameter asm_power_limit;

NAME                            TYPE        VALUE
------------------------------- ----------- -------------------------
asm_power_limit                 integer     10

SQL>