When a disk failure
occurs for an ASM disk, behavior of ASM would be different, based on what kind
of redundancy for the diskgroup is in use. If diskgroup has EXTERNAL REDUDANCY,
diskgroup would keep working if you have redundancy at external RAID level. If
there is no RAID at external level, the diskgroup would immediately get dismounted
and disk would need a repair/replaced and then diskgroup might need to be
dropped and re-created, and data on this diskgroup would require recovery.
For NORMAL and HIGH redundancy diskgroups, the behavior is a little different. When a disk gets corrupted/missing in a NORMAL/HIGH redundancy diskgroup, error is reported in the alert log file, and disk becomes OFFLINE, as we can see in the output of bellow query, after I started my testing for an ASM disk failure. I just needed to plug out the disk from the storage that belonged to an ASM diskgroup with NORMAL redundancy.
col name format a8
col header_status format a7
set lines 2000
col path format a10
select
name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;
NAME
PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S
-------- ---------- -------- ------- ------------ -------
----------------- ------------ ------------- -----------------
DATA1
ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED
DATA2
ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED
DATA3
ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED
DATA4 NORMAL UNKNOWN 1200
OFFLINE MISSING
|
Here we see a value “1200”
under REPAIR_TIME column; this value is time in seconds after which this disk
would be dropped automatically. This time is calculated using value of a diskgroup
attribute called DISK_REPAIR_TIME that I will discuss bellow.
In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.
In 10g, if a disk goes missing, it would immediately get dropped and REBALANCE operation would kick in immediately whereby ASM would start redistributing the ASM extents across the available disks in ASM diskgroup to restore the redundancy.
DISK_REPAIR_TIME
Starting 11g, oracle
has provided an attribute for diskgroups called “DISK_REPAIR_TIME”. This has a
default value of 3.6 hours. This actually means that in case a disk goes
missing, this disk should not be dropped immediately and ASM should wait for
this disk to come online/replaced. This feature helps in scenarios where a disk
is plugged out accidentally, or a storage server/SAN gets disconnected/rebooted
which leaves some ASM diskgroup without one or more disks. During the time when
disk(s) remain unavailable, ASM would keep track of the extents that are
candidates of being written to the missing disks, and immediately starts
writing to the disk(s) as soon as missing disk(s) come back online (this
feature is called fast mirror resync). If disk(s) does not come back online
within DISK_REPAIR_TIME threshold, disk(s) is/are dropped and rebalance starts.
FAILGROUP_REPAIR_TIME
Starting 12c, another new
attribute can be set for the diskgroup. This attribute is
FAILGROUP_REPAIR_TIME, and this has a default value of 24 hours. This attribute
is similar to DISK_REPAIR_TIME, but is applied to the whole failgroup. In
Exadata, all disks belonging to a storage server can belong to a failgroup (to
avoid a mirror copy of extent to be written in a disk from the same storage
server), and this attribute is quite handy in Exadata environment when complete
storage server is taken down for maintenance, or some other reason.
In the following we can
see how to set values for the diskgroup attributes explained above.
SQL> col name format a30
SQL> select name,value from v$asm_attribute
where group_number=3 and name like '%repair_time%';
NAME VALUE
------------------------------
--------------------
disk_repair_time 3.6h
failgroup_repair_time 24.0h
SQL> alter diskgroup data set attribute
'disk_repair_time'='1h';
Diskgroup altered.
SQL>
alter diskgroup data set attribute
'failgroup_repair_time'='10h';
Diskgroup altered.
SQL> select name,value from v$asm_attribute
where group_number=3 and name like '%repair_time%';
NAME VALUE
------------------------------
--------------------
disk_repair_time 1h
failgroup_repair_time 10h
|
ORA-15042
If a disk is offline/missing from an ASM diskgroup, ASM may not mount the diskgroup automatically during instance restart. In this case, we might need to mount the diskgroup manually, with FORCE option.
SQL> alter diskgroup data mount;
alter diskgroup data mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "3" is missing from
group number "2"
SQL> alter diskgroup data mount force;
Diskgroup altered.
|
Monitoring the REPAIR_TIME
After a disk goes
offline, the time starts ticking and value of REPAIR_TIMER can be monitored to
see the time remains before the disk can be made available to avoid auto drop of
the disk.
SQL> select
name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;
NAME
PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S
-------- ---------- -------- ------- ------------ -------
----------------- ------------ ------------- -----------------
DATA1
ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED
DATA2
ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED
DATA3
ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED
DATA4 NORMAL UNKNOWN 649 OFFLINE MISSING
--We
can confirm that no rebalance has started yet by using following query
SQL> select * from v$asm_operation;
no rows selected
|
If we are able to make
this disk available/replaced before DISK_REPAIR_TIME lapses, we can bring this
disk back online. Please note that we would need to bring it ONLINE manually.
SQL> alter diskgroup data online disk data4;
Diskgroup altered.
select
name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;
NAME
PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S
-------- ---------- -------- ------- ------------ -------
----------------- ------------ ------------- -----------------
DATA1
ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED
DATA2
ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED
DATA3
ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED
DATA4 NORMAL UNKNOWN 465
SYNCING CACHED
--Syncing
is in progress, and hence no rebalance would occur.
SQL> select * from v$asm_operation;
no rows selected
--
After some time, everything would become normal.
select name,path,state,header_status,REPAIR_TIMER,mode_status,mount_status from v$asm_disk;
NAME
PATH STATE HEADER_ REPAIR_TIMER MODE_ST MOUNT_S
-------- ---------- -------- ------- ------------ -------
----------------- ------------ ------------- -----------------
DATA1
ORCL:DATA1 NORMAL MEMBER 0 ONLINE CACHED
DATA2
ORCL:DATA2 NORMAL MEMBER 0 ONLINE CACHED
DATA3
ORCL:DATA3 NORMAL MEMBER 0 ONLINE CACHED
DATA4 ORCL:DATA4
NORMAL MEMBER 0
ONLINE CACHED
|
If same disk cannot be
made available, or replaced, either ASM would auto drop the disk after
DISK_REPAIR_TIME has lapsed, or we manually drop this ASM disk. Rebalance would
occur after the disk drop.
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.
Since the disk status if OFFLINE, we would need to use FORCE option to drop the disk. After dropping the disk rebalance would start and can be monitored from v$ASM_OPERATION view.
SQL> alter diskgroup data drop disk data4;
alter diskgroup data drop disk data4
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15084: ASM disk "DATA4" is offline
and cannot be dropped.
SQL> alter diskgroup data drop disk data4
force;
Diskgroup altered.
select
group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;
GROUP_NUMBER
OPERA PASS STATE POWER SOFAR
EST_WORK
----------------------------------
--------- ---- ----------
---------- ---------- ------------------------
2 REBAL RESYNC DONE 9 0 0
2 REBAL REBALANCE DONE
9
42 42
2 REBAL COMPACT RUN 9
1 0
|
Later we can replace
the faulty disk and then add back the new disk again into this diskgroup.
Adding diskgroup back would initiate rebalance once again.
SQL> alter diskgroup data add disk
'ORCL:DATA4';
Diskgroup altered.
SQL> select * from v$asm_operation;
select
group_number,operation,pass,state,power,sofar,est_work from v$asm_operation;
GROUP_NUMBER
OPERA PASS STATE POWER SOFAR
EST_WORK
----------------------------------
--------- ---- ----------
---------- ---------- ------------------------
2 REBAL RESYNC DONE 9
0 0
2 REBAL REBALANCE RUN
9 37 2787
2 REBAL COMPACT WAIT 9
1 0
|
Hi,
ReplyDeleteI followed entirely. I recovered all the disks shown unknown/missing status.
thanks clearing the concept ..
ReplyDeleteGood job..All at one place
ReplyDelete