beautypg.com

Accusys ExaRAID GUI User Manual

Page 249

background image

Troubleshooting

6-6

2. Hard disks are offline unexpected

The RAID controller takes a hard disk offline when the hard disk cannot
respond to the RAID controller after the full-cycle error recovery
procedure has been done. This could happen when the hard disk is
permanently dead because of its internal component failure, and you
lose all your data on the hard disk. Another reason a hard disk is taken
offline is that its reserved space for meta-data (RAID configurations)
cannot be written, which means the reserved space for bad block
reallocation in the hard disk has been full. This is unlikely to happen
because two copies of meta-data are reserved, and a hard disk is offline
only when both areas cannot be accessed.

An offline hard disk might also be transiently down because of its disk
controller firmware lockdown or mechanical unstableness. In this case,
the hard disk is still accessible and you may reuse it, but the hard disk
might fail again.

3. Verify hard disk health status

To know exactly if a hard disk fails or not, using SMART check or DST
(Device Self-Test) to test the hard disks in question is a good choice. It’s
also advised to check the number of bad blocks and warning events
reported by the RAID controller (see 2.8.1 Hard disks on page 2-57).
Another indicator to the health condition of a hard disk is its I/O response
time, because out-of-specification response time could be caused by
abnormal error recovery procedures (see 2.11.1 Hard disks on page 2-
81
).

4. Adjust hard disk settings

Tweaking the hard disk-related settings could help to accommodate the
exceptional behaviors of hard disks (see 2.8.1 Hard disks on page 2-57).
The following are some common workarounds:

• Extend Disk I/O Timeout to accommodate slow disk operation

• Increase Disk I/O Retry Time to try I/O more times before giving up

• Reduce Transfer Speed to mitigate bad signal quality of disks

• Disable I/O Queue to avoid problematic NCQ support of disks

• Disable Disk Standby Mode to avoid problematic sleep support of

disks

• Extend Disk Access Delay Time to allow longer time for disk spin-up

5. Check failure of multiple hard disks

If multiple hard disks are taken offline at a time, it could be system-level
problem. For example, if the hard disks in a JBOD system are offline
unexpected, poor cabling in SAS or FC expansion chain would also lead
to unexpected hard disk offline. In addition, poor heat ventilation,
unstable power supply, or hardware quality issues could also lead to
offline of multiple hard disks. In case there are multiple failed hard disks