40 rebuilding logical drives after disk failures – HP StorageWorks Scalable File Share User Manual
Page 290
Troubleshooting
9–66
5.
If you know which file system is causing the problem, start that file system; otherwise, start all file
systems, but wait until each file system goes to the
started
state before starting the next file system.
If all file systems start normally, you can skip the remaining steps. If the MDS service crashes again after you
start a file system, continue with the remaining steps.
6.
If the MDS server crashes again, stop all file systems (you can enter the
stop filesystem
filesystem_name
command on the administration server while you wait for the MDS server to
reboot). The
stop filesystem filesystem_name
command may appear to fail; however,
because the MDS service will not start when the MDS server is rebooted, the file system is effectively
stopped.
7.
Unmount all Lustre file systems on every client node. If you cannot unmount a file system on a client
node, reboot the client node.
8.
Check that the
/proc/fs/lustre
file does not exist on any client node. If the file exists on a client
node, reboot the client node.
9.
Attempt to mount one client node.
The mount operation may hang or fail with an
Input/output
error
message.
•
If the mount operation hangs, do not abort the operation; wait for ten minutes.
•
If the mount operation fails, repeat the mount attempt every few minutes.
If the mount operation succeeds, the MDS service will go to the
running
state within ten minutes. If
the MDS server crashes again or the MDS service fails to go to the
running
state after ten minutes
or after repeated attempts to mount the file system, contact your HP Customer Support representative
for more information.
When you have finished this procedure, if the mount operation succeeded, and the MDS service is in the
running
state, enable the administration server using the
enable server server_name
command.
9.40 Rebuilding logical drives after disk failures
If you see messages similar to the following in the EVL logs for a server attached to SFS20 storage, it is
possible that a logical drive on an SFS20 array has failed:
cciss(105,48): cmd 43d86420 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d86c78 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d874d0 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d87d28 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d88580 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d88dd8 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d89630 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d89e88 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
For such a failure to occur, at least three disks (if ADG redundancy is used, or two disks if RAID5 redundancy
is used) in the logical drive must have failed at the same time. Such a failure is catastrophic, and it is highly
unlikely that you will be able to access the original data stored on the LUN associated with the logical drive.
If a logical drive fails as described here, use the
show array array_number
command to examine the
status of the individual disks on the array. (For information on using the
show array array_number
command, see Section 4.5.) Replace the failed disks (see Section 8.1 for procedures for replacing
hardware components).