beautypg.com

40 rebuilding logical drives after disk failures – HP StorageWorks Scalable File Share User Manual

Page 290

background image

Troubleshooting

9–66

5.

If you know which file system is causing the problem, start that file system; otherwise, start all file

systems, but wait until each file system goes to the

started

state before starting the next file system.

If all file systems start normally, you can skip the remaining steps. If the MDS service crashes again after you

start a file system, continue with the remaining steps.

6.

If the MDS server crashes again, stop all file systems (you can enter the

stop filesystem

filesystem_name

command on the administration server while you wait for the MDS server to

reboot). The

stop filesystem filesystem_name

command may appear to fail; however,

because the MDS service will not start when the MDS server is rebooted, the file system is effectively

stopped.

7.

Unmount all Lustre file systems on every client node. If you cannot unmount a file system on a client

node, reboot the client node.

8.

Check that the

/proc/fs/lustre

file does not exist on any client node. If the file exists on a client

node, reboot the client node.

9.

Attempt to mount one client node.

The mount operation may hang or fail with an

Input/output

error

message.

If the mount operation hangs, do not abort the operation; wait for ten minutes.

If the mount operation fails, repeat the mount attempt every few minutes.

If the mount operation succeeds, the MDS service will go to the

running

state within ten minutes. If

the MDS server crashes again or the MDS service fails to go to the

running

state after ten minutes

or after repeated attempts to mount the file system, contact your HP Customer Support representative

for more information.

When you have finished this procedure, if the mount operation succeeded, and the MDS service is in the

running

state, enable the administration server using the

enable server server_name

command.

9.40 Rebuilding logical drives after disk failures

If you see messages similar to the following in the EVL logs for a server attached to SFS20 storage, it is

possible that a logical drive on an SFS20 array has failed:

cciss(105,48): cmd 43d86420 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d86c78 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d874d0 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d87d28 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d88580 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d88dd8 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d89630 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4
cciss(105,48): cmd 43d89e88 ctlr_info = 0x482c0000 has CHECK CONDITION, sense
key = 0x4

For such a failure to occur, at least three disks (if ADG redundancy is used, or two disks if RAID5 redundancy

is used) in the logical drive must have failed at the same time. Such a failure is catastrophic, and it is highly

unlikely that you will be able to access the original data stored on the LUN associated with the logical drive.

If a logical drive fails as described here, use the

show array array_number

command to examine the

status of the individual disks on the array. (For information on using the

show array array_number

command, see Section 4.5.) Replace the failed disks (see Section 8.1 for procedures for replacing

hardware components).