beautypg.com

So section 9.39 f – HP StorageWorks Scalable File Share User Manual

Page 289

background image

The MDS service fails with an ASSERTION(ino ==inode->i_ino) message

9–65

9.38 The MDS service fails with an ASSERTION(ino ==inode->i_ino)

message

In rare circumstances, the MDS service encounters a bug (caused by a client node) during the recovery

process. This causes the server where the MDS service is running (normally the MDS server) to crash with

an

LBUG

error. When this happens, events similar to the following are displayed in the event log:

LustreError: 11691:0:(mds_open.c:1013:mds_open()) uuid
92ea84f7-3a40-ac0a-74b5-4e0be3d3e3a3

LustreError: 11691:0:(mds_open.c:1016:mds_open()) ASSERTION(ino ==
inode->i_ino) failed

The value shown at the end of the first of the two events (in this example,

92ea84f7-3a40-ac0a-74b5-

4e0be3d3e3a3

) is the

UUID

identifier of the client node that caused the problem. Use this value to identify

the client node that caused the problem, by entering a command similar to the following on each client

node:

# lctl device_list | grep 92ea84f7

When you have identified the client node that caused the problem, reboot the client node. After the client

node has rebooted, the MDS service can be successfully restarted.

9.39 The MDS service repeatedly crashes with an LBUG error

In rare circumstances, the MDS service encounters a bug during the recovery process. This causes the server

where the MDS service is running (normally the MDS server) to crash with an

LBUG

error. However, when

the MDS service attempts to fail over to the peer server (usually the administration server), the peer server

also crashes. In the meantime, the first server reboots, but may crash again when the MDS service fails back

to the server. This cycle can continue indefinitely.

If this happens, perform the following steps:

1.

Disable the administration server; this will stop the administration server from repeatedly crashing. For

example, to disable the

south1

server, enter the following command (on any server in the HP SFS

system). Note that you must specify the

force=yes

option with the command, because the server

may be rebooting at the time you attempt to execute the command:

sfs> disable server south1 force=yes

The administration server may crash one more time before this command has time to take effect, but

will then reboot and become stable.

2.

Examine the event log (using the

sfsmgr show log

command) to determine the cause of the

LBUG

error. An event with

LBUG

in the text describes the point at which the service failed. Just before this

event, there will probably be a

LustreError

message reporting an assert.

An

LBUG

error can be caused by a number of problems; search this guide and the HP StorageWorks

Scalable File Share Release Notes to see if the problem that caused the

LBUG

error on your system is

a known problem and if further information is provided on dealing with the problem. In particular, see

Section 9.38 of this guide, which deals with one specific known problem that can cause an

LBUG

error.

At this point, the MDS service may recover normally on the MDS server, and no further action may be

needed. However, it is possible that the MDS server may continue to crash with the same

LBUG

error. If this

happens, continue with the remaining steps in this section; do not enable the administration server until you

have completed theses steps.

3.

Stop all file systems.

4.

Reboot the MDS server.