HP StorageWorks Scalable File Share User Manual

Page 253

Troubleshooting file systems

9–29

NOTE:

An MDS service is considered to be a client of an OST service; as a result, the number of

recoverable clients shown in the message should be the total number of client nodes plus one for

the MDS service. However, it is possible for this count to exceed this value; for example, when a

client node is reset or crashes while the Lustre file system is mounted, and then later mounts the

Lustre file system again, Lustre counts the rebooted client mount operation as a completely new

client connection.

As explained in Section 3.8, a service remains in the

recovering

state until one of the following

events occurs:
•

A client node attempts to reconnect to the service due to communications failure processing.

•

A client node attempts to mount the Lustre file system.

When one of these events occurs, the recovery timer, which determines how long the service will wait

for client nodes to reconnect to it before giving up and completing recovery processing, is started. The

recovery timer is set to 300 seconds by default; this value is 1.5 times the client communications

timeout, which defaults to 200 seconds.

The following message shows the recovery timer starting:

Lustre: 32590:0:(ldlm_lib.c:757:target_start_recovery_timer()
south-ost12: starting recovery timer (300s)

If the client message was a reconnect attempt, a message similar to the following is reported,

indicating how many client nodes have yet to attempt to reconnect:

Lustre: 32593:0:(ldlm_lib.c:1029:target_queue_final_reply()) south-ost12:
4 recoverable clients remain

Alternatively, if the client message is a new connection attempt, a message similar to the following is

reported:

LustreError: 3953:0:(ldlm_lib.c:645:target_handle_connect())
denying connection for new client 4bda07a5-18f2-44c7-815b-2da148c06203:
2 clients in recovery for 110s

If some clients are no longer available to reconnect to the service, the recovery timer eventually

expires, and messages similar to the following are reported:

LustreError: 0:0:(ldlm_lib.c:901:target_recovery_expired())
recovery timed out, aborting
LustreError: 4672:0:(ldlm_lib.c:887:target_abort_recovery()) south-ost12:
recovery period over; disconnecting unfinished clients.
LustreError: 4672:0:(genops.c:814:class_disconnect_stale_exports())
south-ost12: disconnecting 1 stale clients

When all clients have succeeded in reconnecting, or the recovery timer has expired, if there are any

recovered clients the service reports a message similar to the following:

Lustre: 32608:0:(ldlm_lib.c:784:target_finish_recovery()) south-ost12:
sending delayed replies to recovered clients

There is also one instance of the following message for each recovered client:

Lustre: 32608:0:(ldlm_lib.c:617:target_finish_recovery())
@@@ delayed: req@8124e400 x4108/t0 o400->@:-1 lens 64/64 ref 0 fl
Interpret:/1/0 rc 0/0

Any pending incomplete operations that are recorded between the service and the recovered clients

also result in messages similar to the following:

Lustre: 18684:0:(llog_obd.c:135:cat_cancel_cb())
processing log 0x50012:ff95be06 at index 269 of catalog 0x50008

Finally, the service reports the following messages, to indicate that the service has completed

recovery, and is ready to act as an MDS or OST service:

ost12: running (was recovering on south7)