HP StorageWorks Scalable File Share User Manual
Page 253
Troubleshooting file systems
9–29
NOTE:
An MDS service is considered to be a client of an OST service; as a result, the number of
recoverable clients shown in the message should be the total number of client nodes plus one for
the MDS service. However, it is possible for this count to exceed this value; for example, when a
client node is reset or crashes while the Lustre file system is mounted, and then later mounts the
Lustre file system again, Lustre counts the rebooted client mount operation as a completely new
client connection.
2.
As explained in Section 3.8, a service remains in the
recovering
state until one of the following
events occurs:
•
A client node attempts to reconnect to the service due to communications failure processing.
•
A client node attempts to mount the Lustre file system.
When one of these events occurs, the recovery timer, which determines how long the service will wait
for client nodes to reconnect to it before giving up and completing recovery processing, is started. The
recovery timer is set to 300 seconds by default; this value is 1.5 times the client communications
timeout, which defaults to 200 seconds.
The following message shows the recovery timer starting:
Lustre: 32590:0:(ldlm_lib.c:757:target_start_recovery_timer()
south-ost12: starting recovery timer (300s)
3.
If the client message was a reconnect attempt, a message similar to the following is reported,
indicating how many client nodes have yet to attempt to reconnect:
Lustre: 32593:0:(ldlm_lib.c:1029:target_queue_final_reply()) south-ost12:
4 recoverable clients remain
Alternatively, if the client message is a new connection attempt, a message similar to the following is
reported:
LustreError: 3953:0:(ldlm_lib.c:645:target_handle_connect())
denying connection for new client 4bda07a5-18f2-44c7-815b-2da148c06203:
2 clients in recovery for 110s
4.
If some clients are no longer available to reconnect to the service, the recovery timer eventually
expires, and messages similar to the following are reported:
LustreError: 0:0:(ldlm_lib.c:901:target_recovery_expired())
recovery timed out, aborting
LustreError: 4672:0:(ldlm_lib.c:887:target_abort_recovery()) south-ost12:
recovery period over; disconnecting unfinished clients.
LustreError: 4672:0:(genops.c:814:class_disconnect_stale_exports())
south-ost12: disconnecting 1 stale clients
5.
When all clients have succeeded in reconnecting, or the recovery timer has expired, if there are any
recovered clients the service reports a message similar to the following:
Lustre: 32608:0:(ldlm_lib.c:784:target_finish_recovery()) south-ost12:
sending delayed replies to recovered clients
There is also one instance of the following message for each recovered client:
Lustre: 32608:0:(ldlm_lib.c:617:target_finish_recovery())
@@@ delayed: req@8124e400 x4108/t0 o400->>@>:-1 lens 64/64 ref 0 fl
Interpret:/1/0 rc 0/0
6.
Any pending incomplete operations that are recorded between the service and the recovered clients
also result in messages similar to the following:
Lustre: 18684:0:(llog_obd.c:135:cat_cancel_cb())
processing log 0x50012:ff95be06 at index 269 of catalog 0x50008
7.
Finally, the service reports the following messages, to indicate that the service has completed
recovery, and is ready to act as an MDS or OST service:
ost12: running (was recovering on south7)