HP StorageWorks Scalable File Share User Manual

Page 173

Managing email alerts

6–43

lustre_bug

Alerts you when a fault

occurs in the Lustre

software.

facility=kern &&
data contains
"LustreError" &&
data contains "LBUG"

The server where the fault

occurred normally reboots

automatically. If this does

not happen, reboot the

server.
See also Section 9.39 for

information on handling

LBUG

errors on the MDS

node.

ost_out_of_space

Alerts you when the

percentage of space on an

OST service exceeds the

value of the

ost_critical_size

attribute (see Section 5.11

for more information).
The alert has a throttle

period of 86400 set by

default.

facility=storage &&
data contains "OST
out of space
critical condition"

Delete files from the file

system to prevent this

problem.

raid_degraded

Alerts you when a LUN

that is a component of a

mirrored LUN fails.

facility=kern &&
data contains "raid
degraded"

See Section 9.33.3 (SFS20

arrays) or Section 9.35

(EVA4000 arrays).

restart_fs

Alerts you to situations

where a system parameter

or other condition has

changed.
It is normal for this alert to

be triggered while you are

following the procedures

described in Chapter 7 to

change system

parameters. This alert can

also happen if a hardware

change occurs that

changes the file system

configuration.

facility=lustre &&
data contains
"Please restart
filesystem"

Review Chapter 7 and

Chapter 8 to verify that you

have followed the correct

procedure to change a

system parameter or

hardware component. See

also Section 4.1.1.
Stop and then restart all file

systems.
Client node must remount

all file systems.

server_down

Alerts you when a server is

not in the

running

state.

facility=server &&
data contains "Down"

If the server has crashed, it

will normally reboot

automatically. If this does

not happen, reboot the

server.

service_lun

Alerts you when a service

LUN is not available.

data contains "IO
error reading quorum
partition"

The most common cause of

this failure is that the array

where the LUN is located is

hung. Power cycle the array

where the LUN is located

and then reboot the servers

attached to the array.

sm_down

Alerts you when

communication within the

HP SFS system is slowing

down. This problem occurs

occasionally when the

system is under heavy

load.

data contains "SM
non-responsive"

If this problem occurs

infrequently, it can be

ignored. However, if the

problem occurs frequently, it

may indicate a problem

with the SFS20 array where

the service LUN for the

specified server is located.
Use the

show array

and

the

show array

array_number

commands to check the

SFS20 disks for errors.

Table 6-4

Default email alerts

Email Alert Name

Purpose

Email Alert Filter

Action Required