8 job accounting, 9 fault tolerance, 10 security – HP XC System 2.x Software User Manual
Page 84
![background image](/manuals/398425/84/background.png)
Example 6-8: Reporting Reasons for Downed, Drained, and Draining Nodes
$ sinfo -R
REASON
NODELIST
Memory errors
dev[0,5]
Not Responding
dev8
6.8 Job Accounting
HP XC System Software provides an extension to SLURM for job accounting. The
sacct
command displays job accounting data in a variety of forms for your analysis. Job accounting
data is stored in a log file; the
sacct
command filters that log file to report on your jobs,
jobsteps, status, and errors. See your system administrator if job accounting is not configured
on your system.
You can find detailed information on the
sacct
command and job accounting data in the
sacct
(1)
manpage.
6.9 Fault Tolerance
SLURM can handle a variety of failure modes without terminating workloads, including crashes
of the node running the SLURM controller. User jobs may be configured to continue execution
despite the failure of one or more nodes on which they are executing (refer to Section 6.4.5.1 for
further information). The command controlling a job may detach and reattach from the parallel
tasks at any time. Nodes allocated to a job are available for reuse as soon as the job(s) allocated
to that node terminate. If some nodes fail to complete job termination in a timely fashion because
of hardware or software problems, only the scheduling of those tardy nodes will be affected.
6.10 Security
SLURM has a simple security model:
•
Any user of the system can submit jobs to execute. Any user can cancel his or her own jobs.
Any user can view SLURM configuration and state information.
•
Only privileged users can modify the SLURM configuration, cancel any job, or
perform other restricted activities. Privileged users in SLURM include
root
users and
SlurmUser
(as defined in the SLURM configuration file).
If permission to modify SLURM configuration is required by others,
set-uid
programs may
be used to grant specific permissions to specific users.
SLURM accomplishes security by means of communication authentication, job authentication,
and user authorization.
Refer to SLURM documentation for further information about SLURM security features.
6-14
Using SLURM