7 job accounting, 8 fault tolerance, 9 security – HP XC System 3.x Software User Manual
Page 94: 7 job accounting 9.8 fault tolerance 9.9 security
Example 9-8 Reporting Reasons for Downed, Drained, and Draining Nodes
$ sinfo -R
REASON NODELIST
Memory errors n[0,5]
Not Responding n8
9.7 Job Accounting
HP XC System Software provides an extension to SLURM for job accounting. The sacct command
displays job accounting data in a variety of forms for your analysis. Job accounting data is stored
in a log file; the sacct command filters that log file to report on your jobs, jobsteps, status, and
errors. See your system administrator if job accounting is not configured on your system.
By default, only the superuser is allowed to access the job accounting data. To grant all system
users read access to this data, the superuser must change the permission of the jobacct.log
file, as follows:
# chmod a+r /hptc_cluster/slurm/job/jobacct.log
You can find detailed information on the sacct command and job accounting data in the sacct(1)
manpage.
9.8 Fault Tolerance
SLURM can handle a variety of failure modes without terminating workloads, including crashes
of the node running the SLURM controller. User jobs may be configured to continue execution
despite the failure of one or more nodes on which they are executing. The command controlling
a job may detach and reattach from the parallel tasks at any time. Nodes allocated to a job are
available for reuse as soon as the job(s) allocated to that node terminate. If some nodes fail to
complete job termination in a timely fashion because of hardware or software problems, only
the scheduling of those tardy nodes will be affected.
9.9 Security
SLURM has a simple security model:
•
Any user of the system can submit jobs to execute. Any user can cancel his or her own jobs.
Any user can view SLURM configuration and state information.
•
Only privileged users can modify the SLURM configuration, cancel other users' jobs, or
perform other restricted activities. Privileged users in SLURM include root users and
SlurmUser
(as defined in the SLURM configuration file).
If permission to modify SLURM configuration is required by others, set-uid programs may
be used to grant specific permissions to specific users.
SLURM accomplishes security by means of communication authentication, job authentication,
and user authorization.
For further information about SLURM security features, see the SLURM documentation.
94
Using SLURM