8 fault tolerance, 9 security, 8 fault tolerance 9.9 security – HP XC System 3.x Software User Manual
Page 82

# chmod a+r /hptc_cluster/slurm/job/jobacct.log
You can find detailed information on the sacct command and job accounting data in the sacct(1) manpage.
9.8 Fault Tolerance
SLURM can handle a variety of failure modes without terminating workloads, including crashes of the
node running the SLURM controller. User jobs may be configured to continue execution despite the failure
of one or more nodes on which they are executing. The command controlling a job may detach and reattach
from the parallel tasks at any time. Nodes allocated to a job are available for reuse as soon as the job(s)
allocated to that node terminate. If some nodes fail to complete job termination in a timely fashion because
of hardware or software problems, only the scheduling of those tardy nodes will be affected.
9.9 Security
SLURM has a simple security model:
•
Any user of the system can submit jobs to execute. Any user can cancel his or her own jobs. Any user
can view SLURM configuration and state information.
•
Only privileged users can modify the SLURM configuration, cancel other users' jobs, or perform other
restricted activities. Privileged users in SLURM include root users and SlurmUser (as defined in
the SLURM configuration file).
If permission to modify SLURM configuration is required by others, set-uid programs may be used to
grant specific permissions to specific users.
SLURM accomplishes security by means of communication authentication, job authentication, and user
authorization.
For further information about SLURM security features, see the SLURM documentation.
82
Using SLURM