beautypg.com

Fault tolerance, Security, Fault tolerance security – HP XC System 3.x Software User Manual

Page 66

background image

# chmod a+r /hptc_cluster/slurm/job/jobacct.log

You can find detailed information on the sacct command and job accounting data in the sacct(1) manpage.

Fault Tolerance

SLURM can handle a variety of failure modes without terminating workloads, including crashes of the node
running the SLURM controller. User jobs may be configured to continue execution despite the failure of one
or more nodes on which they are executing. The command controlling a job may detach and reattach from
the parallel tasks at any time. Nodes allocated to a job are available for reuse as soon as the job(s) allocated
to that node terminate. If some nodes fail to complete job termination in a timely fashion because of hardware
or software problems, only the scheduling of those tardy nodes will be affected.

Security

SLURM has a simple security model:

Any user of the system can submit jobs to execute. Any user can cancel his or her own jobs. Any user
can view SLURM configuration and state information.

Only privileged users can modify the SLURM configuration, cancel other users' jobs, or perform other
restricted activities. Privileged users in SLURM include root users and SlurmUser (as defined in the
SLURM configuration file).

If permission to modify SLURM configuration is required by others, set-uid programs may be used to grant
specific permissions to specific users.

SLURM accomplishes security by means of communication authentication, job authentication, and user
authorization.

Refer to SLURM documentation for further information about SLURM security features.

66

Using SLURM