2 overview of lsf-hpc integrated with slurm – HP XC System 3.x Software User Manual

Page 84

•

The bsub command is used to submit jobs to LSF.

•

The bjobs command provides information on batch jobs.

10.2 Overview of LSF-HPC Integrated with SLURM

LSF-HPC was integrated with SLURM for the HP XC system to merge the scalable and efficient resource
management of SLURM with the extensive scheduling capabilities of LSF-HPC. In this integration:

•

SLURM manages the compute resources.

•

LSF-HPC performs the job management.

SLURM extends the parallel capabilities of LSF-HPC with its own fast parallel launcher (which is integrated
with HP-MPI), full parallel I/O and signal support, and parallel job accounting capabilities.

Managing the compute resources of the HP XC system with SLURM means that the LSF-HPC daemons
run only on one HP XC node and can present the HP XC system as a single LSF-HPC host. As a result:

•

All the nodes are configured as LSF-HPC Client Hosts; every node is able to access LSF-HPC. You
can submit jobs from any node in the HP XC system.

•

The lshosts and bhosts commands only list one host that represents all the resources of the HP
XC system.

LSF-HPC integrated with SLURM obtains resource information about the HP XC system. This information
is consolidated and key information such as the total number of cores and the maximum memory available
on all nodes becomes the characteristics of the single HP XC “LSF Execution Host”. Additional resource
information from SLURM, such as pre-configured node “features”, are noted and processed during
scheduling through the external SLURM scheduler for LSF-HPC.

Integrating LSF-HPC with SLURM on HP XC systems provides you with a parallel launch command to
distribute and manage parallel tasks efficiently. The SLURM srun command offers much flexibility for
requesting requirements across an HP XC system; for example, you can request

•

Request contiguous nodes

•

Execute only one task per node

•

Request nodes with specific features

This flexibility is preserved in LSF-HPC through the external SLURM scheduler; this is discussed in more
detail in the section titled

“LSF-SLURM External Scheduler” (page 88)

A SLURM partition named lsf is used to manage LSF-HPC jobs. Thus:

•

You can view information about this partition with the sinfo command.

•

The total number of cores listed by the lshosts and bhosts commands for that host should be
equal to the total number of cores assigned to the SLURM lsf partition.

When a job is submitted and the resources are available, LSF-HPC creates a properly-sized SLURM
allocation and adds several standard LSF environment variables to the environment in which the job is to
be run. The following two environment variables are also added:

SLURM_JOBID

This environment variable is created so that subsequent srun commands make use
of the SLURM allocation created by LSF-HPC for the job. This variable can be used
by a job script to query information about the SLURM allocation, as shown here:

$ squeue --jobs $SLURM_JOBID

“Translating SLURM and LSF-HPC JOBIDs”

describes the relationship between the

SLURM_JOBID

and the LSF-HPC JOBID.

SLURM_NPROCS

This environment variable passes along the total number of tasks requested with the
bsub -n

command to all subsequent srun commands. User scripts can override this

value with the srun -n command, but the new value must be less than or equal to
the original number of requested tasks.

The differences described in HP XC System Software documentation take precedence over descriptions
in the LSF documentation from Platform Computing Corporation. See

“Differences Between LSF-HPC

and LSF-HPC Integrated with SLURM”

and the lsf_diff(1) manpage for more information on the subtle

differences between LSF-HPC and LSF-HPC integrated with SLURM.

Using LSF-HPC