Notes about using lsf-hpc in the hp xc environment, Job startup and job control – HP XC System 3.x Software User Manual
Page 74
This bsub command launches a request for four cores (from the -n4 option of the bsub command)
across four nodes (from the -ext "SLURM[nodes=4]" option); the job is launched on those cores.
The script, myscript, which is shown here, runs the job:
#!/bin/sh
hostname
srun hostname
mpirun -srun ./hellompi
3.
LSF-HPC schedules the job and monitors the state of the resources (compute nodes) in the SLURM lsf
partition. When the LSF-HPC scheduler determines that the required resources are available, LSF-HPC
allocates those resources in SLURM and obtains a SLURM job identifier (jobID) that corresponds to
the allocation.
In this example, four cores spread over four nodes (n1,n2,n3,n4) are allocated for myscript, and
the SLURM job id of 53 is assigned to the allocation.
4.
LSF-HPC prepares the user environment for the job on the LSF execution host node and dispatches the
job with the job_starter.sh script. This user environment includes standard LSF environment variables
and two SLURM-specific environment variables: SLURM_JOBID and SLURM_NPROCS.
SLURM_JOBID
is the SLURM job ID of the job. Note that this is not the same as the LSF-HPC jobID.
"Translating SLURM and LSF-HPC JOBIDs"
describes the relationship between the SLURM_JOBID
and the LSF-HPC JOBID.
SLURM_NPROCS
is the number of processes allocated.
These environment variables are intended for use by the user's job, whether it is explicitly (user scripts
may use these variables as necessary) or implicitly (the srun commands in the user’s job use these
variables to determine its allocation of resources).
The value for SLURM_NPROCS is 4 and the SLURM_JOBID is 53 in this example.
5.
The user job myscript begins execution on compute node n1.
The first line in myscript is the hostname command. It executes locally and returns the name of
node, n1.
6.
The second line in the myscript script is the srun hostname command. The srun command in
myscript
inherits SLURM_JOBID and SLURM_NPROCS from the environment and executes the
hostname
command on each compute node in the allocation.
7.
The output of the hostname tasks (n1, n2, n3, and n4). is aggregated back to the srun launch
command (shown as dashed lines in
), and is ultimately returned to the srun command in
the job starter script, where it is collected by LSF-HPC.
The last line in myscript is the mpirun -srun ./hellompi command. The srun command inside the
mpirun
command in myscript inherits the SLURM_JOBID and SLURM_NPROCS environment variables
from the environment and executes hellompi on each compute node in the allocation.
The output of the hellompi tasks is aggregated back to the srun launch command where it is collected
by LSF-HPC.
The command executes on the allocated compute nodes n1, n2, n3, and n4.
When the job finishes, LSF-HPC cancels the SLURM allocation, which frees the compute nodes for use by
another job.
Notes About Using LSF-HPC in the HP XC Environment
This section provides some additional information that should be noted about using LSF-HPC in the HP XC
Environment.
Job Startup and Job Control
When LSF-HPC starts a SLURM job, it sets SLURM_JOBID to associate the job with the SLURM allocation.
While a job is running, all LSF-HPC supported operating-system-enforced resource limits are supported,
including core limit, CPU time limit, data limit, file size limit, memory limit, and stack limit. If the user kills a
job, LSF-HPC propagates signals to entire job, including the job file running on the local node and all tasks
running on remote nodes.
74
Using LSF