Example 6-3: reporting on failed jobs in the queue, 6 killing jobs with the scancel command, Example 6-4: killing a job by its jobid – HP XC System 2.x Software User Manual
Page 83: Example 6-5: cancelling all pending jobs, Example 6-6: sending a signal to a job, Example 6-7: using the sinfo command (no options)
The
squeue
command can report on jobs in the job queue according to their state; valid states
are: pending, running, completing, completed, failed, timeout, and node_fail. Example 6-3 uses
the
squeue
command to report on failed jobs.
Example 6-3: Reporting on Failed Jobs in the Queue
$ squeue --state=FAILED
JOBID PARTITION
NAME
USER
ST
TIME
NODES NODELIST
59
amt1 hostname
root
F
0:00
0
6.6 Killing Jobs with the
scancel
Command
The
scancel
command cancels a pending or running job or job step. It can also be used to
send a specified signal to all processes on all nodes associated with a job. Only job owners
or administrators can cancel jobs.
Example 6-4 kills job 415 and all its jobsteps.
Example 6-4: Killing a Job by Its JobID
$ scancel 415
Example 6-5 cancels all pending jobs.
Example 6-5: Cancelling All Pending Jobs
$ scancel --state=PENDING
Example 6-6 sends the
TERM
signal to terminate jobsteps 421.2 and 421.3.
Example 6-6: Sending a Signal to a Job
$ scancel --signal=TERM 421.2 421.3
6.7 Getting System Information with the
sinfo
Command
The
sinfo
command reports the state of partitions and nodes managed by SLURM. It has
a wide variety of filtering, sorting, and formatting options.
sinfo
displays a summary of
available partition and node (not job) information (such as partition names, nodes/partition,
and CPUs/node).
Example 6-7: Using the
sinfo
Command (No Options)
$ sinfo
PARTITION AVAIL TIMELIMIT NODES
STATE NODELIST
lsf
up
infinite
1
down* n15
lsf
up
infinite
2
idle n[14,16]
Using SLURM
6-13