Collecting data on multiple nodes, Consolidating and synchronizing data, Selecting output data for specific systems – HP XC System 3.x Software User Manual
Page 70: Example using hp-lsf, slurm, and mpi, Creating the common hpcpi directory and epoch
Collecting Data on Multiple Nodes
This section describes the tasks you must perform to collect data on multiple nodes, and includes
an example using HP-LSF, SLURM, and MPI.
Consolidating and Synchronizing Data
If you are collecting performance data from all nodes in your job allocation, you can consolidate
the HPCPI data in one database and in one epoch. By default, each hpcpid daemon starts a new
epoch. To consolidate and synchronize the data, follow these steps:
1.
Select and create a directory for the HPCPI database that is shared by all nodes in the cluster.
You must also have write permission for the directory (the HPCPI daemon uses your user
ID when writing data to the database). For additional requirements, see
for the HPCPI Database Directory” (page 36)
2.
Set the HPCPIDB environment variable to the selected database, or specify the database
using the -db option in the hpcpid and HPCPI utilities.
3.
On one node (such as the current login node), create a new epoch by entering the following
command:
% hpcpid -create-epoch
This command also terminates the hpcpid daemon after creating the epoch. This enables
you to start the daemon using the same command on all nodes in the following step.
4.
Start the daemon on the nodes you want to monitor so that each daemon uses the existing
epoch (the epoch you created in step 3) with the following command:
% hpcpid -epoch
You can use this command in a job file that is executed on all nodes, as shown in
Selecting Output Data for Specific Systems
By default, the hpcpiprof, hpcpilist, and hpcpitopcounts utilities search for profile files
in all system subdirectories in the epoch. In a cluster environment with a consolidated HPCPI
database and synchronized epochs, the utilities find profile files for multiple systems, and display
aggregate data values. This is useful for analyzing performance data for the cluster as a single
entity.
To view data from individual nodes, use the -hosts option with the hpcpiprof, hpcpilist,
or hpcpitopcounts utility, as described in
“Selecting Data by System” (page 51)
.
Example Using HP-LSF, SLURM, and MPI
This section contains an example with commands and files used to run an MPI application using
HP-LSF and SLURM, and to collect HPCPI data from all nodes in the job allocation.
NOTE:
SLURM syntax and operation are subject to change. The contents of this example are
provided only as guidelines.
RMS users can use RMS commands and mechanisms to start and control HPCPI.
Creating the Common HPCPI Directory and Epoch
On one system, create the HPCPI directory. The directory must be accessible by all nodes in the
cluster. Then create an epoch. For example:
% setenv HPCPIDB ~/hpcpidb
% mkdir -p $HPCPIDB
% hpcpid -create-epoch
70
Using HPCPI on an HP XC Cluster