How the automatic restart works, Related events, Node failed to rejoin srd on start-up” event – HP Matrix Operating Environment Software User Manual

Page 41: How the automatic restart works related events

NOTE:

If a vpar is borrowing cores from other vpars when it loses contact with its SRD, those

borrowed cores might be separated from the SRD. If the vpar might be down for an extended
time, check that the SRD has reformed without that vpar and that it has enough cores to meet
its commitments. If not, try using vparmodify to reclaim some of the cores. (With the vpar
down, you will not be able to modify it locally, and only some versions of HP-UX Virtual Partitions
allow you to easily modify a remote vpar.)

Similarly, if an npar has several active cores (due to Instant Capacity) when it loses contact with
its SRD, you might have to manually size the npar to reclaim those cores for npars still in the
SRD. For more information, see the Instant Capacity documentation.

How the automatic restart works

When a managed node boots, the gWLM agent (gwlmagent) starts automatically if
GWLM_AGENT_START

is set to 1 in the file /etc/rc.config.d/gwlmCtl. The agent then checks

the file /etc/opt/gwlm/deployed.config to determine its CMS. Next, it attempts to contact
the CMS to have the CMS re-deploy its view of the SRD. If the CMS cannot be contacted, the
SRD in the deployed.config file is deployed as long as all nodes agree.

In general, when an SRD is disrupted by a node’s going down, by a CMS's going down, or by
network communications issues, gWLM attempts to reform the SRD. gWLM maintains the
concept of a cluster for the nodes in an SRD. In a cluster, one node is a master and the other nodes
are nonmasters. If the master node loses contact with the rest of the SRD, the rest of the SRD can
continue without it, as a partial cluster, by unanimously agreeing on a new master. If a nonmaster
loses communication with the rest of the SRD, the resulting partial cluster continues operation
without the lost node. The master simply omits the missing node until it becomes available again.

You can use the gwlmstatus command to monitor availability. It can tell you whether any hosts
are unable to rejoin a node's SRD as well as whether hosts in the SRD are nonresponsive. For
more information, see gwlmstatus(1M).

NOTE:

Attempts to reform SRDs might time out, leaving no SRD deployed and consequently

no management of resource allocations. If this occurs, see the VSE Management Software Release
Notes and follow the actions suggested in the section titled “Data Missing in Real-time
Monitoring.”

Related events

You can configure the following HP SIM events regarding this automatic restart feature:

•

Node Failed to Rejoin SRD on Start-up

•

SRD Reformed with Partial Set of Nodes

•

SRD Communication Issue

For information on enabling and viewing these events, refer to Optimize

→Global Workload

Manager

→Events.

You can then view these events using the Event Lists item in the left pane of HP SIM.

The following sections explain how to handle some of the events.

“Node Failed to Rejoin SRD on Start-up” event

If you see the event “Node Failed to Rejoin SRD on Start-up”:

Restart the gwlmagent on each managed node in the affected SRD:

# /opt/gwlm/bin/gwlmagent --restart

Verify the agent rejoined the SRD by monitoring the Shared Resource Domain View in HP
SIM or by using the gwlm monitor command.

Automatic restart of gWLM’s managed nodes in SRDs (high availability)