NVIDIA UFM High-Availability User Guide

Monitoring and Troubleshooting


Check UFM Status 

Run the below command on the master node:

systemctl status ufm-enterprise.service


Check HA Status 

Run the below command: 

ufm_ha_cluster status 
pcs status


Check DRBD Status 

Run the below command:


ufm_ha_cluster status 


Show DRBD Resource 

Run the below command: 

drbdadm sh-resources 


Show DRBD Disk State 

Run the below command: 

drbdadm dstate ha_data 


Show DRBD Role 

Run the below command: 

drbdadm role ha_data 


Show DRBD Connectivity 

Run the below command: 

drbdadm cstate ha_data 


Split-Brain Recovery 

For automated HA solution, is it recommended to configure STONITH agents to kill (power-off) a peer node.

Step 1: 
Manually choose a node which data modifications will be discarded. 

It is called the split-brain victim. Choose wisely; all modifications will be lost! When in doubt, run a backup of the victim’s data before you continue. 

When running a Pacemaker cluster, you can enable maintenance mode. 

ufm_ha_cluster enable-maintain

If the split-brain victim is in the Primary role, bring down all applications using this resource. 
Now, switch the victim to the Secondary role: 

victim# ufm_ha_cluster reset standby

Resync starts automatically if the survivor is in a WFConnection network state. If the split-brain survivor is still in a Standalone connection state, reconnect it: 

survivor#  ufm_ha_cluster reset master  

Now the resynchronization from the survivor (SyncSource) to the victim (SyncTarget) starts immediately. There is no full sync initiated, but all modifications on the victim will be overwritten by the survivor’s data, and modifications on the survivor will be applied to the victim. 

Communication
Timeout during HA
Configuration

During the configuration phase of high availability, if you encounter errors regarding connectivity, such as 'Error: Unable to communicate with <master/standby IP>' or connection timeouts—even when server connectivity appears fine, consider checking the ypbind service, as it may be affecting communication.
Stop the ypbind service on the master and standby and configure HA. After the configuration succeeds, enable the ypbind service again.

systemctl stop ypbind
# configure HA
systemctl start ypbind



 

Last updated: