UNIX Health Check - PowerHA / HACMP

Tech Blog

These are blog entries written by the UNIX Health Check development team. Our team has extensive technical experience on both AIX and Red Hat systems, and we like to share our knowledge with our visitors.

Topics: AIX, PowerHA / HACMP, System Admin

NFS mounts on HACMP failing

When you want to mount an NFS file system on a node of an HACMP cluster, there are a couple of items you need check, before it will work:

Make sure the hostname and IP address of the HACMP node are resolvable and provide the correct output, by running:
```
# nslookup [hostname]
# nslookup [ip-address]
```
The next thing you will want to check on the NFS server, if the node names of your HACMP cluster nodes are correctly added to the /etc/exports file. If they are, run:
# exportfs -va

The last, and tricky item you will want to check is, if a service IP label is defined as an IP alias on the same adapter as your nodes hostname, e.g.:

# netstat -nr
Routing tables
Destination   Gateway       Flags  Refs  Use    If  Exp  Groups

Route Tree for Protocol Family 2 (Internet):
default       10.251.14.1   UG      4    180100 en1  -     -
10.251.14.0   10.251.14.50  UHSb    0         0 en1  -     -
10.251.14.50  127.0.0.1     UGHS    3    791253 lo0  -     -

The example above shows you that the default gateway is defined on the en1 interface. The next command shows you where your Service IP label lives:

# netstat -i
Name  Mtu   Network   Address         Ipkts   Ierrs Opkts
en1   1500  link#2    0.2.55.d3.75.77 2587851 0      940024
en1   1500  10.251.14 node01          2587851 0      940024
en1   1500  10.251.20 serviceip       2587851 0      940024
lo0   16896 link#1                    1912870 0     1914185
lo0   16896 127       loopback        1912870 0     1914185
lo0   16896 ::1                       1912870 0     1914185

As you can see, the Service IP label (in the example above called "serviceip") is defined on en1. In that case, for NFS to work, you also want to add the "serviceip" to the /etc/exports file on the NFS server and re-run "exportfs -va". And you should also make sure that hostname "serviceip" resolves to an IP address correctly (and of course the IP address resolves to the correct hostname) on both the NFS server and the client.

Topics: AIX, EMC, PowerHA / HACMP, SAN, Storage, System Admin ↑

Missing disk method in HACMP configuration

Issue when trying to bring up a resource group: For example, the hacmp.out log file contains the following:

cl_disk_available[187] cl_fscsilunreset fscsi0 hdiskpower1 false cl_fscsilunreset[124]: openx(/dev/hdiskpower1, O_RDWR, 0, SC_NO_RESERVE): Device busy cl_fscsilunreset[400]: ioctl SCIOLSTART id=0X11000 lun=0X1000000000000 : Invalid argument

To resolve this, you will have to make sure that the SCSI reset disk method is configured in HACMP. For example, when using EMC storage:

Make sure emcpowerreset is present in /usr/lpp/EMC/Symmetrix/bin/emcpowerreset.

Then add new custom disk method:

Enter into the SMIT fastpath for HACMP "smitty hacmp".
Select Extended Configuration.
Select Extended Resource Configuration.
Select HACMP Extended Resources Configuration.
Select Configure Custom Disk Methods.
Select Add Custom Disk Methods.

      Change/Show Custom Disk Methods

Type or select values in entry fields.
Press Enter AFTER making all desired changes.

                                                 [Entry Fields]
* Disk Type (PdDvLn field from CuDv)             disk/pseudo/power
* New Disk Type                                  [disk/pseudo/power]
* Method to identify ghost disks                 [SCSI3]
* Method to determine if a reserve is held       [SCSI_TUR]
* Method to break reserve [/usr/lpp/EMC/Symmetrix/bin/emcpowerreset]
  Break reserves in parallel                     true
* Method to make the disk available              [MKDEV]

Topics: PowerHA / HACMP, System Admin ↑

Synchronizing 2 HACMP nodes

In order to keep users and all their related settings and crontab files synchronized, here's a script that you can use to do this for you:

sync.ksh

Topics: AIX, Networking, PowerHA / HACMP ↑

Using an alternative MAC address

HACMP is capable of using an alternative MAC address in combination with its service address. So, how do you set this MAC address without HACMP, just using the command line? (Could come in handy, in case you wish to configure the service address on a system, without having to start HACMP).

# ifconfig enX down
# ifconfig enX detach
# chdev -l entX -a use_alt_addr=yes
# chdev -l entX -a alt_addr=0x00xxxxxxxxxx
# ifconfig enX xxx.xxx.xxx.xxx
# ifconfig enX up

And if you wish to remove it again:

# ifconfig enX down
# ifconfig enX detach
# chdev -l entX -a use_alt_addr=no
# chdev -l entX -a alt_addr=0x00000000000

Topics: PowerHA / HACMP ↑

QHA

The standard tool for cluster monitoring is clstat, which comes along with PowerHA SystemMirror/HACMP. Clstat is rather slow with its updates, and sometimes the required clinfo deamon needs restarting in order to get it operational, so this is, well, not perfect. There's a script which is also easy to use. It is written by PowerHA/HACMP guru Alex Abderrazag. This script shows you the correct PowerHA/HACMP status, along with adapter and volume group information. It works fine on HACMP 5.2 through 7.2. You can download it here: qha. This is version 9.06. For the latest version, check www.lpar.co.uk.

This tiny but effective tool accepts the following flags:

-n (show network interface info)
-N (show interface info and active HBOD)
-v (show shared online volume group info)
-l (log to /tmp/qha.out)
-e (show running events if cluster is unstable)
-m (show status of monitor app servers if present)
-1 (exit after first iteration)
-c (CAA SAN / Disk Comms)

For example, run:

# qha -nev

It's useful to put "qha" in /usr/es/sbin/cluster/utilities, as that path is usually already defined in $PATH, and thus you can run qha from anywhere.

A description of the possible cluster states:

ST_INIT: cluster configured and down
ST_JOINING: node joining the cluster
ST_VOTING: Inter-node decision state for an event
ST_RP_RUNNING: cluster running recovery program
ST_BARRIER: clstrmgr waiting at the barrier statement
ST_CBARRIER: clstrmgr is exiting recovery program
ST_UNSTABLE: cluster unstable
NOT_CONFIGURED: HA installed but not configured
RP_FAILED: event script failed
ST_STABLE: cluster services are running with managed resources (stable cluster) or cluster services have been "forced" down with resource groups potentially in the UNMANAGED state (HACMP 5.4 only)

Topics: AIX, PowerHA / HACMP, System Admin ↑

Email messages from the cron daemon

Some user accounts, mostly service accounts, may create a lot of email messages, for example when a lot of commands are run by the cron daemon for a specific user. There are a couple of ways to deal with this:

1. Make sure no unnecesary emails are sent at all

To avoid receiving messages from the cron daemon; one should always redirect the output of commands in crontabs to a file or to /dev/null. Also make sure to redirect STDERR as well:

0 * * * * /path/to/command > /path/to/logfile 2>&1
1 * * * * /path/to/command > /dev/null 2>&1

2. Make sure the commands in the crontab actually exist

An entry in a crontab with a command that does not exits, will generate an email message from the cron daemon to the user, informing the user about this issue. This is something that may occur on HACMP clusters where crontab files are synchronized on all HACMP nodes. They need to be synchronize on all the nodes, just in case a resource group fails over to a standby node. However, the required file systems containing the commands may not be available on all the nodes at all time. To get around that, test if the command exists first:

0 * * * * [ -x /path/to/command ] && /path/to/command > /path/to/logfile 2>&1

3. Clean up the email messages regularly

The last way of dealing with this, is to add another cron entry to a users crontab; that cleans out the mailbox every night, for example the next command that deletes all but the last 1000 messages from a users mailbox:

0 * * * * echo d1-$(let num="$(echo f|mail|tail -1|awk '{print $2}')-1000";echo $num)|mail >/dev/null

4. Forward the email to the user

Very effective: Create a .forward file in the users home directory, to forward all email messages to the user. If the user starts receiving many, many emails, he/she will surely do somehting about it, when it gets annoying.

Topics: LVM, PowerHA / HACMP, System Admin ↑

VGDA out of sync

With HACMP, you can run into the following error during a verification/synchronization:

WARNING: The LVM time stamp for shared volume group: testvg is inconsistent with the time stamp in the VGDA for the following nodes: host01

To correct the above condition, run verification & synchronization with "Automatically correct errors found during verification?" set to either 'Yes' or 'Interactive'. The cluster must be down for the corrective action to run.

This can happen when you've added additional space to a logical volume/file system from the command line instead of using the smitty hacmp menu. But you certainly don't want to take down the entire HACMP cluster to solve this message.

First of all, you don't. The cluster will fail-over nicely anyway, without these VGDA's being in sync. But, still, it is an annoying warning, that you would like to get rid off.

Have a look at your shared logical volumes. By using the lsattr command, you can see if they are actually in sync or not:

host01 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -Fvalue
/test:1:809:jfs2:y:

host02 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -Fvalue
/test:1:806:jfs2:y:

Well, there you have it. One host reports testlv having a size of 806 LPs, the other says it's 809. Not good. You will run into this when you've used the extendlv and chfs commands to increase the size of a shared file system. You should have used the smitty menu.

The good thing is, HACMP will sync the VGDA's if you do some kind of logical volume operation through the smitty hacmp menu. So, either increase the size of a shared logical volume through the smitty menu with just one LP (and of course, also increase the size of the corresponding file system); Or, you can create an additional shared logical volume through smitty of just one LP, and then remove it again afterwards.

When you've done that, simply re-run the verification/synchronization, and you'll notice that the warning message is gone. Make sure you run the lsattr command again on your shared logical volumes on all the nodes in your cluster to confirm.

Topics: Monitoring, PowerHA / HACMP ↑

HACMP auto-verification

HACMP automatically runs a verification every night, usually around mid-night. With a very simple command you can check the status of this verification run:

# tail -10 /var/hacmp/log/clutils.log 2>/dev/null|grep detected|tail -1

If this shows a returncode of 0, the cluster verification ran without any errors. Anything else, you'll have to investigate. You can use this command on all your HACMP clusters, allowing you to verify your HACMP cluster status every day.

With the following smitty menu you can change the time when the auto-verification runs and if it should produce debug output or not:

# smitty clautover.dialog

You can check with:

# odmget HACMPcluster
# odmget HACMPtimersvc

Be aware that if you change the runtime of the auto-verification that you have to synchronize the cluster afterwards to update the other nodes in the cluster.

Topics: AIX, Networking, PowerHA / HACMP ↑

Specifying the default gateway on a specific interface

When you're using HACMP, you usually have multiple network adapters installed and thus multiple network interface to handle with. If AIX configured the default gateway on a wrong interface (like on your management interface instead of the boot interface), you might want to change this, so network traffic isn't sent over the management interface. Here's how you can do this:

First, stop HACMP or do a take-over of the resource groups to another node; this will avoid any problems with applications when you start fiddling with the network configuration.

Then open up a virtual terminal window to the host on your HMC. Otherwise you would loose the connection, as soon as you drop the current default gateway.

Now you need to determine where your current default gateway is configured. You can do this by typing:

# lsattr -El inet0
# netstat -nr

The lsattr command will show you the current default gateway route and the netstat command will show you the interface it is configured on. You can also check the ODM:

# odmget -q"attribute=route" CuAt

Now, delete the default gateway like this:

# lsattr -El inet0 | awk '$2 ~ /hopcount/ { print $2 }' | read GW
# chdev -l inet0 -a delroute=${GW}

If you would now use the route command to specifiy the default gateway on a specific interface, like this:

# route add 0 [ip address of default gateway: xxx.xxx.xxx.254] -if enX

You will have a working entry for the default gateway. But... the route command does not change anything in the ODM. As soon as your system reboots; the default gateway is gone again. Not a good idea.

A better solution is to use the chdev command:

# chdev -l inet0 -a addroute=net,-hopcount,0,,0,[ip address of default gateway]

This will set the default gateway to the first interface available.

To specify the interface use:

# chdev -l inet0 -a addroute=net,-hopcount,0,if,enX,,0,[ip address of default gateway]

Substitute the correct interface for enX in the command above.

If you previously used the route add command, and after that you use chdev to enter the default gateway, then this will fail. You have to delete it first by using route delete 0, and then give the chdev command.

Afterwards, check fi the new default gateway is properly configured:

# lsattr -El inet0
# odmget -q"attribute=route" CuAt

And ofcourse, try to ping the IP address of the default gateway and some outside address. Now reboot your system and check if the default gateway remains configured on the correct interface. And startup HACMP again!

Topics: PowerHA / HACMP ↑

Useful HACMP commands

clstat - show cluster state and substate; needs clinfo.
cldump - SNMP-based tool to show cluster state.
cldisp - similar to cldump, perl script to show cluster state.
cltopinfo - list the local view of the cluster topology.
clshowsrv -a - list the local view of the cluster subsystems.
clfindres (-s) - locate the resource groups and display status.
clRGinfo -v - locate the resource groups and display status.
clcycle - rotate some of the log files.
cl_ping - a cluster ping program with more arguments.
clrsh - cluster rsh program that take cluster node names as argument.
clgetactivenodes - which nodes are active?
get_local_nodename - what is the name of the local node?
clconfig - check the HACMP ODM.
clRGmove - online/offline or move resource groups.
cldare - sync/fix the cluster.
cllsgrp - list the resource groups.
clsnapshotinfo - create a large snapshot of the HACMP configuration.
cllscf - list the network configuration of an HACMP cluster.
clshowres - show the resource group configuration.
cllsif - show network interface information.
cllsres - show short resource group information.
lssrc -ls clstrmgrES - list the cluster manager state.
lssrc -ls topsvcs - show heartbeat information.
cllsnode - list a node centric overview of the hacmp configuration.

Number of results found for topic PowerHA / HACMP: 31.
Displaying results: 11 - 20.

Order

No time to lose? Need to know what's wrong with
your UNIX system now? Then get started TODAY!