When you want to mount an NFS file system on a node of an HACMP cluster, there are a couple of items you need check, before it will work:
- Make sure the hostname and IP address of the HACMP node are resolvable and provide the correct output, by running:
# nslookup [hostname] # nslookup [ip-address]
- The next thing you will want to check on the NFS server, if the node names of your HACMP cluster nodes are correctly added to the /etc/exports file. If they are, run:
# exportfs -va
- The last, and tricky item you will want to check is, if a service IP label is defined as an IP alias on the same adapter as your nodes hostname, e.g.:
The example above shows you that the default gateway is defined on the en1 interface. The next command shows you where your Service IP label lives:# netstat -nr Routing tables Destination Gateway Flags Refs Use If Exp Groups Route Tree for Protocol Family 2 (Internet): default 10.251.14.1 UG 4 180100 en1 - - 10.251.14.0 10.251.14.50 UHSb 0 0 en1 - - 10.251.14.50 127.0.0.1 UGHS 3 791253 lo0 - -
As you can see, the Service IP label (in the example above called "serviceip") is defined on en1. In that case, for NFS to work, you also want to add the "serviceip" to the /etc/exports file on the NFS server and re-run "exportfs -va". And you should also make sure that hostname "serviceip" resolves to an IP address correctly (and of course the IP address resolves to the correct hostname) on both the NFS server and the client.# netstat -i Name Mtu Network Address Ipkts Ierrs Opkts en1 1500 link#2 0.2.55.d3.75.77 2587851 0 940024 en1 1500 10.251.14 node01 2587851 0 940024 en1 1500 10.251.20 serviceip 2587851 0 940024 lo0 16896 link#1 1912870 0 1914185 lo0 16896 127 loopback 1912870 0 1914185 lo0 16896 ::1 1912870 0 1914185
Topics: AIX, EMC, PowerHA / HACMP, SAN, Storage, System Admin↑
Missing disk method in HACMP configuration
Issue when trying to bring up a resource group: For example, the hacmp.out log file contains the following:
cl_disk_available[187] cl_fscsilunreset fscsi0 hdiskpower1 false cl_fscsilunreset[124]: openx(/dev/hdiskpower1, O_RDWR, 0, SC_NO_RESERVE): Device busy cl_fscsilunreset[400]: ioctl SCIOLSTART id=0X11000 lun=0X1000000000000 : Invalid argumentTo resolve this, you will have to make sure that the SCSI reset disk method is configured in HACMP. For example, when using EMC storage:
Make sure emcpowerreset is present in /usr/lpp/EMC/Symmetrix/bin/emcpowerreset.
Then add new custom disk method:
- Enter into the SMIT fastpath for HACMP "smitty hacmp".
- Select Extended Configuration.
- Select Extended Resource Configuration.
- Select HACMP Extended Resources Configuration.
- Select Configure Custom Disk Methods.
- Select Add Custom Disk Methods.
Change/Show Custom Disk Methods
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
* Disk Type (PdDvLn field from CuDv) disk/pseudo/power
* New Disk Type [disk/pseudo/power]
* Method to identify ghost disks [SCSI3]
* Method to determine if a reserve is held [SCSI_TUR]
* Method to break reserve [/usr/lpp/EMC/Symmetrix/bin/emcpowerreset]
Break reserves in parallel true
* Method to make the disk available [MKDEV]
In order to keep users and all their related settings and crontab files synchronized, here's a script that you can use to do this for you:
sync.ksh
HACMP is capable of using an alternative MAC address in combination with its service address. So, how do you set this MAC address without HACMP, just using the command line? (Could come in handy, in case you wish to configure the service address on a system, without having to start HACMP).
And if you wish to remove it again:# ifconfig enX down # ifconfig enX detach # chdev -l entX -a use_alt_addr=yes # chdev -l entX -a alt_addr=0x00xxxxxxxxxx # ifconfig enX xxx.xxx.xxx.xxx # ifconfig enX up
# ifconfig enX down # ifconfig enX detach # chdev -l entX -a use_alt_addr=no # chdev -l entX -a alt_addr=0x00000000000
Topics: PowerHA / HACMP↑
QHA
The standard tool for cluster monitoring is clstat, which comes along with PowerHA SystemMirror/HACMP. Clstat is rather slow with its updates, and sometimes the required clinfo deamon needs restarting in order to get it operational, so this is, well, not perfect. There's a script which is also easy to use. It is written by PowerHA/HACMP guru Alex Abderrazag. This script shows you the correct PowerHA/HACMP status, along with adapter and volume group information. It works fine on HACMP 5.2 through 7.2. You can download it here: qha. This is version 9.06. For the latest version, check www.lpar.co.uk.
This tiny but effective tool accepts the following flags:
- -n (show network interface info)
- -N (show interface info and active HBOD)
- -v (show shared online volume group info)
- -l (log to /tmp/qha.out)
- -e (show running events if cluster is unstable)
- -m (show status of monitor app servers if present)
- -1 (exit after first iteration)
- -c (CAA SAN / Disk Comms)
# qha -nevIt's useful to put "qha" in /usr/es/sbin/cluster/utilities, as that path is usually already defined in $PATH, and thus you can run qha from anywhere.
A description of the possible cluster states:
- ST_INIT: cluster configured and down
- ST_JOINING: node joining the cluster
- ST_VOTING: Inter-node decision state for an event
- ST_RP_RUNNING: cluster running recovery program
- ST_BARRIER: clstrmgr waiting at the barrier statement
- ST_CBARRIER: clstrmgr is exiting recovery program
- ST_UNSTABLE: cluster unstable
- NOT_CONFIGURED: HA installed but not configured
- RP_FAILED: event script failed
- ST_STABLE: cluster services are running with managed resources (stable cluster) or cluster services have been "forced" down with resource groups potentially in the UNMANAGED state (HACMP 5.4 only)
Some user accounts, mostly service accounts, may create a lot of email messages, for example when a lot of commands are run by the cron daemon for a specific user. There are a couple of ways to deal with this:
1. Make sure no unnecesary emails are sent at all
To avoid receiving messages from the cron daemon; one should always redirect the output of commands in crontabs to a file or to /dev/null. Also make sure to redirect STDERR as well:
0 * * * * /path/to/command > /path/to/logfile 2>&12. Make sure the commands in the crontab actually exist
1 * * * * /path/to/command > /dev/null 2>&1
An entry in a crontab with a command that does not exits, will generate an email message from the cron daemon to the user, informing the user about this issue. This is something that may occur on HACMP clusters where crontab files are synchronized on all HACMP nodes. They need to be synchronize on all the nodes, just in case a resource group fails over to a standby node. However, the required file systems containing the commands may not be available on all the nodes at all time. To get around that, test if the command exists first:
0 * * * * [ -x /path/to/command ] && /path/to/command > /path/to/logfile 2>&13. Clean up the email messages regularly
The last way of dealing with this, is to add another cron entry to a users crontab; that cleans out the mailbox every night, for example the next command that deletes all but the last 1000 messages from a users mailbox:
0 * * * * echo d1-$(let num="$(echo f|mail|tail -1|awk '{print $2}')-1000";echo $num)|mail >/dev/null4. Forward the email to the user
Very effective: Create a .forward file in the users home directory, to forward all email messages to the user. If the user starts receiving many, many emails, he/she will surely do somehting about it, when it gets annoying.
Topics: LVM, PowerHA / HACMP, System Admin↑
VGDA out of sync
With HACMP, you can run into the following error during a verification/synchronization:
WARNING: The LVM time stamp for shared volume group: testvg is inconsistent
with the time stamp in the VGDA for the following nodes: host01
To correct the above condition, run verification & synchronization with
"Automatically correct errors found during verification?" set to either 'Yes'
or 'Interactive'. The cluster must be down for the corrective action to run.
This can happen when you've added additional space to a logical volume/file system from the command line instead of using the smitty hacmp menu. But you certainly don't want to take down the entire HACMP cluster to solve this message.
First of all, you don't. The cluster will fail-over nicely anyway, without these VGDA's being in sync. But, still, it is an annoying warning, that you would like to get rid off.
Have a look at your shared logical volumes. By using the lsattr command, you can see if they are actually in sync or not:
host01 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -FvalueWell, there you have it. One host reports testlv having a size of 806 LPs, the other says it's 809. Not good. You will run into this when you've used the extendlv and chfs commands to increase the size of a shared file system. You should have used the smitty menu.
/test:1:809:jfs2:y:
host02 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -Fvalue
/test:1:806:jfs2:y:
The good thing is, HACMP will sync the VGDA's if you do some kind of logical volume operation through the smitty hacmp menu. So, either increase the size of a shared logical volume through the smitty menu with just one LP (and of course, also increase the size of the corresponding file system); Or, you can create an additional shared logical volume through smitty of just one LP, and then remove it again afterwards.
When you've done that, simply re-run the verification/synchronization, and you'll notice that the warning message is gone. Make sure you run the lsattr command again on your shared logical volumes on all the nodes in your cluster to confirm.
HACMP automatically runs a verification every night, usually around mid-night. With a very simple command you can check the status of this verification run:
# tail -10 /var/hacmp/log/clutils.log 2>/dev/null|grep detected|tail -1If this shows a returncode of 0, the cluster verification ran without any errors. Anything else, you'll have to investigate. You can use this command on all your HACMP clusters, allowing you to verify your HACMP cluster status every day.
With the following smitty menu you can change the time when the auto-verification runs and if it should produce debug output or not:
# smitty clautover.dialogYou can check with:
# odmget HACMPclusterBe aware that if you change the runtime of the auto-verification that you have to synchronize the cluster afterwards to update the other nodes in the cluster.
# odmget HACMPtimersvc
When you're using HACMP, you usually have multiple network adapters installed and thus multiple network interface to handle with. If AIX configured the default gateway on a wrong interface (like on your management interface instead of the boot interface), you might want to change this, so network traffic isn't sent over the management interface. Here's how you can do this:
First, stop HACMP or do a take-over of the resource groups to another node; this will avoid any problems with applications when you start fiddling with the network configuration.
Then open up a virtual terminal window to the host on your HMC. Otherwise you would loose the connection, as soon as you drop the current default gateway.
Now you need to determine where your current default gateway is configured. You can do this by typing:
# lsattr -El inet0The lsattr command will show you the current default gateway route and the netstat command will show you the interface it is configured on. You can also check the ODM:
# netstat -nr
# odmget -q"attribute=route" CuAtNow, delete the default gateway like this:
# lsattr -El inet0 | awk '$2 ~ /hopcount/ { print $2 }' | read GWIf you would now use the route command to specifiy the default gateway on a specific interface, like this:
# chdev -l inet0 -a delroute=${GW}
# route add 0 [ip address of default gateway: xxx.xxx.xxx.254] -if enXYou will have a working entry for the default gateway. But... the route command does not change anything in the ODM. As soon as your system reboots; the default gateway is gone again. Not a good idea.
A better solution is to use the chdev command:
# chdev -l inet0 -a addroute=net,-hopcount,0,,0,[ip address of default gateway]This will set the default gateway to the first interface available.
To specify the interface use:
# chdev -l inet0 -a addroute=net,-hopcount,0,if,enX,,0,[ip address of default gateway]Substitute the correct interface for enX in the command above.
If you previously used the route add command, and after that you use chdev to enter the default gateway, then this will fail. You have to delete it first by using route delete 0, and then give the chdev command.
Afterwards, check fi the new default gateway is properly configured:
# lsattr -El inet0And ofcourse, try to ping the IP address of the default gateway and some outside address. Now reboot your system and check if the default gateway remains configured on the correct interface. And startup HACMP again!
# odmget -q"attribute=route" CuAt
- clstat - show cluster state and substate; needs clinfo.
- cldump - SNMP-based tool to show cluster state.
- cldisp - similar to cldump, perl script to show cluster state.
- cltopinfo - list the local view of the cluster topology.
- clshowsrv -a - list the local view of the cluster subsystems.
- clfindres (-s) - locate the resource groups and display status.
- clRGinfo -v - locate the resource groups and display status.
- clcycle - rotate some of the log files.
- cl_ping - a cluster ping program with more arguments.
- clrsh - cluster rsh program that take cluster node names as argument.
- clgetactivenodes - which nodes are active?
- get_local_nodename - what is the name of the local node?
- clconfig - check the HACMP ODM.
- clRGmove - online/offline or move resource groups.
- cldare - sync/fix the cluster.
- cllsgrp - list the resource groups.
- clsnapshotinfo - create a large snapshot of the HACMP configuration.
- cllscf - list the network configuration of an HACMP cluster.
- clshowres - show the resource group configuration.
- cllsif - show network interface information.
- cllsres - show short resource group information.
- lssrc -ls clstrmgrES - list the cluster manager state.
- lssrc -ls topsvcs - show heartbeat information.
- cllsnode - list a node centric overview of the hacmp configuration.


