UNIX Health Check - System Admin

Tech Blog

These are blog entries written by the UNIX Health Check development team. Our team has extensive technical experience on both AIX and Red Hat systems, and we like to share our knowledge with our visitors.

Topics: Hardware, Installation, System Admin

Automating microcode discovery

You can run invscout to do a microcode discovery on your system, that will generate a hostname.mup file. Then you go upload this hostname.mup file at this page on the IBM website and you get a nice overview of the status of all firmware on your system.

So far, so good. What if you have plenty of systems and you want to automate this? Here's a script to do this. This script first does a webget to collect the latest catalog.mic file from the IBM website. Then it distributes this catalog file to all the hosts you want to check. Then, it runs invscout on all these hosts, and collects the hostname.mup files. It will concatenate all these files into 1 large file and do an HTTP POST through curl to upload the file to the IBM website and have a report generated from it.

So, what do you need?

You should have an AIX jump server that allows you to access the other hosts as user root through SSH. So you should have setup your SSH keys for user root.
This jump server must have access to the Internet.
You need to have wget and curl installed. Get it from the Linux Toolbox.
Your servers should be AIX 5 or higher. It doesn't really work with AIX 4.
Optional: a web server, like Apache 2, would be nice, so you can drop the resulting HTML file on your website every day.
An entry in the root crontab to run this script every day.
A list of servers you want to check.

Here's the script:

#!/bin/ksh

# script:  generate_survey.ksh
# purpose: To generate a microcode survey html file

# where is my list of servers located?
SERVERS=/usr/local/etc/servers

# what temporary folder will I use?
TEMP=/tmp/mup

# what is the invscout folder
INV=/var/adm/invscout

# what is the catalog.mic file location for invscout?
MIC=${INV}/microcode/catalog.mic

# if you have a webserver,
# where shall I put a copy of survey.html?
APA=/usr/local/apache2/htdocs

# who's the sender of the email?
FROM=microcode_survey@ibm.com

# who's the receiver of the email?
TO="your.email@address.com"

# what's the title of the email?
SUBJ="Microcode Survey"

# user check
USER=`whoami`
if [ "$USER" != "root" ];
then
    echo "Only root can run this script."
    exit 1;
fi

# create a temporary directory
rm -rf $TEMP 2>/dev/null
mkdir $TEMP 2>/dev/null
cd $TEMP

# get the latest catalog.mic file from IBM
# you need to have wget installed 
# and accessible in $PATH
# you can download this on:
# www-03.ibm.com
# /systems/power/software/aix/linux/toolbox/download.html
wget techsupport.services.ibm.com/server/mdownload/catalog.mic
# You could also use curl here, e.g.:
#curl techsupport.services.ibm.com/server/mdownload/catalog.mic -LO

# move the catalog.mic file to this servers invscout directory
mv $TEMP/catalog.mic $MIC

# remove any old mup files
echo Remove any old mup files from hosts.
for server in `cat $SERVERS` ; do
   echo "${server}"
   ssh $server "rm -f $INV/*.mup"
done

# distribute this file to all other hosts
for server in `cat $SERVERS` ; do
   echo "${server}"
   scp -p $MIC $server:$MIC
done

# run invscout on all these hosts
# this will create a hostname.mup file
for server in `cat $SERVERS` ; do
   echo "${server}"
   ssh $server invscout
done

# collect the hostname.mup files
for server in `cat $SERVERS` ; do
   echo "${server}"
   scp -p $server:$INV/*.mup $TEMP
done

# concatenate all hostname.mup files to one file
cat ${TEMP}/*mup > ${TEMP}/muppet.$$

# delete all the hostname.mup files
rm $TEMP/*mup

# upload the remaining file to IBM.
# you need to have curl installed for this
# you can download this on:
# www-03.ibm.com
# /systems/power/software/aix/linux/toolbox/download.html
# you can install it like this:
# rpm -ihv 
#    curl-7.9.3-2.aix4.3.ppc.rpm curl-devel-7.9.3-2.aix4.3.ppc.rpm
# more info on using curl can be found on: 
# http://curl.haxx.se/docs/httpscripting.html
# more info on uploading survey files can be found on:
# www14.software.ibm.com/webapp/set2/mds/fetch?pop=progUpload.html

# Sometimes, the IBM website will respond with an
# "Expectation Failed" error message. Loop the curl command until
# we get valid output.

stop="false"

while [ $stop = "false" ] ; do

curl -H Expect: -F mdsData=@${TEMP}/muppet.$$ -F sendfile="Upload file" \ 
   http://www14.software.ibm.com/webapp/set2/mds/mds \
   > ${TEMP}/survey.html

#
# Test if we see Expectation Failed in the output
#

unset mytest
mytest=`grep "Expectation Failed" ${TEMP}/survey.html`

if [ -z "${mytest}" ] ; then
        stop="true"
fi

sleep 10

done

# now it is very useful to have an apache2 webserver running
# so you can access the survey file
mv $TEMP/survey.html $APA

# tip: put in the crontab daily like this:
# 45 9 * * * /usr/local/sbin/generate_survey.ksh 1>/dev/null 2>&1

# mail the output
# need to make sure this is sent in html format
cat - ${APA}/survey.html <<HERE | sendmail -oi -t
From: ${FROM}
To: ${TO}
Subject: ${SUBJ}
Mime-Version: 1.0
Content-type: text/html
Content-transfer-encoding: 8bit

HERE

# clean up the mess
cd /tmp
rm -rf $TEMP

Topics: AIX, System Admin ↑

Duplicate errpt entries

By default, AIX will avoid logging duplicate errpt entries. You can see the default settings using smitty errdemon, which will be set to checking duplicate entries within a 10000 milliseconds time interval (10 seconds). Also, the default duplicate error maximum is set to 1000, so after 1000 duplicates, an additional entry will be made, depending on which is reached earlier, the duplicate time interval of 10 seconds or the duplicate error maximum.

Topics: AIX, System Admin ↑

The default log file has been changed

You may encounter the following entry now and then in your errpt:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
573790AA 0528212209 I O RMCdaemon The default log file has been changed.

An example of such an entry is:

-----------------------------------------------------------------
LABEL: RMCD_INFO_2_ST
IDENTIFIER: 573790AA

Date/Time: Sun May 17 22:11:46 PDT 2009
Sequence Number: 8539
Machine Id: 00GB214D4C00
Node Id: blahblah
Class: O
Type: INFO
Resource Name: RMCdaemon

Description
The default log file has been changed.

Probable Causes
The current default log file has been renamed and a new log file created.

Failure Causes
The current log file has become too large.

Recommended Actions
No action is required.

Detail Data
DETECTING MODULE
RSCT,rmcd_err.c,1.17,512
ERROR ID
6e0tBL/GsC28/gQH/ne1K//...................
REFERENCE CODE

File name
/var/ct/IW/log/mc/default

This error report entry refers to a file that was created, called /var/ct/IW/log/mc/default. Actually, when the file reaches 256 Kb, a new one is created, and the old one is renamed to default.last.

The following messages can be found in this file:

2610-217 Received 193 unrecognized messages in the last 10.183333 minutes. Service is rmc.

This message more or less means:

"2610-217 Received count of unrecognized messages unrecognized messages in the last time minutes. Service is service_name.
Explanation:
The RMC daemon has received the specified number of unrecognized messages within the specified time interval. These messages were received on the UDP port, indicated by the specified service name, used for communication among RMC daemons. The most likely cause of this error is that this port number is being used by another application.

User Response:
Validate that the port number configured for use by the Resource Monitoring and Control daemon is only being used by the RMC daemon."

Check if something else is using the port of the RMC daemon:

# grep RMC /etc/services
rmc                      657/tcp                # RMC
rmc                      657/udp                # RMC
# lsof -i :657
COMMAND     PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
rmcd    1384574 root    3u  IPv6 0xf35f20      0t0  UDP *:rmc
rmcd    1384574 root   14u  IPv6 0xf2fd39      0t0  TCP *:rmc (LISTEN)
# netstat -Aan | grep 657
f1000600022fd398 tcp     0   0  *.657    *.*   LISTEN
f10006000635f200 udp     0   0  *.657    *.*
The socket 0x22fd008 is being held by proccess 1384574 (rmcd).

No, it is actually the RMC daemon that is using this port, so this is fine.

Start an IP trace to find out who's transmitting to this port:

# iptrace -a -d host1 -p 657 /tmp/trace.out
# ps -ef | grep iptrace
root 2040018 iptrace -a -d lawtest2 -p 657 /tmp/trace.out
# kill 2040018
iptrace: unload success!
# ipreport -n /tmp/trace.out > /tmp/trace.fmt

The IP trace reports only shows messages from RMC daemon of the HMC:

Packet Number 3
====( 458 bytes received on interface en4 )==== 12:12:34.927422418
ETHERNET packet : [14:5e:81:60:9d -> 14:5e:db:29:9a] type 800 (IP)
IP header breakdown:
        < SRC =    10.231.21.55 >  (hmc)
        < DST =    10.231.21.54 >  (host1)
        ip_v=4, ip_hl=20, ip_tos=0, ip_len=444, ip_id=0, ip_off=0 DF
        ip_ttl=64, ip_sum=f8ce, ip_p = 17 (UDP)
UDP header breakdown:
        
        [ udp length = 424 | udp checksum = 6420 ]
00000000     0b005001 f0fff0ff e81fd7bf 01000100   |..P.............|
00000010     ec9f95eb 85807522 02010000 05001100   |......u"........|
00000020     2f001543 a88ba597 4a03134a 50a00200   |/..C....J..JP...|
00000030     00000000 00000000 4ca00200 00000000   |........L.......|
00000040     85000010 00000000 01000000 45a34f3f   |............E.O?|
00000050     fe5dd3e7 3901eb8d 169826cb cc22d391   |.]..9.....&.."..|
00000060     e6045340 e2d4b997 1efc9b78 f0bfce77   |..S@.......x...w|
00000070     487cbbd9 21fda20c f5cf8920 53d2f55a   |H|..!...... S..Z|
00000080     2de3eb9d 62ba1eef 10b80598 e90f1918   |-...b...........|
00000090     9cd9c654 8fb26c66 2ba6f7f0 7d885d34   |...T..lf+...}.]4|
000000a0     aa8d9f39 d2cd7277 7a87b6aa 494bb728   |...9..rwz...IK.(|
000000b0     53dea666 65d92428 e2ad90ed 73869b8d   |S..fe.$(....s...|
000000c0     d1deb7b2 719c27c5 e643dfdf 50000000   |....q.'..C..P...|
000000d0     00000000 00000000 00000000 00000000   |................|
********
00000150     02007108 00000000 4a03134a 40000000   |..q.....J..J@...|
00000160     9c4670e2 7ec24946 de09ff13 f31c3647   |.Fp.~.IF......6G|
00000170     f2a41648 3ae78b97 cd4f0177 d4f83407   |...H:....O.w..4.|
00000180     37c6cdb0 4f089868 24b217b1 d37e9544   |7...O..h$....~.D|
00000190     371bd914 eb79725b ef68a79f d50b4dd5   |7....yr[.h....M.|

To start iptrace on LPAR, do:

# startsrc -s iptrace -a "-b -p 657 /tmp/iptrace.bin"

To turn on PRM trace, on LPAR do:

# /usr/sbin/rsct/bin/rmctrace -s ctrmc -a PRM=100

Monitor /var/ct/3410054220/log/mc/default file on LPAR make sure you see NEW errors for 2610-217 log after starting trace, may need to wait for 10min (since every 10 minutes it logs one 2610-217 error entry). To monitor default file, do:

# tail -f /var/ct/3410054220/log/mc/default

To stop iptrace, on LPAR do:

# stopsrc -s iptrace

To stop PRM trace, on LPAR do:

# /usr/sbin/rsct/bin/rmctrace -s ctrmc -a PRM=0

To format the iptraces, do:

# ipreport -rns /tmp/ipt > /tmp/ipreport.out

Collect ctsnap data, on LPAR do:

# ctsnap -x runrpttr

When analyzing the data you may find several nodeid's in the packets.

On HMC side, you can run: /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc to find out if 22758085eb959fec was managed by HMC. You will need to have root access on the HMC to run this command. And you can get a temporary password from IBM to run with the pesh command as the hscpe user to get this root access. This command will list the known managed systems to the HMC and their nodeid's.

Then, on the actual LPARs run /usr/sbin/rsct/bin/lsnodeid to determine the nodeid of that LPAR. If you find any discrepancies between the HMC listing of nodeid's and the nodeid's found on the LPAR's, then that is causing the errpt message to appear about the change of the log file.

To solve this, you have to recreate the RMC deamon databases on both the HMC and on the LPARs that have this issue: On HMC side run:

# /usr/sbin/rsct/bin/rmcctrl -z
# /usr/sbin/rsct/bin/rmcctrl -A
# /usr/sbin/rsct/bin/rmcctrl -p

Then run /usr/sbin/rsct/install/bin/recfgct on the LPARs:

# /usr/sbin/rsct/install/bin/recfgct
0513-071 The ctcas Subsystem has been added.
0513-071 The ctrmc Subsystem has been added.
0513-059 The ctrmc Subsystem has been started.
Subsystem PID is 194568.
# /usr/sbin/rsct/bin/lsnodeid
6bcaadbe9dc8904f

Repeat this for every LPAR connected to the HMC. After that, you can run on the HMC again:

# /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc
# /usr/sbin/rsct/bin/lsrsrc IBM.ManagedNode Hostname UniversalId

After that, all you have to do is check on the LPARs if any messages are logged in 10 minute intervals:

# ls -als /var/ct/IW/log/mc/default

Topics: AIX, System Admin ↑

Sdiff

A very usefull command to compary 2 files is sdiff. Let's say you want to compare the lslpp from 2 different hosts, then sdiff -s shows the differences between two files next to each other:

# sdiff -s /tmp/a /tmp/b
                                  >  bos.loc.com.utf          5.3.9.0
                                  >  bos.loc.utf.EN_US        5.3.0.0
                                  >                                    
gskta.rte               7.0.3.27  |  gskta.rte               7.0.3.17
lum.base.cli             5.1.2.0  |  lum.base.cli             5.1.0.0
lum.base.gui             5.1.2.0  |  lum.base.gui             5.1.0.0
lum.msg.en_US.base.cli   5.1.2.0  |  lum.msg.en_US.base.cli   5.1.0.0
lum.msg.en_US.base.gui   5.1.2.0  |  lum.msg.en_US.base.gui   5.1.0.0
rsct.basic.sp           2.4.10.0  <
                                  <
rsct.compat.basic.sp    2.4.10.0  <
                                  <
rsct.compat.clients.sp  2.4.10.0  <
                                  <
rsct.opt.fence.blade    2.4.10.0  <
rsct.opt.fence.hmc      2.4.10.0  <
bos.clvm.enh             5.3.8.3  |  bos.clvm.enh            5.3.0.50
lum.base.cli             5.1.2.0  |  lum.base.cli             5.1.0.0

Topics: AIX, LVM, System Admin ↑

How to mount/unmount an ISO CD-ROM image as a local file system

To mount:

Build a logical volume (the size of an ISO image, better if a little bigger).
Create an entry in /etc/filesystem using that logical volume (LV), but setting its Virtual File System (V'S) to be cdrfs.
Create the mount point for this LV/ISO.
Copy the ISO image to the LV using dd.
Mount and work on it like a mounted CD-ROM.

The entry in /etc/filesystem should look like:

/IsoCD:

dev = /dev/lv09
vfs = cdrfs
mount = false
options = ro
account = false

To unmount:

Unmount the file system.
Destroy the logical volume.

Topics: AIX, Backup & restore, Storage, System Admin ↑

JFS2 snapshots

JFS2 filesystems allow you to create file system snapshots. Creating a snapshot is actually creating a new file system, with a copy of the metadata of the original file system (the snapped FS). The snapshot (like a photograph) remains unchanged, so it's possible to backup the snapshot, while the original data can be used (and changed!) by applications. When data on the original file system changes, while a snapshot exists, the original data is copied to the snapshot to keep the snapshot in a consistant state. For these changes, you'll need temporary space, thus you need to create a snapshot of a specific size to allow updates while the snapshot exists. Usually 10% is enough. Database file systems are usually not a very good subject for creating snapshots, because all database files change constantly when the database is active, causing a lot of copying of data from the original to the snapshot file system.

In order to have a snapshot you have to:

Create and mount a JFS2 file system (source FS). You can find it in SMIT as "enhanced" file system.
Create a snapshot of a size big enough to hold the changes of the source FS by issuing smitty crsnap. Once you have created this snapshot as a logical device or logical volume, there's a read-only copy of the data in source FS. You have to mount this device in order to work with this data.
Mount your snapshot device by issuing smitty mntsnap. You have to provide a directory name over which AIX will mount the snapshot. Once mounted, this device will be read-only.

Creating a snapshot of a JFS2 file system:

# snapshot -o snapfrom=$FILESYSTEM -o size=${SNAPSIZE}M

Where $FILESYSTEM is the mount point of your file system and $SNAPSIZE is the amount of megabytes to reserve for the snapshot.

Check if a file system holds a snapshot:

# snapshot -q $FILESYSTEM

When the snapshot runs full, it is automatically deleted. Therefore, create it large enough to hold all changed data of the source FS.

Mounting the snapshot:

Create a directory:

# mkdir -p /snapshot$FILESYSTEM

Find the logical device of the snapshot:

# SNAPDEVICE=`snapshot -q $FILESYSTEM | grep -v ^Snapshots | grep -v ^Current | awk '{print $2}'`

Mount the snapshot:

# mount -v jfs2 -o snapshot $SNAPDEVICE /snapshot$FILESYSTEM

Now you can backup your data from the mountpoint you've just mounted.

When you're finished with the snapshot:

Unmount the snapshot filesystem:

# unmount /snapshot$FILESYSTEM

Remove the snapshot:

# snapshot -d $SNAPDEVICE

Remove the mount point:

# rm -rf /snapshot$FILESYSTEM

When you restore data from a snapshot, be aware that the backup of the snapshot is actually a different file system in your backup system, so you have to specify a restore destination to restore the data to.

Topics: AIX, Security, System Admin ↑

Portmir

A very nice command to use when you either want to show someone remotely how to do something on AIX, or to allow a non-root user to have root access, is portmir.

First of all, you need 2 users logged into the system, you and someone else. Ask the other person to run the tty command in his/her telnet session and to tell you the result. For example:

user$ tty
/dev/pts/1

Next, start the portmirror in your own telnet session:

root# portmir -t /dev/pts/1

(Of course, fill in the correct number of your system; it won't be /dev/pts/1 all the time everywhere!)

Now every command on screen 1 is repeated on screen 2, and vice versa. You can both run commands on 1 screen.

You can stop it by running:

# portmir -o

If you're the root user and the other person temporarily requires root access to do something (and you can't solve it by giving the other user sudo access, hint, hint!), then you can su - to root in the portmir session, allowing the other person to have root access, while you can see what he/she is doing.

You may run into issues when you resize a screen, or if you use different types of terminals. Make sure you both have the same $TERM setting, i.e.: xterm. If you resize the screen, and the other doesn't, you may need to run the tset and/or the resize commands.

Topics: AIX, Storage, System Admin ↑

Burning AIX ISO files on CD

If you wish to put AIX files on a CD, you *COULD* use Windows. But, Windows files have certain restrictions on file length and permissions. Also, Windows can't handle files that begin with a dot, like ".toc", which is a very important file if you wish to burn installable filesets on a CD.

How do you solve this problem?

Put all files you wish to store on a CD in a separate directory, like: /tmp/cd
Create an ISO file of this directory. You'll need mkisofs to accomplish this. This is part of the AIX Toolbox for Linux. You can find it in /opt/freeware/bin.
# mkisofs -o /path/to/file.iso -r /tmp/cd
This will create a file called file.iso. Make sure you have enough storage space.
Transfer this file to a PC with a CD-writer in it.
Burn this ISO file to CD using Easy CD Creator or Nero.
The CD will be usable in any AIX CD-ROM drive.

Topics: AIX, System Admin ↑

Printing to a file

To create a printer queue that dumps it contents to /dev/null:

# /usr/lib/lpd/pio/etc/piomkpq -A 'file' -p 'generic' -d '/dev/null' -D asc -q 'qnull'

This command will create a queue named "qnull", which dumps its output to /dev/null.

To print to a file, do exactly the same, except, change /dev/null to the /complete/path/to/your/filename you like to print to. Make sure the file you're printing to exists and has the proper access rights.

Now you can print to this file queue:

# lpr -Pqfile /etc/motd

and the contents of your print will be written to a file.

Topics: AIX, System Admin ↑

Calculating dates in Korn Shell

Let's say you wish to calculate with dates within a Korn Shell script, for example "current date minus 7 days". How do you do it? There's a tiny C program that can do this for you, called ctimed. You can download it here: ctimed.tar.

Executable ctimed uses the UNIX Epoch time to calculate. UNIX counts the number of seconds passed since Januari 1, 1970, 00:00.

So, how many seconds have passed since 1970?

# current=`./ctimed now`

This should give you a number well over 1 billion.

How many seconds is 1 week? (7 days, 24 hours a day, 60 minutes an hour, 60 seconds an hour):

# let week=7*24*60*60

# let aweekago="$current-$week" Convert this into human readable format:

# ./ctimed $aweekago

You should get something like: Sat Sep 17 13:50:26 2005

Number of results found for topic System Admin: 249.
Displaying results: 181 - 190.

Order

No time to lose? Need to know what's wrong with
your UNIX system now? Then get started TODAY!