UNIX Health Check

Tech Blog

These are blog entries written by the UNIX Health Check development team. Our team has extensive technical experience on both AIX and Red Hat systems, and we like to share our knowledge with our visitors.

Topics: AIX, Installation, System Admin

Installation history

A very easy way to see what was installed recently on your system:

# lslpp -h

Topics: AIX, Installation, System Admin ↑

Alternate disk install

It is very easy to clone your rootvg to another disk, for example for testing purposes. For example: If you wish to install a piece of software, without modifying the current rootvg, you can clone a rootvg disk to a new disk; start your system from that disk and do the installation there. If it succeeds, you can keep using this new rootvg disk; If it doesn't, you can revert back to the old rootvg disk, like nothing ever happened.

First, make sure every logical volume in the rootvg has a name that consists of 11 characters or less (if not, the alt_disk_copy command will fail).

To create a copy on hdisk1, type:

alt_disk_copy -d hdisk1

If you now restart your system from hdisk1, you will notice, that the original rootvg has been renamed to old_rootvg. To delete this volume group (in case you're satisfied with the new rootvg), type:

# alt_rootvg_op -X old_rootvg

A very good article about alternate disk installs can be found on developerWorks.

If you wish to copy a mirrored rootvg to two other disks, make sure to use quotes around the target disks, e.g. if you wish to create a copy on disks hdisk4 and hdisk5, run:

# alt_disk_copy -d "hdisk4 hdisk5"

Topics: AIX, System Admin ↑

Howto setup AIX 'boot debugger'

The AIX kernel has an "enter_dbg" variable in it that can be set at the beginning of the boot processing which will cause all boot process output to be sent to the system console. In some cases, this data can be useful in debugging boot issues. The procedure for setting the boot debugger is as follows:

First: Preparing the system.

Set up KDB to present an initial debugger screen

# bosboot -ad /dev/ipldevice -I

Reboot the server:

# shutdown -Fr

Setting up for Kernel boot trace:

When the debugger screen appears, set enter_dbg to the value we want to use:

************* Welcome to KDB *************
    Call gimmeabreak...
    Static breakpoint:
    .gimmeabreak+000000     tweq     r8,r8    r8=0000000A
    .gimmeabreak+000004      blr
<.kdb_init+0002C0> r3=0
    KDB(0)> mw enter_dbg
    enter_dbg+000000:  00000000  = 42
    xmdbg+000000:  00000000  = .
    KDB(0)> g

Now, detailed boot output will be displayed on the console.

If your system completes booting, you will want to turn enter_dbg off:

************* Welcome to KDB *************
    Call gimmeabreak...
    Static breakpoint:
    .gimmeabreak+000000     tweq     r8,r8    r8=0000000A
    .gimmeabreak+000004      blr
<.kdb_init+0002C0> r3=0
    KDB(0)> mw enter_dbg
    enter_dbg+000000:  00000042  = 0
    xmdbg+000000:  00000000  = .
    KDB(0)> g

When finished using the boot debugger, disable it by running:

# bosdebug -o
# bosboot -ad /dev/ipldevice

Topics: AIX, Hardware, Logical Partitioning ↑

Release adapter after DLPAR

An adapter that has previously been added to a LPAR and now needs to be removed, usually doesn't want to be removed from the LPAR, because it is in use by the LPAR. Here's how you find and remove the involved devices on the LPAR:

First, run:

# lsslot -c pci

This will find the adapter involved.

Then, find the parent device of a slot, by running:

# lsdev -Cl [adapter] -F parent

(Fill in the correct adapter, e.g. fcs0).

Now, remove the parent device and all its children:

# rmdev -Rl [parentdevice] -d

For example:

# rmdev -Rl pci8 -d

Now you should be able to remove the adapter via the HMC from the LPAR.

If you need to replace the adapter because it is broken and needs to be replaced, then you need to power down the PCI slot in which the adapter is placed:

After issuing the "rmdev" command, run diag and go into "Task Selection", "Hot Plug Task", "PCI Hot Plug Manager", "Replace/Remove a PCI Hot Plug Adapter". Select the adapter and choose "remove".

After the adapter has been replaced (usually by an IBM technician), run cfgmgr again to make the adapter known to the LPAR.

Topics: AIX, Networking, System Admin ↑

SCP Stalls

When you encounter an issue where ssh through a firewall works perfectly, but when doing scp of large files (for example mksysb images) the scp connection stalls, then there's a solution to this problem: Add "-l 8192" to the scp command.

The reason for scp to stall, is because scp greedily grabs as much bandwith of the network as possible when it transfers files, any delay caused by the network switch of the firewall can easily make the TCP connection stalled.

Adding the option "-l 8192" limits the scp session bandwith up to 8192 Kbit/second, which seems to work safe and fast enough (up to 1 MB/second):

# scp -l 8192 SOURCE DESTINATION

Topics: AIX, Networking, System Admin ↑

Map a socket to a process

Let's say you want to know what process is tying up port 25000:

# netstat -aAn | grep 25000
f100060020cf1398  tcp4  0  0  *.25000  *.*  LISTEN
f10006000d490c08  stream  0  0  f1df487f8  0  0  0  /tmp/.sapicm25000

So, now let's see what the process is:

# rmsock f100060020cf1398 tcpcb
The socket 0x20cf1008 is being held by proccess 1806748 (icman).

If you have lsof installed, you can get the same result with the lsof command:

# lsof -i :[PORT]

Example:

# lsof -i :5710
COMMAND     PID   USER   FD   TYPE     DEVICE  SIZE/OFF NODE NAME
oracle  2638066 oracle   18u  IPv4 0xf1b3f398 0t1716253  TCP host:5710

Topics: AIX, Installation, NIM, System Admin ↑

How to migrate from p5 to p6

If your AIX server level is below 5.3 TL06, the easiest way is just to upgrade your current OS to TL 06 at minimum (take note it will depend of configurations for Power6 processors) then clone your server and install it on the new p6.

But if you want to avoid an outage on your account, you can do the next using a NIM server (this is not official procedure for IBM, so they do not support this):

Create your mksysb resource and do not create a spot from mksysb.
Create an lppsource and spot with minimum TL required (I used TL08).
Once you do nim_bosinst, choose the mksysb, and the created spot. It will send a warning message about spot is not at same level as mksysb, just ignore it.
Do all necessary to boot from NIM.
Once restoring the mksysb, there's some point where it is not able to create the bootlist because it detects the OS level is not supported on p6. So It will ask to continue and fix it later via SMS or fix it right now.
Choose to fix it right now (it will open a shell). You will notice oslevel is as the same as mksysb.
Create a NFS from NIM server or another server where you have the necessary TL and mount it on the p6.
Proceed to do an upgrade, change the bootlist, exit the shell. Server will boot with new TL over the p6.

Topics: AIX, System Admin ↑

Duplicate errpt entries

By default, AIX will avoid logging duplicate errpt entries. You can see the default settings using smitty errdemon, which will be set to checking duplicate entries within a 10000 milliseconds time interval (10 seconds). Also, the default duplicate error maximum is set to 1000, so after 1000 duplicates, an additional entry will be made, depending on which is reached earlier, the duplicate time interval of 10 seconds or the duplicate error maximum.

Topics: AIX, System Admin ↑

The default log file has been changed

You may encounter the following entry now and then in your errpt:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
573790AA 0528212209 I O RMCdaemon The default log file has been changed.

An example of such an entry is:

-----------------------------------------------------------------
LABEL: RMCD_INFO_2_ST
IDENTIFIER: 573790AA

Date/Time: Sun May 17 22:11:46 PDT 2009
Sequence Number: 8539
Machine Id: 00GB214D4C00
Node Id: blahblah
Class: O
Type: INFO
Resource Name: RMCdaemon

Description
The default log file has been changed.

Probable Causes
The current default log file has been renamed and a new log file created.

Failure Causes
The current log file has become too large.

Recommended Actions
No action is required.

Detail Data
DETECTING MODULE
RSCT,rmcd_err.c,1.17,512
ERROR ID
6e0tBL/GsC28/gQH/ne1K//...................
REFERENCE CODE

File name
/var/ct/IW/log/mc/default

This error report entry refers to a file that was created, called /var/ct/IW/log/mc/default. Actually, when the file reaches 256 Kb, a new one is created, and the old one is renamed to default.last.

The following messages can be found in this file:

2610-217 Received 193 unrecognized messages in the last 10.183333 minutes. Service is rmc.

This message more or less means:

"2610-217 Received count of unrecognized messages unrecognized messages in the last time minutes. Service is service_name.
Explanation:
The RMC daemon has received the specified number of unrecognized messages within the specified time interval. These messages were received on the UDP port, indicated by the specified service name, used for communication among RMC daemons. The most likely cause of this error is that this port number is being used by another application.

User Response:
Validate that the port number configured for use by the Resource Monitoring and Control daemon is only being used by the RMC daemon."

Check if something else is using the port of the RMC daemon:

# grep RMC /etc/services
rmc                      657/tcp                # RMC
rmc                      657/udp                # RMC
# lsof -i :657
COMMAND     PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
rmcd    1384574 root    3u  IPv6 0xf35f20      0t0  UDP *:rmc
rmcd    1384574 root   14u  IPv6 0xf2fd39      0t0  TCP *:rmc (LISTEN)
# netstat -Aan | grep 657
f1000600022fd398 tcp     0   0  *.657    *.*   LISTEN
f10006000635f200 udp     0   0  *.657    *.*
The socket 0x22fd008 is being held by proccess 1384574 (rmcd).

No, it is actually the RMC daemon that is using this port, so this is fine.

Start an IP trace to find out who's transmitting to this port:

# iptrace -a -d host1 -p 657 /tmp/trace.out
# ps -ef | grep iptrace
root 2040018 iptrace -a -d lawtest2 -p 657 /tmp/trace.out
# kill 2040018
iptrace: unload success!
# ipreport -n /tmp/trace.out > /tmp/trace.fmt

The IP trace reports only shows messages from RMC daemon of the HMC:

Packet Number 3
====( 458 bytes received on interface en4 )==== 12:12:34.927422418
ETHERNET packet : [14:5e:81:60:9d -> 14:5e:db:29:9a] type 800 (IP)
IP header breakdown:
        < SRC =    10.231.21.55 >  (hmc)
        < DST =    10.231.21.54 >  (host1)
        ip_v=4, ip_hl=20, ip_tos=0, ip_len=444, ip_id=0, ip_off=0 DF
        ip_ttl=64, ip_sum=f8ce, ip_p = 17 (UDP)
UDP header breakdown:
        
        [ udp length = 424 | udp checksum = 6420 ]
00000000     0b005001 f0fff0ff e81fd7bf 01000100   |..P.............|
00000010     ec9f95eb 85807522 02010000 05001100   |......u"........|
00000020     2f001543 a88ba597 4a03134a 50a00200   |/..C....J..JP...|
00000030     00000000 00000000 4ca00200 00000000   |........L.......|
00000040     85000010 00000000 01000000 45a34f3f   |............E.O?|
00000050     fe5dd3e7 3901eb8d 169826cb cc22d391   |.]..9.....&.."..|
00000060     e6045340 e2d4b997 1efc9b78 f0bfce77   |..S@.......x...w|
00000070     487cbbd9 21fda20c f5cf8920 53d2f55a   |H|..!...... S..Z|
00000080     2de3eb9d 62ba1eef 10b80598 e90f1918   |-...b...........|
00000090     9cd9c654 8fb26c66 2ba6f7f0 7d885d34   |...T..lf+...}.]4|
000000a0     aa8d9f39 d2cd7277 7a87b6aa 494bb728   |...9..rwz...IK.(|
000000b0     53dea666 65d92428 e2ad90ed 73869b8d   |S..fe.$(....s...|
000000c0     d1deb7b2 719c27c5 e643dfdf 50000000   |....q.'..C..P...|
000000d0     00000000 00000000 00000000 00000000   |................|
********
00000150     02007108 00000000 4a03134a 40000000   |..q.....J..J@...|
00000160     9c4670e2 7ec24946 de09ff13 f31c3647   |.Fp.~.IF......6G|
00000170     f2a41648 3ae78b97 cd4f0177 d4f83407   |...H:....O.w..4.|
00000180     37c6cdb0 4f089868 24b217b1 d37e9544   |7...O..h$....~.D|
00000190     371bd914 eb79725b ef68a79f d50b4dd5   |7....yr[.h....M.|

To start iptrace on LPAR, do:

# startsrc -s iptrace -a "-b -p 657 /tmp/iptrace.bin"

To turn on PRM trace, on LPAR do:

# /usr/sbin/rsct/bin/rmctrace -s ctrmc -a PRM=100

Monitor /var/ct/3410054220/log/mc/default file on LPAR make sure you see NEW errors for 2610-217 log after starting trace, may need to wait for 10min (since every 10 minutes it logs one 2610-217 error entry). To monitor default file, do:

# tail -f /var/ct/3410054220/log/mc/default

To stop iptrace, on LPAR do:

# stopsrc -s iptrace

To stop PRM trace, on LPAR do:

# /usr/sbin/rsct/bin/rmctrace -s ctrmc -a PRM=0

To format the iptraces, do:

# ipreport -rns /tmp/ipt > /tmp/ipreport.out

Collect ctsnap data, on LPAR do:

# ctsnap -x runrpttr

When analyzing the data you may find several nodeid's in the packets.

On HMC side, you can run: /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc to find out if 22758085eb959fec was managed by HMC. You will need to have root access on the HMC to run this command. And you can get a temporary password from IBM to run with the pesh command as the hscpe user to get this root access. This command will list the known managed systems to the HMC and their nodeid's.

Then, on the actual LPARs run /usr/sbin/rsct/bin/lsnodeid to determine the nodeid of that LPAR. If you find any discrepancies between the HMC listing of nodeid's and the nodeid's found on the LPAR's, then that is causing the errpt message to appear about the change of the log file.

To solve this, you have to recreate the RMC deamon databases on both the HMC and on the LPARs that have this issue: On HMC side run:

# /usr/sbin/rsct/bin/rmcctrl -z
# /usr/sbin/rsct/bin/rmcctrl -A
# /usr/sbin/rsct/bin/rmcctrl -p

Then run /usr/sbin/rsct/install/bin/recfgct on the LPARs:

# /usr/sbin/rsct/install/bin/recfgct
0513-071 The ctcas Subsystem has been added.
0513-071 The ctrmc Subsystem has been added.
0513-059 The ctrmc Subsystem has been started.
Subsystem PID is 194568.
# /usr/sbin/rsct/bin/lsnodeid
6bcaadbe9dc8904f

Repeat this for every LPAR connected to the HMC. After that, you can run on the HMC again:

# /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc
# /usr/sbin/rsct/bin/lsrsrc IBM.ManagedNode Hostname UniversalId

After that, all you have to do is check on the LPARs if any messages are logged in 10 minute intervals:

# ls -als /var/ct/IW/log/mc/default

Topics: AIX, System Admin ↑

Sdiff

A very usefull command to compary 2 files is sdiff. Let's say you want to compare the lslpp from 2 different hosts, then sdiff -s shows the differences between two files next to each other:

# sdiff -s /tmp/a /tmp/b
                                  >  bos.loc.com.utf          5.3.9.0
                                  >  bos.loc.utf.EN_US        5.3.0.0
                                  >                                    
gskta.rte               7.0.3.27  |  gskta.rte               7.0.3.17
lum.base.cli             5.1.2.0  |  lum.base.cli             5.1.0.0
lum.base.gui             5.1.2.0  |  lum.base.gui             5.1.0.0
lum.msg.en_US.base.cli   5.1.2.0  |  lum.msg.en_US.base.cli   5.1.0.0
lum.msg.en_US.base.gui   5.1.2.0  |  lum.msg.en_US.base.gui   5.1.0.0
rsct.basic.sp           2.4.10.0  <
                                  <
rsct.compat.basic.sp    2.4.10.0  <
                                  <
rsct.compat.clients.sp  2.4.10.0  <
                                  <
rsct.opt.fence.blade    2.4.10.0  <
rsct.opt.fence.hmc      2.4.10.0  <
bos.clvm.enh             5.3.8.3  |  bos.clvm.enh            5.3.0.50
lum.base.cli             5.1.2.0  |  lum.base.cli             5.1.0.0

Number of results found for topic AIX: 231.
Displaying results: 141 - 150.

Order

No time to lose? Need to know what's wrong with
your UNIX system now? Then get started TODAY!