Tech Blog

These are blog entries written by the UNIX Health Check development team. Our team has extensive technical experience on both AIX and Red Hat systems, and we like to share our knowledge with our visitors.

Topics: GPFS

GPFS / IBM Spectrum Scale links

Topics: GPFS

GPFS introduction

GPFS is a concurrent file system. It is a product of IBM and is short for General Parallel File System. It is a high-performance shared-disk file system that can provide fast data access from all nodes in a homogenous or heterogenous cluster of IBM UNIX servers running either the AIX or the Linux operating system.

All nodes in a GPFS cluster have the same GPFS journaled filesystem mounted, allowing multiple nodes to be active at the same time on the same data.

A specific use for GPFS is RAC, Oracle's Real Application Cluster. In a RAC cluster multiple instances are active (sharing the workload) and provide a near "Allways-On" database operation. The Oracle RAC software relies on IBM's HACMP software to achieve high availability for hardware and the operating system AIX. For storage it utilizes the concurrent filesystem called GPFS.

Data availability

GPFS is fault tolerant and can be configured for continued access to data even if cluster nodes or storage systems fail. This is accomplished though robust clustering features and support for data replication. GPFS continuously monitors the health of the file system components. When failures are detected appropriate recovery action is taken automatically. Extensive logging and recovery capabilities are provided which maintain metadata consistency when application nodes holding locks or performing services fail. Data replication is available for journal logs, metadata and data. Replication allows for continuous operation even if a path to a disk or a disk itself fails. GPFS Version 3.2 further enhances clustering robustness with connection retries. If the LAN connection to a node fails GPFS will automatically try and reestablish the connection before making the node unavailable. This provides for better uptime in environments experiencing network issues. Using these features along with a high availability infrastructure ensures a reliable enterprise storage solution.

GPFS interaction with AIX

GPFS is a means to provide a journaled filesystem that can be mounted on multiple nodes simultaneously. GPFS stripes the data across all disks that belong to that file system. GPFS has a somewhat different approach of dealing with AIX volume groups and disks as we're used to; also mirroring is done in a different way.

A standard AIX setup has a device relationship that follows the following rules: A volumegroup is created that holds one or more physical disks. A disk contains one or more logical volumes, or a logical volume may span multiple disks. There is a one-to-one relation between a logical volume and the filesystem it contains. With LVM-mirroring each logical partition of a logical volume is placed on two separate disks. This typical setup is shown in the figure below:

The original AIX filesystem structure.
The original AIX filesystem structure.

In a SAN environment, this picture looks like this:

The AIX filesystem structure in a SAN environment.
The AIX filesystem structure in a SAN environment.

Each volume group of GPFS contains only 1 (one) physical disk. Each disk contains only 1 (one) logical volume. Each filesystem contains multiple logical volumes (one for each disk). LVM mirroring is not supported (there is only one disk in a volume group). This translates in the following picture:

The GPFS filesystem structure in a SAN environment.
The GPFS filesystem structure in a SAN environment.

In GPFS 2.3 the GPFS volumes are called Network Storage Devices (NSD), that contain each only one physical disk. No volume groups and/or logical volumes are created in this GPFS version. In migrated clusters (from GPFS 2.2 to GPFS 2.3) you will still see volume groups and logical volumes, but only for the "old" disks. New disks and filesystems will be created without them.

We change the picture in a more "stack"-like representation. Here you see one GPFS filesystem that is made up out of four separate disks. AIX multipath-software has created the hdisk and vpath devices.

On the AIX level GPFS creates a separate volume group for each disk, so 4 volume groups in total. GPFS fills each disk with a logical volume, so 4 logical volumes in total. These logical volumes are represented as disk in the GPFS configuration. These GPFS-disks are used in the filesystem. A file stored in the filesystem is striped across the four disks (in 8kb blocks). The command used to create the GPFS disks is mmcrlv.

The stacked GPFS filesystem structure.
The stacked GPFS filesystem structure.

Usually, only small LUNs of only 17,5 GB are used instead of big luns (of 400 GB), because of performance.

Mirroring versus replication

Traditional AIX mirrorring on the logical volume level can not be done in a typical GPFS device setup. The volume group holds only one disk that is completely filled with one logical volume, so there is no destination possible for the second copy of the lv's logical partitions. GPFS provides replication as the alternative.

GPFS provides a structure called replication that provides a means of surviving a diskfailure. On the file level you can specify how many copies of that file must be present in the filesystem (one or two). When you specify two copies, GPFS will duplicate the file across two "failuregroups". Setting replication on the file level is error-prone, this can easily be forgotten. It is also possible to specify this globally on the filesystem level. Set the "Default number of replicas = 2" and the "Maximum number of replicas = 2" on each GPFS filesystem, so that every file in all the GPFS filesystem are automatically replicated. Keep in mind that replication stores two file copies in the same filesystem. Each file will use twice the amount of space, so the filesystem free space will drop in size twice as fast. An example: The free space in the filesystem is 15 MB. You want to save a file of 10 MB, the result is a FILE SYSTEM FULL ERROR. The reason is that you need at least 20 MB free space to hold both copies of the file!

Failure groups

GPFS groups disks into "failuregroups". A failuregroup is a collection of disks that share a single point of failure (SPOF). In a SAN setup there is usually only one SPOF for the disk: All disks are usually multipath, so a single Host Bay Adapter (HBA) failure is no problem. All systems can be connected to two separate SAN fabrics, so a fabric failure is also no problem. Each disk is hosted by one ESS. When the ESS fails, all disks in that ESS will fail. Unless you have a second ESS, you can prevent this failure by using failuregroups. GPFS uses these failure groups to prevent that both replication copies of a file will fail at the same time. It does this by writing the two copies of a file to disks in separate failuregroups.

Each file copy in a separate failuregroup.
Each file copy in a separate failuregroup.

In the example above you see that the file is written twice in the filesystem. One copy is striped over lun 1 + 2 and the other copy is striped across lun 3 + 4. When ESS1 fails the second copy of the file is still completely usable on ESS2.

Striping

Large files in GPFS are divided into equal sized blocks, and consecutive blocks are placed on different disks in a round-robin fashion. To minimize seek overhead, the block size is large (typically 256K). Large blocks have the advantage that they allow a large amount of data to be retrieved in a single I/O from each disk. GPFS stores small files (at the end of large files) in smaller units called sub-blocks, which are as small as 1/32 of the size of a full block. Striping works best when disks have equal size and performance. This is why you should use one disksize for data storage in a filesystem; do not mix and match large and small luns.

GPFS transaction log

Just like JFS, GPFS is a journaled filesystem. GPFS records all metadata updates that affect file system consistency in a journal log. Each node has a separate log for each file system it mounts, stored in that file system. Because this log can be read by all other nodes, any node can perform recovery on behalf of a failed node. It is not necessary to wait for the failed node to come back to life. After a failure, file system consistency is restored quickly by simply re-applying all updates recorded in the failed node's log. Once the updates described by a log record have been written back to disk, the log record is no longer needed and can be discarded. Thus, logs can be fixed size, because space in the log can be freed up at any time by flushing "dirty" metadata back to disk in the background.

GPFS data and metadata

The GPFS filesystem contains two types of data: Data and Metadata. "Data" means the actual files you want to store in the filesystem. This is the usable storage space. "Metadata" refers to all sorts of information used by GPFS internally. For each GPFS disk you can specify what it will contain: DataAndMetadata, MetadataOnly, Dataonly or DescOnly. "DataAndMetadata" is used for normal disks, so most disks in the system will have this designation. "DescOnly" is used for "quorum busters".

GPFS filesystem descriptor quorum

There is a structure in GPFS called the Filesystem Descriptor (FSDesc) that is written originally to every disk in the filesystem, but is updated only on a subset of the disks as changes to the filesystem occur, such as adding or deleting disks. The subset of disks is usually a set of three or five disks, depending on how many disks and failuregroups are in the filesystem. The disks that constitute this subset of disks can be found by reading any one of the FSDesc copies on any disk. The FSDesc may point to other disks where more up-to-date copies of the FSDesc are located.

To determine the correct filesystem configuration, a quorum of the subset of disks must be online so that the most up-to-date FSDesc can be found. If there are three special disks, then two of the three must be available. GPFS distributes the copies of FSDesc across the failure groups. If there are only two failuregroups, one failure group has two copies and the other failure group has one copy. In a scenario that causes one entire failure group to disappear all at once, if half of the disks that are unavailable contain the single FSDesc that is part of the quorum, everything stays up. On the other hand, if the downed failure group contains the majority of the quorum, the FSDesc cannot be updated and the filesystem must be force unmounted. If the disks fail one at a time, the FSDesc is moved to a new subset of disks by updating the other two copies and a new disk copy. However, if two of the three disks fail simultaneously, the FSDesc copies cannot be updated to show the new quorum configuration. In this case, the filesystem must be unmounted to preserve existing data integrity. To survive a single ESS failure in a dual ESS configuration, there must be a third failure group on an independent disk outside both ESSs (the so-called TieBreaker node, which contains one disk per filesystem which contains the third FSDesc).

The final picture will be:

Final GPFS on SAN picture.
Final GPFS on SAN picture.

Taking all things mentioned above in account, the final solution for a GPFS filesystem is:

All files in the filesystem are replicated across two failuregroups on two nodes (preferably in two sites). This is controlled by the filesystem setting "default number of replicas = 2". The number of disk that hold data is the same at each of the two sites. The number of disk used for data has no practical limit. You will probably create multiple filesystems for other reasons than the disk limit. These disks also hold a copy of the metadata.

There is a third site with one disk used as quorum buster on the TieBreaker node. These disks hold no data or metadata, only a single filesystem descriptor (FSDesc).

GPFS software

For GPFS 2.2 the following filesets are installed on each node of the GPFS cluster:
  • mmfs.base.cmds
  • mmfs.base.rte
  • mmfs.gpfs.rte
  • mmfs.gpfsdocs.data
  • mmfs.msg.en_US
For GPFS 2.3 the following filesets are installed on each node of the GPFS cluster:
  • gpfs.base
  • gpfs.msg.en_US
  • gpfs.docs.data
For Oracle RAC using GPFS 2.3, installation of HACMP 5.2 (and RSCT) is required. This is specifically necessary for Oracle RAC and not for GPFS.

Topics: GPFS

GPFS & FSCK

There's a special GPFS command for FSCK of a MMFS filesystem: mmfsck.

It works the same way as a normal fsck: it will only show the lost blocks. For all other checks and repairs, an unmount of the filesystem is necessary.

Topics: AIX, Backup & restore, Monitoring, Red Hat / Linux, Spectrum Protect

Report the end result of a TSM backup

A very easy way of getting a report from a backup is by using the POSTSchedulecmd entry in the dsm.sys file. Add the following entry to your dsm.sys file (which is usually located in /usr/tivoli/tsm/client/ba/bin or /opt/tivoli/tsm/client/ba/bin):

POSTSchedulecmd "/usr/local/bin/RunTsmReport"
This entry tells the TSM client to run script /usr/local/bin/RunTSMReport, as soon as it has completed its scheduled command. Now all you need is a script that creates a report from the dsmsched.log file, the file that is written to by the TSM scheduler:
#!/bin/bash
TSMLOG=/tmp/dsmsched.log
WRKDIR=/tmp
echo "TSM Report from `hostname`" >> ${WRKDIR}/tsmc
tail -100 ${TSMLOG} > ${WRKDIR}/tsma
grep -n "Elapsed processing time:" ${WRKDIR}/tsma > ${WRKDIR}/tsmb
CT2=`cat ${WRKDIR}/tsmb | awk -F":" '{print $1}'`
((CT3 = $CT2 - 14))
((CT5 = $CT2 + 1 ))
CT4=1
while read Line1 ; do
   if [ ${CT3} -gt ${CT4} ] ; then
      ((CT4 = ${CT4} + 1 ))
   else
      echo "${Line1}" >> ${WRKDIR}/tsmc
      ((CT4 = ${CT4} + 1 ))
      if [ ${CT4} -gt ${CT5} ] ; then
         break
      fi
   fi
done < ${WRKDIR}/tsma
mail -s "`hostname` Backup" email@address.com < ${WRKDIR}/tsmc
rm ${WRKDIR}/tsma ${WRKDIR}/tsmb ${WRKDIR}/tsmc

Topics: EMC, SAN, Storage

BCV issue with Solution Enabler

There is a known bug on AIX with Solution Enabler, the software responsible for BCV backups. Hdiskpower devices dissapear and you need to run the following command to make them come back. This will happen when a server is rebooted. BCV devices are only visible on the target servers.

# /usr/lpp/EMC/Symmetrix/bin/mkbcv -a ALL
hdisk2 Available
hdisk3 Available
hdisk4 Available
hdisk5 Available
hdisk6 Available
hdisk7 Available
hdisk8 Available
hdiskpower1 Available
hdiskpower2 Available
hdiskpower3 Available
hdiskpower4 Available

Topics: EMC, SAN, Storage

Reset reservation bit

If you run into not being able to access an hdiskpowerX disk, you may need to reset the reservation bit on it:

# /usr/lpp/EMC/Symmetrix/bin/emcpowerreset fscsiX hdiskpowerX

Topics: EMC, SAN, Storage

EMC Grab

EMC Grab is a utility that is run locally on each host and gathers storage-specific information (driver version, storage-technical details, etc). The EMC Grab report creates a zip file. This zip file can be used by EMC support.

You can download the "Grab Utility" from the following locations:

When you've downloaded EMCgrab, and stored in a temporary location on the server like /tmp/emc, untar it using:
tar -xvf *tar
Then run:
/tmp/emc/emcgrab/emcgrab.sh
The script is interactive and finishes after a couple of minutes.

Topics: Hardware, SAN, SDD, Storage

How-to replace a failing HBA using SDD storage

This is a procedure how to replace a failing HBA or fibre channel adapter, when used in combination with SDD storage:

  • Determine which adapter is failing (0, 1, 2, etcetera):
    # datapath query adapter
  • Check if there are dead paths for any vpaths:
    # datapath query device
  • Try to set a "degraded" adapter back to online using:
    # datapath set adapter 1 offline
    # datapath set adapter 1 online
    (that is, if adapter "1" is failing, replace it with the correct adapter number).
  • If the adapter is still in a "degraded" status, open a call with IBM. They most likely require you to take a snap from the system, and send the snap file to IBM for them to analyze and they will conclude if the adapter needs to be replaced or not.
  • Involve the SAN storage team if the adapter needs to be replaced. They will have to update the WWN of the failing adapter when the adapter is replaced for a new one with a new WWN.
  • If the adapter needs to be replaced, wait for the IBM CE to be onsite with the new HBA adapter. Note the new WWN and supply that to the SAN storage team.
  • Remove the adapter:
    # datapath remove adapter 1
    (replace the "1" with the correct adapter that is failing).
  • Check if the vpaths now all have one less path:
    # datapath query device | more
  • De-configure the adapter (this will also de-configure all the child devices, so you won't have to do this manually), by running: diag, choose Task Selection, Hot Plug Task, PCI Hot Plug manager, Unconfigure a Device. Select the correct adapter, e.g. fcs1, set "Unconfigure any Child Devices" to "yes", and "KEEP definition in database" to "no". Hit ENTER.
  • Replace the adapter: Run diag and choose Task Selection, Hot Plug Task, PCI Hot Plug manager, Replace/Remove a PCI Hot Plug Adapter. Choose the correct device (be careful, you won't see the adapter name here, but only "Unknown", because the device was unconfigured).
  • Have the IBM CE replace the adapter.
  • Close any events on the failing adapter on the HMC.
  • Validate that the notification LED is now off on the system, if not, go back into diag, choose Task Selection, Hot Plug Task, PCI Hot Plug Manager, and Disable the attention LED.
  • Check the adapter firmware level using:
    # lscfg -vl fcs1
    (replace this with the actual adapter name).

    And if required, update the adapter firmware microcode. Validate if the adapter is still functioning correctly by running:
    # errpt
    # lsdev -Cc adapter
  • Have the SAN admin update the WWN.
  • Run:
    # cfgmgr -S
  • Check the adapter and the child devices:
    # lsdev -Cc adapter
    # lsdev -p fcs1
    # lsdev -p fscsi1
    (replace this with the correct adapter name).
  • Add the paths to the device:
    # addpaths
  • Check if the vpaths have all paths again:
    # datapath query device | more

Topics: EMC, Installation, SAN, Storage

EMC and MPIO

You can run into an issue with EMC storage on AIX systems using MPIO (No Powerpath) for your boot disks:

After installing the ODM_DEFINITONS of EMC Symmetrix on your client system, the system won't boot any more and will hang with LED 554 (unable to find boot disk).

The boot hang (LED 554) is not caused by the EMC ODM package itself, but by the boot process not detecting a path to the boot disk if the first MPIO path does not corresponding to the fscsiX driver instance where all hdisks are configured. Let me explain that more in detail:

Let's say we have an AIX system with four HBAs configured in the following order:

# lscfg -v | grep fcs
fcs2 (wwn 71ca) -> no devices configured behind this fscsi2 driver instance (path only configured in CuPath ODM table)
fcs3 (wwn 71cb) -> no devices configured behind this fscsi3 driver instance (path only configured in CuPath ODM table)
fcs0 (wwn 71e4) -> no devices configured behind this fscsi0 driver instance (path only configured in CuPath ODM table)
fcs1 (wwn 71e5) -> ALL devices configured behind this fscsi1 driver instance
Looking at the MPIO path configuration, here is what we have for the rootvg disk:
# lspath -l hdisk2 -H -F"name parent path_id connection status"
name   parent path_id connection                      status
hdisk2 fscsi0 0       5006048452a83987,33000000000000 Enabled
hdisk2 fscsi1 1       5006048c52a83998,33000000000000 Enabled
hdisk2 fscsi2 2       5006048452a83986,33000000000000 Enabled
hdisk2 fscsi3 3       5006048c52a83999,33000000000000 Enabled
The fscsi1 driver instance is the second path (pathid 1), then remove the 3 paths keeping only the path corresponding to fscsi1 :
# rmpath -l hdisk2 -p fscsi0 -d
# rmpath -l hdisk2 -p fscsi2 -d
# rmpath -l hdisk2 -p fscsi3 -d
# lspath -l hdisk2 -H -F"name parent path_id connection status"
Afterwards, do a savebase to update the boot lv hd5. Set up the bootlist to hdisk2 and reboot the host.

It will come up successfully, no more hang LED 554.

When checking the status of the rootvg disk, a new hdisk10 has been configured with the correct ODM definitions as shown below:
# lspv
hdisk10 0003027f7f7ca7e2 rootvg active
# lsdev -Cc disk
hdisk2 Defined   00-09-01 MPIO Other FC SCSI Disk Drive
hdisk10 Available 00-08-01 EMC Symmetrix FCP MPIO Raid6
To summarize, it is recommended to setup ONLY ONE path when installing an AIX to a SAN disk, then install the EMC ODM package then reboot the host and only after that is complete, add the other paths. Dy doing that we ensure that the fscsiX driver instance used for the boot process has the hdisk configured behind.

Topics: Monitoring, PowerHA / HACMP, Security

HACMP 5.4: How to change SNMP community name from default "public" and keep clstat working

HACMP 5.4 supports changing the default community name from "public" to something else. SNMP is used for clstatES communications. Using the "public" SNMP community name, can be a security vulnerability. So changing it is advisable.

First, find out what version of SNMP you are using:

# ls -l /usr/sbin/snmpd
lrwxrwxrwx 1 root system 9 Sep 08 2008 /usr/sbin/snmpd -> snmpdv3ne
(In this case, it is using version 3).

Make a copy of your configuration file. It is located on /etc.
/etc/snmpd.conf <- Version 1
/etc/snmpdv3.conf <- Version 3
Edit the file and replace wherever public is mentioned for your new community name. Make sure to use not more that 8 characters for the new community name.

Change subsystems and restart them:
# chssys -s snmpmibd -a "-c new"
# chssys -s hostmibd -a "-c new"
# chssys -s aixmibd -a "-c new"
# stopsrc -s snmpd
# stopsrc -s aixmibd
# stopsrc -s snmpmibd
# stopsrc -s hostmibd
# startsrc -s snmpd
# startsrc -s hostmibd
# startsrc -s snmpmibd
# startsrc -s aixmibd
Test using your locahost:
# snmpinfo -m dump -v -h localhost -c new -o /usr/es/sbin/cluster/hacmp.defs nodeTable
If the command hangs, something is wrong. Check the changes you made.

If everything works fine, perform the same change in the other node and test again. Now you can test from one server to the other using the snmpinfo command above.

If you need to backout, replace with the original configuration file and restart subsystems. Note in this case we use double-quotes. There is no space.
# chssys -s snmpmibd -a ""
# chssys -s hostmibd -a ""
# chssys -s aixmibd -a ""
# stopsrc -s snmpd
# stopsrc -s aixmibd
# stopsrc -s snmpmibd
# stopsrc -s hostmibd
# startsrc -s snmpd
# startsrc -s hostmibd
# startsrc -s snmpmibd
# startsrc -s aixmibd
Okay, now make the change to clinfoES and restart and both nodes:
# chssys -s clinfoES -a "-c new"
# stopsrc -s clinfoES
# startsrc -s clinfoES
Wait a few minutes and you should be able to use clstat again with the new community name.

Disclaimer: If you have any other application other than clinfoES that uses snmpd with the default community name, you should make changes to it as well. Check with your application team or software vendor.

Number of results found: 470.
Displaying results: 421 - 430.