UNIX Health Check - System Admin

Tech Blog

These are blog entries written by the UNIX Health Check development team. Our team has extensive technical experience on both AIX and Red Hat systems, and we like to share our knowledge with our visitors.

Topics: AIX, System Admin

Bootinfo

To find out if your machine has a 64 or 32 bit architecture:

# bootinfo -y

To find out which kernel the system is running:

# bootinfo -K

You can also check the link /unix:

# ls -ald /unix

unix_mp: 32 bits, unix_64: 64 bits

To find out from which disk your system last booted:

# bootinfo -b

To find out the size of real memory:

# bootinfo -r

To display the hardware platform type:

# bootinfo -T

Topics: AIX, PowerHA / HACMP, System Admin ↑

Email messages from the cron daemon

Some user accounts, mostly service accounts, may create a lot of email messages, for example when a lot of commands are run by the cron daemon for a specific user. There are a couple of ways to deal with this:

1. Make sure no unnecesary emails are sent at all

To avoid receiving messages from the cron daemon; one should always redirect the output of commands in crontabs to a file or to /dev/null. Also make sure to redirect STDERR as well:

0 * * * * /path/to/command > /path/to/logfile 2>&1
1 * * * * /path/to/command > /dev/null 2>&1

2. Make sure the commands in the crontab actually exist

An entry in a crontab with a command that does not exits, will generate an email message from the cron daemon to the user, informing the user about this issue. This is something that may occur on HACMP clusters where crontab files are synchronized on all HACMP nodes. They need to be synchronize on all the nodes, just in case a resource group fails over to a standby node. However, the required file systems containing the commands may not be available on all the nodes at all time. To get around that, test if the command exists first:

0 * * * * [ -x /path/to/command ] && /path/to/command > /path/to/logfile 2>&1

3. Clean up the email messages regularly

The last way of dealing with this, is to add another cron entry to a users crontab; that cleans out the mailbox every night, for example the next command that deletes all but the last 1000 messages from a users mailbox:

0 * * * * echo d1-$(let num="$(echo f|mail|tail -1|awk '{print $2}')-1000";echo $num)|mail >/dev/null

4. Forward the email to the user

Very effective: Create a .forward file in the users home directory, to forward all email messages to the user. If the user starts receiving many, many emails, he/she will surely do somehting about it, when it gets annoying.

Topics: LVM, PowerHA / HACMP, System Admin ↑

VGDA out of sync

With HACMP, you can run into the following error during a verification/synchronization:

WARNING: The LVM time stamp for shared volume group: testvg is inconsistent with the time stamp in the VGDA for the following nodes: host01

To correct the above condition, run verification & synchronization with "Automatically correct errors found during verification?" set to either 'Yes' or 'Interactive'. The cluster must be down for the corrective action to run.

This can happen when you've added additional space to a logical volume/file system from the command line instead of using the smitty hacmp menu. But you certainly don't want to take down the entire HACMP cluster to solve this message.

First of all, you don't. The cluster will fail-over nicely anyway, without these VGDA's being in sync. But, still, it is an annoying warning, that you would like to get rid off.

Have a look at your shared logical volumes. By using the lsattr command, you can see if they are actually in sync or not:

host01 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -Fvalue
/test:1:809:jfs2:y:

host02 # lsattr -Z: -l testlv -a label -a copies -a size -a type -a strictness -Fvalue
/test:1:806:jfs2:y:

Well, there you have it. One host reports testlv having a size of 806 LPs, the other says it's 809. Not good. You will run into this when you've used the extendlv and chfs commands to increase the size of a shared file system. You should have used the smitty menu.

The good thing is, HACMP will sync the VGDA's if you do some kind of logical volume operation through the smitty hacmp menu. So, either increase the size of a shared logical volume through the smitty menu with just one LP (and of course, also increase the size of the corresponding file system); Or, you can create an additional shared logical volume through smitty of just one LP, and then remove it again afterwards.

When you've done that, simply re-run the verification/synchronization, and you'll notice that the warning message is gone. Make sure you run the lsattr command again on your shared logical volumes on all the nodes in your cluster to confirm.

Topics: AIX, Monitoring, System Admin ↑

"Bootpd: Received short packet" messages on console

If you're receiving messages like these on your console:

Mar 9 11:47:29 daemon:notice bootpd[192990]: received short packet
Mar 9 11:47:31 daemon:notice bootpd[192990]: received short packet
Mar 9 11:47:38 daemon:notice bootpd[192990]: hardware address not found: E41F132E3D6C

Then it means that you have the bootpd enabled on your server. There's nothing wrong with that. In fact, a NIM server for example requires you to have this enabled. However; these messages on the console can be annoying. There are systems on your network that are sending bootp requests (broadcast). Your system is listening to these requests and trying to answer. It is looking in the bootptab configuration (file /etc/bootptab) to see if their mac-addresses are defined. When they aren't, you are getting these messages.

To solve this, either disable the bootpd daemon, or change the syslog configuration. If you don't need the bootpd daemon, then edit the /etc/inetd.conf file and comment the entry for bootps. Then run:

# refresh -s inetd

If you do have a requirement for bootpd, then update the /etc/syslog.conf file and look for the entry that starts with daemon.notice:

#daemon.notice /dev/console
daemon.notice /nsr/logs/messages

By commenting the daemon.notice entry to /dev/console, and instead adding an entry that logs to a file, you can avoid seeing these messages on the console. Now all you have to do is refresh the syslogd daemon:

# refresh -s syslogd

Topics: Red Hat / Linux, System Admin ↑

How to enable ntpd on Linux

This is a procedure to enable time synchronization (ntpd) on Linux (in this example, replace the IP address of the time server with the IP address of your time server):

Stop all applications on the server.
Check if you can access the time servers, e.g.:
# ntpdate -q 10.250.9.11
Check if the current timezone setting is correct by simply running the date command.
Set the time and date correct:
# ntpdate 10.250.9.11
Start the NTP server:
# service ntpd start
Check the status:
# service ntpd status
Check the time synchronization (it may take some time for the client to synchronize with its time server):
# ntpq -p
Check that ntpd is started at system restart:
# chkconfig ntpd on
# chkconfig --list | grep ntpd
Check the process:
# ps -ef | grep ntpd
Reboot the server:
# reboot

Topics: Red Hat / Linux, System Admin ↑

Enabling sendmail on Linux

Make sure the relay host, e.g. the Exchange server, allows incoming email from your Linux server.
Make sure no firewall is blocking SMTP traffic from the Linux host. You can use nmap for this purpose:
# nmap -sS smtp.server.com
(Replace "smtp.server.com" for the actual SMTP server hostname of your environment).
Check it the DNS configuration is correct in /etc/resolv.conf and make sure you can resolve the hostname and its IP address reversely:
```
# nslookup hostname
# nslookup ipaddress
```
(use the IP address returned by the first DNS lookup on the hostname to reversely lookup the hostname by the IP address).
Make a copy of sendmail.mc and sendmail.cf in /etc/mail.

Edit sendmail.mc (add in the name of your SMTP server):

define(`confTRUSTED_USER', `root')dnl
define(`SMART_HOST', `esmtp:smtp.server.com')dnl
MASQUERADE_AS(`hostname.com')dnl
FEATURE(masquerade_envelope)dnl
FEATURE(masquerade_entire_domain)dnl

Then run:
# make -C /etc/mail
Edit sendmail.cf by modifying the "C{E}" line in sendmail.cf. Take any user listed on that line including root off that line, so mail sent from root gets masqueraded as well. Towards the bottom of sendmail.cf file, there is a section for Ruleset 94. Make sure that after "R$+" there is ONE tab (no space, or multiple spaces/tabs):
```
SMasqEnv=94
R$+ $@ $>MasqHdr $1
```
Clean out /var/spool/clientmqueue and /var/spool/mqueue (there may be lots of OLD emails there, we may not want to send these anymore).
Then restart sendmail:
# service sendmail restart
(or "service sendmail start" if it isn't running yet; check the status with: "service sendmail status").
Make sure that sendmail is started at system restart:
# chkconfig sendmail on # chkconfig --list sendmail
Open a "tail -f /var/log/maillog" so you can watch any syslog activity for mail (of course there should be a "mail.*" entry in /etc/syslog.conf directing output to /var/log/maillog for this to work).
Send a test email message:
# echo "test" | sendmail -v address@email.com
(and check that the email message is actually accepted for delivery in the verbose output).
Wait for the mail to arrive in your mailbox.

Topics: Storage, System Admin ↑

Inodes without filenames

It will sometimes occur that a file system reports storage to be in use, while you're unable to find which file exactly is using that storage. This may occur when a process has used disk storage, and is still holding on to it, without the file actually being there anymore for whatever reason.

A good way to resolve such an issue, is to reboot the server. This way, you'll be sure the process is killed, and the disk storage space is released. However, if you don't want to use such drastic measures, here's a little script that may help you trying to find the process that may be responsible for an inode without a filename. Make sure you have lsof installed on your server.

#!/usr/bin/ksh

# Make sure to enter a file system to scan
# as the first attribute to this script.
FILESYSTEM=$1
LSOF=/usr/sbin/lsof

# A for loop to get a list of all open inodes
# in the filesystem using lsof.
for i in `$LSOF -Fi $FILESYSTEM | grep ^i | sed s/i//g` ; do
# Use find to list associated inode filenames.
if [ `find $FILESYSTEM -inum $i` ] ; then
echo > /dev/null
else
# If filename cannot be found,
# then it is a suspect and check lsof output for this inode.
echo Inode $i does not have an associated filename:
$LSOF $FILESYSTEM | grep -e $i -e COMMAND
fi
done

Topics: EMC, SAN, Storage, System Admin ↑

Recovering from dead EMC paths

If you run:

# powermt display dev=all

And you notice that there are "dead" paths, then these are the commands to run in order to set these paths back to "alive" again, of course, AFTER ensuring that any SAN related issues are resolved.

To have PowerPath scan all devices and mark any dead devices as alive, if it finds that a device is in fact capable of doing I/O commands, run:

# powermt restore

To delete any dead paths, and to reconfigure them again:

# powermt reset
# powermt config

Or you could run:

# powermt check

Topics: System Admin ↑

System administration best practices

System Administrators can be the worst kind of users on your system

The ideal computer system is a system that doesn't have any users on it, or isn't related to any user action. Why? Well, as long as users can't access the computer system or users can't create a load on a system, the system will run smoothly. But this isn't reality. Without users, there wouldn't be any computer systems. And without users, there wouldn't be system administrators. Although users can be idiots and amaze you of all the stupid things they do, it's the collegue system administrators you'll have to really watch out for, because they have root authority and can mess up this quite badly.

The people that install a system, should be responsible for the system

If you install a system, and know you'll have to manage it yourself once it's in production, you'll make sure the system in configured correctly. I've seen it many times: the people responsible for installing the system, aren't responsible for the maintenance of the system. This creates a "throw-it-over-the-fence" effect: people installing a system really don't have a clue what kind of administration nightmares they've created and all the problems administrators run into during the most horrible hours of the day (usually Murphy preferres Sunday night, or when you're about to go to sleep). Make absolutely sure that once a system is installed, the same people have to manage it, at least during the first two months of production.

Poorly designed systems are difficult to administer

Take your time designing a system and learn about a specific application before implementing it. Rapid designs under high time-pressure usually end up causing lots of problems during production (and they give your administrator(s) a long-lasting headache). Make sure the documentation of a system is made, before going into production. And, as an administrator, perform a health check for accepting to manage the newly installed system. Also, keep the previous best-practice tip in mind: if you haven't installed the system, you have no clue what you're getting into. And finally: during the implementation phase of a project, as a system administrator, you should be involved in the project, to be able to help design, install, configure and document the system.

Don't combine different information systems on one server

Sharing one operating system image by different information systems usually leads to problems, as these different information systems conflict in tuning parameters, backup windows, downtime slots, user information, peak usage, etcetera. Systems that do different tasks should be separated from each other to avoid dependencies. A grouping of information systems can be made by use, for example put databases together on one server. Or you can group them by product, for example put a single product and its database on one box. But NEVER, EVER put software from different vendors on 1 system!

Naming conventions should be easy

When choosing names for your hosts, printers, users etcetera, keep a few simple rules in mind:

Choose names that are easy to remember and are not too long (8 characters max).
Choose names NOT related to any department or other part of your organisation. Departments keep changing over time; by not naming your systems after departments, it will save you lots of time.
Choose names NOT related to any location. Locations also change frequently, when assets are moved around.
Never EVER reuse a name. Choose new names for new assets or users. When migrating from one host to another, don't use the same hostname, choose a new one. Rather use DNS aliases when you wish to keep a hostname. Don't configure hostname aliases on a network interface to keep using the old hostname. And for users and groups, always use a new UID and GID.
When a system has more than 1 network interface, choose hostnames related to each other: Service address: rembrandt; Boot-address: rembrandt_boot; Standby-address: rembrandt_standby; etcetera.
Choose a standard naming convention; don't change your naming convention.
Never use a hostname twice, even though they are in seperate networks.

Never enough disk space

Never give your users too much disk space at once. Users will always find a way to fill up all disk space. Give them small amounts at a time. Encourage your users to clean up their directories, before requesting new disk space. This will save you time, disk space, backup throughput and money.

Temporary space (like /tmp) is TEMPORARY. Make sure your users know that. Clean out temporary space every night and be ruthless about it! Applications can NEVER use temporary space in /tmp; applications should use separate file systems for storing temporary files.

Put your static application data in a file system, SEPERATED from the changing data of an application. Usually every application should use 2 file systems at least: 1 for the application binaries and 1 for the data (and also log files). This will make sure your application file system will never run full.

Versioning

Keep the least amount of versions of applications or operating system levels on your systems. The lesser the number of versions, the less you have to manage and the easier it becomes. Try to standardize on a small amount of versions.

Only use supported versions of applications and operating systems. Check regularly which versions are supported. Upgrade in time, but not too fast! Usually a N-1 best practice should be used (always stay one version behind the released levels from vendors). Don't try to use the newest versions, as these versions usually suffer from all kinds of defects, yet to be discovered. This applies to application software, OS levels, service packs, firmware levels, etcetera.

Know what you're doing!

If you don't know what your doing EXACTLY, just don't do it.

Get educated. Take time away from your boss for training. Being away from your work will make sure, you won't be disturbed all the time during your studies. Switch your mobile off and tell everybody you're not available.
Take at least 2 courses a year with at least 40 hours of study each. An employer not wishing to pay for studies is not a good employer. IT is a fast-changing arena, so you have to keep up.
Take courses related to your work. Don't bother taking courses, where you learn someting, you'll never use.
Before going on a course, read about the subject. Learning is a lot more easier if you know already something about it.
Make sure you have all the prerequisites when taking a course.
After a course, actually use your new-gained knowledge.
Also, get certified. Good for your career, but also good for the understanding of a subject. Certification requires you to actually use a certain product for an extended period of time thoroughly and also requires you to read books or get training on the subject.
Don't do good-luck-certifications. Do your learning. Doing a certification 3 times on a single day, just to get certified, won't give you the needed knowledge.
At least 2 certifications a year!
Get a test system and try out your knowledge. Don't use production systems for testing. Also, make sure the test system isn't on the same network as your production system.
Write down what your doing. It is always easier to look something up again instead of guessing what you did.
Create procedures and stick to procedures. Keep procedures short!

Keep it Simple

We have difficulty understanding systems as they become more complex. Complexity leads to more errors and greater maintenance effort. We want systems to be more understandable, more maintainable, more flexible, and less error-prone. System design is not a haphazard process. There are many factors to consider in any design effort. All design should be as simple as possible, but no simpler. This facilitates having a more easily understood, and easily maintained system. This is not to say that features, even internal features, should be discarded in the name of simplicity. Indeed, the more elegant designs are usually the more simple ones. Simple also does not mean "quick and dirty." In fact, it often takes a lot of thought and work over multiple iterations to simplify. Result: Systems that are more maintainable, understandable, and less error-prone.

Manage the information system as a whole

Don't just administer the operating system and its supporting hardware. The OS is always used to provide some kind of basis for an information system. You need to know the complete picture of the information system itself; its parts, the interfaces, etcetera, to understand the role of the OS in it. System Administration is a lot easier, knowing what the information system is used for.

Therefore, manage a system from the users point of view. How will it affect the users if you change anything on the OS or on the underlying hardware level?

Backup, backup and backup!

Before doing anything on your system, make ABSOLUTELY sure you have a full, working backup of your system! Check this over and over again. A system should be backed up once every day. Determine what you should do when a backup fails. Determine how you should restore your system. Document your backup and restore procedures. And ofcourse, test it at regular intervals, by restoring a backup on a separate system.

Last but not least

Did you know that companies are spending roughly 70 to 90 percent of their complete IT budgets on maintaining their systems? Knowing this, it's a huge responsibility to maintain the systems in the best possible manner!

Number of results found for topic System Admin: 249.
Displaying results: 241 - 249.

Order

No time to lose? Need to know what's wrong with
your UNIX system now? Then get started TODAY!