Sar Do you have the proper patches on your system? With ANY problem, there really is no sense in chasing it very far if you are running on systems that are not properly patched. Updated programs and kernel fixes often make problems magicaly go away. Your beginning tool for that is "sar"- enable it on SCO Unix if it isn't already enabled with /usr/lib/sa/sar_enable -y Ignore the warning about rebooting- it is not necessary. If you already know about sar, you may want to skip ahead to What's Wrong? now. All that does is uncomment entries for sa1 and sa2 in the sys and root crontabs. By default, SCO runs sar every twenty minutes during "working hours" and ever hour otherwise; you should adjust the "sys" crontab to meet your specific needs. The daily summary (the sa2 script run from root's crontab) creates summary files in /var/adm/sa. These will have names like "sar01", "sar02", etc.; the daily data files are named "sa01" through "sa31". The daily data files are binary data. If you wanted to examine memory statistics from the 15th of the month, you'd run sar -f /var/adm/sa15 -r The "sar" summary files are text. The "sar15" is the output of running sar -f /var/adm/sa15 -A > /var/adm/sar15 and therefor can be viewed directly with "more", printed directly or whatever. On Red Hat Linux systems, sar data collection runs from /etc/cron.hourly/sysstat: #!/bin/sh # snapshot system usage every 10 minutes six times. umask 0022 /usr/lib/sa/sa1 600 6 & The daily summary is done by: #!/bin/sh # generate a daily summary of process accounting. umask 0022 /usr/lib/sa/sa2 -A & The sar files are found in /var/log/sa, and use the same naming as on SCO. If you have been running sar for a month or more, you'll always have one months worth of historical data. As daily files are overwritten by daily files from the following month, you don't have to be concerned with using up disk space. Having this historical data lets you quickly decide if the current sar statistics represent an unusual condition. The flags for sar vary on different OS's, so read the man page. On all systems, sar without any arguments gives you cpu usage, but even there the output will vary. Linux systems have a "nice" column that SCO Unix lacks, and SCO includes a useful "wio" column not found on Linux: Linux 2.4.9-12 (apl) 12/04/2001 05:32:15 AM CPU %user %nice %system %idle 05:32:20 AM all 0.00 0.00 0.20 99.80 05:32:25 AM all 0.00 0.00 0.00 100.00 05:32:30 AM all 4.20 0.00 1.00 94.80 05:32:35 AM all 4.40 0.00 0.60 95.00 05:32:40 AM all 0.00 0.00 0.00 100.00 Average: all 1.72 0.00 0.36 97.92 SCO_SV scosysv 3.2v5.0.6 PentIII 12/04/2001 05:48:33 %usr %sys %wio %idle (-u) 05:48:38 0 0 0 100 05:48:43 0 0 0 100 05:48:48 0 0 0 100 05:48:53 0 0 0 100 05:48:58 0 0 0 100 Average 0 0 0 100 If you run sar without any numerical arguments, it will look for today's historical data (and complain if it can't find it). If you run it with numerical arguments, it samples what is happening now. The first argument is the time between samples (5 seconds is a good choice), the second is the number of samples. So "sar 5 2" gives two samples, 5 seconds apart. What's Wrong? So, the system is slow. Let's try to find out why. If it is the cpu that is pegged busy, it *may* be a run away process that is eating cpu cycles. Do this: for x in 1 2 3 4 5 do ps -e | sort -r +2 | head -5 echo "===" sleep 5 done Look for a process who's time column has gone up by 3 to 5 seconds each time- if you have something like that, that's your problem- you need to need to kill it. The TIME column is time on the cpu- normally a process doesn't spend a great deal of time actually running- it' waiting for the disk, waiting for you to type something, etc. So something that gains 3 seconds or more in 5 seconds of wall time is usually suspect. Of course you need to understand what you are killing: you probably wouldn't want to kill the main Oracle database, for example. If you kill the errant process and another copy of it pops right back to the top of the list, then you need to track down its parent: # for example, if process 15246 is the problem ps -p 15246 -o ppid Of course, it may go further up the chain. Here's a script that traces back to init: # This works on SCO or Linux, just pass a process ID as an argument. MYPROC=$1 NEXTPROC=$MYPROC while [ $NEXTPROC != 0 ] do ps -lp $NEXTPROC MYPROC=$NEXTPROC NEXTPROC=`ps -p $MYPROC -o "ppid=" ` done Sometimes you'll have a badly written network program that starts sucking resources when its client dies. If you can't get the supplier to fix it, you may want to write a script to track down and kill these things. One clue that might help: the difference between a good "xyz" process and a bad one might just be whether or not it has an attached tty. So, if you see this: 5821 ? 00:00:42 xyz 6689 ttyp0 00:00:08 xyz 7654 ttyp1 00:00:12 xyz It's probably the one with a "?" that will start accumulating time. So a script that watched for and killed those might look like this: set -f # turn off shell expansion because of "?" ps -e | grep "xyz$" | while read line do set $line [ "$2" = "?" ] && kill -9 $1 done If you can't do it that way, you have to get more clever, and watch for changing time: set -f mkdir /tmp/mystuff ps -e | grep "xyz$" | while read line do set $line ps -p $1 > /tmp/mystuff/first sleep 5 #adjust sleep as necessary ps -p $1 > /tmp/mystuff/second diff /tmp/mystuff/first /tmp/mystuff/second || kill -9 $1 done And even that may not be clever enough for your particular situation, so test and tread carefully. You may even need to do math on the time field to see what has really happened. Another thing you may see is a process that has used a lot of time but isn't gaining time right now. I've seen that many times where the process is "deliver"- MMDF's mail delivery agent on SCO systems that aren't running sendmail. What happens is that for whatever reason (a root.lock file from a crash in /usr/spool/mail or a missing "sys" home directory), there are thousands of undelivered messages in the subdirectories of /usr/spool/mmdf/lock/home The fix for that is simple if you don't care about the messages: rm -r all those directories and recreate them empty with the same ownership and permissions cd /usr/spool/mmdf/lock/home /etc/rc2.d/P86mmdf stop rm -r * chown mmdf:mmdf * chmod 777 * cd /usr/spool/mail rm *.lock /etc/rc2.d/P86mmdf start You'd then want to verify that mail is working normally and that whatever caused the problem isn't still happening- for example, if /usr/sys is missing this problem will come right back again very quickly. Another possibility is a program that is rapidly spawning off other programs. You should be able to see that in "ps -e". First, are the number of processes growing?: ps -e | wc -l sleep 5 ps -e | wc -l Or, are there new processes briefly showing up at the end of the listing?: ps -e | tail sleep 5 ps -e | tail In either case, you need to track down the parent and kill it. Low Memory If sar -r shows low memory or (worse) swapping, go buy more memory. That's going to be easy to spot on SCO's sar, but Linux is a bit harder. Let's look at SCO first: SCO_SV scosysv 3.2v5.0.6 PentIII 12/06/2001 11:37:06 freemem freeswp availrmem availsmem (-r) 11:37:06 unix restarts 11:40:00 52972 786000 56222 150408 12:00:00 52903 786000 56234 150643 12:20:00 52996 786000 56240 150723 12:40:00 53018 786000 56240 150723 13:00:00 53018 786000 56240 150723 13:20:00 53018 786000 56240 150723 13:40:00 53018 786000 56240 150723 14:00:00 52885 786000 56231 150606 14:20:00 52999 786000 56240 150723 14:40:00 53016 786000 56240 150723 15:00:00 53018 786000 56240 150723 This machine consistently has over 200 MB of memory available- unused ( freemem pages are 4K each). Obviously no problem there, and in fact, if this is always the case (which you'd know from sar historical data), you may want to use some of that memory for disk buffers- see http://pcunix.com/Unixart/memory.html.