Hello Sun Managers:
Thanks to all who responded. I got 12 responses. I tried almost all
suggestions. In the mean time I also upgraded the OS (from SunOS 4.1.3 to
SunOS 4.1.3_UI) and split the dump to two nights. It seems that the backup
is working fine now but I am not sure whether the problem is totally solved.
My original question was about that rdump suddenly took forever to finsish
(appr. 14 hours).
Here are the responses that I received:
-----------------------------------------------------------------------
-----------------------------------------------------------------------
perryh@pluto.rain.com (Perry Hutchison):
The only things I can think of are
* system load may have increased enough to slow down the data
to the point that the tape no longer streams, or
* the partition now has about 15 times as much data as it used to.
-----------------------------------------------------------------------
wallen@cogsci.UCSD.EDU (Mark R. Wallen):
I had a similiar problem dumping a DECstation 3100
to a SUN. All of a sudden the dumps started taking
a *LONG* time. If you looked at the trace that dump
prints out as it goes, the times and percentages done
decrease. In this case, the time kept increasing.
The problem was due to a poor ethernet connection,
specifically an AUI cable that plugged into the Sun.
Only dump with its large block (1k) transfers seemed
to trigger the problem, and just from one host!?!
----------------------------------------------------------------------
daniel@sar3.CANR.Hydro.Qc.CA (Daniel Hurtubise):
I would check to see if there are any read and/or write retries
from any of the disks. Use the dmesg command to check if there
are any abnormalities in the system.
If the dumps are not being done in single user mode, and you
are doing backups over the network, you may be hitting a
bottleneck somewhere. However, you mentioned that the problem
was happening only on a particular machine, so I would stick
with the disk theory.
----------------------------------------------------------------------
raoul@MIT.EDU (Nico Garcia):
Clean your tape drives, run an fsck of your partitions, and look for the
fragmentation data. Also check your network loads at dump times.
----------------------------------------------------------------------
weitzel@burke.com (David Weitzel):
In your dump script, you might do a "ps -auwx" and redirect it
to a log periodically. I would suspect something like a find
command is getting kicked off out of crontab or something.
Look in the log after the dumps are complete, and I would suspect
you will see the problem.
vmstat -S 2 10 is another good command.
----------------------------------------------------------------------
Gary.Richardson@proteon.com (Gary Richardson):
I had this problem a few weeks ago. I have a backup server on
one network that backs up MANY machines on many different subnets.
I had this ONE machine that would take as much time as your machine
did (appx. 10-15 hours for 800MB). I traced it down to the guy
who owned the machine used his UTP ethernet drop and plugged it
into a small utp MAU type box to get himself more ethernet ports.
Well, there must've been something wrong with that box. When I had
him plug his utp drop back into the Sun, the backup time went back
to normal.
So, I guess what I'm saying is that it might be worthwhile to see
where that machine is physically plugged into the network. You might
have a bad module that won't show any problems for normal use, but
as soon as you start trying to push large amounts of data thru it,
it backs up.
-----------------------------------------------------------------------
jerry@soul.ampex.com (Jerry Stachowski):
What happens when you rdump to a file on the same machine? Are there lots
of errors in /var/adm/messages on that machine?
-------------------------------------------------------------------------
markus@octavia.anu.edu.au (Markus Buchhorn):
You may have a problem file in the directory tree of that particular
rdump. A common problem is a named pipe - processes that try to read it
will try to read from the other end of this pipe (sorta like sucking
on /dev/null :-) ). This will basically take forever to finish. It is
possible that rdump trips across a file like this, sucks on it for 14 hours
and then either dies or gives up. Try something like 'find -type p' on
the filesystems you are rdump'ing.
There may be a similar-but-different problem, e.g. a circular link
somewhere. This can be harder to track down. Try writing a small script
which cat's every file to /dev/null, and wrap 'date' commands around it.
Run this in a spare window somewhere and keep an eye on it. When it suddenly
hangs/takes 14 hours to 'cat' the file you have your culprit... :-)
---------------------------------------------------------------------------
---------------------------------------------------------------------------
Thanks again for all those who responded. I really appreciate your help.
Kelly Liu
MSI NetSys Inc.
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:09:08 CDT