I'd like to thank all the people who wrote back with suggestions on my
problems getting our system to complete backups properly.  It's really
quite wierd what has happened.  I apologize if I've left anyone out of
this summary.  
The helpers:
Jon Burnard <jon.burnard@West.Sun.COM>
From: Colin Johnston <colinj@uk.psi.com>
From: www@cnp.tea.state.tx.us (William White)
From: Cyril Plisko <imp@Orbotech.Co.IL>
From: "Harris, Charles (PA62)" <Charles.Harris@ftw1.honeywell.com>
From: Lee Hughes <lhughes@anitesystems.co.uk>
From: gdh@rational.com (George Harrington)
From: "Knut =?iso-8859-1?Q?Helleb=F8?=" <Knut.Hellebo@nho.hydro.com>
From: Tony.Tran@Ebay.Sun.COM (Tony Tran)
Quick recap:  
Our backup server had a disk added to it recently, then a week or so
later it started being very slow and not completing backups and
seeming to hang on a variety of clients.  The only changes recently
had been the new disk and patching Networker with the -010 patch for
Solaris. 
What we tried:
At first we thought it was a hardware problem, so we started pulling
drives out of the jukebox, shortening cables, etc.  We even went as
fast as getting a new SCSI card for the server, shortening the cables
and removing the jukebox and all but one tape drive.  The problem was
still there.
I also tried backing out the -010 patches, going back to the -008 set
which had been working fine for months.  No change there either.  The
frustrating thing was that a small backup would run fine for a couple
of Gb of data and then lockup towards the end, or at least after
writing for a while, so this made testing quite slow and painful as
well. 
At this point we decided to move from the Ultra 140 to an old Sparc 10
we had sitting around because the backup server was also the print
server for the building and had home dirtectories for various users
and it wasn't nice to keep taking the system down on them.
The Solution:
We installed Solaris 2.5.1 with the latest patches and st.conf file
from Legato.  We also re-installed Legato Networker 4.2.5 and patch
-010 from scratch and re-installed the jukebox configuration as well.
We were then able to move over original /nsr/res/* files and not lose
client indexes or the media index.  
Of course, this system worked like a charm once we got it up and
running.  It's running the same jukebox, with the same tape drives and
SCSI cables.  It's even running with the same SCSI cards from the old
Ultra 140 system.  
But now it works.  Argh!
I suspect that the problem might have been a mis-applied -010 patch
and the fact that the old jukebox config was really taken from the
original SunOS 4.1.4 system that we upgraded from Networker 4.0.2
about 7 months ago.  The conversion then went painfully, but did go
and did work for six plus months.
I did also follow the advice of the people and looked for network
problems on both the server and the clients.  I found one bad network
port and moved the client to another port and all works well now for
that client.  But one of the clients that *was* failing had a
perfectly good network port.  
This move to a dedicated server works out since it lets us dedicate a
machine to network backups, which we've really decided is a *good*
thing to do, since when it does go bad, you won't impact other users
or services as you work on it.
Below is my orginal message and the edited replies I got from various
people.  Thanks for the suggestions!
John
        John Stoffel - Senior Unix Systems Administrator - Fluent, Inc.
          jfs@fluent.com - http://www.fluent.com - 603-643-2600 x341
                             Kill your Television
----------------------------------------------------------------------
Date: Wed, 13 Aug 1997 12:23:23 -0400 (EDT)
From: John Stoffel <jfs@fluent.com>
To: networker@iphase.com
Subject: networker - Problems with: Networker 4.2.5-008, EXB-120, Solaris 2.5.1
>>>>> John Stoffel <jfs@fluent.com> posted to the networker list:
Note: I sent this to Sun-managers mailing list so it's pretty Sun
specific, but I've got some new networker specific questions at the
end.  
Hello fellow Managers,
We're running into a really *strange* problem here which we think
we've narrowed down to hardware, but we're not sure, so we're looking
to get more confirmation on where the problem is.
Background:
We're run an UltraServer 140, Solaris 2.5.1, 256Mb of RAM, two generic
SCSI controllers (esp0,1), one combined Fast SCSI (fas0) and 100base-T
(hme0).  It's the main backup server (120+ clients) and it running
Networker 4.2.7 connected to an EXB-120 with 4 8500c 8mm tape drives
on esp1.  We're running ODS 4.0 to mirror and stripe disks for
disaster protection on the esp0 and fas0 SCSI buses.  We have the
following patches installed:
    Patch: 102580-13 Obsoletes: Packages: SUNWmd.2 4.0,REV=1.0,PATCH=13
    Patch: 103582-01 Obsoletes: Packages: SUNWcsr
    Patch: 103609-02 Obsoletes: Packages: SUNWcsr
    Patch: 103630-01 Obsoletes: Packages: SUNWcsr, SUNWcsu
    Patch: 103582-06 Obsoletes: Packages: SUNWcsr
    Patch: 103663-01 Obsoletes: Packages: SUNWcsu, SUNWhea
    Patch: 103594-03 Obsoletes: Packages: SUNWcsu
    Patch: 103680-01 Obsoletes: Packages: SUNWcsu
    Patch: 103699-01 Obsoletes: Packages: SUNWcsu
    Patch: 103817-01 Obsoletes: Packages: SUNWcsu
    Patch: 103683-01 Obsoletes: Packages: SUNWcsu
    Patch: 103738-02 Obsoletes: Packages: SUNWcsu
    Patch: 103959-03 Obsoletes: Packages: SUNWlps, SUNWlpu, SUNWscpu
    Patch: 103461-12 Obsoletes: Packages: SUNWmfrun
    Patch: 103686-01 Obsoletes: Packages: SUNWnisu
The disks and tapes look like this:
    esp0 is a "Generic SCSI" SCSI disk controller
        c0t0d0 (sd0) is a "SEAGATE-ST32151W-0160" 2.0 GB disk drive
        c0t1d0 (sd1) is a "SEAGATE-ST32151W-0160" 2.0 GB disk drive
        c0t2d0 (sd2) is a "SEAGATE-ST34371N-0338" 4.0 GB disk drive
        c0t3d0 (sd3) is a "SEAGATE-ST34371N-0338" 4.0 GB disk drive
        c0t6d0 (sd6) is a disk drive
    SUNW,fas0 is a "Sun FAS366" Fast SCSI disk controller
        c1t2d0 (sd16) is a "SEAGATE-ST15230N-0638" 4.0 GB disk drive
        c1t3d0 (sd17) is a "SEAGATE-ST15230N-0638" 4.0 GB disk drive
        c1t4d0 (sd18) is a "SEAGATE-ST34371N-0280" 4.0 GB disk drive
        c1t5d0 (sd19) is a "SEAGATE-ST34371N-0280" 4.0 GB disk drive
    rmt/2 (st8) is a "Exabyte EXB-8500 8mm Helical Scan (EXABYTE EXB-8500)" 
SCSI tape drive
    rmt/3 (st9) is a "Exabyte EXB-8500 8mm Helical Scan (EXABYTE EXB-8500)" 
SCSI tape drive
    rmt/0 (st11) is a "Exabyte EXB-8500 8mm Helical Scan (EXABYTE EXB-8500)" 
SCSI tape drive
    rmt/1 (st12) is a "Exabyte EXB-8500 8mm Helical Scan (EXABYTE EXB-8500)" 
SCSI tape drive
Problem:
Over the past week or so we have been having *terrible* problems with
the backups not completing.  They start out fine, running about 600
kb/sec on each drive and basically looking happy.  We'll then come in
the next morning and fine the backups in a hung state where there will
be 0-100 kb/sec going to one or maybe two drives.  Looking at the
clients and the server, there's very little CPU utilization, very
little IO wait (using sar and iostat).  The only thing we can do is
stop the backup and see how many/few clients managed to backup.
The only recent changes we've made is to add another disk to the esp0
controller, for a total of 5 disks.  About a month ago one of the Tape
drives died and was replced.
I disabled, disconnected and powered off that tape drive from the SCSI
bus as a test, but it still locked up.
Today we used tcpdump to watch the network traffic between the server
and one client (70+ Gb file server) which are connected via 100base-t
switch ports to a central hub.  At first there will be lots of data
dumping from the client with the tcp window nice and large at 64240
(we maximized this a while ago for performance reasons) and then
suddenly the TCP receive window gets smaller and smaller and goes to
0, so the client can't send any more data.  After a time (say 5-15
seconds), it bounces back and all is happy.  The client pushes a bunch
of data for a while, then the window starts shutting and then closes.
We did a test where we copied 8Mb or so of files across from the
client to the server via NFS and it worked fine.  Never a problem,
never did the TCP window get smaller or goto zero.  
This led us to suspect that it's a jukebox/tape drive problem.  So
what we want to do now is run some stress tests on the tape drive,
just writing as much data as fast as we can to the drive and looking
at how the performance holds up over time.  
Are there any packages out there for doing this?  Or should I just end
up timing how long it takes to dd some data to each tape drive and do
the math myself?  
We've looked at: sar, iostat, proctool, tapeexercise (legato) and top,
but none of them give stats on tape performance, which is what we're
looking for.
-------- networker specifics -------------
So has anyone seen this problem before?  
This morning I came in and found the backups paused again, so this
time I've completely removed the jukebox from the equation, with only
one tape drive hooked up and running along by itself on the bus.  And
_that_ isn't working either!  I'm going insane here trying to find a
solution.  
The next step is to pull the drive completely from the jukebox and put
it into a standalone case with *much* shorter SCSI cables and see how
that works.  
Thanks for any and all help,
John
        John Stoffel - Senior Unix Systems Administrator - Fluent, Inc.
          jfs@fluent.com - http://www.fluent.com - 603-643-2600 x341
                             Kill your Television
------------------------------
Date: Wed, 13 Aug 1997 09:50:40 -0700 (PDT)
From: Jon Burnard <jon.burnard@West.Sun.COM>
Hi John,
The tapes on the their own scsi bus not mixed with the disks, right?
This is good. I'm a little confused about mirroring fas and esp
drives. I would imagine that causes a few problems. I do not believe
that you should mix drive types in meta devices, let alone bus
speeds. I'd try metadetaching the slow half of the mirrors and see if
that helps.
-- +------------------------------------------------+ | Jon Burnard | jon@west.sun.com | 619 625 3749 | |-------------+------------------+---------------+ | Sys Admin - Sun Microsystems Inc. | +------------------------------------------------+------------------------------
Date: Wed, 13 Aug 1997 17:59:36 +0100 (BST) From: Colin Johnston <colinj@uk.psi.com>
Hi John, you might like to try the following make sure the tape drive is on it own dedicated scsi bus make sure you increase timeout values.
I had a constant problem with a SUN DLT4000 hanging the system until a separate SCSI card was used and since then no problems
hope this helps :)
Colin
------------------------------
Date: Wed, 13 Aug 1997 13:05:27 -0500 (CDT) From: www@cnp.tea.state.tx.us (William White)
John, Our environment is different from yours, AIX server, but at the same level of NetWorker. We were having problems with miserable tape performance and opened a problem with Legato. They told us that our jukebox (ATL 120) had not been tested with the new version of NetWorker's drivers. They had us install the -008 patch level and the backlevel drivers. I don't know which part fixed the problem, but I think it was probably the drivers. Do you think that could be part of your problem? Before changing to the backlevel (4.2.0) drivers, our full backup and cloning for offsite storage was running almost all week. Sometimes it would not finish before the next week's backup was scheduled to start. Now it runs in 30 to 35 hours over the weekend.
I know this is a shot in the dark, so I'm sending it to you privately. If you figure out your problem, I hope you will post the solution to the list.
Regards, www. -- (Signed) William W. White -- www@tenet.edu (512) 475-3557 fax 463-8320 Texas Education Agency, Systems Support Division 612, 1701 Congress Avenue, Austin, Texas 78701
------------------------------
Date: Wed, 13 Aug 1997 21:06:10 +0300 (Israel Daylight Time) From: Cyril Plisko <imp@Orbotech.Co.IL>
John,
>Are there any packages out there for doing this? Or should I just end >up timing how long it takes to dd some data to each tape drive and do >the math myself?
The Solaris 2.6 (due Aug 18) introduces new iostat capabilities, among them measuring tape drive performance.
[stuff edited out]
Regards, Cyril Plisko
------------------------------
Date: Wed, 13 Aug 1997 14:11:34 -0400 From: "Harris, Charles (PA62)" <Charles.Harris@ftw1.honeywell.com>
Hi, I can't offer you much specific help but, personally, I'd look at the Fast SCSI/Fast Ethernet card. I doubt if the card itself or the bus it's on is being legitimately overwhelmed but I would suspect that the Fast Ethernet port (hme0?) has intermittant problems communicating through the bus. Hope this doesn't clutter your inbin with useless speculation. charlie harris - unix sys and net admin
------------------------------
Date: Thu, 14 Aug 97 11:08:52 GMT From: Lee Hughes <lhughes@anitesystems.co.uk>
[edited out my message]
Suggestions.
Have a look at netstat, see if your dropping packets from any of your interfaces.
Although you might not be overloading the processer (i.e. not running at 100%), you will be creating a LOT of interuppts, seeing as you running scsi at full speed and your ethernet card a 100MBits (that's 10MB a second!).
wy is NFS working you may ask? Well, could be because that uses UDP instead of TCP.
Let me know how you get on! Cheers, Lee Lee Hughes Anite Systems Space and Defence Division
Tel: + 44 (0) 117 927 7854 3rd Floor, DAS House Fax: + 44 (0) 117 929 0917 Quayside, Temple Back Bristol BS1 6NH Email: lhughes@anitesystems.co.uk United Kingdom
------------------------------
Date: Thu, 14 Aug 1997 11:24:01 -0700 From: gdh@rational.com (George Harrington)
I've seen lots of odd problems caused by corrupted indices. I would suggest shutting down networker, running nsrck -F (this will take a very long time, if it doesn't complete in a few hours, you may need to kill it and restart nsrck -F again, it will usually succeed on the second round). If you can afford to toss any files indexes before running nsrck, do so, to shorten the run time. After nsrck completes start networker with
nsrd&&nsrexecd&&nsrim -X
This has succeeded foe me several times when networker began behaving erratically. Also If you can trace a problme to a specific client/s, kill any save processes and nsrexecd on that client and restart nsrexecd. gdh
-- George Harrington -- System Administrator -- Rational Software 2800 San Tomas Expwy., Santa Clara, CA 95051-0951 gdh@rational.com 408-496-3878 fax 408-496-3636
------------------------------
Date: Fri, 15 Aug 1997 08:33:55 +0200 From: "Knut =?iso-8859-1?Q?Helleb=F8?=" <Knut.Hellebo@nho.hydro.com>
Try getting patch #010 and see if this helps. Lots of things in that one, maybe something in there that solves your problem ... --
****************************************************************** * Knut Hellebų | DAMN GOOD COFFEE !! * * Norsk Hydro a.s | (and hot too) * * Phone: +47 55 996870, Fax: +47 55 996342 | * * Cellular Phone: +47 93092402 | * * E-mail: Knut.Hellebo@nho.hydro.com | Dale Cooper, FBI * ******************************************************************
------------------------------
Date: Wed, 13 Aug 1997 09:45:20 -0700 From: Tony.Tran@Ebay.Sun.COM (Tony Tran)
John,
Sounds like you have reached the SCSI cable limits of 3 meters or about 10 ft. This includes the external cables *and* internal cables. I never daisy-chain more than 4 SCSI devices on a SCSI disk controller because performance will start degrading. I suggest adding another SCSI controller or put the tape drives on another server etc... You can put them disks into a standalone case with much shorter SCSI Cables but remember that the performance will greatly suffer. I doubt if you can use this server while Networker is running.
You are a good candidate to buy a Sparc Storage Array (SSA)
Good luck, John Tony Tran
------------------------------
End of this Digest ******************
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:01 CDT