SUMMARY: Sparc 10 Network Freeze

From: David Moline (drm@gcs.oz.au)
Date: Tue Apr 04 1995 - 19:18:49 CDT


Dear Sun-Managers

        I received a number of replies to my problem. They fell into
three categories, 1) use ifconfig to reset the interface, 2) apply SunOS
patch 101954-06, and 3) check out the physical/electrical side.

        As it turned out the main culprit was a faulty thin ethernet
transceiver. This transceiver would quite happily work when the
everything on the network was good. However when there was a network
problem, the transceiver would not detect and signal this back to the
Sun (that is once the network was broken, the Sun did not pick this fact
up and start spewing messages like "le0: No carrier - transceiver cable
problem?"). This explains the CPU load going through the roof, as the
Sun thinks it can still (and will) transmit packets. Once the faulty
transceiver was replaced, the problem went away, the Sun being able to
recover properly after an extended network outage.

        The ifconfig solution sounds nice, but unless there was a root
shell waiting for this problem to occur it is unlikely that it could be
tested. Finally, if we had kept the OS at SunOS 4.1.3_U1, I would have
applied and tried the patch 101954, however this is supposedly obsoleted
by SunOS 4.1.4 and there is no similar patch.

        Thank you again for your help. Original question and replies
follow.

Regards

---
David Moline (drm@gcs.oz.au) Graphics Computer Systems Pty Ltd, Australia
Ph: +61-3-888-8522   Fax: +61-3-808-9151
-----------------------------------
Original Question:
> I have encountered a problem with a number of Sparc 10 clones using
> both hyperSPARC and SuperSPARC cpus, and three OBP revisions, and
> running SunOS 4.1.3_U1 and 4.1.4 connected to a thin wire network via a
> AUI to thin transceiver via the Audio/AUI port.
> 
> 	The problem manifests itself when the network is disrupted for
> a period of time (for instance simply by disconnecting the network
> connection on a running machine). The machine has typically had a
> couple of xterminals running off of it doing fairly intensive work both
> X display and NFS requests from the machine to a central server as
> well as running a couple of apps (eg Interleaf) locally. While trying
> to investigate the problem, I have swapped in a number of different
> motherboards, and CPU modules with no real fix to the problem. On the
> console, running the performance meter (showing at least CPU, load and
> packets), the problem shows itself after the network is disrupted with
> the CPU and packet meters dropping down to zero, while the load meter
> starts to climb. After about 2 minutes of network disruption, the
> machine does not recover at all. It does not seem to send out any
> packets at all indicating (to me) that the network interface has
> frozen. On a couple of similarly configured workstations, the machine
> quite happily recovers very soon after the network is reconnected.
> About the only difference I have noted is that the machines that work
> will send messages to the console indicating network problems (eg le0:
> no carrier - transceiver cable problem?).
> 
> 	Having tried multiple boards and CPU modules, I do not believe
> it to be a hardware problem (but have not ruled that possiblity out),
> but some kind of revision related problem. One combination that has
> continued to work since installation (and has been unchanged because of
> this) is a Sparc 10, with a dual 67MHz hypersparc CPU, OBP revision 2.8
> and running SunOS 4.1.4 without patches. Although this combination has
> failed on another workstation setup.

--------------------- Replies: From: Rahul Dhesi <dhesi@rahul.net> ----- I sent this to somebody, don't remember if it was you or somebody else making the same inquiry. I run the script given below from cron at intervals. This unfreezes the machine. I see these freezes on SSS-2 and SS-10 machines running SunOS 4.1.3_U1.

Rahul Dhesi <dhesi@rahul.net>

#! /bin/sh # if pings fail, use ifconfig to bring le0 back up ra=192.160.13.203 ifconfig=/usr/etc/ifconfig ping=/usr/etc/ping DOIT=eval

if $ping $ra 10 > /dev/null 2>&1; then : else echo reset le0 $DOIT $ifconfig le0 down $DOIT sleep 5 $DOIT $ifconfig le0 up $DOIT sleep 10 $DOIT $ifconfig le0 down $DOIT sleep 5 $DOIT $ifconfig le0 up fi # == END == ------------------- From: mattias@txc.com (Mattias Zhabinskiy) ---- I had similar problems, but very rarely about once in 2 month and always when workstation was heavily loaded. Sun support told me to install patch 101954 (readme file attached below). I'm not sure if it helped (it's to early to make a conclusion), but I don't have any problems since than. One more thing I'm running SunOS 4.1.3_U1 and this patch is for SunOS 4.1.3_U1. But may be they have a patch for SunOS 4.1.4 too. ------------------------------ From: mammino@roch803.mc.xerox.com (Joe Mammino) ----

I ran into the same problem with the net hanging when trying to ftp or backup large files (greater than 200 M) over the e-net.

I am running a Sparc10, 10-baseT, SunOS 4.1.3_U1.

I applied patch 101954-04...even though the readme states this is for the AUI interface, it did fix the problem -- the sun is using the 10-baseT interface.

Included below is the readme and attatched is a compressed tar of the patch.

---------------------------------- From: don@alaska.opensys.COM (Don Lenamond) ---- If you have already tried replacing the system board on the machine that seems to hang on its network interface, then I believe you have eliminated the network controller/interface as being the culprit.

My next step is determining whether your problem is software (OS) related or a problem of your network physical layer.

I'd tackle the later first. One simple but effective way to diagnose the physical layer characteristics of the troublesome system is to halt the system, and run the PROM level "net test" from the "ok" prompt. If you are having physical layer problems, the first test (loopback) should pass, but the external (second) test should fail. If the second test fails, then look at changing first the AUI cable, run the test again, and if results don't change, try switching transceivers. After this, if you still get failures on the external test, then you have a network cable problem.

On the other hand, if the "net test" results in failures on the internal (loopback) interface, then your system board/interface is more than likely faulty.

After this, I would look at any software patches related to networking that may be related. One that comes to mind with the mention of X-terms is the inetd patch. This usually shows itself on system that a relatively fast, say above a SPARC 1. Look at patch-id 101618 (inetd) and the NFS Jumbo patch 100173.

My guess is you are encountering a physical layer problem with your network. Get rid of the coax and move to 10Base-T in the future if possible. ------------------------- From: charles.mengel@lgi.com (Charles Mengel) ---- I had a similar problem in a very similar network setup - an SS10 w/ 8 Sun Xterms on thinnet. THe adapter off of the back of the SS10 has a very poor physical connection. THis may also be true of your clones. We replaced it and the problems went away. --------------------------- From: whj@cs.washington.edu (Warren Jessop) ---- I saw your message (repeated below) a few weeks ago and have waited for the summary. I don't have the answer, but we have a Sun SS10/30 with 4 HyperSparc 90Mhz cpus that has been exhibiting a similar type of network freeze. It's happened at least six times since the HyperSparc installation in January. The machine had been performing flawlessly since early 1993.

I haven't really noticed what precipitates the problem, but the effect is as if the net is dead, with messages on the console about inability to reach remote file servers and such. Someone is always logged into the console at the time, running X over then net, so I can't login as root. I plan to try two things: 1) make a separate console on one of the serial ports, so I'll have a fighting chance of logging in as root and trying ifconfig, as suggested in Bob Kupiec's message, and 2) using the TP port instead of the AUI, when we get a HUB in the room where the system sits.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:21 CDT