Hi all. Sorry about the delayed summary - I was waiting to see if any further
clues would turn up. The question I asked was:
> We have a pretty serious problem here - two SS2s which will not
> reliably talk to the network.
>
> This subnet has on it two SS2s, 5 SS10s and about 100 X-terminals. Three of
> the SS10s are also on an ATM network. About two or three months ago, the
> SS2s started to exhibit _very_ large delays when talking to the network,
> which have continued to this day. It does not seem to be related to high
> collision or error rates.
>
> Some of the effects of this problem are: an exposed xterm running on one of
> them can sometimes take 3 or more minutes to redraw itself, remote logins
> will sometimes freeze up. However at other times the response is in the
> order of a few seconds. Often a process which tries to generate a lot of
> network traffic will freeze up quite quickly, and not respond again for
> several minutes.
> Importantly when one process freezes up, other processes can still use the
> network, on the same machine. Running netstat shows still quite a lot of
> traffic, and network logins are still accepted. Tracing the processes that
> freeze shows they are blocked waiting to write to the network. Netstat does
> not report any ususual statistics or number of errors.
Well I haven't got any closer to solving this one, though it does seem to be
going away by itself to some extent.
gibian%typhoon@stars1.HANSCOM.AF.MIL (Marc Gibian) sent in a good analysis
of the problem based on his experiences, and said (in part):
> This sure sounds like a collision protocol ideosyncrocy I heard
> about. Apparently the low level collision retry algorithm that causes
> certain combinations of SPARCstations problems. I seem to remember
> that SPARCstation 10s (and 20s) retry more quickly than SPARCstation
> 2s. This causes the 10s to become favored and the 2s to get locked
> out. The real solution is to reduce the traffic load on the particular
> LAN segment(s) on which the 2s reside
In our case, we were in the process of reducing the traffic when the problem
started, but it still seems like the most likely scenario - perhaps the
change in traffic loads set it off.
Anyway, thanks to the following people (in order of appearance):
Margarita Suarez <marg@columbia.edu>
sdr@rdga3.att.com (S. D. Raffensberger 500622500 (RD))
gibian%typhoon@stars1.HANSCOM.AF.MIL (Marc Gibian)
David Hawes <dhawes@dcs.qmw.ac.uk>
-Matthew
+--------------------------------------------------------------------------+
| Matthew Donaldson Email: matthew@cs.adelaide.edu.au |
| Computer Science Department Phone: +61 8 303 5583 _ |
| University of Adelaide Fax: +61 8 303 4366 John / \/ |
| South Australia 5005 Telex: UNIVAD AA89141 3:16 \_/\ |
+--------------------------------------------------------------------------+
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:28 CDT