SUMMARY: le0 giving spurious `no carrier' messages

From: Brendan Kehoe (brendan@cs.widener.edu)
Date: Wed Nov 13 1991 - 08:19:46 CST


I wrote:

     I'm pretty sure something's hosed with the system's ethernet
    interface, but I figured I'd make a shot in the dark on the off chance
    it's something obscure. :)
     My server, every 5-10 seconds, complains there's no carrier on its
    le0 ethernet, but continues to work fine (as do its clients, but with
    some periodic delays).
     I've checked every cable in the state nearly, and it seems to be
    okay; is my diagnosis above right, or is there something else I should
    check?

Answers came from:

        "Ric Anderson" <ric@cs.arizona.edu>
        duke@dsp.analog.com (Kerry Duke)
        holle@ASC.SLB.COM
        shankar@ulysses.att.com
        "Ric Anderson" <ric@cs.arizona.edu>
        brw@hertz.njit.edu (Brian White)
        bukys@cs.rochester.edu
        Joe Schmo (not from Kokomo) <deaes!kocrsw15!swlodin@iuvax.cs.indiana.edu>
        antonson@software.org (Todd S. Antonson)
        claude@genethon.genethon.fr (Claude Scarpelli)
        aldrich@sunrise.stanford.edu (Jeff Aldrich)
        proton!baumann@ucrmath.ucr.edu (Michael Baumann)
        admin@summa4.mv.com (Scott Babb)
        stevens@greece.ctr.columbia.edu (Andy Stevens)
        jimh@nsd.fmc.com (Jim Hendrickson x7348 M233)
        David Wiseman <magi@csd.uwo.ca>
        ian@whistler.sfu.ca (Ian Reddy)
        Byron Rakitzis <byron@archone.tamu.edu>
        Trent MacDougall <trent@cs.dal.ca>
        jfy@cis.ksu.edu (Joseph F. Young)
        darin@kaman.com (Darin S. Lory)
        randy@ncbi.nlm.nih.gov (Rand S. Huntzinger)
        kwthomas@nsslsun.nssl.uoknor.edu (Kevin W. Thomas)
        Peter Smith <peter@sucia.stanford.edu>
        mmzn@cs.rice.edu (Mark Mazina)
        poffen@sj.ate.slb.com (Russ Poffenberger)
        flp@tcs.com (Rick Preston)
        bernards@ecn.nl
        Konradin Stoehr <topaz@uni-augsburg.de>
        don@doug.med.utah.edu (Don Baune 581-6088 MIRL)

The consensus was that either the cable/transceiver was bad, heavy
network traffic, or there was something else bad on the network. (In
our case, it was a misconfigured bridge.)

Here are a few snippets:

--
From: bien@aero.org

Do you have a sniffer? We had the same problem and it turned out we had an NCD terminal that was spewing garbage to the net. When we fixed that, the errors went away. -- From: holle@ASC.SLB.COM

You don't say what kind of server. If it's a 4/490, there is a Sun patch for that.

[BK: in our case, it was a Sparc2] -- From: "Ric Anderson" <ric@cs.arizona.edu>

We get that on our sun4/490 when the network traffic load gets out of hand (like a packet storm). If you have a network monitor, you might want to look at the traffic rate when the error occurs.

This can also be caused by a client which is running an image that was deleted from the server. The client and server get into a "give me the block. There isn't any block" loop. The solution is to kill the offending process on the client or reboot the client.

Etherfind run from a third machine may be useful in detecting this kind of problem. -- From: jimh@nsd.fmc.com (Jim Hendrickson x7348 M233)

The problem has to be between the transceiver and the ethernet chip (or the chips awfully close to it), assuming an AUI cable anyway. If you use thinnet, it could be beyond your machine.

If you have an external transceiver - thicknet or 10BASE-T, "no carrier" is independant of the coax or twisted-pair; it'll be transceiver, cable, board, or CONNECTORS, often the Sun-to-cable connection. -- From: David Wiseman <magi@csd.uwo.ca>

Check the actual connection between the transceiver cable and the workstation. We have had a horrible time with ethernet connectors that have one too many lock washers. (ONE is too many). We've also found that removing the slide connector completely sometimes does wonders for the connectivety of a workstation. The connectors fit together much more tightly when the locking hardware is missing... -- From: ian@whistler.sfu.ca (Ian Reddy)

If your server is a SPARCstation (as the "le0" implies) and you're using a 10BASE-T connection I'm not suprised. I've had the same message ever since my workstation went from 10BASE-2 to 10BASE-T (about a year now) and no-one seems to able to track the problem down. The message appears anytime a large amount of traffic is going in or out and, as with you, does not seem to cause any obvious damage. -- From: jfy@cis.ksu.edu (Joseph F. Young)

You might want to try running "netstat -i 1" and see what load of traffic you are getting, number of collisions, number of errors, etc. when this stuff happens.

[BK: Our collisions were through the roof; for 4M packets, 1M were collisions.] -- From: darin@kaman.com (Darin S. Lory)

We (Kaman Sciences Corp.) have a Sun SparcStation 1+ that was giving the same problem. We had Sun come in and replace the system motherboard and sbus card. Our sparc had two ethernet interfaces though. I shutdown the second interface:

ifconfig le1 down

and it seemed to work. But another situation was that the Sun was networked off a tertiary network. The primary was a 10Base-2, the secondary was a 10Base-T, and the tertiary network off of the second was a 10Base-2.

The signal could not generate down to the third network. I replaced the 10Base-2 coax on the tertiary network and that didn't work. The only solution was to connect it back to the 10Base-T.

I had the equipment sent back to Cabletron and they sent me new stuff but that created the same problem. Replaces all the wiring. Nope. I had to settle that the signal was deteriorating so much that it was randomly making it to the Sun SparcStation. -- From: randy@ncbi.nlm.nih.gov (Rand S. Huntzinger)

Check your network load. If your network is as heavily loaded as ours, you'll may find that you'll get the message when the network load is out of site 25% or more of capacity. -- From: kwthomas@nsslsun.nssl.uoknor.edu (Kevin W. Thomas)

Everyonce is a while, I see that on my le1. Disconnecting the cable and reconnecting the cable at the CPU board has always fixed it for us. -- And, finally, Konradin Stoehr <topaz@uni-augsburg.de> sent this summary of only two months ago (shoot!).

> From: Scott Babb <admin@summa4.mv.com> > Subject: UPDATE: le0: No carrier... > Date: Thu, 12 Sep 91 16:40:00 EDT > > First, my apologies at the long lead-time on this update. I've > received several responses, and I wanted to try most of them and, > hopefully, post the solution to the problem. Unfortunately, I don't > have a solution yet, but I think I'm on the right track. Here's my > update of what's happened so far and where I'm going: > > > BRIEF RECAP OF THE PROBLEM: > (e-net = Ethernet. xcvr = transceiver.) > > Sun 4/330 file server with 2 e-net controllers gives > > le0: No carrier - transceiver cable problem? > > messages as network load increases. Messages are only from the CPU > board e-net controller, never on the second (ie1) e-net interface board. > 4/330 is serving files to both networks. When the errors show up, > network clients get NFS and RPC timeouts, applications crash, etc. > Server is connected to 3 Cabletron MRX 10BaseT hubs on the le0 interface > and 2 MRX hubs on the ie1 interface. Network load is 10 diskful > SS1/1+/2 and 10 PC-NFS machines. > > > > EXPLANATIONS/SUGGESTIONS FROM SUN: > > There is too much e-net traffic for the 4/330 to handle. Both of the > interfaces are interrupting the CPU and it can't keep up, so packets are > lost. > > Make sure that the SQE heartbeat on the 10baseT xcvr is off, to reduce > network traffic. <<I turned SQE off: No effect. I left it off, because > it should be off.>> > > Remove all non-essential NFS mounts and use the automounter instead. > NFS mounts generate traffic because they periodically verify the mount > with the server. > > Check to make sure that an individual node isn't generating a large > amount of traffic due to a faulty connection, etc. <<netstat -i on all > of the workstations didn't show any machine generating a large volume of > traffic or collisions. I'm not sure how to deal with the PCs.>> > > > > COMMENTS/SUGGESTIONS FROM NETLAND: > > First, thanks to all who responded. It was good to hear that I'm not > the only one who has this problem. > > Several people said that they were having similar problems and Sun > and/or their service people either didn't believe them or were unable to > come up with a solution. From the responses, I gather that all Sun 4 > systems exhibit this behavior. People who have seen this on machines > ranging from 4/60s through 4/490s are: > kfoster@orca.wv.tek.com (Ken Foster) > tim@mdi.com (Tim Rosmus) > acb@erhs50.ericsson.se (Sander Beenen) > schoett@informatik.tu-muenchen.de (Oliver Schoett) > coppins@arch.adelaide.edu.au (Simon Coppins) > jjr@ace.ece.arizona.edu (Jeff Rodriguez) > hayes!msieweke@uunet.uu.net (Mike Sieweke) > > hj413we@fire1.uni-duisburg.de (H. J. Weber) was getting the same error > messages on a 4/60 with two e-net controllers. He found that he had a > PC with a bad xcvr. > > bennett@keylime.reef.cis.ufl.edu (Paul Bennett) had a similar problem > with a branch off a DEC 8 port repeater. He had a coax problem. The DC > continuity of the coax looked OK, but a time-domain reflectometer showed > a big impedance bump about 23 feet down one of his coax lines. The > repeater saw the mismatch as a termination and shut down the loop. > > ray@am.dsir.govt.nz (Ray Brownrigg) got the messages continuously until > he traced it to a bad card in a multiport repeater. His e-net was ugly > on the thin-net side of the xcvr. Swapping to a different segment of > e-net cured the problem. > > datri@lovecraft.convex.com (Anthony A. Datri) suggests removing the > sliding clip and screws from the e-net connector on the back of the > machine. He also says that a 4/490 is not necessarily the best solution > for a file server. (I think that Auspex has an interesting box, > Anthony, but we just became Sun Catalyst members, so the 4/490 is hard > to beat on price :-) > > pwh@bradley.bradley.edu (Pete Hartman) experiences the same > problem under heavy network load. He suggests watching the lights on > the xcvrs to see if the load is high or there are lots of collisions. > He says that not much can be done about it except to get more bandwidth > or to subnet for more isolation. > > rpa@dsbc.icl.co.uk (Richard P Almeida) had this problem on several > servers connected to a fanout box. The SQE on the xcvrs was turned on. > Turning it off reduced but the problem, but didn't eliminate it. > > todd@macsch.com (Todd Williams) fixed the problem in his SS2 with a > motherboard swap. He believes that the message is generated when the > software decides that the hardware isn't working fast enough, so it must > be at least partly disconnected. > > bukys@cs.rochester.edu (???) suggests that it could be a periodic > broadcast storm or a problem with an ethernet hub. > > > THINGS THAT I TRIED: > > Our service people (Polaris Service, highly recommended) have swapped > the CPU and the ie1 controllers. Both to no avail. The service tech > and I exchanged the le0 and ie1 feeds and clients at the wiring closets > to rule out bad 10BaseT hubs. That didn't fix or move the problem. > I've replaced the main thinnet xcvrs (and AUI cables) at the 4/330, also > with no change. It doesn't appear to be defective hardware on the > network backbones. > > Going on the Sun suggestion that the two interfaces are contending for > the CPU, I tried moving all of the workstations to the le0 network to > place the majority of traffic on that interface. My thinking was that > the ie1 interface would be asking for less of the CPU's time, since it > only had to deal with PC-NFS machines. The problem didn't get better or > worse. This bothers me because we ran all of the workstations and PCs > on the le0 interface before we installed cabling, etc. for the ie1 > network and there was never any problem. > > I reconfigured the kernel on the 4/330. I was running the GENERIC 4.1.1 > kernel. I created a new one which is GENERIC with 'maxusers' increased > to 112 to give me more directory name cache space in the kernel. I also > carved out everything that wasn't specific to a 4/330. After I built > this kernel, I patched it with adb to increase the size of the buffer > cache to 112. These two modifications are suggested in _Tuning the > SPARCserver 490 for Optimal NFS Performance_ by Varun Mehta and Rajiv > Khemani. The kernel patching script to increase the buffer cache size > looks like this: > > #!/bin/sh > adb -w /vmunix << EOF > nbuf?W 0x70 > EOF > > I rebooted the 4/330 with the new kernel. Monitoring the cache stats > shows that I am getting a much higher cache hit rate for both caches. > Unfortunately, that didn't make to the le0:... problem go away. > > I've spent hours looking at the output from etherfind. I did notice > that there were often heavy NFS writes to the 4/330 when the timeouts > occurred. NFS packets are 8300 bytes long. The e-net drivers break > this up into 6 e-net packets and blast them out at a rate that is *VERY* > close to the maximum ethernet spec. I didn't find any particular > workstation that was running away with the network. I did think that > the large UDP (NFS) packets were placing a major load on the network, so > I changed the fstabs on all of the workstations to mount their NFS > filesystems with rsize=1024 and wsize=1024 so that the NFS packets are > smaller than the maximum e-net packet size. This cut the frequency of > the timeouts and error messages by about 75%, but they are still too > frequent. The users also complained of slow file access. > > I looked at setting up the automounter, but it's a big job and I haven't > had a free weekend to do it without impacting the users. I probably > will set it up, just to make things easier when I need to modify the > network mounts. > > I spoke to Dean Griffler (salesman) at Interphase Corporation > (214/919-9000, uunet!iphase!griffler). The Interphase NC400 > co-processor was mentioned in _Tuning the SPARCserver 490 for Optimal > NFS Performance_ as the item which had the greatest positive impact on > NFS performance. This is a paraphrase of his explanation of my problem: > > The Sun le0 controller is moderately efficient. The problem is with > the ie1 controller. The ie1 is rather inefficient. It introduces a > fairly heavy load on the CPU at times. When both the le0 and ie1 > controllers are trying to serve files out to networks, the CPU can > become bogged down with interrupts and start losing packets. > Effectively, the ie1 interface slows down the le0 interface. In our > situation (the 4/330 is also used for NIS, timesharing, mail, uucp, > etc.) the net result could be that the combined useful bandwidth of > the two networks is less than that of the le0 interface alone! > > Mr. Griffler suggested combining the two networks into a single network > and driving it with an NC400 e-net co-processor to keep the performance > up. The NC400 off loads much of the NFS processing from the CPU. The > NC400 handles IP, UDP, RPC/XDR and NFS protocols on-board. It also > combines the multiple interrupts from an NFS request (6 interrupts per > request) into a single CPU filesystem interrupt. According to > measurements done at Sun (see Mehta & Khemani, above) the SS490 ie > interface is capable of 200 NFSops/sec @ 59 ms/call. The NC400 came in > at 290 NFSops/sec (e-net media saturation!) @ 37 ms/call. The NC400 > also used about 1/2 of the CPU cycles to achieve equivalent NFS > performance and cut CPU context switches by an order of magnitude. It > sounded like an overall win, so I wrote up a requisition for one. I'm > waiting for the paperwork to go through. > > Interestingly, a footnote in _Tuning..._ states: > > "The Sun 452A is best used when there is a need to provide system access > to multiple networks for gateway functions, rather than to provide the > highest performance interface in a multi-network file server." > > Does this sound like an admission that you can't route packets and serve > files to two networks with a standard Sun ie interface? Well, maybe... > > In Sun's defense, they didn't sell us the 4/330, and they never claimed > that it would serve NFS to two networks. A Sun VAR did. Our other > dealings with this VAR convinced us that they were "less than optimal", > so I guess that I shouldn't be surprised. > > > WHERE I'M GOING FROM HERE > > I'm waiting for the NC400. I'm still interested in any suggestions that > netland might have, but the NC400 sounds like my best shot. If I find > anything else that helps, then I'll post it. If not, then I'll still > let you all know how the NC400 works out. If you have any experience > with NC400s, I'm also interested. I hope that the wait and my tendency > to run off at the keyboard haven't caused any undue problems ;-) Thanks > for all of the advice. > > --Scott



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:17 CDT