SUMMARY: etherfind and loss of IP packets (final)

From: J M Thompson (masato@access.digex.net)
Date: Sat Oct 09 1993 - 17:20:24 CDT


SINCE THERE IS A LARGE TIME LAG BETWEEN THIS MESSAGE AND THE PREVIOUS
MESSAGE, I HAVE INCLUDED THE PREVIOUS MESSAGE FOR REFERENCE.

It appears that Mr. Stern had the correct view. When the
original problem occurred and during my initial attempts
to recreate the problem, I missed the error message "out of mbufs:
packets dropped". The message can be found on the system console, if
you are close to it when starting etherfind, and in /var/adm/messages.

( I must have violated the principle that says the probability of
finding a criticial error message is inversely proportional to the
number of hours since sleeping :-) )

After searching SunSolve and conferring with the Support Center, patch
100835-02 was identified as the likely candidate.

After putting on the patch, the dropped packet problems no longer
appear when we run etherfind.

THE REMAINDER OF THIS MESSAGE CONTAINS THE PREVIOUS EMAIL ON THE TOPIC.

>
>The jury is still out regarding this problem :-( But at least I wanted
>to summarize what I got so far. This message is divided into the fol-
>lowing sections: original problem description, responses received to-date,
>some new information, and acknowledgements.
>
>
>ORIGINAL DESCRIPTION
>
>>I am trying to determine if an intermittent loss of IP packets
>>could be caused by running the etherfind utility.
>>
>>The situation that occurred today is that shortly after starting
>>an etherfind trace, we began experiencing intermittent loss of
>>IP packets to and from the system that had the etherfind
>>trace running. etherfind was started has follows:
>>
>>etherfind -i tr0 -v -x -l 256 -t \
>>between hostname1 and ipaddrhost2 >file.out
>>
>>The external symptoms inlcuded:
>>
>>o Excessively long response time for applications running on the
>> host with the etherfind trace.
>>
>>o ping commands to/from this system reported from 20 to 50 percent
>> of the packets being lost. Example of the ping command is
>>
>> ping -s another.host 1000
>>
>>o The intermittent IP packet loss problem continued even after
>> terminating the etherfind trace.
>>
>>The system configuration is SunOS 4.1.3 on a SUN690, single processor,
>>with a token ring interface.
>>
>>At wits end, we rebooted the SUN690 and the problem went away.
>>
>>To further confuse the issue, we had run an etherfind trace on DIFFERENT
>>SUN690 without incident earlier in the day.
>>
>>Any help would be appreciated.
>
>
>SUMMARY OF RESPONSES
>
>Joel Shandelman writes:
>
>>Although not documented [to my knowledge], it is recommended that the
>>workstation acting as the sniffer/scope not monitor it's own interface.
>>Sun Advanced Admin concurrs with this as well. This doesn't explain very
>>well why the problem cleared up after a reboot but it still makes sense
>>that a snifer/scope shouldn't monitor it's own actions.
>
>Sumner K Hushing III writes:
>
>>etherfind opens the interface in promiscuous mode, which grabs any old
>>transaction that comes by. My experience with etherfind is that you
>>must use it on a system other than the one you are debugging, since
>>it will indeed affect operations. I'm surprised you had to reboot
>>to recover, though. My 4.1.3 Sparc10's would recover as soon as I
>>stopped etherfind.
>
>In the situation described by the original posting, I was having
>etherfind monitor its own interface. But I was able to also recreate
>the symptoms of the problem by running etherfind on a third
>box monitoring the traffic of two other boxes. (see NEW INFORMATION
>section for more details)
>
>Mike Raffety writes:
>
>>It CAN ... if the host is already fairly busy, and/or there's LOT of
>>traffic for etherfind to capture.
>
>I can't be certain in the case of the original problem, but in the
>work to recreate the problem in a controlled environment, I was able
>to get the problem to reappear while monitoring ping traffic between
>two systems and at the same time invoking a telnet session from a PC.
>Other than the usual system processes, etherfind was the only process
>running on the system that was functioning as the monitor. And I don't
>think 'ping -s hostname 1000 20' repeated after a three seconds delay
>and a single telnet session is that heavy a load. (see NEW INFORMATION
>section for additional details)
>
>Hal Stern writes:
>
>>you may be exhausting some kernel buffers when running
>>etherfind, and when you're done the system doesn't
>>recover because the buffers are leaked. check out
>>the various patches for leaking mbufs and exhausting
>>kernel memory on a 600MP.
>>
>>the fact that the problem is corrected after booting
>>makes it appear that you're running out of mbufs.
>>you run out, you start to drop packets. you'll
>>use a *ton* of mbufs running etherfind because
>>it goes and grabs every packet that it can
>
>I did searches at the WAIS server located at quake.think.com and reviewed
>the INDEX file for /pub/sun-info/sun-fixes located at sunsite.unc.edu
>and found references to the types of problems described above. But in
>the write-ups I found, I should have also encountered mbufs shortage
>messages written to the console or a panic situation. I did not
>encounter either symptom in the original problem or in the attempts
>to recreate it in a controlled envionment.
>
>
>NEW INFORMATION
>
>Since the original problem occurred in the production environment,
>work to recreate the problem has been carried out in a separate
>environment for debugging purposes. In this environment I have
>
>SysA - Sparc10, SunOS 4.1.3, token ring
>SysB - Sparc10, SunOS 4.1.3, token ring
>SysC - Sparc2, SunOS 4.1.3, token ring
>PC1 - Compaq 486, MS/DOS 5.0, Windows 3.1, FTP, Inc. TCP/IP support,
> token ring
>
>All of the above systems are on the same 16mb token ring subnet.
>
>I can get the problem symptoms to consistently reappear by doing
>the following:
>
>Execute the following on SysB
>
> while :
> do
> date | tee -a file.out
> ping -s SysC 1000 20 | grep "packet loss" | tee -a file.out
> sleep 3
> done
>
>Start etherfind on SysA as follows:
>
> etherfind -i tr0 -v -x -l 256 -t between SysB SysC >trace.file &
>
>AND start a telnet session from PC1 to SysB. As soon as I issue the
>telnet command, within 5 to 20 seconds, the ping command begins to
>report dropped packets, response at the telnet session is poor.
>Terminating etherfind does not clear the problem. As soon as
>I reboot SysA, the ping command stops experiencing dropped packets.
>
>To make it even more interesting, if instead of starting a telnet
>session from PC1 to SysB, I start a telnet session from SysC to
>SysB, *nothing happens*. The problem symptoms do not appear.
>
>If I only run etherfind I do not experience any problems. I use
>the FTP, Inc. TCP/IP support software daily without any problems. It seems
>that when both are active that is when the problem arises. I am
>now currently pursuing the problem with the respective software vendors.
>
>
>ACKNOWLEDGEMENTS
>
>I'd like to thank Joel Shandelman, Sumner K Hushing III, Mike Raffety
>and Hal Stern for their responses.
>

-- 
Jim Thompson                     
email: masato@access.digex.net
daytime phone: 703-759-8252    



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:23 CDT