SUMMARY -- nres_gethostbyaddr/email & hopeless NFS, ftp's puzzler

From: George Planansky (george@tusk.med.harvard.EDU)
Date: Sun Nov 01 1992 - 22:31:00 CST

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

The original question:

For several days I have been getting this error messages on my Sun
SLC SunOS 4.1.1, (running NIS with the Makefile B=-b), when mail comes in:

syslog: nres_gethostbyaddr: ra.mcs.anl.gov != 140.221.10.118
syslog: nres_gethostbyaddr: nic.near.net != 192.52.71.4

But nslookup finds that these names do have these addresses.

  What does this means? Could it be a clue to the following net
  problem, which is hurting our efforts and done a number on
  net-news.

  We have lately had severe problems with file transfers with our Suns,
  via both ftp and NFS operations, to hosts that are on the other side
  of one or two routers (not under our control). Anything but the
  shortest file takes forever. PC's running NCSA ftp do just fine at
  the same time to the same machines. There's some evidence that the
  network is losing big packets, but small packets get through.

I. ANSWERS -- nres_gethostbyaddr

This is a known bug.

--------------
>From the FAQ #12:
12) What does the "nres_gethostbyaddr !=" error mean?
        This message is from "ypserv" and has been determined to be
        "harmless" (bug #1039839). Sun supplied a patch 100141-01 to
        quiet it, but the patched version appears to die silently at
        random times, so Sun now has a new patch, 100141-02.

---------------
You need to install patch 100141 on your DNS server(s) which takes
care of your problem.
Here is the README file of the patch which explains it.

===============================================================
Patch-ID# 100141-03
Keywords: NIS DNS nres_gethostbyaddr messages console
Synopsis: SunOS 4.1,4.1.1: nres_gethostbyaddr logs misleading messages to console
Date: 10-Oct-91

SunOS release: 4.1, 4.1.1

Unbundled Product:

Unbundled Release:

BugId's fixed with this patch: 1039839

Architectures for which this patch is available: sun3, sun4

Obsoleted by: 4.1.2

Problem Description:

        DNS used in conjunction with NIS may generate syslog messages
        to the console something like :
        nres_gethostbyaddr: some.name.org != its.correct.IP.addr

Install:

As root and for the correct architecture directory.

example: sun3

/usr/etc
mv /usr/etc/ypserv /usr/etc/ypserv.orig

#copy the new version to /usr/etc

cp sun3/ypserv /usr/etc/ypserv

chown root /usr/etc/ypserv
chmod 755 /usr/etc/ypserv

kill the ypserv and ypbind processes and restart them with
/usr/etc/ypbind &
/usr/etc/ypserv &
=======================================================

Another possibility to check::

----------------
This happened to us. We found the couple of cases that it was happening
on were due because we left out a trailing "." (period) in the forward
nameserver database (/var/named/named.hosts).

I.E. They should have the name "ra.mcs.anl.gov."
but instead they had "ra.mcs.anl.gov"

II. ANSWERS -- hung file transfers

These were caused by a bad fiber transceiver on a gateway Cisco
router, coupled with overlong incremental "back-off" delays in the BSD
TCP-IP codes. Here are several replies that suggest alternatives to
check, plus a longer description of how the offending transceiver was
run to ground, mostly by Tim Baum's efforts here at HMS.

------------
Could it be that you are using PCroute for routers, or a
PC or such? Those usually don't keep up with a train of 6 1.5 kbyte
UDP NFS packets.

------------
The losing of your big packets is the problem, the PC's probably have a
MTU that is small so they work fine but the suns have a MTU of about
1000 or more. You need to plead/nag your upstream sites to check the
configuration of their routers to make sure they are configured
properly.

------------
It's a timing parameter in Unix that says after a failure
delay 1 to 5 seconds before retrying, which is ungodly, try something
in milliseconds. I got this from a friend of mine but haven't tried
it yet here myself.

--------------
as far as your network problem goes, sounds more like a hardware problem
with the network. I observed problems kinda like that, turned out to be
a loose transceiver.

----------------
How the transceiver was located:

We'd found that workstation NFS and ftp transfers of any but very
small files bewteen several local HMS networks, and the outside world,
were hanging, as was rcp -- but NCSA ftp transfers from a PC
did not hang. Also, transfers in the other direction did not
hang. Traceroute showed the affected networks had a particular
router in common. However, the regular router perfomance data
that the network group was following showed no network problems
whatsoever.

Running etherfind during ftp transfers showed that Sun ftp and
NCSA ftp acted differently:

With Sun ftp on both ends, the transaction looks like

   send a bunch of packets
   acknowledge
   send a bunch of packets
   acknowledge

With NCSA on the receiving end, the transaction is

   send ONE packet
   acknowledge
   send ONE packet
   acknowledge

Packet loss only occurred when several packets are sent together; the
second packet usually was lost. But with NCSA telling the remote ftpd
to send only one packet at a time, every packet got through.

Using 'spray' to control packet size and delays, etherfind showed:

   (1) spray -c 10 -l 512 destination-host
       would always have a packet loss of around 50%
   (2) If I inserted any delay at all into the spray command
       (even -d 1, that's 1 microsecond) there would be no packet loss.
       So the packet loss really was dependent on the packets arriving
       in a burst.

The offending fiber transceiver was identified by uplugging subnets
from involved gateway Cisco routers and monitoring the result.
Stangely, the sub-networks *directly* attached to the router with the
bad transceiver had not reported file transfer problems:

        subnets subnets HMS subnets
        seemingly with file
        unaffected transfer problems

"x" denotes bad transceiver

Discussion -- this bad transceiver had resulted in the following
symptoms, some of which had seemed contradictory:

   (1) Slow data transfer on UNIX workstations, but not a
       portable PC used for network diagnostics.
   (2) No problem with ping even with large packet size.
   (3) No problem with NCSA ftp.
   (4) No problem with far-away networks (e.g. utexas.edu, ftp.uu.net).
   (5) No problem with spray (default settings)
   (6) No problem with outbound traffic, only inbound.

(2) (3) and (4) were explained by the timing of the
sequence of packets: 'ping' always inserts a 1-second delay between
packets; NCSA ftp inserted enough of a delay between packets to avoid
them being trashed; and the guess is that the same was true of transfers
from far-away networks.

(5) was explained because problem doesn't occur with the default packetsize
for 'spray' (does show up for larger ones).

(1) is due to the Berkeley TCP/IP code which apparently
has an algorithm to back off before retrying a lost transmission,
to avoid overloading a congested network. Each time a packet is lost,
the tcpip code in the kernel waits an incrementally longer delay before
retrying. Problem is, the increment is in SECONDS where it should be
more like milliseconds; so as a file transfer progressed it got to the
point where the delay was maybe 30 seconds between each packet!

(6) is unexplained.

THANKS to:

Geert Jan de Groot <geertj@ica.philips.nl>
diekema@jdbbs.mi.org (Jon Diekema)
Anil.Katakam@att.com (Anil Katakam)
phillips@qualcomm.com (Marc Phillips)
ups!glenn@fourx.Aus.Sun.COM (Glenn Satchell)
jose.piquer@dcc.uchile.cl (Jo Piquer)
ft@circus.tv.tek.com (Fereydoun Tavangary)
blymn@awadi.com.AU (Brett Lymn)
bacon@mtu.edu (Jeff Bacon)
FENER@ULNA.BWH.HARVARD.EDU
satmb@gauss.med.harvard.EDU (Tim Baum)
pnta@WARREN.med.harvard.EDU (Patrick Nta)
marquard@bcmp.med.harvard.edu (John L. Marquardt)

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:53 CDT