summary of 4.1 bad traps

From: Hal Stern - Consultant (halstern@sun.com)
Date: Wed Oct 17 1990 - 10:34:21 CDT


a while ago, i solicited input from people who were seeing bad trap
panics in SunOS 4.1. several months ago (end of june, i think),
i started collecting examples. now i have a summary and some
explanations for general consumption.

a bad trap is exactly what it sounds like - the kernel got a trap
(hardware or software interrupt) that it can't handle. there are
two major flavors: timeouts and data faults. timeouts aren't that
interesting, since they're usually hardware related (out of revision
memory boards failing to latch signals for the proper setup and
hold times, for example). data faults are more interesting, because
they are caused by data corruption of one sort or another:
a non-reentrant routine was re-entered, or a null pointer got dereferenced.
in the case of a null pointer, the trace back should show you an address
down in the first two pages of memory.

there are 4 classes of bad traps that fell out of the mail i collected.

1. flush_windows()

this is peculiar to sparc machines and is caused by a race condition
in the fork/context switch code. in general, it looks like you either
see this one *a lot* or you don't ever see it. processes that have
huge stacks (eg, lots of local variables in procedure calls or deeply
nested calls) tend to be affected as well as shells that do an explicit
setting of the stack size limit as part of their initialization.

a patch is available.

2. streams (tty) read

if you are seeing "zs?: parity error ignored" messages around the time
of the panic, and panicing in the streams tty code (strq, strread, etc)
then you may be getting stung by the message itself. logging the message
actually drops the kernel priority for a small window of time; if you
have to handle more tty input during that window there is danger of
damaging streams data structures.

a patch should be available soon, although the best fix is to determine
the cause of the parity errors (line noise, poor grounding, device that
leaves half-frames between connect/disconnects, etc).

3. streams (tty) ioctl

i've only seen this in one place, and the user was doing
        while (1) {
                ioctl(0, FIONREAD &r);
                read(0, buf, r);
        }
with a fire-hose like input stream containing control characters.
the problem looks like the canonical input processing was nuking
characters as fast as the FIONREAD ioctl() was trying to count
them. the kernel panics somewhere in msgdsize().

fix: if you're reading raw (unprocessed) input from a tty device, turn
off canonical processing or pop the tty modules off of the stream.
use select() instead of FIONREAD unless you absolutely need to know
how many characters are in the stream.

4. ifconfig on non-ethernet device

the sunos 4.1 ifconfig has a very neat feature: it will display
the ethernet (MAC) address of a network interface. if you ifconfig
a non-ethernet device (eg, sync serial line), this may panic the
system *if* you have the NIT device present (ifconfig uses the NIT
interface to glean the address info).

fix: remove /dev/nit or take "options NIT" out of the kernel configuration.
if you're booting diskless clients, you can't do this: rarpd requires
the NIT device to be present. just say no to "ifconfig ifd0"

--hal stern
  sun microsystems
  northeast area consulting group
  halstern@sun.com



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:58 CDT