SunOS 4.1 multi-user dump causes crashes (RESOLVED!)

From: Fuat C. Baran (fuat@cunixf.cc.columbia.edu)
Date: Thu Aug 09 1990 - 11:29:17 CDT


Summary [you can skip to the end if you already know the story]:

25-May-90:
  Upgrade from SunOS 4.0.1 to SunOS 4.1 on Sun-4/280's (with 1 ALM-II,
  2 Hitachi disks on a xylogics 451 controller, 1 tape drive on a
  xylogics 472 controller, 2 8 Mb and 1 32 Mb memory board). During
  first post-upgrade multi-user (logins disabled) full dump system
  crashed with:

    Memory Error Register 1d4<INTR,INTENA,CE_ENA,WBACKERR>
    DVMA=1, context=0, virtual address=fff3cfc0
    pme=0, physical address=fc0
    panic: writeback error
    syncing file system... {at this point it hangs and we have to reset
                             from the cpu board, though in one of the 20
                             or so crashes it saved a core image}
1-Jun-90:
  My first message to sun-spots/sun-managers. Got a few responses
  describing similar occurences, but no suggested solution worked.

20-Jun-90:
  Frustrated by Sun's lack of responsiveness in looking into the
  problem (hardware support people worked hard, swapping boards,
  building test systems, etc. despite their suspicions that the
  problem was software related), I posted my second message to
  sun-spots/sun-managers, and received even more reports of similar
  problems, including one other site that received a similar brush-off
  ("multi-user dumps aren't supported").

31-Jul-90:
  After repeated calls to Sun and getting various managers involved
  and having the problem "escalated" even further, the problem was
  finally identified.

**********************************************************************
Fix:

Remove from /etc/fstab the line:

        /dev/xy0b swap swap rw 0 0

Apparently in SunOS 4.1, if you have an fstab entry for the default
swap partition, then when you go multi-user and run swapon(8) the
default swap gets added again. This eventually leads to the kernel
crashing when dump runs and causes the system to swap. This is an
unconfirmed theory (we are still waiting for our sources), but
removing the fstab entry stopped the system from crashing. We are now
back to daily multi-user incremental dumps on our systems. Now all we
have to do is get one of our machines, whose disk got trashed when a
faulty disk controller was swapped in during one of numerous
experiments, back into full service.

Thanks to everyone who responded with suggestions and reports of
similar occurences. It helped put the pressure on Sun to get them to
look at the problem seriously.

                                                --Fuat

Internet: fuat@columbia.edu U.S. MAIL: Columbia University
  BITNET: fuat@cunixf Center for Computing Activities
    UUCP: ...!rutgers!columbia!cunixf!fuat 712 Watson Labs, 612 W115th St.
   Phone: (212) 854-5128 Fax: (212) 662-6442 New York, NY 10025



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:58 CDT