Summary [you can skip to the end if you already know the story]:
25-May-90:
Upgrade from SunOS 4.0.1 to SunOS 4.1 on Sun-4/280's (with 1 ALM-II,
2 Hitachi disks on a xylogics 451 controller, 1 tape drive on a
xylogics 472 controller, 2 8 Mb and 1 32 Mb memory board). During
first post-upgrade multi-user (logins disabled) full dump system
crashed with:
Memory Error Register 1d4<INTR,INTENA,CE_ENA,WBACKERR>
DVMA=1, context=0, virtual address=fff3cfc0
pme=0, physical address=fc0
panic: writeback error
syncing file system... {at this point it hangs and we have to reset
from the cpu board, though in one of the 20
or so crashes it saved a core image}
1-Jun-90:
My first message to sun-spots/sun-managers. Got a few responses
describing similar occurences, but no suggested solution worked.
20-Jun-90:
Frustrated by Sun's lack of responsiveness in looking into the
problem (hardware support people worked hard, swapping boards,
building test systems, etc. despite their suspicions that the
problem was software related), I posted my second message to
sun-spots/sun-managers, and received even more reports of similar
problems, including one other site that received a similar brush-off
("multi-user dumps aren't supported").
31-Jul-90:
After repeated calls to Sun and getting various managers involved
and having the problem "escalated" even further, the problem was
finally identified.
**********************************************************************
Fix:
Remove from /etc/fstab the line:
/dev/xy0b swap swap rw 0 0
Apparently in SunOS 4.1, if you have an fstab entry for the default
swap partition, then when you go multi-user and run swapon(8) the
default swap gets added again. This eventually leads to the kernel
crashing when dump runs and causes the system to swap. This is an
unconfirmed theory (we are still waiting for our sources), but
removing the fstab entry stopped the system from crashing. We are now
back to daily multi-user incremental dumps on our systems. Now all we
have to do is get one of our machines, whose disk got trashed when a
faulty disk controller was swapped in during one of numerous
experiments, back into full service.
Thanks to everyone who responded with suggestions and reports of
similar occurences. It helped put the pressure on Sun to get them to
look at the problem seriously.
--Fuat
Internet: fuat@columbia.edu U.S. MAIL: Columbia University
BITNET: fuat@cunixf Center for Computing Activities
UUCP: ...!rutgers!columbia!cunixf!fuat 712 Watson Labs, 612 W115th St.
Phone: (212) 854-5128 Fax: (212) 662-6442 New York, NY 10025
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:58 CDT