I sent mail to the list Monday with the following question:
>  Last monday we experienced a power dropout in our lab. Of course, it 
>caused everything to crash. Everything seemed to come back up (well, the 
>superblock was corrupted on one drive but fsck fixed it). In the last week,
>twice I have come in in the morning to find our Sun3 at the PROM prompt
>with the message watchdog reset. One time I typed B and it booted fine, but
>the other time it failed its mwemory check. I had to cycle power to bring it
>back to life.
>  Questions:
>1)  What exactly is the watchdog?
>2)  Is this behaviour related to the power dropout?
>3)  Can this be fixed?
>        Thanks.
I received 10 replies.  It seems that the watchdog is a timer that checks
if instructions are executed within a certain amount of time. If it times
out, it assumes there is a problem and quits. People had a variety of 
different things that would cause this behavior. Thanks to all who replied,
and especially to Koper Zangocyan <cc_koper@rcvie.co.at>, who sent a copy
of a Sun Tech Bulletin. I'm not sure what I'm going to do about the problem
except hope it doesn't reappear (it's not worth throwing money at a Sun3/160).
Thanks.
-------------------------------------------------------------------------
>From polk@ece.nps.navy.mil
>It's funny that you wrote about a "watchdog
>reset."  I just got one and I am formatting my
>disk.  My boss said it was a hardware problem
>that has something to do with the bus.
>
>I don't know the answer to your problem but, if
>this formatting doesn't work it probably means
>that we have a serious problem.
--------------------------------------------------------------------------
>From shandelm@jpmorgan.com
>
>I knew of a problem with 3/260 and 3/280's whereby if the ethernet cable was
>reattached while the machine was up, it would cause a watchdog reset. I forget
>the circumstances which cause this. My memory is a bit foggy now since the
>3/2xx days. I hope this helps somewhat.
--------------------------------------------------------------------------
>From glk@pls.amdahl.com
>I have an old Sun 3 that displays similar behavior.  I don't know exactly whatthe watchdog is.  The way it was explained to me, the machine gets itself intoa confused state, at which point it decides it wants to start over with a clean
>slate.  It seems reasonable to assume the power problem probably had somethingto do with it.  My guess is memory problems, judging from your message and also
>my experience with my Sun 3.
--------------------------------------------------------------------------
>From cc_koper@rcvie.co.at
>Regarding your question about watchdog resets, I have found the following.
>It may not be the 100% suitable information for you but I am sure it will help.
>We also have had Sun-3's and they were unstable, especially wrt power.
>
>
>-----------------------------------------------------------------------------
>
>Collection: Software Technical Bulletins
>Document: 1056
>
>Year:   1989
>Month:  February
>Title:  Watchdog Resets for Kernel Debuggers
>-------------------------------------------------------------------------------
>
>Watchdog Resets for Kernel Debuggers
>-------- ------ --- ------ ---------
>
>This article is a discussion of  watchdog  resets  for  kernel  debuggers.   To
>understand  these  procedures  you must be familiar with kadb, 680x0 assembler,
>and have an understanding of  680x0  stack  frames.   Sun-4  systems  are  also
>discussed.
>
>What is a Watchdog Reset?
>---- -- - -------- -----
>
>The Sun-2, Sun-3, and Sun-4 system CPUs are capable of  halting  under  certain
>conditions.  This  can occur with the 680X0-based Sun-2 and Sun-3 machines when
>executing a halt (stop) instruction in either of the following circumstances:
>  A bus error is received at times when the CPU is unable to handle it, due  to
>  stack  problems  When  the  stack  is  rendered unusable as a result of other
>  circumstances, the 680X0 cannot issue a bus error
>
>he Sun-4 SPARC chip only halts when  it  receives  a  synchronous  trap  while
>already   servicing  a  trap,  a  situation  which ely  occurs.  Note  that
>asynchronous traps (such as interrupts) will not cause a halt.
>
>The watchdog reset message is produced by the ROM  monitor.  The  detection  of
>watchdog  resets  is  less accurate on some machines than others; sometimes the
>ROM decides they are power-on resets.
>
>By default, a system stops when a watchdog reset occurs. An EEPROM  option  can
>cause  an  automatic  reboot.  The most likely kernel bugs which cause watchdog
>resets either overflow or trash the interrupt stack; some hardware problems can
>also cause a watchdog reset.
>
>Watchdog resets occur when processor hardware stops, as follows:
>  Sun-2 and Sun-3 software can stop with a stop  #2?00  instruction.   Software
>  can  cause a 68020 to stop by causing a bus error during exception processing
>  of a bus  error,  address  error,  reset,  or  certain  portions  of  an  RTE
>  instruction.   Sun-4 machines will generate a watchdog reset if a synchronous
>  trap occurs while already servicing a trap.   The  most  likely  kernel  bugs
>  which  cause  watchdog  resets  either overflow or trash the interrupt stack.
>  Some hardware problems can also cause a watchdog reset.
>
>In order to recover from these conditions, Sun  has  built  what  is  called  a
>`watchdog  timer'  into  its systems.  If no instructions are executed within a
>certain amount of time (for whatever reason) a timer expires, and we reset  the
>CPU so the system can take steps to get running again.
>
>What is the System State After a Watchdog Reset?
>---- -- --- ------ ----- ----- - -------- -----
>
>The ROM monitor will have an accurate picture of most of the processor state at
>the time of the crash.  The details are written up in the PROM monitor's trap.s
>module.  The ROM monitor attempts to preserve the processor  registers  and  so
>forth, but the following information will be lost:
>  The Interrupt Stack Pointer  (ISP).   The  PC.   The  Status  Register  (SR),
>  including  the supervisor/user flag, the `use master stack pointer' flag, and
>  the interrupt level.  Segment map entry 0 (0th pmeg).   Page  map  entry  for
>  g resetaddr (g resetmap).
>   -            -
>
>Gathering Information for Analysis
>--------- ----------- --- --------
>
>If a watchdog reset occurs at some random time, perform the following:
>
>       Use g4 to get a kernel stack trace.
>
>       Use g0 to get a dump (this occasionally fails after a watchdog reset).
>
>Using kadb to Debug a Watchdog Reset
>----- ----
>
>kadb is useful when debugging a reproducible watchdog reset. When using kadb to
>debug a watchdog reset, the following occurs:
>  The kadb registers will be wrong.  Note the boot PROMs, as it is possible the
>  boot  PROM's registers will be invalid.  Symbolic addresses will be constant,
>  with or without kadb.  Dynamically allocated kernel storage will move.
>
>To get a stack trace from kadb, perform the following.
>
>       a6 is the C frame pointer; it will usually be located somewhere near the
>       stack.   If  you  check  around  where a6 points, you can usually find a
>       frame-link address.
>
>       If you can find a frame-link address, addr$c will produce a useful stack
>       trace.
>
>Identifying the Causes of Watchdog Resets on Sun-2s and Sun-3s
>----------- --- ------ -- -------- ------ -- --- -- --- --- --
>Most watchdog resets involve interrupt stack problems of  some  sort,  such  as
>overflow, trashing, or unmapping.  Here are some hints for identifying these on
>Sun-2 and Sun-3 machines.
>
>To identify the cause of a watchdog reset, one generally needs  a  reproducible
>case.   kadb  can be used to obtain such a case. Therefore, load kadb and cause
>the crash.  Then, use the PROM monitor to list all the registers, and copy  the
>listed registers down.  Finally, start kadb with g fd00000.
>
>The overall strategy of this procedure is to determine the location of the last
>stack  frame.   Once  that  is  available, you use addr$c to get a stack trace,
>which will tell you what is active at the time.
>
>Check a6 against the range eintstack-2k <= a6 < eintstack. If it is some  value
>that  is  wildly  out  of range, the stack was probably trashed.  In this case,
>refer to `Finding Your Place in a Trashed  Stack',  below.   If  the  value  is
>reasonable,  but  near  the  low end of the range, refer to `Checking for Stack
>Overflow', below.
>
>If trying to read the stack gives you an error message, the stack was  probably
>unmapped.
>Finding Your Place in a Trashed Stack
>------- ---- ----- -- - ------- -----
>
>If a6 is zero or some small value, try working from the  highest-level  routine
>downward.
>
>If a6 is some unusually large value, try searching the  stack  for  that  value
>using the following commands:
>  eintstack-800,800/L unusually-large-value
>  eintstack-7fe,7fe/L unusually-large-value
>
>If these commands find some matches, try the following:
>  found-addr+4/p
>
>For the matches which show valid routine names, look on  the  stack  for  other
>pointers  of  the  form intstack+something.  Use these as arguments to $c; this
>may bring you to a valid stack frame.
>
>If the above commands fail to find an appropriate match, the  problem  requires
>further, independent investigation outside the scope of this article.
>
>Checking for Stack Overflow
>-------- --- ----- --------
>
>Take the value of a6 obtained from the PROM monitor  and  enter  the  following
>command: a6-from-prom-monitor$c
>
>This will usually produce a valid stack trace.  Look at the prefix code of  the
>last routine named, and find the size of the routine's stack frame (on a 68020,
>this will be an argument to the linkw instruction).  Then enter the  following:
>eintstack-addr of last stack frame+size of last stack frame = x
>
>If this number is more than 0x800, you have a stack overflow.
>
>Interrupt Stack Sizes
>--------- ----- -----
>
>On Sun-2 and Sun-3 machines, the interrupt stack varies in size from 2k to 10k;
>you  are  guaranteed 2k.  On Sun-4 machines, the interrupt stack varies in size
>from 4k to 12k; you are guaranteed 4k. The interrupt stack begins at the  first
>page boundary following intstack.
>
>The stack-defining code in locore.s is deceptive, as it appears that the  stack
>size  is  2k  plus  the  page  size.   Actually, the first part of the stack is
>write-protected, since it follows the kernel in memory.  For  further  details,
>refer to the locore.s manual page.
>
>*******************************************************************************
>
>ONLINE SUPPORT SYSTEM (OSS),
>
>     Software Technical Bulletin (STB),
>
>     Produced by: Technical Information Services (TIS)
>
>Copyright (c) 1989, Sun Microsystems, Inc.  All Rights Reserved.   No  part  of
>this  work covered by copyright hereon may be reproduced or used in any form or
>by any means -- graphic, electronic,  or  mechanical,  including  photocopying,
>recording,  taping,  or  information  storage  and retrieval systems -- without
>permission of the copyright owner.
>
>RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the Government  is
>subject  to  restrictions as set forth in subparagraph (c)(1)(ii) of the Rights
>in Technical Data and Computer Software clause  at  DFARS  52.227-7013  and  in
>similar clauses in the FAR and NASA FAR Supplement.
--------------------------------------------------------------------------------
>From @a.gec-epl.co.uk:dunstan@gec-epl.co.uk
>
>As far as I know, it is a software watchdog. The message "Watchdog
>Reset" usually means that the kernel has found some inconsistency in
>the hardware it is addressing. This would stack up with your saying
>that it failed memory tests.
>
>It is possible that some small corruption still exists in your root
>filesystem - and it might be worth your while to reinstall the OS, but
>it sounds like hadware to me.
--------------------------------------------------------------------------------
>From hendefd@tech.duc.auburn.edu
>
>We experienced multiple watchdog resets with our 4/280.  It also had
>occasional problems with memory check and it turns out that we had a
>bad memory board.   Since it's real difficult and expensive to get a
>new one or have one repaired, we moved the faulty board to the top of
>the memory stack and ran fine for about 3 months.  Now, of course it
>died completely, but didn't cause a reset.
-------------------------------------------------------------------------------
>From ups!upstage!glenn@fourx.Aus.Sun.COM
>A watchdog reset occurs when the system panics for some reason, and
>then while it is handling the panic it panics again. Since it hasn't
>finished responding to the first one it cannot continue and gives a
>watchdog reset. Usually these are caused by a hardware failure. It
>sounds like it's time to call in the repair man, as you may need a new
>cpu or memory I think.
>
>Sometimes you can see the panic message by using dmesg after the system
>has booted (this won't work if you had to power cycle because dmesg
>prints out the kernels message buffer from memory).
--------------------------------------------------------------------------------
>From ups!kalli!kevin@fourx.Aus.Sun.COM
>
>A watchdog reset is causes when the CPU halts.  A timer goes off to
>make sure the system doesn't just hang, hence the name watchdog. This
>is generally caused by double bus errors on the 68020 machines.  That
>in turn is usually because somebody hosed the stack (as in a badly
>written driver or system code), or the machine is a sick puppy and the
>memory isn't giving the right answers all the time.
>
>> 2)  Is this behaviour related to the power dropout?
>
>If it wasn't happening before, probably.
>
>> 3)  Can this be fixed?
>
>Have you run the diagnostics on the board yet?  They are not exhaustive,
>but they sometimes catch problems...
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:32 CDT