Summary of Crash Dump Analysis

From: matt (matt@mostar.com)
Date: Tue Jul 28 1998 - 04:48:44 CDT


Sun managers,

Many thanks to:
Steven Aizic <steven@yucc.yorku.ca>
Andreas Ehliar <ehliar@futurniture.se>
Jean-Philippe.LEROY@st.com
Wim Olivier <wim@na.co.za>
Ray Trzaska <rtrzaska@uk.mdis.com>
Andrew Ho <andrewh@isn.com>
Pam Skillman <pam.skillman@East.Sun.COM>
blymn@baea.com.au (Brett Lymn)

Original post attached below the summary.

>From the responses of the above mentioned people I ran crash on the
two files (vmcore.0 and unix.0) Here is the process which I performed
on the files, Thanks Ray and Pam.

    1) strings vmcore.0 | more
        at some point you will get the buffer that creates
        /var/adm/messages, this msgbuf is a ring buffer, so
        you have to guess where the most recent messages start.
        but you will probably recognise all the stuff around
        the crash, like BAD TRAP for example.

****I modified this and did strings vmcore.0 > stringsinfo
    then vi'd the stringsinfo file and searched for a BAD
    TRAP entry. Which was there and gave some details around
    the actual crash.

    2) netstat -d unix.0 vmcore.0

    3) nfsstat -n unix.0 vmcore.0

    4) arp -a unix.0 vmcore.0

    5) ipcs -a -C vmcore.0 -N unix.0

    6) crash -d vmcore.0 -n unix.0
> help p
> p -e
>
Along the same lines as number 6, well it is number six only in greater
detail. Use 'crash vmcore.0 unix.0' .
  The following commands can be found by issuing ? at the > prompt
  but Pam threw in the more useful ones in her email:

stat prints system statistics
user or u prints the user structure for the designated process
               (You will usually see the command that was running at
                the time under 'PROCESS MISC' information).
proc or p prints the process table
               (the process with a 'p' in the second field is the
                process that was running at the time of the crash)

****The proc command came in really helpful since it actually designated
    the process which was running at the crash. This allowed me to pin
    in closer on a problem, see below.

After performing all of the above I came to find out that
"it seems that " one of the proprietary software programs
installed was making a RPC when the system suddenly took a
nice power surge (even through the ups) which caused the
powerchute program to kick on.

This seemed to have wiped the swap space and so on, at
least according to the logs which said dump on /cotodos1
(my swap space). I saved the information for future use in
case it happens again (just in case this wasnt the problem
and it happens to be a glitch in the proprietary software).

The crash program truly did help. I was able to locate the
actual process that was running, which will allow me to keep
an eye on that program and note if it gets buggy again. Also
I was able to look for a bad trap entry while using the strings
command.

Thanks for all the help and if I wasnt clear in my summary
please feel free to email for further information. The whole
process made me appreciate a good engineer, hope I dont have
to do anything like this all the time. Also, thanks to Wim
for offering his time at looking over the information.

Matt
matt@mostar.com

=========================================================================
                                ORIGINAL POST
=========================================================================
Sun managers,

I had a problem about one month ago where my sun ultra 1 running solaris
2.5.1 with all patches seemed to have dumped the swap space and had
a cpu panic. With the suggestion of a few managers I enabled the
savecore located in /etc/rc2.d/S20syssetup. By enabling this it allowed
the system to save information about this particular cpu panic to the
proper directory.

I now have the following information from this cpu panic in its
proper directory. Which, by the way, was created when I enabled
the savecore.

   Three files:

      -rw-rw-rw- 1 root root 2 Jul 27 bounds
      -rw-r--r-- 1 root root 845992 Jul 27 unix.0
      -rw-r--r-- 1 root root 35332096 Jul 27 vmcore.0

Now, naturally the problem is, what do I do with this information.
I realize there is a wonderful book out, which I plan to buy as
soon as I return to the United States in August about Unix
crash dump analysis. Unfortunately I do not have the convienience
of getting this book at my current location (I'm in the middle of
a foreign country which doesnt have a great many bookstores, to say
the least english computer bookstores). So.. I was deeply hoping
that any managers out there could give me some pointers on
what I should do ????



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:44 CDT