SUMMARY: 4/490 IPI problem

From: David Way (dpw@kate.as.utexas.edu)
Date: Thu Mar 30 1995 - 15:32:03 CST


Sun Managers,

On Monday, March 20, 1995, I wrote:

> Our SparcServer 490 has been crashing lately with a variety of error
> messages being written to /usr/adm/messages. The following, which
> appears during many of the reboots after a crash, seems to be a recurrent
> theme:
>
> Mar 20 12:53:34 astro vmunix: ipi_lookup: found wrong command for refnum
> Mar 20 12:53:34 astro vmunix: ipi_lookup: looking for 2b7, found 3b48 ...
> Mar 20 12:53:34 astro vmunix: is0: refnum lookup on success failed. refnum 2b7
> Mar 20 12:53:34 astro vmunix: ipi 0: missing interrupt. refnum 3b48
> Mar 20 12:53:34 astro vmunix: id000a: block 80 (80 abs): write: missing interrup
> t - attempting recovery
>
> Does this suggest a problem with an ipi disk (i.e. media error), or with the
> ipi controller ? The first couple of lines imply that data read from the disk
> isn't what was expected, but the last line about the missing interrupt seems
> to point more to the ipi controller.

Just got one response:

--
From: mhr@internet.sbi.com (Michael Ringbauer)
Subject: Re: 4/490 IPI problem

David,

Just my $.02 for what it's worth. In dealing with IPIs and 490s in the past where they generated messages that weren't very descriptive, it's usually turned out to be the disk causing the problems.

I've swapped out the controllers in these types of situations only having to replace the disk the following day.

Check with your Sun engineer if you have a maintenance contract. If the controllers go bad the errors usually span all the disks on that controller. I don't know if that's the only disk you have on there.

The other thing to watch out for is that Sun changes rev levels on the IPI controllers every once in a while so if you buy new disks or change OS levels you sometimes end up a few revs behind the current one since Sun doesn't go out of their way to notify their customers of upgraded revisions.

Check the obvious first, all the cables are tight and then the revision level of the controller. See if the drive has generated any other messages such as Read/Retry, Conditional Success messages recently. These usually indicate that the heads are starting to misalign and the drive eventually needs to be replaced or reformatted.

Mike -- We never did figure out precisely what was causing the ipi messages, but shortly after I posted my message our machine started experiencing numerous other hardware problems. It's on Sun hardware maintenance, so we just started replacing the most likely things: cpu board, ipi board, ipi cables, power supply, cpu board (again). The last components replaced (cpu board and power supply) fixed it, after about 10 days worth of constant ups and downs. It appears at this point that the ipi messages were symptoms of a separate hardware problem, since there seem to be no problems with the ipi disks now.

Much obliged to Mike Ringbauer for his advice, even though, in retrospect, our problem turned out to be something else.

David. -- David Way McDonald Observatory/Astronomy Dept.- Univ. of Texas, Austin (office) RLM 16.206 (voice) 471-7439 (internet) dpw@kate.as.utexas.edu



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:21 CDT