Summary: Data integrity on 4/280

From: Doug Neuhauser (doug@seismo.gps.caltech.edu)
Date: Fri Mar 30 1990 - 18:43:57 CST


As I promised, here is the summary of the "Data Integrity Problems" on my
4/280 server. A brief summary of the problem was:

1. System utilities (such as f77, cc, ld) will occasionally bomb with a
error such as "segmentation fault". If the exact same command is re-issued,
it will work.
2. Executable files created by loading the exact same .o files will
sometimes vary by 1 or more bytes. Files will be the same length, but a
"cmp -l" will provide differences like:
        1008607 10 110
3. Data files that are copied (via cp) may differ by 1 or more bytes.
4. No disk or memory errors logged.

RESOLUTION:

1. I compared all files in /usr/bin and /usr/lib. They were the same.
2. Ran memory diagnostics - no errors.
3. Replaced the kernel with kernel from other similarly-configured 4/280.
Same problem. NO problems on the other 4/280.
4. Discovered missing P3 and P4 backplane jumper on slot for 1 of the
memory boards. Replaced the jumpers, and the problem still existed.
5. Removed Systech VPC (Versatec Printer/Plotter and Centronics Interface),
which is the only non-Sun maintained board. Problem still existed.
6. Swapped CPUs between the two 4/280s. The problem followed the CPU board!
7. Sun swapped the CPU board this morning, and I can't reproduce the error
again.

The various suggestions and comments were:

1. Paul Graham <pjg@acsu.buffalo.edu>
Overtaxed 451 controllers. Rumored that (2) 2361 double-eagles will
overtax the controllers and produce these results.

2. markm@bit.UUCP (Mark Morrissey)
Potential NFS problem. There is suppostedly a ufs_inode patch and NFS
patch tape #2 that address problems. Often the problems are gross
corruptions, but not always.

3. vasey@mcc.com (Ron Vasey)
We just had a very similar problem on one of our 3/280s.
Most noticeable was the occasional added second bit to bytes of text
(i -> k, p -> r, etc.), generally caught as compiler errors, and the
tendency of Framemaker files to corrupt and kill the owning process.
I had SUN replace the CPU board and disk controller, then memory and
another disk controller. I'd bet it was the disk controller (xy451),
although the symptoms and virtual absence of other disk corruption were
quite unusual. The problem has not recurred since replacement.

4. loral!sysadmin!jes@ucsd.edu (John E. Schimmel)
UPD checksums being turned off provides source of data corruption over NFS.

5. phaneuf@ireq.hydro.qc.ca (Daniel Phaneuf 514-652-8074)
Requested my results. Well, here they are.

I am trying to see if Sun can track this board and give me a report on what
problems they found. If so, I'll let you know.
Thanks very much for all suggestions. Hope this helps someone else in the
future.
----------------------------------------------------------------------------
Doug Neuhauser Div. of Geological and Planetary Sciences
doug@seismo.gps.caltech.edu California Institute of Technology
818-356-3993 MS 252-21, Pasadena, CA 91125



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:56 CDT