SUMMARY FILE CORRUPTION...(??VIRUS??)..SUMMARY

From: Randy Born M-50 Rm 266 (randy@ai.iit.nrc.ca)
Date: Wed Dec 15 1993 - 21:01:47 CST


 SUMMARY
*********
To: sun-managers@eecs.nwu.edu
Subject: FILE CORRUPTION...(??VIRUS??)..SUMMARY
Status: RO

Appears that it was caused by nfs which hopefully has been patched with
NFS.jumbo patch on servers and clients. I say hopefully because we have
been running 4.1.X now for at least 2 years error free except for the one
week in Nov 93 when this file corruption started mysteriously and as
mysteriously stopped one week later.

Many replies/suggestions/advice below:

I SEND MY THANKS TO ALL WHO RELIED/HELPED for their quick response
and hopefully can now look forward to an uneventfull coming XMAS HOLIDAY
( or is that just wish full thing on the part of any Sys Admin?? )

Lots of hints for you other sys admins should something like this strike!
Maybe better reading than a Stephne King novel should you want more sleepness
nights (NOT)

<< ORIGINAL POST >>
Further to an earlier post. User files are being corrupted as follows:

-file size remains unchanged.

-many ( thousands,) characters are changed to null. These are always
sequential and always from somewhere in file to end of file.

- checksum of a good copy (from tape) and corrupted copy diff (AS THEY SHOULD)

- any type of file can be affected.

- corrupt files appear to be random though out user dir. Sometimes all the
files in a sub-dir are affected and sometimes only one or a few.

I"VE checked various ftp sites for possible sun-patch-list to see if
any known patch for this sympton exists. None found.

Hopefully this expanded post may shed some light amongst my peer system mgrs/admins.
 
ANY HELP/advice/prayers or suggestions will be most appreciated!!

Has anybody experienced similiar happennings and know why or patch# to cure?

SunOS in use is 4.1.1 and user workstations are IPC's
Server of user nfs files is a 4/490 also running 4.1.1

(We are not knowingly doing anything special.)

                                                                Randy Born
                                                        System Admin trying to drain the swamp
                                                        which is rapidly rising...
<< End of ORIGINAL POST >>

------------------------------------------------------------------------------

Do you have an NC400 board in your 4/490. If so, you probably want
to turn UDP check sums on. The ether device will probably be ne0,
so you can look for it in /var/adm/messages or with the command dmesg.
Also, if you have that board, make sure to get patch 100762-02, as
it fixes a really nasty bug with oversized packets.

PS: Oh, by the way, to turn on checksums if you have an NC400,
use "sncnet -s udpcksum 1 ne0" and you should probably put it
in your rc.local or whereever the board is getting started.
Ben Taylor
s9ubxt@fnma.com
------------------------------------------------------------------------------

The strangest thing happened: programs only reading files can corrupt that file
On the physical drive nothing is wrong. just NFS went nuts.
Even worse, other workstations accessing the same part from the 3/50, will also
see a corrupted file. The problem goes away eventually if the directory is left alone
for a while, and some other directories are listed with ls(1)
I guess if the NFS server buffering mechanism is flushed somehow , all things turn back
as normal. This only happens with files larger than +100 Kbyte

Marcel Bernards
------------------------------------------------------------------------------

Jup !

We had a bunch of SLC's and ELC's automounting some old 3/50 systems
The users runned jobs using stuff from the 3/50
The strangest thing happened: programs only reading files can corrupt that file
On the physical drive nothing is wrong. just NFS went nuts.
Even worse, other workstations accessing the same part from the 3/50, will also
see a corrupted file. The problem goes away eventually if the directory is left alone
for a while, and some other directories are listed with ls(1)
I guess if the NFS server buffering mechanism is flushed somehow , all things turn back
as normal. This only happens with files larger than +100 Kbyte

This only happens on 3/50 NFS servers and SPARC clients
First we thought it was caused by a Sun fortran bug, but also programs like colrm seems
to be able to corrupt a readonly file. Applying the NFS Jumbo patch did not
solve things. We have reported this to Sun, but they did not come with a solution.

Ah, well , we almost phased out all 3/50's now, so the problem went away too :-)
I wonder if this is the same on your environment. that means that NFS is still a bit
braindeath...

Greetings,

Marcel Bernards, UNIX & Net sysadm Netherlands Energy Research Foundation ECN
------------------------------------------------------------------------------

One thing you may check is that the partitions on the machine are not
overlapping. You can do this using "format" and the "print partitions"
option. Sometimes during the installation, two partitions may accidently
be set to overlap due to an error in calculations. The symptoms would
be that it would work fine for a while but after a short time the disk
space used by both partitions would begin to overwrite one another causing
garbage files. Just recheck the partition starting cylinders and sizes
to make sure this is not the problem. Thanx.

==
Abdul Malik

------------------------------------------------------------------------------
There is DEFINITELY a nasty bug in SunOS 4.1.1 with NFS writing zeros.
I believe there is a patch for this in the big NFS patch for 4.1.1.
Sun will have the patch.
--------------------
Greg Gilley
ggilley@adobe.com
415-962-3862 (voice)

------------------------------------------------------------------------------
Sounds scary. Just a guess, but:

Check to make sure that your partitions are not overlapping slightly?
Use dkinfo or format to see your partition layout

-Christopher

------------------------------------------------------------------------------

      A UNIX virus, from all I've read, is highly unlikely (although
  not impossible). If the problem is always null bits padded to the
  "normal" end of the file, I'd suspect:
             (1) NFS patches needed
             (2) Locking patches needed
             (3) Bad memory chip

--------------------------------------------------------------------
| Art Schoenstadt 0085P@NAVPGS.BITNET |
| Code MA/Zh (Math. Dept) 0085P@vm1.cc.nps.navy.mil |
| Naval Postgraduate School (408)-656-2662 |
| Monterey, CA 93943 "Are we having fun yet??" |
--------------------------------------------------------------------

If the disk in question has been recently re-partitioned, I'd check
to see whether there's an overlap between partitions. I did that
once (only once), and saw many of the smae problems you're seeing.
It was a bear to track down.

... John

------------------------------------------------------------------------------
Hello,

that sounds like a buggy hardware eg. hard drive. Look into
/var/adm/messages for read retry/read failed/write failed messages. If
there are a lot of them then you just have had some bad spots on your
disk. After running commands like format with surface repair or so,
all those unreadable disk sectors will turn to null bytes without
changing file size or timestamps. Unreadable sectors will affect byte
sizes at a multible of 512 (1024?).

We have had that problem with a rather old Xylogics connected to a
sun4/260 running SunOS 4.1.1. That drive must be formatted twice per
year to avoid growing number of bad spots.

Hope this will help you out.

Andreas

------------------------------------------------------------------------------

I had a similar situation almost a year ago. It looked very much like
a virus, but it wasn't. See below...

Same symptom - no file size change. That should tell you right off
it's not a virus. Viruses attach themselves to programs, so there
would be file size changes in executables. It could be a worm, but it's
not a virus. (IMHO)

Now here's where it gets interesting. In my situation, only certain
characters were being changed. "A"'s were being changed to "G"s and such.
A board in one of our routers (a CISCO) had one bad chip that induced errors only
when certain bit-patterns (like the letter "A") were transmitted through it.
Replacing the board fixed the problem. Needless to say, it took several hours
to isolate the problem and the company I worked for at the time was in a panic
about it the entire time, which didn't help matters any. Management thought
for sure it was a virus, but it wasn't.

If the files being affected are static, that is, they are affected in-place, on your
disk, then the router problem I described is not likely the cause of your problem.
But if you're not certain the files are static, a bad router (or repeater, bridge,
etc.) is a good bet. While you're at it, check all your cabling in the affected
subnet.

Other thoughts - are the file corruption patterns consistent within files ?
Always the same section of the file and always the same character patterns ?
Or is it just random garbage ? I'm not quite sure where I'm going with thus
other than to emphasize looking for patterns - it's what led us to the router problem
we had.

If the routers,etc check out good, one *really disturbing* possibility is
a malicious cracker, either internal to your organization, or coming in
from outside - if you have audit logs, you might want to check them out -
but I'd check everything else first.
> which is rapidly rising...

I hope these suggestions have helped. Summarize back to the group when
you get it nailed down - having been through a similar situation once before,
I'm really curious about the solution you come up with at your site.

In the interim, get a snorkel until the swamp drains and may the Force [Farce ? ;)]
be with you ! ;-)

--Randy Taylor
rtaylor@ait.nrl.navy.mil

------------------------------------------------------------------------------
I have had this problem, and I also solved this problem. Here is what worked
for me. (I was even using a 4/490 with 4.1.1!)

The 4/490 is a VME-backplane device, which is surprisingly sensitive about
the order of the boards on the bus. STart with making sure your boards
are in the order specified by the Backplane Configuration Guide. If they
aren't, then get them in order and run some experiments copying large files
to see if it fixes the problem.

For us, the disk controllers HAD to be closer to the cpu board on the bus
than the Backplane guide recommended. Specifically, they had to be closer
than our ALM. We ended up putting them as close as physically possible
(remember, the 490 has a divided backplane, so they won't work right next
to the cpu.)

Also, NFS under 4.1.1 is horrible. Upgrading to a newer release will
definately help. (Doesn't 4.1.3 run on 4/490s?) Make sure the NFS mounts
to the clients are HARD mounts, for all writeable filesystems. NO SOFT
MOUNTS UNLESS YOU ARE MOUNTING A FILESYSTEM READ-ONLY or this can happen again.

We applied NFS patches, but it didn't help. The 8k blocks of nulls kept
occurring until we moved the boards and upgraded the OS.

Good luck!

..Celeste Stokely

------------------------------------------------------------------------------
my guess is no virus--we have seen this occasionally
but never figured out why.

sender: ???? ( accidentally deleted in haste to make summary<RVB>

------------------------------------------------------------------------------
On the off chance that you were asleep during the last year:

Be sure to apply the NFS jumbo patch.
Make sure that your NFS mounts are hard mounts, not soft mounts.

Mark Anderson
----------------------------------------------------------
Offhand, I'd say try 4.1.3, if you can. I seem to recall something on
this in the release notes for 4.1.3 or 4.1.2.

Bikes on the road. Cars in the gutter.
Louis M. Brune ANDATACO

Are these files written to over NFS?

If so, are they soft-mounted on any clients, or has the server been hacked
to support async writes? Either of these are known to corrupt files.

Finally, do these nulls come in nice 8K blocks or not? If so, that possibly
another indication of NFS woes.

(Of cousre, I could be completely barking up the wrong tree...)

Dave.

* David Mitchell, Systems Administrator, email: D.Mitchell@dcs.shef.ac.uk
------------------------------------------------------------------------------

This may or may not be applicable to your situation, but what the
heck.

You may want to investigate to ensure that all NFS patches are
installed. When 4.0 came out, I was the lucky person who got
to spend 19 months arguing with Sun about data corruption in
NFS. One of the forms of corruption was the changing of data
to NULLs in the file. If you look closely at your file, you
may find that the number of characters is within one memory
page size as the original. You should check to see if the
corruption begins one a page boundry as well.

My problems eventually resulted in the NFS jumbo patch, which
for some reason continues to live on in successive versions of
the OS.

Best of luck, corruption is a pain to track down.

--mark

---
Mark Morrissey                    Intel Corp.
------------------------------------------------------------------------------

I wouldn't look for patches as you probably have a hardware problem. Things you I wouldn't look for patches as you probably have a hardware problem. Things you should consider:

(1) If you log into the server (4/490) and access files nothing happens, but get corruption when you access them from an NFS client, then you may have an ethernet chip gone bad. This happened on our 4/380 a few years ago. After maintenance replaced the entire CPU board, the problem went away.

(2) Are all of the corrupt files on one disk? If so, maybe the disk is going bad.

(3) Are all of the corrupt files on one controller? If so, maybe the controller is going bad.

I'm assuming you've rebooted your server so that "fsck" can attempt to clean things up.

Kevin W. Thomas

You should install the SunOS 4.1.1 jumbo NFS patch!! I'll bet the blocks of nulls tend to be in multiples of 512 bytes.

A virus is unlikely, a trojan more likely but have you double checked the partition tables?

I had this happen once long ago on an OLD disk on 4/110 running 4.0.3 that never got tracked down but haven't seen that kind of behaviour in years.

-- -dave fetrow- INTERNET: fetrow@biostat.washington.edu ------------------------------------------------------------------------------

That sounds very, very much like you have nfs running in dangerous mode somewhere. If you are unfamiliar with it, nfs service has two modes - 'safe' (synchrounous writes) and 'unsafe' (the client fires a write and hopes that it actually gets to the disk).

You may also have read/write file systems mounted 'soft', which is unwise and explicitly warned against in both sun and o'reilly documentation, but which is an incredibly common error.

RichardT

------------------------------------------------------------------------------ this looks like a known problem, where files are corrupted on the client side. check the NFS server to see if they are still intact on the server, and only appear mangled on the client. in either case, install the NFS jumbo patch on BOTH the client and servers

--hal

------------------------------------------------------------------------------

******************************************************************************* * Randy V. Born EMAIL: born@ai.iit.nrc.ca * * Technical Officer randy@ai.iit.nrc.ca * * National Research Council of Canada PHONE: (613) 993-8549 * * Montreal Road , Bldg M-50 FAX: (613) 952-7151 * * Ottawa, Ontario, Canada, K1A 0R8 * * * * * ** ** ***** *** *** * * * * *** ** ** ** ** * ** * * * ****************** ** * ** **** ** ** ******************** * ** *** ** ** ** * ** * * * ** ** ** ** *** *** *



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:33 CDT