SUMMARY: 690MP problems and patches to 4.1.2

From: Ove Hansen (hansen@SCR.SLB.COM)
Date: Tue Jul 28 1992 - 13:39:38 CDT


A month ago I sent the message included at the end regarding my 690MP and problems
I had with it. A lot of people had solutions, a lot more had similar problems and
wanted a summary. Well, here it is, with thanks to everyone who suggested ways
of solving the problems:

button@alc.com, Mike.McCann@eng.clemson.edu, chrisb@learning.siemens.com,
stack@metaflow.com, hanh@mars.cse.fau.edu, glenn%kalli@fourx.Aus.Sun.COM,
ramey@jello.csc.ti.com, mdl@cypress.com

(apologies to anyone I unintentionally should have left out.)

The following four patches were recommended, the two last ones are the ones I
have applied to my system.

100475-01: SunOS 4.1.2: mmap system call on galaxy causes BAD TRAP
100495-01: SunOS 4.1.2: asynch I/O on a sun4m machine causes panics
100542-03: SunOS 4.1.2: IPI - Galaxy jumbo patch
100575-02: SunOS 4.1.2: MP machines do not perform as well as 4/4XX equivalent

The problem with the kernel with patch no. 100542-03 hanging on boot-up seems
to have been caused by the IPI disk drivers. Mine were revision level 5, and
replacing them solved that problem. I believe the revision level now is 9, if
you have an old system which is upgraded to an MP system you should definitely
check the revision levels of your IPI cards. After Sun replaced the cards both
patches worked fine, no more bootup hangs or panics!

Another change I did was to remove two swap files (I had two raw swap partitions
and two swap files created with `mkfile') and re-partition my disks so that I
now only have two raw swap partitions, one on each IPI driver. Don't know if
that solved any problems, but it could be significant.

When large processes exit the whole system might freeze up for several minutes,
apparently Sun has proposed a patch for it but until it is released the following
work-around could be used: Turn off the swap-order code (which by default is on,
and gives a minor speedup for extremely long running processes). To turn it off
on a running system:

    echo swap_order/W0 | adb -w /vmunix /dev/kmem

I have put it in my rc.local, hope it is the right place for it... The problem
apparently affects all SPARC 2 based systems, not just the 690MP. The work-around
is Sun approved:

> Date: 26-May-92
>
> Description: 4M MACHINES HANG WHEN EXITING LARGE PROCESSES
>
> PROBABILITY OGRESS CHANGE:
>
> 05/26/92
> We have satisfied ourselves that the workaround is acceptable. Have the customer
> adb their kernel and set the global 'swap_order' to 0 (its 1 by default). This
> will alleviate the problem. Engineering is working on a real fix, but this may
> take some time.

After I did this the system hangs I used to experience disappeared.

I am not sure exactly what fixed the problem with processes hanging in "D" state,
but with the changes to the system as described above this problem too disappeared
completely, and I actually have started liking my 690MP now...

Again, many thanks to everyone who replied, and I hope this small summary will be
useful for those who wanted it.

----------------------------------------------------------------------------
Ove Hansen e-mail : hansen@scr.slb.com
Schlumberger Cambridge Research Tel/fax: 0223-325246 / 0223-315486
P.O.Box 153, Cambridge CB3 0HG, England (International prefix for UK: 44)
============================================================================
>
> Which patches have people applied to SunOS 4.1.2's Multiprocessing bits
> and IPI disk control devices? And which problems have occurred and/or been
> solved by these? We've a 690MP with 2 CPUs, running 4.1.2 without any patches
> to the operating system whatsoever.
>
> At times our wonderful system grinds to a halt without any messages on the
> console or elsewhere, then suddenly springs to life again. Some times processes
> hangs in "D" state (non-interruptable wait), usually they belong to an
> application called `Matlab', which always happens on cpu 0. And `pstat -s'
> (after the system has run for a while) shows that the majority of the swap
> space is `reserved', while adding up the sizes of the processes from `ps'
> indicates that much more swap space should be free.
>
> Sun sent me a couple of patches: 100542 to take care of bugs in the IPI driver,
> and 100575 which improves the MP performance, and should take care of some
> system hangs, crashes etc. 100542 ought to be installed before 100575.
>
> So late one evening off I went and installed the first patch (100542 - IPI
> Galaxy jumbo patch) and booted the system with the new kernel (GENERIC with
> MAXUSERS increased to 128). Where it tried to become an NFS server (in
> rc.local) it panicked and died:
>
> panic on 1: free: freeing free frag
> syncing file systems...
>
> So I undid that patch, and applied the second patch. Now it booted up, but
> again, where it in rc.local tried to become an NFS server, exporting the file
> systems took several minutes (as opposed to 10 seconds before the patch) and
> I almost pressed <ctrl><break> as I thought the system hung. No error messages
> were given. The system appeared to run fine after the apparent hang, but I
> chickened out, gave it its old kernel, and rebooted again.
>
> Sun said that they are not aware of any problems with the patches, and
> recommended that I enable savecore and send them a tape with the panic dump.
> I'll do this when I can tidy up enough disk space (128 MB).
>
> If anyone should recognise any of my problems and have any solutions, I would
> be extremely grateful. Especially as we have ordered a Prestoserve card which
> supposedly needs the first patch. If anyone should be aware of further problems
> with the MP and have any recommendations - any info would be welcome. I'll
> summarise to anyone who wishes so (or here if many do.)
>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:45 CDT