The problem turned out to be several users who were writing 90 small
files every two minutes to a file server via NFS. This wouldn't have
been so bad, but the directories to which they were writing to had
between 80,000 and 200,000 files in them. We have since converted them
to use a hierarchical approach and all is well in 670 land.
However, several people commented that they had seen similar problems,
so here is a summary of the hints and suggestions which I received.
Note also that there is a performance patch to SunOS 4.1.3_U1 (patch
#101508-01) which we also installed.
--mark
--- Mark Morrissey Computing Facilities Manager markm@cse.ogi.edu The Oregon Graduate Institute (503) 690-1106 Department of Computer Science and Engineering-----------------------------------------------------------------------------
The original:
We have a Sun 670 MP system with 4 processors running SunOS 4.1.3_U1. Periodically, the response time on this system gradually increases until it takes 2-3 minutes for a command such as w() or ps() to complete. This slowdown can occur over several hours or in a very short period of time. The slowdown appears to be unrelated to actual use at the time of the slowdown. Often times, the slowness will clear up without intervention.
---------------------------------------------------------------------------------
From: nlf@aluxpo.att.com Date: Wed, 9 Feb 94 08:28:52 EST Subject: Re: 670 MP problems
Mark,
About 1 year ago we were in the same boat as you with our 2 670s and 4 Cypress processors each. Spin lock was high and so forth. nfs could be your problem if nobody is on machine, ours was that we had > 100 users on each machine.
Anyway, we did not want to, nor have we migrated to Solaris 2 (though we are testing 2.3 and it is reasonable).
We solved our problem by replacing each dual processor Cypress with a single SS10/41 module so we have two 670/412s now. The machines fly and spin lock remains low. It did solve our problem and we've added numerous patches that Sun has put out for MP kernel bugs. The users were estatic.
I know that you are at a university and don't have the same $$ that AT&T has, but perhaps someone is trading in model 41s for model 51s and you can get some cheap...
Also - because of the way the kernel is, sometimes two processors will do a better job than putting in all four (at least that is was Sun says). We tried it as one of our experiments and it was much worse than four processors and we had to shutdown after 5 hours and add the third and fourth processors.
Good Luck,
Nelson Fernandez AT&T Bell Labs nelson.fernandez@att.com
-------------------------------------------------------------------------
From: dougj@iplab.health.ufl.edu (Doug Jones) Subject: Re: 670 MP problems
I had talked to a kernel guy at sun in november about 4.1.3C compatibility with 4.1.3. At the time the guy mentioned 4.1.3U1 which was not in existence yet, and also metioned off the record a patch that was being worked on at the request of some large corporate customers who wanted the 4.1.3 OS fixed to do MP correctly. He told me that they were in beta testing, that this was not officially acknowledged by sun (as was the 4.1.3U1 in november)
You might log a call and see if Sun has a fix now available.
dougj
---------------------------------------------------------------------------------------
From: mcostel@kaman.com (Mark Costello) Subject: Re: 670 MP problems
Hi Mark,
I have a client with (4) Cypress CPUs in a 670 on SunOS 4.1.3 and have not seen this behavior.
I have many seats running SunOS 5.x and have not found it painfull. The challenge was to learn the 'new' way of doing things.
Good Luck in the direction you choose, Mark Costello
---------------------------------------------------------------------------------------
From: shandelm@jpmorgan.com (Joel Shandelman FIMS Information Systems - 212-648-4480) Subject: Re: 670 MP problems
See if you can dig up a Unix Review article under the Tested Mettle heading. They found that 4 CPU's in a 670/690MP can slow you down under certain circimstances. The article must have appeared about 12-18 months ago.
-- Joel
PS. Only Solaris-2.x REALLY knows what to do with multi-processors
---------------------------------------------------------------------------------------
From: Pat Cain (Denver) <pjc@denver.ssds.com> Subject: Re: 670 MP problems
Mark:
I occasionally see slowing on our 690, also with 4 Cypress processors running 4.1.3 rev A.
After several hours of tracking, I've found that once a user that was running PINE exited, the slow response went away. I asked the user how they involed pine, and they responded "I use an alias in my .cshrc". I looke din his .cshrc, and the alias was formed as follows:
alias p pine -i -z
If you run pine like this from a command line, it doesn't cause the problem. If you put QUOTES around the alias, the performance doesn't degrade.
Very, very odd, but true.
please keep me informed when you find what you're looking for.
pjc PS - my version of "top" (2.5) doesn't show "spin". What are you running?
---------------------------------------------------------------------------------------
From: wpmc!mother!cygan@uu5.psi.com (Linda Cygan) Subject: Re: 670 MP problems
Do you have any Solaris 2.2 machines in your network. We had similar problems from the time I put 2.2 on several machines. Upgrading those machines to 2.3 has mysteriously fixed the problems as well as nfs problems we were also having.
---------------------------------------------------------------------------------------
From: celeste@xs.com (Celeste Stokely) Subject: Re: 670 MP problems
Any console error messages? Anything in /var/adm/messages about retries on disk blocks?
Just a thought on a different track than the one you're taking.
..Celeste Stokely Unix System Administration Consultant, Stokely Consulting EMAIL: celeste@xs.com Voice Line: 415-967-6898 / FAX: 415-967-0160 USMAIL Address: Stokely Consulting 211 Thompson Square / Mountain View CA 94043
--------------------------------------------------------------------------------------
From: stern@sunrise.East.Sun.COM (Hal Stern - NE Area Systems Engineer) Subject: Re: 670 MP problems
if you're seeing that much spin, something is requesting kernel services and then hogging them. this sounds very, very much like a serial line that is going out of control -- are you seeing lots of gettys? aer you'/were you running something on the serial ports?
--hal
---------------------------------------------------------------------------------------
From: Thomas Hutton <hutton@SDSC.EDU> Subject: Re: 670 MP problems
We saw this same problem on our 600 series with the Cypress chipset under 4.1.2 and 4.1.3. We upgraded from the Dual Cypress cards to Viking cards and the problem went away. Note that a single viking is supposely faster than a dual Cypress card.
Tom Hutton - San Diego Supercomputer Center
-------------------------------------------------------------------------------------
From: John DiMarco <jdd@db.toronto.edu> Subject: Re: 670 MP problems
In list.sun-managers you write:
>We have a Sun 670 MP system with 4 processors running SunOS 4.1.3_U1. >Periodically, the response time on this system gradually increases >until it takes 2-3 minutes for a command such as w() or ps() to >complete. This slowdown can occur over several hours or in a >very short period of time. The slowdown appears to be unrelated >to actual use at the time of the slowdown. Often times, the slowness >will clear up without intervention.
Hmm, the only thing I can think of is to try applying the latest version of the Sun4m performance patch, 100726-12.
Regards,
John -- John DiMarco jdd@cdf.toronto.edu Computing Disciplines Facility Systems Manager jdd@cdf.utoronto.ca University of Toronto EA201B,(416)978-1928
-----------------------------------------------------------------------------------------
From: Colin Hillman <colin@apsec.bt.com.au> Subject: Slow 690s
I'd appreciate seeing your replies - we sometimes have that problem too, but we have an Oracle database and a user interface that's pretty CPU hungry!
Thanks in advance
Colin Hillman colin@apsec.bt.com.au BT Australasia - Asia Pacific Software Engineering Centre
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:55 CDT