Folks,
Here it is at long last! Apologies for the delay. The problem has only recently
been fixed.
Eventually, at Sun's request we moved Sol2.5 -> Sol2.6. The effect was dramatic. So
far we've not seen any load problems, and can comfortably accomodate 500+ imap
daemons. At this load we have ~60% CPU idle, and quite low paging activity.
Under Sol2.5 we hit v. high ~300 smtx counts as shown by mpstat at a load of ~400
imap daemons. Now we're seeing typical smtx counts of 10-20 with >500
daemons. Response at the imap clients is excellent at this load.
I suspect the next bottleneck will be disk I/O on /var/mail which is now 
reaching ~40% as shown by iostat -x. But we can do something about that for
modest expenditure.
On Sol2.5 we had unsuccessfully tried:
removing as much NIS  activity as possible: the server was a NIS client and
used NIS for everything in nsswitch.conf. Now it's just used for passwd and
netgroup. That reduced TCP activity between it and the NIS server dramatically 
(reduced total TCP traffic by ~25%!)
adding more swap space
adding DNS services (2ndy server)
moving SMTP services onto another server
twiddling TCP parameters
all to no avail. 
It seems Sol2.5 may have some deficiencies for the kind of load our imap server
gets.  There's some evidence from other sites that this problem is NOT apparent
in Sol2.5.1.
People are smiling at me again - even my wife!
Thanks to all who replied:
From: arthur@spool1.mail.troy.psi.com
From: <Glenn.Satchell@uniq.com.au>
From: rali@meitca.com
From: Scott McDermott <scottm@kcls.org>
From: "Eric M. Stone" <erics@cdcna.com>
From: "Karl E. Vogel" <vogelke@c17mis.region2.wpafb.af.mil>
From: birger@Vest.Sdata.No (Birger A. Wathne)
From: Francis Liu <fxl@pulse.itd.uts.edu.au>
From: Bret Giddings <bret@essex.ac.uk>
From: Clive McDowell <C.McDowell@Queens-Belfast.AC.UK>
From: Kevin Worvill <K.Worvill@uea.ac.uk>
Several people suggested using a different imap daemon. This would have been
a very big change as other imap implementations are thought not to be
compatable with the present imsp database. So we chose to take Sun's advice
and move to 2.6. It turned out to be a straightforward upgrade, but we hit
problems with the amd automounter so we're now using solaris automountd.
Original query shown below. Thanks again to all.
Gordon.
-------------------------------------------------------------------------
Gordon Robertson,		Central Systems Manager,
                                Infrastructure Systems Division,
Tel  +44(0)224 273340		Directorate of Information Systems 
E-Mail : g.robertson@abdn.ac.uk	and Services,
                                Aberdeen University, Aberdeen AB24 3FX, U.K.
--------------------------------------------------------------------------
Folks,
We have an SS1000 with 4 CPU's, 384Mbytes, 2 x SWIFT cards, and an FDDI
(SUNWnfr 4.0) card. There's a fast/wide SCSI disk on one of the SWIFT
cards(s4 below) and 2 x 1Gbyte internal disks(s0,1). Both hme interfaces are
connected.
Every now and again, when user load builds up, the system runs very slowly,
with the run Q getting very high (20->) and "sys" CPU time >90% as shown by
ps(see below).
This server runs Sol2.5 and provides imap, pop and sendmail services. The
biggest load is imap support - we see upwards of 400 imap processes when
busy. All is well up to about 420, then the next few cause the
problem. Then when a few drop off, we get back to normal.
I've checked the hme and fddi interfaces - the packet rates are modest
compared to some of our less powerful NFS servers.
When running normally with about 400 imaps, there's a fair bit of memory 
allocation activity as shown by vmstat, but response time is good when 
running such commands, and imap service is good. 
Here's some output from 'vmstat 10' output showing the transition from normal
-> dreadful...
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s4 --   in   sy   cs us sy id
 0 0 0 410120  9136   0 443  6  0  0  0  0 18  1 10  0  328 7163  917 12 36 53
 0 0 0 411164  9532   0 442  5  0  0  0  0 19  0 16  0  282 2414  527  5 24 71
 0 0 0 413992 11692   0 309  2  0  0  0  0 15  0  4  0  180 1196  241  5 10 85
 0 0 0 415804 12896   0 342 28  0  0  0  0 12  1 12  0  228 1393  247  5 20 74
 0 0 0 413328 10900   0 443 23  0  0  0  0 14  0  7  0  230 2931 1030  6 33 61
 0 0 0 410708  8764   3 615 14 18 18  0  0 20  0 16  0  393 3785 1292 10 44 46
 1 0 0 404952  6496  21 800  4 221 458 0 101 24 0 10 0  475 4521 1441 13 58 29
 0 1 0 404480  7660  14 611 48  8  8  0  0 26  2 26  0  545 5390 1110 13 45 41
 0 1 0 408172  9108   0 610  5  0  0  0  0 19  0 16  0  451 4302 1277 11 53 36
 0 0 0 408872  9368   0 328  7  0  0  0  0 14  0  5  0  203 2610  855  5 28 67
 0 0 0 410960 10928   0 129  0  0  0  0  0 11  2 24  0  266 1066  187  2 18 80
 1 0 0 410424 10420   0 490 80  0  0  0  0 10  0 21  0  588 4003 1723  8 65 27
 9 0 0 408052  8180   1 428  0  8  8  0  0 15  0  1  0  688 3959 1775  8 91  1
 8 1 0 408456  7860   4 513 31 14 14  0  0 24  0 12  0  801 3782 1720  9 89  1
 10 1 0 407064 7268   6 551 38 26 26  0  0 24  1 12  0  822 3918 1777  8 90  2
 0 1 0 410616 10252   1 738  8  1  1  0  0 29  0 18  0  527 3326  878 16 37 46
 1 0 0 410272  9976   0 319  9  0  0  0  0 10  0  8  0  330 3533 1300  6 54 39
 5 0 0 408792  8388   0 434 65  0  0  0  0 17  2 15  0  713 4419 1903  9 88  3
 9 0 0 405100  6180   5 394  0 100 327 968 96 8 0 1  0  562 3640 1537  8 91  1
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s4 --   in   sy   cs us sy id
 18 1 0 401832 6640  20 608 43 182 259 4 31 24 0 27  0  816 4040 1606 12 87  1
 19 0 0 400652 6880   9 420  2 48 58  0  4 14  0  8  0  618 3480 1477  9 91  1
 16 0 0 403800 8852   0 310 16  0  0  0  0  9  2 17  0  746 3624 1613  9 90  1
 11 0 0 403556 8196   0 280 16  0  0  0  0  9  0 15  0  733 3674 1667  9 90  1
 11 0 0 406100 9012   0 230 31  0  0  0  0  6  0 22  0  612 2155 1273  6 92  3
 16 0 0 408300 10172  0 155 88  0  0  0  0  8  1 25  0  667 2679 1297  7 92  1
 13 0 0 409668 10308  0 215 18  0  0  0  0  4  0 34  0  775 2973 1481  9 89  2
 19 0 0 409568 10444  0 216 10  0  0  0  0  5  0  5  0  528 2689 1239  6 94  0
 12 0 0 410628 12008  0 211  6  0  0  0  0 37  0  6  0  770 4781 1604 14 85  1
Here's some really dreadful vmstats...
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s4 --   in   sy   cs us sy id
 29 0 0 444512 8080  18  79 32 11 11  0  0  2  0 25  0  565 2283 1180  5 94  1
 33 0 0 446080 8376  19  88 24 27 27  0  0  7  2 35  0  588 1900  946  6 91  3
 35 0 0 447876 9980   3 122 51  0  0  0  0  5  0 21  0  550 1943 1142  6 94  1
 30 0 0 449176 11032  0 111  0  0  0  0  0  1  0 10  0  423 1496  901  6 94  0
 28 1 0 448472 10240  0 200 258 0  0  0  0  4  0 33  0  632 2331 1244  8 89  3
 33 0 0 449104 9192   0 194 31  0  0  0  0  4  0 18  0  701 2768 1519 16 84  0
 30 0 0 447280 8496   0 354  0  0  0  0  0  4  0  3  0  585 2545 1287 18 82  0
 28 0 0 445916 7328   0 146  0  0  0  0  0  1  0  0  0  367 1574  850  5 95  0
 33 0 0 445008 6648   0 178  1  0  0 384 0  3  2  8  0  420 1967 1024  5 94  0
 20 0 0 443972 6344  14 119  0 116 177 432 37 3 0 18 0  570 2277 1221  5 95  1
 30 0 0 446844 8804   0 131  1  0  0  0  0  4  0 10  0  540 2221 1158  5 93  1
 25 0 0 446112 8248   0 225  2  0  0  0  0  3  0  8  0  534 2370 1173 13 87  0
 22 0 0 445832 8080   0 160  0  0  0  0  0  2  0 14  0  552 2223 1132  9 91  0
 17 0 0 445396 7684   0 241  2  0  0  0  0  3  0 30  0  758 3033 1468  9 91  0
 22 0 0 446264 8420   0  72  0  0  0  0  0  2  0  8  0  393 1978  962  4 95  0
 20 0 0 446012 8188   0 280  2  0  0  0  0  5  2 23  0  566 2389 1135  7 92  1
 23 0 0 445700 7996   0 168 24  0  0  0  0  4  0 16  0  673 3091 1482  7 93  0
 21 0 0 444876 6768   0 155 67  1  1  0  0  1  0 11  0  463 2239 1065 10 90  0
 It seems like I have hit some hard limit somewhere,  for this change from
good -> poor performance to happen so suddenly.
In my /etc/system file I have 'set maxusers=256'. 
I had a look at 'kmast' output from "crash" and it seems that perhaps
some buffer limits are low, eg...
cache name            size avail total   in use    succeed fail
----------           ----- ----- ----- --------    ------- ----
kmem_alloc_8192       8192     4   929  7610368     351307    0
ptbl_kmcache           272     1   910   266240        909    0
pt_kmcache            4096     0   909  3723264        909    0
inode_cache            320    45  9036  3084288       9036    0
rnode_cache            376    19  4460  1826816      85615    0
...but I'm not sure how to interpret this properly. If I'm right, I seem to
be near the 'bufhwm' limit which I think defaults to 2% memory, and looks
very near the "in use" value of kmem_alloc_8192. Should I set 'bufhwm'?
and/or change maxusers? 
Can  anyone  suggest  what the problem might be, or what to
investigate next, or even provide a cure (preferably without spending
money)?  
Gordon.
 
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:08 CDT