All, OK, some fantastic responses.....Darren, thank you, as always, for your advice....both here, and on vrts, I always enjoy your posts.... BUT.... The guy that hit the nail RIGHT ON THE HEAD this time was Frank Smith....Frank, well done, I swear you defined my problematic system to a T (see Frank's response below), and all based on some preliminary sar output....VERY well done.... I suppose it's still too early to tell what kind of environment I'll see when 500 users are pounding away at the system, so perhaps it's too soon for a summary....that recognized, I will detail the steps I took throughout the evening, the end result of which brought a system from 3% user, 60% system, 1% iowt, and 33% idle, to 67% user, 20% system, 18% iowt, and 2% idle....these numbers are based on running multiple user-like loads via scripts... I am aware that: 1. 2% idle time sucks, and more than likely the system is CPU bound, and can benefit from some more hardware... 2. No script can emulate a true production user load So noone yell at me that it's too soon to know for sure how the system will behave during peak hours..... However: 1. I watched a historical peak of 4% user time rise to a never-before-seen 80% user time.... 2. I watched the usr/kernel ratio completely pull a 180, which is truly what I was after at this point 3. I watched as top went from a historical "2 or 3 on cpu" statistic, to "10 on cpu", on an 8 processor box 4. I watched mutex lock contention dive 5. I watched involuntary context switches dive 6. Cross calls are still high, but at least there is actual data processing going on behind them 7. I watched reports that took 10-15 minutes to run, run in 1-2 minutes, with hardly any impact on the kernel My Approach: 1. High smtx in mpstat, combined with high cross calls, combined with high system calls, combined with 3% user / 60% system numbers, told me quite clearly that: A. Solaris was spending most of it's time kernel thrashing a. Lots of cache transfers B. User processes were not getting enough CPU time to actually process, by the time their giant process cache was pulled from other CPU's a. Very low user time C. Very much a problem accessing some shared resource inherent to the app a. High smtx in mpstat D. App inherently has a HUGE dependence on system calls, issuing forks, execs, lseeks, fopens, etc., at an alarming rate a. High number of system calls, high system time 2. Most of the symptoms seemed to have to do with CPU load (or lack thereof), and cache thrashing A. Memory was not being touched 3. How can I reduce mutex lock contention, and at the same time drastically increase the time the system spends handling user requests. So: A. Let's see if we can't make these processes stay on the cpu a little bit longer B. Let's see if we can't shrink some of the caches where the mutex lock contention is coming from C. Let's see if we can't take some advantage of the fact that the app depends so heavily on the kernel (system calls), rather than try to change it's nature a. If each app request becomes a system call, then in effect, each user request MUST be chaperoned by a kernel thread b. As such, maybe I should concentrate on streamlining my kernel, and let the app ride the coattails of the more efficient kernel That said, here is what I did, after hours of research and benchmark / load testing: 1. I removed any /etc/system entries that expanded the size of IPC parameters, specifically semaphores and message queues, and let them default to whatever Solaris wants by nature 2. I replaced the default Solaris 8 TimeSharing dispatch table (dispadmin) with a StarFire TimeSharing dispatch table 3. I changed the setting of LD_LIBRARY_PATH so that the alternate Solaris thread libraries (/usr/lib/lwp) were used during dynamic linking of the application code, rather than the Solaris default of /usr/lib, ro /usr/ccs/lib And that's it! Numbers changed like night and day...NEVER has anyone seen this system run THIS application at the speed it ran tonight....hopefully, I will see the same results with the typical user load....I will know for sure by lunch :-) If there is no further summary, or HELP REQUEST ;-), then you can be sure that today went well.... I've seen this problem (kernel intensive, single threaded app) mentioned on the list many times....Frank Smith has obviously endured it once or twice....maybe these techniques will prove useful to others.... Thanks to All, Again,....J.~ **************************************************************************** *********************************** I dooubt you'll find anything wrong with the hardware. It looks like the machine has plenty of idle time available, it isn't swapping, and the disk latency is usually pretty low (except for one time in the afternoon). That would correspond with the snappy command line response you are seeing. The system time you are seeing is probably mostly VxFS, NFS, and network connection service time. Your protection faults (pflt/s) and validity faults (vflt/s) seem somewhat high. I'm not that familiar with jBASE, but my guess would be there is serious contention for some shared resource internal to the app that is causing all the processes to spend their time sleeping. You seem to have plenty of spare disk, memory, and CPU if only the process were able to use it. Does this app work speedily during off hours and then crater as the number of simultaneous users climbs past a certain point? I suspect that it does, but it is difficult to track down the bottleneck. If the app has any profiling support built in (or could be compiled in) that would narrow down the problem, but it still may not be fixable. A single-threaded app can only do so much before it bogs itself down spending all its time managing context switches and hardly any time doing actual work. Adding hardware may not help, other than faster CPUs. More CPUs or RAM won't help as you already have idle time and no swapping. If you are lucky you will find some unneccessary use of locks in the code that you can remove. Perhaps you can move parts of the app to a different machine. Your disk I/O doesn't seem to be a real problem, but mounting your filesystems with the no_atime option can speed that up considerably (assuming you don't need atime for something, like scripts that remove unused files). Also, I noticed on the jBASE web site that if you are using j1 through j4 jBASE files: This method, however, needs a significant of administration and maintenance in a very volatile environment, where data is being added, removed or changed significantly over short periods of time. I have no idea what that means exactly, but I assume it means there are things you need to do on a regular basis to tune and optimize those files. Good luck, Frank Smith **************************************************************************** *********************************** > Looking for: > > 1. Anyone see anything glaring as far as isolating problematic Solaris > subsystem (i.e. memory, cpu, etc.) Not really. The actual number of forks don't look *too* high, but I wonder if the processes that are forking are really big. Normally the memory required should be copy on write, so that the fork itself shouldn't be too heavyweight, but the high system time points me to look in that direction. > 2. Can something like this be caused by a bad RAM chip? Bad CPU? (No > errors in system logs) I doubt it. > 3. Some advice in narrowing down definitive cause, troubleshooting > checklists, tools, general approach to finding the needle in the haystack? > > As always, thanks to all....any and all additional information is > available upon request.... Ugh.. yeah... Nothing off the top of my head. I don't know if you can see mutex locks on the sar output. Maybe a quick 'mpstat 5' or something to see if those look out of place. Darren Dunham **************************************************************************** *********************************** > 1. Database app written in PICK basic, jBASE to be specific > (obviously, this is a major CAUSE of the problem, just got to figure out where, exactly) > Not multithreaded, LOTS of forks, assuming application code is the Yuck! > culprit, but in order to rewrite app, I need to identify which Solaris > subsystem is hurting 2. ~500 users connecting via telnet I would run prstat for a bit, to see which processes are eating your CPU time. Probably your app, as fork() is an expensive operation... > 1. Anyone see anything glaring as far as isolating problematic Solaris > subsystem (i.e. memory, cpu, etc.) Bit hard to tell (need vmstat and/or mpstat output), but it looks like you have a CPU starvation problem. Have you set priority_paging? > 2. Can something like this be caused by a bad RAM chip? Bad CPU? (No > errors in system logs) Nope. > 3. Some advice in narrowing down definitive cause, troubleshooting > checklists, tools, general approach to finding the needle in the haystack? I'm guessing that PICK Basic is horribly inefficient... -- Rich Teer, SCNA, SCSA **************************************************************************** ************************************* Jason Shatzkamer, MCSE, SSA Corporate Express Imaging 1096 E Newport Center Drive Suite # 300 Deerfield Beach, FL 33442 (800) 828-9949 x5415 (954) 379-5415 Jason.Shatzkamer@cexp.com http://imaging.cexp.com <http://imaging.cexp.com> > Confidentiality Notice: This message, including any or all attachments, > is for the sole use of the intended recipient(s). This message may > contain proprietary and confidential pricing information of Corporate > Express Imaging and shall NOT be used, disclosed or reproduced in whole or > in part for any purpose other than to evaluate internally and by > authorized personnel of named company. Any unauthorized review; use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply e-mail and destroy all > copies of the original message. _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Wed Feb 25 07:26:45 2004
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:30 EST