Kevin, Michael, Karl, Mike, Steve : thank you for your answer. Original question : What are the good iostat thresholds to detect disks bottlenecks ? Answer : There was a general consensus about the 30 ms service time threshold, but I'm still convinced that it's not significant for big blocks (say 1 Mb I/O). A simple sequential read test (dd with 1MB bs) shows that svt_t can reach +30ms and %b=99% even for a single reading process (You should agree that one single sequential-read process can't be suspected to generate a disk bottleneck, right ?) As stated by Karl : * The significant bottleneck threshold is %b (percent time disk busy) > 20% AND (20 ms < svc_t (ServiceTime) < 30 ms) * The critical bottleneck threshold is %b (percent time disk busy) > 20% AND ( svc_t (ServiceTime) > 30 ms) As Karl gave many other tuning advice, I have reproduced them at the end of this message. --- Sebastien DAUBIGNE sdaubigne@bordeaux-bersol.sema.slb.com <mailto:sdaubigne@bordeaux-bersol.sema.slb.com> - (+33)5.57.26.56.36 SchlumbergerSema - SGS/DWH/Pessac -----Message d'origine----- De: Karl Vogel Objet: Re: Disk contention >> On Tue, 22 Jul 2003 18:49:22 +0200, >> "DAUBIGNE Sebastien - BOR" said: S> We have a Solaris 2.6/Oracle box which has poor throughput and a high S> (from 50 to 100) number of IO busy processes (column "b" of vmstat). S> CPU (50%)/memory (no paging) are OK, so I assume the poor throughput is S> due to the disk part. Maybe. I've included some other things to look at below. First, *strongly* consider upgrading to Solaris-8. Lots of throughput improvements, different memory management scheme. We have an Enterprise E450, 1 Gb of memory for our main system. Tuning took awhile because the information is spread out all over the planet, but it runs pretty well now. Our /etc/system is below. Your directory/inode cache (measured by the dnlc script below) should have a hit rate of at least 90-95%. Add "noatime,logging" to the mount options field in /etc/vfstab to get the biggest performance and boot time improvement. You might have to put in a patch to have logging capabilities under Solaris-6; this is probably the single biggest improvement you can make. S> Also, what is the good interval for iostat samples : 30 sec ? 5 min ? I've read that 30 seconds is as low as you should go, because kernel counters aren't updated more often. -- Karl Vogel I don't speak for the USAF or my company vogelke at pobox dot com http://www.pobox.com/~vogelke <http://www.pobox.com/~vogelke> If a nation expects to be ignorant and free in a state of civilization, it expects what never was and never will be. --Thomas Jefferson =========================================================================== #!/bin/sh # dnlc: print directory/inode cache PATH=/bin:/usr/bin export PATH cmd=' BEGIN { fmt = "%-13s %9d %s\n" } /:.../ { s = substr ($0, 30); printf fmt, $1, $2, s } /perc/ { s = substr ($0, 30); printf fmt, " ", $1, s } ' echo 'Directory/inode cache statistics' echo '(See /usr/include/sys/dnlc.h for more information)' echo adb -k /dev/ksyms /dev/mem <<END | expand | awk "$cmd" maxphys/D"Max physical request" ufs_ninode/D"Inode cache size" sq_max_size/D"Streams queue" ncsize/D"Directory name cache size" ncstats/D"# of cache hits that we used" +/D"# of misses" +/D"# of enters done" +/D"# of enters tried when already cached" +/D"# of long names tried to enter" +/D"# of long name tried to look up" +/D"# of times LRU list was empty" +/D"# of purges of cache" *ncstats%1000>a *(ncstats+4)%1000>b *(ncstats+14)%1000>c <a+<b+<c>n <a*0t100%<n=D"Hit rate percentage" END exit 0 =========================================================================== #!/bin/sh # getkern: show predefined kernel tunables kernelvars () { adb -k /dev/ksyms /dev/mem << EFF | \ awk '/^[a-zA-Z_-]+:/ { \ if (!i) { i++; next } \ if ($2 >= 0) { printf "%-20s %s\n",$1,$2; } next } \ /^[a-z_-]+[ \t0-9a-f]+$/ { next } \ { print }' autoup/D bufhwm/D coredefault/D desfree/E fastscan/E lotsfree/E max_nprocs/D maxpgio/E maxphys/D maxuprc/D maxusers/D minfree/E nbuf/D ncsize/D nrnode/D physmem/E rlim_fd_cur/D rlim_fd_max/D slowscan/E sq_max_size/D swapfs_minfree/E tune_t_fsflushr/D tune_t_gpgslo/D ufs_HW/D ufs_LW/D ufs_ninode/D ufs_throttles/D EFF } kernelvars exit 0 =========================================================================== * $Id: etc-system,v 1.3 2001/07/26 20:39:55 vogelke Exp $ * $Source: /space/sitelog/newmis/RCS/etc-system,v $ * * NAME: * /etc/system * * SYNOPSIS: * Tailors kernel variables at boot time. * * DESCRIPTION: * The most frequent changes are limited to the number of file * descriptors, because the socket API uses file descriptors for * handling internet connectivity. You may want to look at the hard * limit of filehandles available to you. Proxies like Squid have to * count twice to thrice for each request: open request descriptors * and an open file and/or (depending what squid you are using) an * open forwarding request descriptors. Similar calculations are true * for other caches. * * WARNING: * SUN does not make any guarantees for the correct working * of your system if you use more file descriptors than 4096. * Programs like fvwm (window manager) may have to be recompiled. * * If you experience SEGV core dumps from your select(3c) system call * after increasing your file descriptors above 4096, you have to * recompile the affected programs. The select(3c) call is known to * Squid users for its bad temper concerning the maximum number of * file descriptors. * ----------------------------------------------------------------------- * rlim_fd_cur * Since 8: default 256, no recommendations * * This parameters defines the soft limit of open files you can * have. Use at your own risk values above 256, especially if you * are running old binaries. A value of 4096 may look harmless * enough, but may still break old binaries. * * Another source mentions that using more than 8192 file * descriptors is discouragable. It mentions that you ought to use * more processes, if you need more than 4096 file descriptors. * On the other hand, an ISP of my acquaintance is using 16384 * descriptors to his satisfaction. * * The predicate rlim_fd_cur <= rlim_fd_max must be fulfilled. * * Please note that Squid only cares about the hard limit (next * item). With respect to the standard IO library, you should not * raise the soft limit above 256. Stdio can only use <= 256 FDs. * You can either use AT&T'ssfio library, or use Solaris 64-bit mode * applications which fix the stdio weakness. RPC prior to 2.6 may * break, if more than 1024 FDs are available to it. * * Also note that RPC prior to Solaris 2.6 may break, if more than * 1024 FDs are available to it. Also, setting the soft limit to or * above 1024 implies that your license server queries break (first * hand experience). Using 256 is really a strong recommendation. set rlim_fd_cur = 256 * ----------------------------------------------------------------------- * rlim_fd_max * default 1024, recommended >=4096 * * This parameter defines the hard limit of open files you can have. * For a Squid and most other servers, regardless of TCP or UDP, the * number of open file descriptors per user process is among the * most important parameter. The number of file descriptors is one * limit on the number of connections you can have in parallel. * * You should consider a value of at least 2 * tcp_conn_req_max * and you should provide at least 2 * rlim_fd_cur. The predicate * rlim_fd_cur <= rlim_fd_max must be fulfilled. * * Use at your own risk values above 1024. SUN does not make any * warranty for the workability of your system, if you increase this * above 1024. set rlim_fd_max = 1024 * ----------------------------------------------------------------------- * ufs_ninode * default 4323 = 17*maxusers+90 (with maxusers 249) * * Specifies the size of an inode table. The actual value will be * determined by the value of maxusers. A memory-resident inode is * used whenever an operation is performed on an entity in the file * system (e.g. files, directories, FIFOs, devices, Unix sockets, * etc.). The inode read from disk is cached in case it is needed * again. ufs_ninode is the size that the Unix file system attempts * to keep the list of idle inodes. As active inodes become idle, if * the number of idle inodes increases above the limit of the cache, * the memory is reclaimed by tossing out idle inodes. * * Must be equal to ncsize. set maxusers = 2048 set ufs_ninode = 512000 * ----------------------------------------------------------------------- * ncsize * default 4323 = 17*maxusers+90 (with maxusers 249) * * Specifies the size of the directory name lookup cache (DNLC). * The DNLC caches recently accessed directory names and their * associated vnodes. Since UFS directory entries are stored in * a linear fashion on the disk, locating a file name requires * searching the complete directory for each entry. Also, adding * or creating a file needs to ensure the uniqueness of a name for * the directory, also needing to search the complete directory. * Therefore, entire directories are cached in memory. For instance, * a large directory name lookup cache size significantly helps NFS * servers that have a lot of clients. On other systems the default * is adequate. The default value is determined by maxusers. * * Every entry in the directory name lookup cache (DNLC) points * to an entry in the inode cache, so both caches should be sized * together. The inode cache should be at least as big as the DNLC * cache. For best performance, it should be the same size in the * Solaris 2.4 through Solaris 8 operating environments. * * Warning: Do not set ufs_ninode less than ncsize. The ufs_ninode * parameter limits the number of inactive inodes, rather than the * total number of active and inactive inodes. With the Solaris * 2.5.1. to Solaris 8 software environments, ufs_ninode is * automatically adjusted to be at least ncsize. Tune ncsize to get * the hit rate up and let the system pick the default ufs_ninode. * * I have heard from a few people who increase ncsize to 30000 when * using the Squid webcache. Imagine, a Squid uses 16 toplevel * directories and 256 second level directories. Thus you'd need * over 4096 entries just for the directories. It looks as if * webcaches and newsserver which store data in files generated from * a hash need to increase this value for efficient access. * * You can check the performance of your DNLC - its hit rate - with * the help of the vmstat -s command. Please note that Solaris 7 * re-implemented the algorithm, and thus doesn't have the toolong * entry any more: * * $ vmstat -s ... * 1743348604 total name lookups (cache hits 95%) 32512 toolong * * Up to Solaris 7, only names less than 30 characters are cached. * Also, names too long to be cached are reported. A cache miss * means that a disk I/O may be needed to read the directory (though * it might still be in the kernel buffer cache) when traversing the * path name components to get to a file. A hit rate of less than 90 * percent requires attention. * * For an E450 with maxusers = 2048, ~800,000 files: * default ncsize = 128512 which gives about 90% hit rate. * setting ncsize = 262144 gives about 94% hit rate. set ncsize = 512000 * ----------------------------------------------------------------------- * tcp_conn_hash_size * default 512 * * This can be set to help address connection backlog. During high * connection rates, TCP data structure kernel lookups can be expensive * and can slow down the server. Increasing the size of the hash * table improves lookup efficiency. This is the kernel hash table * size for managing active TCP connections. A larger value makes * searches far more efficient when there are many open connections. * On Solaris, this value is a power of two and can be set as small * as 256 (default) or as large as 262144 as is typically used in * benchmarks. A larger tcp_conn_hash_size requires more memory, * but it is clearly worth the extra investment if many concurrent * connections are expected. This parameter must be a power of 2, * and can be set in the /etc/system kernel configuration file. The * current size is shown at the start of the read-only tcp_conn_hash * display using ndd. set tcp:tcp_conn_hash_size = 32768 * ----------------------------------------------------------------------- * noexec_user_stack * Since 2.6: default 0, recommended: see CERT CA-98.06, or DE-CERT. * Limited to sun4[mud] platforms! Warning: This option might crash * some of your application software, and endanger your system's * stability! * * By default, the Solaris 32 bit application stack memory areas are * set with permissions to read, write and execute, as specified in * the SPARC and Intel ABI. Though many hacks prefer to modify the * program counter saved during a subroutine call, a program snippet * in the stack area can be used to gain root access to a system. * * If the variable is set to a non-zero value, the stack defaults to * read and write, but not executable permissions. Most programs, * but not all, will function correctly, if the default stack * permissions exclude executable rights. Attempts to execute code * on the stack will kill the process with a SIGSEGV signal and log * a message in kern:notice. Program which rely on an executable * stack must use the mprotect(2) function to explicitly mark * executable memory areas. * * Refer to the System Administration Guide for more information on * this topic. Admins which don't want the report about executable * stack can set the noexec_user_stack_log variable explicitly to * 0. Also note that the 64 bit V9 ABI defaults to stacks without * execute permissions. * set noexec_user_stack = 1 * Log attempted stack exploits. * set noexec_user_stack_log = 1 * ----------------------------------------------------------------------- * Swap * System keeps 128 Mbytes (1/8th of memory) for swap. * Reduce that to 32 Mbytes (4096 8K pages). set swapfs_minfree=4096 * ----------------------------------------------------------------------- * Network * Set to 100 Mbps. set hme:hme_adv_autoneg_cap = 0 set hme:hme_adv_100T4_cap = 0 set hme:hme_adv_100fdx_cap = 1 set hme:hme_adv_100hdx_cap = 1 set hme:hme_adv_10fdx_cap = 0 set hme:hme_adv_10hdx_cap = 0 * ----------------------------------------------------------------------- * Memory management * * http://www.carumba.com/talk/random/tuning-solaris-checkpoint.txt <http://www.carumba.com/talk/random/tuning-solaris-checkpoint.txt> * Tuning Solaris for FireWall-1 * Rob Thomas robt@cymru.com <mailto:robt@cymru.com> * 14 Aug 2000 * * On firewalls, it is not at all uncommon to have quite a bit of * physical memory. However, as the amount of physical memory is * increased, the amount of time the kernel spends managing that * memory also increases. During periods of high load, this may * decrease throughput. * * To decrease the amount of memory fsflush scans during any scan * interval, we must modify the kernel variable autoup. The default * is 30. For firewalls with 128MB of RAM or more, increase this * value. The end result is less time spent managing buffers, * and more time spent servicing packets. set autoup = 120 * ----------------------------------------------------------------------- * http://www.sunperf.com/perfmontools.html <http://www.sunperf.com/perfmontools.html> * * NETSTAT * One key indicator is nocanput being non-zero. * * root# netstat -k hme0 * hme0: * ipackets 228637416 ierrors 0 opackets 269844650 oerrors 0 * collisions 0 defer 0 framing 0 crc 0 sqe 0 code_violations 0 * len_errors 0 ifspeed 100000000 buff 0 oflo 0 uflo 0 missed 0 * tx_late_collisions 0 retry_error 0 first_collisions 0 * nocarrier 0 nocanput 62 allocbfail 0 runt 0 jabber 0 babble 0 * tmd_error 0 tx_late_error 0 * ... * * If this is the case, your streams queue is too small. It should * be set to 400 per GB of memory. Put a similar line in your * /etc/system file. This assumes you have 4GB RAM. * * set sq_max_size=1600 set sq_max_size = 400 * ----------------------------------------------------------------------- * http://www.london-below.net/~adrianc/2002/cookbook.html <http://www.london-below.net/~adrianc/2002/cookbook.html> * Recipe bufhwm: Large Active Filesystem (>>TB) * Tell tale sign: small hit rate in the buffer cache * Fix: increase bufhwm * Drawback: may consume memory for little benefit * Created: July 19 2001 * * Tune the default bufhwm value if you have a small hit ratio on * the buffer cache during periods of high activity: * * "sar -b 1 10" shows %rcache or %wcache < 90% * * A maximum bufhwm KB of kernel memory is used to cache metadata * information (e.g. block indirection data). bufhwm defaults * to 2% of system memory, it cannot be more than 20%. The buhfwm * configured on your system can be obtained with * /usr/sbin/sysdef | grep bufhwm * * The requirements for bufhwm should be: * 'Sum Total of Active Filesystem Size' / 2M. * * For a 100GB filesystem then configure 50MB of "bufhwm" kernel * memory and set bufhwm = 50000 (in units of K). Our current * setting is about 20 MB: * * me% /usr/sbin/sysdef | grep bufhwm * 20725760 maximum memory allowed in buffer cache (bufhwm) * * We're using 86 GB out of about 203 total, so use 50 Mb. Overall * hits/lookups are around 98% according to netstat -k: * biostats: * buffer_cache_lookups 127637848 buffer_cache_hits 125365885 * new_buffer_requests 0 waits_for_buffer_allocs 0 * buffers_locked_by_someone 6131 duplicate_buffers_found 53 set bufhwm = 50000 * ----------------------------------------------------------------------- * http://www.london-below.net/~adrianc/2002/cookbook.html <http://www.london-below.net/~adrianc/2002/cookbook.html> * Recipe segmap_percent: Dedicated I/O server on large Dataset * Tell tale sign: small segmap cache hit rate * Fix: increase segmap_percent * * Only a portion of memory is readily mapped in the kernel in * "segmap" to be the target of an actual I/O. For a read or * write call, being or not in segmap can cause a performance * difference of approximately 20%. Solaris 8 introduced a new * kernel parameter called segmap_percent that controls the size of * segmap. The segmap is sized to be portion of free memory after * boot; C17 uses default value of 12%. * * On a dedicate I/O server it may be beneficial to increase this * value. This actually consumes little additionally memory for * segmap structures (< 1%) but it should be noted that the segmap * portion of the filesystem cache is not considered free memory. * * WARNING: setting too high can result in paging storm. * set segmap_percent = 20 * ----------------------------------------------------------------------- * http://www.london-below.net/~adrianc/2002/cookbook.html <http://www.london-below.net/~adrianc/2002/cookbook.html> * Recipe ufs_HW: GBs of data written to a file * Tell tale Sign: ufs_throttles keeps increasing * Fix: increase ufs_HW * Created: July 19 2001 * * UFS keeps track for each file of the number of bytes of data being * written to disk. Those are bytes in transit between the page * cache and the disks. When this amounts exceeds the threshold * ufs_HW then subsequent write(2) will be blocked until enough of * the I/O operation complete. * * We can set ufs_HW/ufs_LW parameters to values that should limit * the adverse condition: * * ufs_HW should be set to many times maxphys * ufs_LW should be 2/3 of ufs_HW * * When throttling happens, a process is blocked for a time of the * order of a physical write, say 0.01s. This means that a process * can achieve of the order of ufs_HW/0.01s or 100*ufs_HW Bytes/s. * The default of 384K throttles a process around 38MB/sec. * * Our ufs_HW is the default (384K); doubling it slowed down * throttling but didn't eliminate it. set ufs:ufs_HW = 4194304 set ufs:ufs_LW = 2796202 * ----------------------------------------------------------------------- * http://www.samag.com/documents/sam0213b/ <http://www.samag.com/documents/sam0213b/> * Solaris 8 Performance Tuning * maxphys * * The maxphys setting, often seen in conjunction with JNI and Emulex * HBAs, is the upper limit on the largest chunk of data that can be sent * down the SCSI path for any single request. There are no real issues * with increasing the value of this variable to 8 Mb (in /etc/system, * set maxphys=8388608), as long as your IO subsystem can handle it. * All current Fibre Channel adapters are capable of supporting this, * as are most ultra/wide SCSI HBAs, such as those from Sun, Adaptec, * QLogic, and Tekram. * * Try 1Mb for now. set maxphys = 1048576 =========================================================================== Notes from a Lotus Domino site running on Solaris Disk bottlenecks are the most likely bottlenecks. Here are the thresholds you should look for using the different monitoring tools. VMSTAT vmstat is one of the simplest and most useful tools because it reports important data in the categories of CPU, memory utilization, and disk-I/O. To see the system activity for 3 seconds with a 1 second reporting interval use: vmstat 1 3 In the process (procs) group of statistics, there are two important stats, r and b: r is the number of processes in the CPU run queue. b is the number of processes blocked for resources I/O, paging, and so forth. In the memory group of statistics, the important stat is sr: sr is the number of pages scanned and can be an indicator of a RAM shortage. The cpu group of statistics gives a breakdown of the percentage usage of CPU time. On MP systems, this is an average across all processors. us is the percentage of user CPU time. sy is the percentage of system CPU time. The following is an example of the results of doing a vmstat 1 3. The r, b, sr, us, and sy columns are most important. procs memory page cpu r b w swap free re mf pi po fr de sr us sy id 0 0 0 354696 10616 0 7 3 0 0 0 0 65 13 22 0 0 0 368976 8104 0 9 0 0 0 0 0 0 1 99 0 0 0 368976 8104 0 0 0 0 0 0 0 0 0 100 * A significant bottleneck threshold occurs if b (processes blocked for resources) approaches r (# in run queue) * A critical bottleneck threshold occurs if b (processes blocked for resources) = or > r (# in run queue) IOSTAT You can add the switch -x to provide extended statistics, which makes the output more readable because each disk has its own line. You can also add the -c switch to report the percentage of time the system has spent in user mode, in system mode, waiting for I/O, and idling. The following is an example of the results of doing iostat -nxtc 30 3 The svc_t, %b, us, sy, and wt columns are most important. extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 fd0 3.2 1.0 11.4 3.0 0.0 0.0 0.0 4.6 0 2 c0t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t6d0 0.1 1.5 1.1 5.6 0.0 0.0 0.0 6.9 0 1 c2t0d0 22.7 0.2 2045.5 0.7 0.0 0.2 0.0 7.0 0 11 c2t1d0 0.0 1.3 0.0 5.4 0.0 0.0 0.0 6.6 0 1 c2t2d0 0.0 0.1 0.0 0.4 0.0 0.0 0.0 2.9 0 0 c3t0d0 0.0 1.5 0.0 5.6 0.0 0.0 0.0 4.4 0 1 c3t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c3t3d0 %b is the percent of time the disk is busy (transactions in progress). The column that I pay most attention to is the wsvc_t column. This is the average service time in milliseconds. A high number is a sign that the disk is becoming a bottleneck. A rule of thumb is >35 is cause for investigation. Large numbers in the r/s and w/s is an indication of a too small block size. This could also be a poorly tuned application that is making many small reads/writes instead of a few large reads/writes. The kr/s and kr/s give you a good indication of how much bandwidth you are using. For a single Ultra Wide Differential SCSI disk I would expect to get 10MB/s in throughput. For a correctly configured stripe, I would expect to get 10MB/s x number of disks in the stripe. On a read from RAID 5 you should get similar performance. On write the cache will help and you should get close to the same performance while the cache is not being over run. * The significant bottleneck threshold is %b (percent time disk busy) > 20% AND (20 ms < svc_t (ServiceTime) < 30 ms) * The critical bottleneck threshold is %b (percent time disk busy) > 20% AND ( svc_t (ServiceTime) > 30 ms) _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Fri Aug 1 06:15:34 2003
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:17 EST