Summary : Capacity planning and performance measurement

From: Siddhartha Jain <sid_at_netmagicsolutions.com>
Date: Tue May 08 2001 - 16:44:28 EDT
Hi,

Here is what i posted :-
-------------------------------
I looking for any papers or tools for sun server capacity planning. For eg.
how much RAM and CPUs i need for a given application. Also, how do i
compare performance of different servers with different OSs. For eg.
performance differences between Solaris 2.6/7/8.
------------------------------------

Most pointed to this book :-

Configuration and Capacity Planning for Solaris Servers
by Brian L. Wong of Sun Microsystems
ISBN: 0-13-349952-9

and i got a good paper on Oracle/Solaris. Below are the responses :-

-------------------------------------------------------------------
John Malick wrote :-
There is a great book just on your concerns:

Configuration and Capacity Planning for Solaris Servers
by Brian L. Wong of Sun Microsystems
ISBN: 0-13-349952-9

Another book I found extremely useful is from Sun's performance GURU:

Sun Performance and Tuning
by Adrian Cockcroft and Richard Pettit from Sun
ISBN: 0-13-095249-4
--------------------------------------------------------------------
Kevin Buterbaugh wrote :-

I would recommend the following 3 books:  1) "Configuration and
Capacity Planning for Solaris Servers" by Brian Wong (the book I believe
you're referring to),  2)  "Sun Performance and Tuning, 2nd Edition" by
Adrian Cockcroft, and  3) "Solaris Internals" by Jim Mauro and Richard
McDougall.  I've found very few questions that one of those three books
couldn't answer.

     In addition, there's also some great articles on Sun's web site at
www.sun.com/sun-on-net/performance.  There's some great articles on RAID
and VM sizing, etc.

     We have several servers running Solaris 2.8 now (none running Solaris
2.7 and most running Solaris 2.6).  I have been extremely pleased with the
performance improvements in Solaris 2.8, especially relating to the new
memory architecture.  For example, I have a UE6000 (14 CPU's, 9.5 GB RAM)
that had an average scan rate of around 500 / 600 under Solaris 2.6.  Since
we upgraded it to Solaris 2.8, it's been 0!

     Hope this helps...

----------------------------------------------------------------------------
Karl Vogel  sent me this great paper for Oracle/Solaris sizing :-

Optimizing and Measuring the Solaris Kernel For Large Oracle Servers
by Mike Jaffee, Sun Microsystems

  The first part of the paper will discuss the basics of Solaris
  Internals that are relevant to the Oracle DBA along with tips to
  common technical questions and relevant header files.  The second
  part is quoted tuning information taken from Sun Experts.  The final
  part is a discussion of kernel memory allocation, how to measure it,
  and some things that can be done to prevent starvation.

Solaris Internals
  Sparc has two rings of execution.  The inner ring is for kernel
  functions and the outer ring is for user process functions.
  The process address space is virtual, and normally only part of a
  process is in physical memory.  The kernel stores the contents of
  the process address space in physical memory, on-disk files, and
  specially reserved swap areas.  Over time the kernel shuffles pages
  of the processes between physical memory and disk.  Each process has
  registers that are stored in the kernel and are place in the hardware
  registers at run time.  A process must block if it is waiting for
  a resource and allow another process to run.

  The kernel allows each process a brief period of time, usually
  10 milliseconds, to run before performing a context switch.
  (Vahalia p.20-25)  On startup once the kernel is loaded, user
  processes can request system services from the kernel through the
  system call interface.  If the process misbehaves by dividing by
  zero or overflow its stack, a hardware exception occurs, and the
  kernel intervenes, usually aborting the process.  Interrupts come
  from peripheral devices usually indicating a status change or
  I/O completion.  Two important processes that manage memory are
  the swapper and pagedaemon.  (Vahalia p.22-25)

  Each process has a virtual memory address space (VMA) that is
  translated to physical memory addresses by page tables.  This mapping
  is done by the chip's MMU.  (Tip - System panics can be either
  hardware or software related.  The MMU registers give helpful hints
  on what actually caused the panic.)  In addition to kernel and user
  mode, there is kernel and user space.  This refers to regions in
  virtual memory address space of the process.  There is only one
  kernel and many processes and hence every process must map in a
  single kernel address space.

  The kernel portion of the VMA maintains global data structures
  and some per process objects.  These can only be accessed by the
  kernel when the chip is running in kernel mode (ring 0).  Since the
  kernel is shared by all processes, kernel space must be protected
  by user-mode access.  This is done by requiring the processes
  to use the system call interface.  This requires the chip to go
  into kernel mode, transfer program control to the kernel, have the
  kernel execute system code instructions, then switch back to user
  mode and user control of the process.  (Vahalia p.22-23)

System Services
  Oracle uses many Solaris system services such as file and record
  locking, inter process communications, virtual memory, and process
  scheduling.  Common system calls are open, read, write, fcntl, kill,
  priocntl, plock, memcntl, sync.  Common signals are
  <pre>
SIGSEGV - usually means user stack overflow,
SIGBUS  - out of the process address space,
SIGTERM - user has "hung up" without exiting gracefully,
SIGUSR1 - defined signal for asynchronous events,
SIGKILL - kill process immediately no exceptions.
  </pre>

  Oracle uses file and record locking by setting read write locks on
  portions of a file.

  Any process can read a file that is locked but only the owner of
  the lock can update the file.  A write lock is sometimes called an
  exclusive lock and a read lock is sometimes called a shared lock.
  Process scheduling is usually managed very well by the kernel,
  however a slow job can be speeded up by the priocntl system call.
  (System Services Guide p.1-25)

  Jim Skeen of Sunsoft - "Oracle gets locked-down memory as a
  consequence of using intimate shared memory (ISM), not through plock.
  It controls sharing inside shared memory through latches, not memcntl
  or plock."  He also cautions against changing the priority of the
  Oracle processes: "This is something we in DBE actually strongly
  discourage.  Only the most daring and knowledgable DBA's should
  attempt this.  The problem is that system threads can get starved
  if Oracle processes are not "well behaved" when running in real
  time class.  Oracle processes may easily hog a cpu for extended
  periods of time (time being measured in Unix quantums).  We in DBE
  have experimented with changing the dispatch table in useful/clever
  ways, to minimize the number of involuntary context switches.
  But Oracle processes still run in TS class."  (private letter Skeen)

Oracle Internals and Solaris System Services
  Mark Johnson of Oracle and Jim Skeen provide the following expert
  insight and information.  The system global area is defined as "One
  or more shared segments visible to all Oracle processes that are used
  to store precompiled SQL and PL/SQL (library cache), database buffers
  (buffer cache), and for interprocess communication" (Johnson).
  As far as process control - "Oracle does use semaphores, but latches
  are the usual synchronizing mechanism, as mutexes implemented as
  spin locks" (Johnson).  On the subject of locks "Oracle maintains
  database transaction integrity through use of database locks of
  various sorts--shared read, exclusive read, exclusive write, etc.
  These are implemented through database locks, not using Unix file
  locks.  Thus, the scope of a database lock can be limited to a
  single row in the database.  Or, the database may choose to lock a
  database page (which may be quite a bit smaller than a Unix page).
  Or, the database may choose to lock an entire database table (which
  may be composed of multiple database files, which in turn may or
  may not map into Unix files)."  (private letter Skeen).

  Oracle uses heavyweight processes that are in the shared memory
  portion of the process address space.  The DBWR (data buffer writer)
  process uses aio threads known as light weight processes (LWP).
  An LWP is a kernel-supported user thread that is based on kernel
  threads.  They are independently scheduled and share the address
  space of the process.  Vahalia's book has a nice discussion on LWPs.
  (Jaffee) Kernel Asynchronous I/O and Intimate Shared Memory are
  two key technologies used by Oracle on the Solaris platform.

  Asynchronous I/O is needed because a single blocking thread in
  a multi-threaded application causes all threads to wait until
  the thread wakes up.  What needs to happen is for the thread to
  issue an asynchronous I/O request and then pass control to another
  thread in the process.  Also heavy I/O is not efficient when done
  synchronously because of the large number of context switches that
  must occur every time a thread is blocked.  (Hyuck Yoo)

  Asynchronous I/O under Solaris is implemented two ways - under
  Solaris 2.3 it is using the library and under Solaris 2.4 and beyond
  it is in the file system layer of the kernel.  The library approach
  uses kernel-level threads where each I/O request is handled by a
  newly created kernel-level thread that acts synchronously (i.e.
  issuing read and write calls).  The library lives outside of the
  kernel and the kernel threads that perform the I/O are separate from
  the calling process.  The kernel approach is much more sophisticated
  and efficient.  The basic concept is to not maintain the queue in
  user space but to put the request directly into the device driver
  queue.  The biowait function is bypassed (which is the device driver
  equivalent to a blocking function) and the thread transfers control
  rather than sleep in the kernel.  The kernel has buffers with slots
  called AIO that maintain a listing of all I/O requests.  (Hyuck Yoo)

  Solaris has provided the ISM feature since 2.2.  The main feature
  of ISM is in addition to sharing the "memory" pages (like the
  normal shared memory), it also shares the page table entries for
  those pages (therefore, it's "intimate").  Another side feature,
  which is more important for this discussion, is that ISM also locks
  down the shared memory segment in real physical RAM.  Since the
  main purpose of ISM is for the DBMS products' buffer cache usage,
  this makes sense.  (Jaffee)

  Sharing page table entries solves the problem of page table stealing
  which is expensive because all the pages mapped in the stolen page
  table have to be flushed before being given to another process.
  This avoids the condition where the whole system may thrash as
  processes steal page tables from each other.  (H. Yoo)

  The design team created a new segment in the process address space
  called segshm so that they could create one set of page tables for a
  shared memory segment and share the page tables among the processes
  that attach that same shared memory.  In addition to saving page
  table allocation, sharing page tables have other advantages such
  as having a higher cache hit rate on memory map lookups because the
  tables are in a buffer cache rather than in memory.  It also avoids
  the amount of overhead done by the hardware address translation
  layer since it no longer needs go through page tables for every
  process to monitor whether a page has been modified.  These are
  both huge savings and speed up the virtual memory paging algorithm
  within Solaris.  (H. Yoo)

IPC
  The Oracle RDBMS is a complex program that uses multiple cooperating
  processes that must communicate with each other and share resources.
  The kernel provides a mechanism in user space called inter
  process communication or IPC.  The processes operate in a shared
  memory segment such that if one process modifies data it will be
  immediately visible to the other processes.  Data transfer and
  event notifications occur between the various Oracle processes in
  the Oracle SGA.  Semaphores are used for Oracle's own locking and
  synchronization scheme.  Asynchronous events such as errors are
  reported to the processes using signals.  The default action for
  most signals from the kernel is to terminate the process, however
  the process may specify an alternate response by providing a signal
  handler function.  (Tip - Before installing the kernel jumbo patch
  read the readme file to see if there are any known signal problems
  with Oracle).  (Vahalia - p150)

  The relevant IPC system calls Oracle makes are shmget, semget,
  shmat, shmdt, shmctl, and semctl.  The ipc information is stored
  in the kernel with the ipc_perm structure.  shmget(key, size,flag)
  creates a portion of shared memory (which will be the size of the
  Oracle SGA) and shmat(shmid, shmaddr, shmflag) attaches the region
  to a virtual memory address of the process.  (shmsys is how Oracle
  sets up the intimate shared memory segment).  The structure of a
  shared memory segment includes access permission, segment size,
  the PID of the process performing last operation, and the memory
  map segment descriptor pointer as well as other fields.  (tip -
  sgabeg in the ksms.s file is a virtual address not physical address
  (0-0xffffffff = 2 GB).  Choose small beginning addresses for large
  SGAs.  Also watch out for 28 bit Sparc chips.  They have a smaller
  virtual addresses.  Hal Stern notes "They're really not 28 bit chips,
  but instead the system architecture only passes 28 bits of virtual
  address space on to the memory bus.  [private letter]) Once attached
  the region may be accessed like any other memory location without
  requiring system calls to read or write data to it.  Hence shared
  memory is the fastest mechanism for processes to share data.  (Tip -
  don't be confused by the SZ field in ps -elf.  It is in 4 KB pages
  and represents shared memory in the case of Oracle.  For example
  Oracle may have 60 server processes in a shared memory segment
  all approximately 25000 4 KB pages.  A common misconception is to
  think that Oracle needs 60 X 4KB X 25000 = 6 GB of virtual memory.
  Those 60 processes are mainly using the shared memory region in
  the process address space).

  (Tip - shared memory pages are backed by swap space, not by a file.
  The absolute minimum swap must be at least the size of the SGA.)

  A process detaches the shared memory with shmdt(addr) and destroys
  the shared memory region completely with the IPC_RMID command of
  the shmctl system call.

  The important commands are
  <pre>
  ipcs -b
look at field SEGSZ for shared memory size in use

  sysdef -i and sysdef -i -n /dev/ksyms
IPC and resource table definitions;

  kill -9 <process id>
terminate (no core file) a hung process

  kill -6 <process id>
abort (core file) a hung Oracle process

  modload -p sys/shmsys at the command line or
  forceload: sys/shmsys in the system file
may be needed if ipcs -b doesn't work correctly.
  </pre>

  This is because the kernel is dynamic meaning that file systems,
  drivers, and modules are loaded into memory when they are used,
  and the memory is returned if the module is no longer needed.
  (Vahalia - p155-158, p162-164)

  Semaphores are counters that are used by Oracle to monitor and
  control the availability of shared memory segments.  Typically the
  process initializes the semaphore with semget, assigns ownership
  of the semaphore with semctl , and then updates the semaphore
  with semop.  A process has to block until the semaphore operation
  has reached zero.  A semaphore structure contains the following
  information - semaphore value, the PID of the process that last
  performed successfully, the number of processes waiting for the
  semaphore to increase, and the number of processes waiting for the
  semaphore to reach zero.  (tip-ipc_perm and sem in ipc.h, sem.h)
  (System Services Guide - p68-77).

  Shared Memory and Semaphore Tunables in Solaris 2 relevant to Oracle.
  (Tip - semmnu = semmns = semmsl X semmni).  There is no harm in
  setting the numbers too high since the Oracle instance will only
  allocate semaphores and shared memory as needed.  The values are
  definitions not declarations.
  <pre>
  Name     Default   Min        Max         Reference             Suggested
  ____     _______   ___        ___         _________             ________
  shmmax   1048576   1048576    Available   Maximum shm segment   50% of RAM
                                RAM         size in bytes
  shmmin   1         1          -           Minimum shm segment   1
                                            size in bytes
  shmni    100       100        -           Number of shm id      100
                                            to pre-allocate
  shmseg   6         6          -           Maximum number shm    32
                                            seg per process
  semmni   10        10         65535       Number of semaphore   64
                                            identifiers
  semmns   60        -          -           Number of semaphores  1600
                                            in system
  semmnu   30        -          -           Number of undo        1250
                                            structures in sys
  semmsl   25        -          -           Maximum number of     25 (fixed)
                                            semaphores per ID
  </pre>

Solaris Tuning According to the Experts
  Every month in SunWorld Online, the performance experts at Sun
  write articles on tuning.  In addition to the well known book,
  "Sun Performance and Tuning", Adrian Cockcroft with the help of
  Rich Pettit have put together a series of scripts called se2.5
  (www.sun.com/960301/columns/adrian/se2.5.html.

  Hal Stern, another well known Sun tuning guru, has written an
  O'Reilly press book on "Managing NFS & NIS" and he too writes
  articles that can be downloaded off of the web.

  Fellow SunService Engineers Chris Drake and Kimberley Woods
  wrote "Panic - System Core dump Analysis" which contains detailed
  information on the Solaris kernel and common techniques used in to
  analysis core files.

  Brian Wong the hardware expert has written a book called
  "Configuration and Capacity Planning of Large Sun Servers".

  Most of the tuning information for large Sun Servers running Oracle
  can be found in these sources.  Since many customers often call
  SunService for further explanations, it is appropriate to highlight
  some common questions and answer them as the experts would.

Question 1 - Where is all my Memory?
  Probably the most common performance question of all is "Why does
  vmstat report only xxxx about of free memory available?"  To use an
  example, type the vmstat 5 and suppose the system shows freemem of
  80708 and available swap is 330000.  Now start the application and
  observe that the freemem goes down to 8824 and swap goes to 300000.
  Now stop the application and observe that all of the available
  swap returns to 330000 but the freemem returns only to 21260.
  Where then is all of the ram?  Do we have a memory leak?

  The answer is probably no because as Cockcroft notes "(the app)
  starts up more quickly than it did the first time, and with less
  disk activity.  The application code and its data files are still in
  memory, even though they are not active.  The memory they occupy is
  not "free."  If you restart the same application it finds the pages
  that are already in memory.  The pages are attached to the inode
  cache entries for the files.  If you start a different application,
  and there is insufficient free memory, the kernel will scan for
  pages that have not been touched for a long time, and "free" them.
  Once you quit the first application, the memory it occupies is
  not being touched, so it will be freed quickly for use by other
  applications."  (Cockcroft 1)

  Leaving parts of the app in memory even after termination is
  efficient because "Attaching to a page in memory is around 1,000
  times faster than reading it in from disk."  (Cockcroft 1) So how
  can one know if he has a memory leak in his application?  The answer
  is there will be a shortage of swap space after the program runs
  a while and the SZ field in ps -elf for that app will grow over time.

Question 2 - My Oracle Server is slow. Can you help me tune the kernel?
  The answer depends on the version of the operating system and the
  level of the patches.  Early versions of the os had performance bugs
  and incompatible hardware that were the cause of slow performance.
  The latest version of the os is self-tuning for high performance
  and will work quite successfully on systems ranging from a huge
  SparcCenter 2000 to small desktops.

  As Cockcroft says "In normal use there is no need to tune the Solaris
  2 kernel, since it dynamically adapts itself to the given hardware
  configuration and application workload.  " (Cockcroft 2) However
  for really large Oracle servers some tuning may be needed if using
  early versions of Solaris 2.3 2.4 and 2.5 without a kernel patch
  that automatically adjusts the the paging algorithm.  Solaris 2.5.1
  is self tuning for large memory systems.

  Paul Faramelli of the kernel TSE group has put together the following
  list of tunables for Solaris.  Recommendations for large Oracle
  servers (Ram > 1 GB) are listed.  (Tip - Use crash to display kernel
  tunables.  As root type crash.  At the greater than prompt, type
  "od -d maxuser" or "od -d lotsfree".  The od stands for octal dump,
  and the -d stands for decimal.  By the way every Solaris tunable
  [even undocumented ones] can be displayed by typing nm /kernel/unix).
  Note these recommendations are only necessary for early versions
  of Solaris.  The some recommendations are provided by Steve O'Neil
  of SunService.  (Caution - there is no right answer)
  <pre>
  Parameter   Description                                        Recommended
  ---------   -----------                                        -----------
  dump_cnt    Size of the dump

  autoup      Used in struct var for dynamic configuration of the age    300
              that a delayed-write buffer must be, in seconds, before
              bdflush will write it out (default = 60)

  bufhwm      Used in struct var for v_bufhwm; it's the high water mark
8000
              for buffer cache memory usage, in Kbytes (2% of memory).

  maxusers    Maximum number of users (In 2.3 and 2.4 the default is
              number of Megabytes in memory)

  max_nprocs  Maximum number of processes (10 + 16 * maxuser)

  maxuprc     The maximum number of user processes. (max_nprocs - 5)

  rstchown    POSIX_CHOWN_RESTRICTED is enabled (default = 1 )

  ngroups_max Maximum number of supplementary groups per user (def 32).

  rlim_fd_cur Maximum number of open file descriptors per process sysem
              wide (default = 64, max = 1024)

  ncallout    Number of callout buffers (default = 16 + max_nprocs).
              (No longer exists in Solaris 2.2 and later releases)

  nautopush   Number of entries in the autopush free list
1024

  sadcnt      Number allowed of concurrent opens of both /dev/sad/user
2048
              and /dev/sad/admin (default 16).

  npty        Number of 4.X psuedo-ttys configured (default 48)
1024

  pt_cnt      Number of 5.X psuedo-ttys configured (default 48)
1024

  physmem     Sets the number of pages usable in physical memory. Only
              use this for testing, it reduces the size of  memory.

  minfree     Memory threshold which determines when to start swapping
100
              processes, when free memory falls to this level swapping
              begins (default: 2.4 - 4d = 50 pages, all others 25
              pages, 2.3 - physmem / 64 ).

  desfree     This is the "desperation" level, this determines when
200
              paging is abandoned for swapping. When free memory stays
              below this level for 30 seconds, swapping kicks in ( 2.4
              4d = 100 pages, all others 50 pages, 2.3 physmem / 32 ).

  lotsfree    Memory threshold which determines when to start paging.
512
              When free memory falls below this level paging begins (2.4
              4d = 256 pages all others 128 pages, 2.3 physmem /16)

  fastscan    The number of pages scanned per second when free memory
              is zero, the scan rate increases as free memory falls
              from lotsfree to zero, reaching fastscan ( default: 2.4
              physmem / 4 with 64Mb being max, 2.3 physmem / 2 ).

  slowscan    The number of pages scanned per second when free memory
              is equal to lotsfree, also see fastscan ( defaults: 2.4
              is fixed at 100, 2.3 fastscan /10 ).

  handspr-    Is the distance between the front hand and backhand in

  eadpages    the clock algorithm. The larger the number the longer an
              idle page can stay in memory (default: 2.4 physmem / 4
              2.3 physmem / 2 ).

  maxpgio     The maximum number of page-out I/O operations per second.
120
              This acts as a throttle for the page deamon to prevent
              page thrashing ((DISKRPM * 2) /3 = 40). This parameter
              must be set higher if using two swap partitions.

  t_gpgslo    2.1 through 2.3, Used to set the threshold on when to
              swap out processes (default 25 pages ).

  ufs_ninode  Maximum number of inodes. (max_nprocs+16+maxusers+64)
34906

  ndquot      Number of disk quota structures. (default = (maxusers *
              NMOUNT / 4) + max_nprocs)

  ncsize      Number of dnlc entries. (default = max_procs + 16 +
34906
              maxusers + 64); dnlc is the directory-name lookup cache
  </pre>

Cockcroft on maxusers
  "I never set maxusers.  It sizes itself based on the amount of RAM
  in the system.  In some cases on configurations with gigabytes of
  RAM it needs to be reduced to avoid problems with lack of kernel
  address space.  The kernel uses up a lot of space keeping track
  of all the RAM in a system.  Several other kernel table sizes and
  limits are derived from maxusers."  (Cockcroft 2)

Cockcroft on ncsize
  "The directory name lookup cache (DNLC) is sized to a default value
  based on maxusers.  A large cache size (ncsize) significantly
  helps NFS servers that have a lot of clients.  On other systems
  the default is adequate."(Cockcroft 2)

Question 3: How much swap is needed for a large Oracle database?
  Many people are under the impression that very little swap is needed
  for Oracle because the architecture uses temporary tablespaces for
  sorting and the SGA is fixed in memory.  Well the truth is large
  databases require a lot of swap.  The shared memory segment is backed
  by swap so the allocated swap MUST be at least as large as the shared
  memory segments.  In addition when the database uses intimate shared
  memory this is also backed by swap.  All of the Oracle processes
  must be partially backed by swap.  Steve Schuettinger, the Oracle
  applications specialist at Sun, recommends at least 2 GB of swap
  for benchmark testing on large servers.  Obviously since RAM plus
  swap equals virtual memory, once swap is gone, the program will halt
  and no new apps can be started until other programs have stopped.
  As Adrian Cockcroft says "The important thing to realize about swap
  space is that it is the combined total size of every program running
  and dormant on the system that matters.  When a system runs out of
  swap space it can be very difficult to recover.  Sometimes you find
  that there is insufficient swap space left to login as root or run
  the commands needed to kill the errant process that is consuming
  all the swap space."  (Cockcroft 3)

  In Theory Solaris 2 changes the rules by adding the RAM and the
  disk space so if the system has enough RAM for the workload, "it
  can run with no swap disk.  In practice common database applications
  that are sized to run in a few gigabytes of RAM will actually need
  many gigabytes of disk allocated as swap space."  (Cockcroft 3)

  In the same article Cockcroft says "The consequences of running out
  of swap space affect a larger number of users on a big server, so it's
  wise to allocate a lot more than you normally need to cope with any
  usage peaks.  To start with, add twice as much disk as you have RAM."
  (Cockcroft 3) (Tip - It is not worth making a striped metadevice to
  swap on - that would just add overhead and slow it down.  There is
  also a limit of 2 gigabytes on the size of each swap partition,
  so striping disks together tends to make them too big.

  /usr/ucb/ps alx, fields SZ or SIZE, /usr/proc/bin/pmap
  <pre>
  % /usr/ucb/ps alx
  F   UID   PID  PPID CP PRI NI   SZ  RSS    WCHAN S TT     TIME COMMAND
  8  2595  1133  1130  0  48 20  988  360 modlinka S pts/4  0:00 -bin/csh
  </pre>

  There is confusion between what ps reports.  The "/bin/ps prints
  a field labelled SZ, but this is the resident set size in RAM --
  printed as RSS by the /usr/ucb/ps.  You need to use the SZ or SIZE
  field reported by /usr/ucb/ps alx in units of kilobytes to determine
  the amount of swap space used by the process."  (Cockcroft 3)

  Oracle's Mark Johnson adds the following "I had thought the standard
  Oracle rule of thumb was 2 to 4 times physical memory (can be a bit
  less on very large memory systems).  Smaller memory systems may want
  to use higher ratios of SGA size to physical memory size and higher
  swap space ratios.  (I ended up using ratios of 1:1 and 1:4 for a
  very small Solaris for Intel system with surprisingly good results.)"

  Hal Stern says "So why do you need swap space if your SGA << phys
  mem?  The short answer is that the "phys mem" in that calculation
  is the non-locked-down physical memory, and when you allocate
  an oracle SGA, you allocate intimate shared memory (ISM) that is
  taken out of the physical memory pool (ie, it gets locked down).
  so on a 1 Gbyte machine, you may think you're ok with a 256M SGA,
  leaving 700M+ for processes.  BUT: the 256M SGA gets taken out of
  the available memory pool, so your maximum VM is only 700M+, and
  you could probably use the swap space....as the SGA/memory ratio
  goes up, this is even more true."  (private letter from Stern)


Question 4 - Will a faster cpu help performance?
  The answer is not easy to answer.  As Hal Stern noted " Noticing that
  you're using 20 percent of the CPU doesn't mean anything until you
  know the kind of work that's using the cycles.  If you're CPU-bound,
  then you have headroom to increase the workload by a factor of four
  or five.  An I/O-bound job, however, that uses 20 percent of the
  CPU might be improved by adding disk spindles.  As you increase the
  disk count and I/O load, to ease the bottleneck, you'll use more CPU
  to deal with the I/O setup, system calls, and interrupts from the
  additional work.  You run the risk of morphing a disk problem into
  a CPU shortage.  How do you know when relaxing one constraint pops
  another one into the foreground?  Define the right relationships --
  CPU time used per disk I/O tells you how much system time you eat up
  as you add disk load -- and measure with your tailored yardstick."
  (Stern 1)

Preventing Kernel Memory Starvation
  When Oracle is working very hard and the operating system is
  Solaris 2.3 or early Solaris 2.4, it is possible to have kernel
  memory allocation faults that can eventually lead to kernel memory
  starvation.  A new memory allocator algorithm has been developed
  and integrated into Solaris 2.5.1 (the old allocator had paging
  thresholds that were too low which causing kernel memory allocation
  failures on very large systems).  The allocator has been back ported
  to rev 40 of the Solaris 2.4 jumbo patch and to a future rev of the
  2.5 jumbo patch.  No fix has yet been developed for Solaris 2.3.
  (Tip - large database users should upgrade to Solaris 2.4 or better).

  In the past Oracle customers could manually adjust paging thresholds.
  The actual value that needed to be set was proportional and depended
  upon the amount of memory and the number of cpus on the system.
  Also in some cases decreasing maxusers and bufhwm would mitigate
  the problem.  The total allowable size for the kernel on the
  ultrasparc servers running 2.5 is now so large that kernel memory
  allocation problems on very large systems is virtually impossible.
  See examples below.  The crash output displaying kernel memory
  starvation is taken from a SparcServer 1000 running Solaris 2.3
  with 1 GB of ram and 8 cpus.
  <pre>
    Solaris 2.4:       Solaris 2.5:            Kernel memory limits
      sun4c 33MB        sun4c 33MB
      sun4m 61MB        sun4m 100MB
      sun4d 139MB       sun4d 251MB
                        sun4u 2525MB

  $> kas crash 15
  >map kernelmap FREE: 2042      WANT: 1 SIZE: 2042 SIZE    ADDRESS TOTAL
  NUMBER OF SEGMENTS 0 TOTAL SIZE 0
  > kmastat
                         total bytes     total bytes
  size        # pools       in pools       allocated     # failures
  -----------------------------------------------------------------
  small       6807           26138880        25677584     1989915
  big         2652           75276288        73046528       0
  outsize       -                  -         18571264     45351
  </pre>

  Crash is a very powerful tool that helps analyze kernel memory
  allocation failures.  We see from the output "TOTAL SIZE 0"
  indicates that no more free kernel memory exists.  The FREE field
  (2042) indicates that there is still plenty of memory in the user
  portion of the virtual address space.

  Carl of Sunsoft provides an explanation of kernel map scarcity
  under Solaris 2.3 and Solaris 2.4:

  "In the overwhelming majority of cases on large database servers,
  we have found that 64MB is overly generous for bufhwm in that it
  can be cut back by one-half (to 32MB) without too much of an impact
  on the cache hit ratio.  What is usually in short supply on these
  machines is not the buffer cache but the amount of kernel heap
  (mapped by kernelmap) that remains for non-buffer cache usage.
  Limiting buffer cache growth to 32MB frees up an addition 32MB
  to the heap and has proven successful in avoiding kernelmap
  scarcity at a number of sites running large database applications.
  Kernelmap scarcity (or equivalently kernel heap scarcity as the size
  of the kernel heap is limited by the size of the address space the
  kernelmap can map) results in an extreme slowdown of processing in
  the systems.  All of a sudden kernelmap becomes a scarce resource
  that every thread contends for and to exacerbate the situation the
  rate of release is slowed by the very same contention to the point
  that kernelmap turnover grinds down almost to the point of deadlock.

  Why 64MB's worth of kernelmap is inadequate for the largest database
  servers is unknown.  The sites on which this has been a problem
  have been checked for kernelmap leakage and none has been found.
  There has also been a problem in the past with some kernel data
  structures being pre allocated from the heap and the size of this
  pre allocation being inappropriately scaled to physical memory.
  As it is fairly common now for machines to be equipped with 3GB of
  physical memory, this was not the right thing to do and did account
  for some kernelmap depletion headaches.  But this particular bug has
  been fixed.  With these two things discounted, the only conclusion
  is that modern database workloads are driving up peak transient
  demands for kernelmap to the 100MB level."

  (Tip - For large databases running Solaris 2.4 or less set bufhwm
  to 8000 on 4c, 4m, and 4d or upgrade to Solaris 2.5 which has a
  large kernel map address space.)

Acknowledgements
  I want to thank Sun performance gurus Adrian Cockcroft and Hal
  Stern for their contributions to this paper.  UNIX architect
  Mark Johnson of Oracle and database expert Jim Skeen of Sunsoft
  provided comments on Oracle internals.  Kernel architect Jeff
  Bonwick has added explanations and suggestions regarding kernel
  memory allocation and kernel memory starvation.  SunService kernel
  engineer Paul Faramelli documented the Solaris tuning parameters and
  SunService Technical Expert Steve O'Neil provided recommendations
  for tuning large Oracle databases on versions of Solaris that are
  not self tuning.  Finally I want to thank Uresh Vahalia who gave
  me permission to quote at length from his wonderful book "UNIX
  Internals - The New Frontiers".

Disclaimer
  The author alone is responsible for the contents of this paper.
  No one at Sun Microsystems, Sunsoft, SunService, or the Oracle
  corporation has reviewed or approved the paper for completeness or
  accuracy in it's published format and nothing in the paper can be
  construed as the official policy of Sun Microsystems or the Oracle
  Corporation.

References
  UNIX Internals - The New Frontiers by Uresh Vahalia, Prentice Hall 1996

  "How the Solaris Kernel is Optimized for Oracle" by Mike Jaffee 1996

  "Shared Page Table: Virtual Memory Enhancement for Data Sharing in UNIX"
H.Yoo

  "Comparative analysis of Asynchronous I/O in Multithreaded UNIX" Hyuck Yoo

  "Help! I've lost my memory!" by Adrian Cockcroft, SunWorldOnline 1995 (1)

  "What are the tunable kernel parameters for Solaris 2?" by Adrian
Cockcroft (2)

  "How does swap space work?" by Adrian Cockcroft, SunWorldOnline 1995 (3)

  "We suggest creative ways to better your system performance" by Hal Stern

  System Service Guide - Solaris 2.4 Manual, SunSoft, 1994

  "The Slab Allocator: An Object-Caching Kernel Memory Allocator" Jeff
Bonwick

----------------------------------------------------------------------------
---------------

Thanks again for the help.

Siddhartha Jain
Received on Tue May 8 21:44:28 2001

This archive was generated by hypermail 2.1.8 : Wed Mar 23 2016 - 16:24:54 EDT