SUMMARY: looking for 'per file' disk IO statistics

From: Peter Polasek (pete@cobra.brass.com)
Date: Tue Jan 26 1999 - 16:47:49 CST


Original Question:
------------------
> Is there utility for Solaris that provides disk activity on statistics
> on 'per-file' basis? We have a server that has very heavy 'iostat'
> disk activity -- unfortunately there about 10 files that are known to
> be extremely active, so it is difficult to identify the most offensive
> file. Incidentally, the files are 'memory mapped' images of physical
> memory, so I would be curious if anyone is aware of any issues,
> optimizations, or utilities for memory mapped files.

Summary of responses:
---------------------
I expected to get lots of responses about tools to collect file-specific
disk stats. Unfortunately, this was not the case. I did get several
very good suggestions however. Several suggested isolating the files on
separate disk spindles (both for performance and analysis). We do
distribute the files over multiple disks in production and have placed
individual files on a separate disk for analysis, but I was hoping for
a more flexible approach. One suggested using tmpfs file systems - this
is an excellent suggestion for performance. We are using tmpfs very
effectively for some temporary files, but unfortunately, the memory mapped
files need to be retained in the event of machine panic conditions. Several
also suggested truss, which is normally highly effective for analizing file
writes. Unfortunately, the memory map updates are handled by the kernel
and are not reflected in the truss output. There was a reference to a
GUI tool for disk utilization:

http://www.sunworld.com/swol-01-1999/swol-01-perf.html?0105a

This tool shows overall disk utilization and fragmentation. One of the
references in this article describes a kernel tracing facility that
shows disk activity on a per-PID basis. This kernel disk tracing tool
is native to solaris (2.5 and above) and provides a great deal of
information. While the tracing utility is very easy to use, the data is
very low-level and is not that easily interpreted. However, it does pretty
much show every aspect of file update and patient analysis of the results
will probably reveal a lot of information. The tool is described in the
following URL:

http://www.sunworld.com/swol-10-1996/swol-10-perf.html

A brief excerpt from this is printed below:

> Obtaining an I/O trace
> ----------------------
> The command sequence is quite simple. The commands must be run as root, and
> a directory is needed to hold the output. I find it easiest to have two
> windows open. One to run prex, and the other to go through a cycle of
> extract, dump, and viewing the data as required. The command sequence for
> prex is to first allocate a buffer (the default is 384 kilobytes -- you can
> make it bigger), enable the "io" group of probes, make them trace accesses,
> then turn on the global flag that enables all kernel tracing.
>
> # prex -k
> Type "help" for help ...
> prex> buffer alloc
> Buffer of size 393216 bytes allocated
> prex> enable io
> prex> trace io
> prex> ktrace on
>
> Now wait a while or run the program you want to trace. In this case I ran
> iostat -x 10 in another window, didn't try to cause any activity, and waited
> for some slow service time to appear. After a minute or so I stopped collecting.
>
> prex> ktrace off
>
> In the other window I extracted the data and dumped it to take a look.
> # mkdir /tmp/tnf
> # cd /tmp/tnf
> # tnfxtract io.tnf
> # tnfdump io.tnf | more
>
> The first section identifies the three probes that were enabled because they
> have the key "io." The "strategy" probe records an I/O being issued, and
> "biodone" records an I/O completing. The other mechanism that may cause I/O
> is the pageout scanner, and this is also being probed. The location of the
> probe in the Solaris source code is printed for kernel developers to refer to.
>
> probe tnf_name: "strategy" tnf_string: "keys io blockio;file ../../common/os/driver.c;line 358;"
> probe tnf_name: "pageout" tnf_string: "keys vm pageio io;file ../../common/vm/vm_pvn.c;line 511;"
> probe tnf_name: "biodone" tnf_string: "keys io blockio;file ../../common/os/bio.c;line 935;"
>
> The next section is a table of probe events in time order. Where a user
> process was scheduling the I/O, its PID is recorded. Events caused by
> interrupts or kernel activity are recorded as PID 0. This trace was
> measured on a dual-CPU SPARCstation 10, so you can see a mixture of CPU 0
> and CPU 2. I've listed the first few records then skipped through looking
> for something interesting.

>
> Elapsed (ms) Delta (ms) PID LWPID TID CPU Probe Name Data/Description . . .
>
> 0.000000 0.000000 632 1 0xf61a1d80 0 strategy device: 8388632 block: 795344 size: 8192 buf: 0xf5afdb40 flags: 9
> 108.450000 108.450000 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 795344 buf: 0xf5afdb40
> 108.977000 0.527000 632 1 0xf61a1d80 0 strategy device: 8388632 block: 795086 size: 1024 buf: 0xf610a358 flags: 524569
> 121.557000 12.580000 0 0 0xfbf6aec0 2 biodone device: 8388632 block: 795086 buf: 0xf610a358
> 121.755000 0.179000 632 1 0xf61a1d80 0 pageout vnode: 0xf61d0a48 pages_pageout: 0 pages_freed: 0 pages_reclaimed: 0
>
> The first strategy routine is an eight-kilobyte write (flags bit 0x40
> is set for reads) to block 795344. To decode the device 8388632 = 0x800018,
> 0x18 = 24 and running ls -lL on /dev/dsk shows that c0t3d0s0 is minor
> device 24, which is mounted as /export/home. It's an old Sun 424-megabyte
> drive, so valid blocks are from 0 to 828719. This is a seek to right at
> the end of the disk, and it takes 108 milliseconds. It is followed by a
> one-kilobyte write to a nearby block which only takes 12.58 milliseconds.

Thanks to the following for their help:

   Jamie Lawrence <jal@ThirdAge.com>:
   birger@sdata.no (Birger Wathne):
   u-kevin@megami.veritas.com (Kevin Sheehan - Uniq)
   jrg@blodwen.demon.co.uk (James R Grinter):

I have included excerpts from the individual responses at the end of this
message.

Thanks,
Peter Polasek

<-- begin response excerpts -->

From: Jamie Lawrence <jal@ThirdAge.com>:
----------------------------------------
I don't know of anything that will do exactly what you want. If these
are custom apps, add some debugging code. Otherwise, I think you're
stuck either (a) doing test cases trying to isolate each file (even
if this is all from the same program, you can temporarily move
files to a different spindle, etc.) or (b) capturing a lot of truss
output and coming up with clever grep strings to get an idea. Option
B isn't pretty...

The only issues I know of with mmapping files are that programmers
tend to use them in ways that cause lots of access - they do think
of it as memory instead of disk. Other than that, there are a lot
of performance advantages to doing IO this way.

From: birger@sdata.no (Birger Wathne):
--------------------------------------
Just a thought.... How about mounting a tmpfs file system and copying the
files over there? Could work well if the files are read-only or only
contain information that doesn't need saving on a system reboot (I have seen
a plotter spooling system that could need this kind of modification for its
polling behaviour...)

Look at Adrian Cockrofts latest article in SunWorld Online (http://www.sunworld.com).
There is a new tool that could help you. If can visualize disk access so you can see
which blocks are being hit with I/O activity. It should be possible to see which
files if you set up your own script to access the files when they are otherwise idle,
or backtrack from block numbers. The tool is kind of crude, but fully functional.

From: u-kevin@megami.veritas.com (Kevin Sheehan - Uniq)
-------------------------------------------------------
The short answer is "no". You can see what disks are doing, you can
see what NFS is doing, but there is no easy way to see what files or
other file systems are doing short of the disk traffic.

If you really mean "these are mmap()ed files", then there are some page
tools that allow you to see what is going on with pages in the system and
per-vnode (file) as it goes on. proctool in particular is a good start.

From: jrg@blodwen.demon.co.uk (James R Grinter):
------------------------------------------------
Peter Polasek <pete@cobra.brass.com> writes:

> Is there utility for Solaris that provides disk activity on statistics
> on 'per-file' basis? We have a server that has very heavy 'iostat'
> disk activity -- unfortunately there about 10 files that are known to

difficult - I don't think the kernel counts it in that manner. You
might be able to get something that will count i/o per process.
(Solaris 2.6 onwards count pages in/out per partition, so that might
be of some help)

alternatively, you might be able to spot it by truss'ing the relevant
processes and seeing which ones are doing the most operations, but I'm
not sure really what would show up with mmap()d files.

> be extremely active, so it is difficult to identify the most offensive
> file. Incidentally, the files are 'memory mapped' images of physical
> memory, so I would be curious if anyone is aware of any issues,
> optimizations, or utilities for memory mapped files.

all Solaris I/O ends up being mmap()/VM operations anyway, so mmap is
often good to use directly, if you can.

to address your actual problem: if you know the files that are
extremely active you ought to consider putting them on separate
spindles as a starting point, anyway.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:14 CDT