SUMMARY : SCSI bus tweaking

From: Geert Devos (Geert.Devos@ping.be)
Date: Mon Apr 08 1996 - 05:39:49 CDT


Hi all,

It has been some time since I posted my request for help on SCSI bus tweaking.

I didn't receive any answers that remedied the mediocre performance in a
substancial way, so I thought I'd let everyone know what I got from the
group. Perhaps somebody will be able to elaborate further on the discussion
I had with Kevin Sheehan.

I hope to get through this problem/question in the next week, and will post
a follow-up when ready.

I would like to thank everybody for the very swift resposes I got :

"Brian O'Mahoney" <brian@teraflex.com>
mshon@sunrock.East.Sun.COM (Michael J. Shon {*Prof Services} Sun Rochester)
Kevin.Sheehan@uniq.com.au
Markus Storm <storm@plato.uni-paderborn.de>

Geert Devos

original posting
-------------------------------------------------------------------------

>From sun-managers-relay@ra.mcs.anl.gov Sun Mar 17 17:05 EST 1996
To: info-sun-managers@news.Belgium.EU.net
From: geert.devos@ping.be (Geert Devos)
Subject: SCSI bus tweaking
Date: Sun, 17 Mar 1996 17:19:33 +0100
Nntp-Posting-Host: dialup27.brussels2.eunet.be

Hi all,

I have a specific question.
I'm using a Ultra1 (140) as file-print-opi server in a prepress setup.
There's a SunSwift adapter card in the system that gives both 100Base-TX
and Fast&Wide SCSI connections.

On the F&W SCSI I have a hardware RAID system that has 5 9GB drives behind
it (5,400 rpm drives). The system is set up using RAID 3.

The file system (newfs) I created was done with param "-r 5400", because I
didn't really have the time to full bore testing before the system went
out to the customer.

After installing the system in the production environment, I ran "iostat
-xtc" for a couple of days, and noticed two very remarkable things:
1. I/O tops out at rougly 5 MB/s, whereas the F&W SCSI is supposed to be
able to achieve 20 MB/s. The "b%" factor does display 100% utilisation at
this moment though.
2. When I look at the control panel of the RAID controler, it states
20-30% maximum.

What on earth is causing this? My "educated" guess would point me in the
direction of the "newfs -r 5400". Since we're using RAID 3 (synchronised
reads and writes over all spindles in the array) perhaps I should
reconfigure with "newfs -r 21600" or something of the like (4 disks @
5,400 rpm in sync). But would this make any change at all? Perhaps the
SCSI bus protocol won't let me have the whole bandwidth for my one device
on the F&W SCSI (yes, there's only this one device on the controler),
since it is presumed to be able to handle I/O from up to 15 devices. So
maybe I may have to do something to the SCSI page modes (I know things can
be changed, but I haven't found any docs on this).

The RAID system comes from MegaDrive Corporation, and uses a hardware
controler based on a 960i, has 32 MB of cache RAM. The drives are in one
LUN.

Thanks for any suggestions.

end original posting
-------------------------------------------------------------------------
>From Brian O'Mahoney

You are fairly comprehensively confused in two quite different
directions:

1. The newfs rpm setting simply affects the layout of the file
system and is mostly irrelevant these days since disks have
large on board caches which mask the seek delays that the file
system address skewing is supposed to optimise. In the case of
a raid there is even more buffering in the raid contoller.

2. There are two quite different throughput considerations (a)
the bandwidth of the bus 2x10 MB/s for FAST SCSI (b) the max
data generation rate of the CPU (writing) and disk/raid controller
(reading). Just because the bus will run at 20 MB/s dosn't
mean the CPU or disk can!

--
Regards, Brian.

------------------------------------------------------------------------- >From Michael Shon

The continuous data transfer rate from any one of the disks in there will be about 5Mb/sec.

If the raid width is very narrow (say 2k from one disk, then 2k from the next, etc) you can get several disks involved in a single IO, and get throughput of multiples of 5Mb/sec.

If the width is larger (say 1 cylinder) , then only one disk will be involved in a filesystem IO, even a clustered IO of 56K (default). You will be limited by the transfer rate of a single drive; about 5Mb/sec .

If the box does not allow you to pick the RAID width, then your mkfs parameters will not make much difference.

------------------------------------------------------------------------- >From Kevin Sheehan

Generally you will not see straight media thruput on disk accesses. Rotational latency will be a good part of it, as will seek time for the normal UFS file system layout. You might want to think about tweaking maxbpg and maxcontig if you are doing large file access...

100% utililization include seek and rotational latency, as well as data busyness...

------------------------------------------------------------------------- >From Markus Storm

I don't know what throughput rates it is supposed to provide. However, your 9 GB drives are presumably Seagate Elites. AFAIK, they max out at 1MB/sec (sustained). So if your data stream is big enough to max out cache(s) on the array, this would give you a maximum of 5 MB/sec. if you stripe across 5 drives. Look at the 'svc_t' column in iostat (or use the SE package - great stuff), that's the average time-to-service requests (in msec). It's supposed to be a better measure than %busy (I forgot why ... maybe because it incorporates drive cache ... sorry, haven't got my tuning book handy).

Something you should look into are the SCSI Options. You might have a line in /etc/system similar to

set scsi_options=0x378

where you can e.g. disable tagged command queuing and even disable F&W SCSI.

/* * SCSI subsystem options - global word of options are available * * bits 0-2 are reserved for debugging/informational level * bit 3 reserved for a global disconnect/reconnect switch * bit 4 reserved for a global linked command capability switch * bit 5 reserved for a global synchronous SCSI capability switch * * the rest of the bits are reserved for future use * */

#define SCSI_DEBUG_TGT 0x1 /* debug statements in target drivers */ #define SCSI_DEBUG_LIB 0x2 /* debug statements in library */ #define SCSI_DEBUG_HA 0x4 /* debug statements in host adapters */

#define SCSI_OPTIONS_DR 0x8 /* Global disconnect/reconnect */ #define SCSI_OPTIONS_LINK 0x10 /* Global linked commands */ #define SCSI_OPTIONS_SYNC 0x20 /* Global synchronous xfer capability */ #define SCSI_OPTIONS_PARITY 0x40 /* Global parity support */ #define SCSI_OPTIONS_TAG 0x80 /* " tagged command support */ #define SCSI_OPTIONS_FAST 0x100 /* " FAST scsi support */ #define SCSI_OPTIONS_WIDE 0x200 /* " WIDE scsi support */

Hope that helps,

Markus

------------------------------------------------------------------------- My reply to Kevin Sheehan

Hi Kevin,

I fully aggree in case we're dealing with one physical disk. But here were talking about a RAID3 system with a hardware RAID controler, something like the SPARC Storage Array (sorry, SSA has no RAID3). The 5*9GB sits behind the controler and presents itself as one physical disk. Internally (on the disks side) the device is equiped with 5 seperate SCSI controlers, one per disk. The hardware controler came with 32 MB of RAM cache.

The reason we're using RAID3 for this customer is because it is a prepress site, typical file size of 4 - 30 MB per file for the scans (46%) and 200 KB - 2 MB (46%) for their low-res files. The remainder of the files are anything from 30 KB to 3 MB for the page layout files.

This means that any read or write likely to happen on this system is for more than 1.5 MB (more than 65%) for one single file. I saw file systems and hard disks behave in a very funny way with this kind of load.

The idea for using RAID3 instead of RAID5 is that with a caching controler between the host and the physical disks, one could use the maximum throughput on the disk side. RAID3 uses synchronised spindles (sorry if I sound like preaching) meaning that, in theory, the total throughput over the disks should be something like 3 or 4 times the theoretical throughput per disk if you can shove data to the RAID controler fast enough. If in this case the controlers on the disk side are Fast SCSI II (10 MB/s), and the controler on the host side is Fast & Wide SCSI II (20MB/s), I think we should be seeing a bit more than 5 to 6 MB/s on large reads and writes. I know file system overhead can do something, but not to the point of eating up 70% of the throughput.

About changing maxbpg and maxcontig: - Normaly, Sol_2.5 gives 25 % of a cylinder group to one single file before jumping to another cg. I have found that in real life performance this is adequate for "normal" disks, i.e. non-striped. Giving a full cg to one single file pushes performance for some 3 to 6% (depending on the file size) in testing environments. But in a life production environment I've seen performance drops of up to 30%, because of simultanuous accesses to related files. I guess the caching algorithm has something to do with that.

------------------------------------------------------------------------- >From Kevin Sheehan

Which flavor of RAID3 is it though? The original idea is all of the disks would be completely slaved. The general implementation is that they all get talked to at the same time, but are not rotationally slaved.

> The reason we're using RAID3 for this customer is because it is a prepress > site, typical file size of 4 - 30 MB per file for the scans (46%) and 200

Pretty typical place for RAID3 to be...

> KB - 2 MB (46%) for their low-res files. The remainder of the files are > anything from 30 KB to 3 MB for the page layout files. > > This means that any read or write likely to happen on this system is for > more than 1.5 MB (more than 65%) for one single file. I saw file systems > and hard disks behave in a very funny way with this kind of load.

Hmmm, at that size you may still be seeing rotational latency if the drives are not slaved. > > The idea for using RAID3 instead of RAID5 is that with a caching controler > between the host and the physical disks, one could use the maximum > throughput on the disk side. RAID3 uses synchronised spindles (sorry if I > sound like preaching) meaning that, in theory, the total throughput over > the disks should be something like 3 or 4 times the theoretical throughput > per disk if you can shove data to the RAID controler fast enough.

Yep - that's why I ask about "real" RAID3. I've seen some implementations where they just used a bunch of normal SCSI disks with no sync on the spindles.

> I know file system overhead can do something, but not to the point of > eating up 70% of the throughput.

Not even close. I can get 95% of theoretical disk speed with a well tuned UFS file system and huge reads. *Writes* are a problem, since the metastate is written synchronously, which generally causes head movement.

It might be that you want to turn off synchronous metastate updates (which will raise the chances of a hosed FS if you crash) and see what effect that has. For restores, we see 5x improvement in time. At the end is fastfs.c, which does this. > > About changing maxbpg and maxcontig: > - Normaly, Sol_2.5 gives 25 % of a cylinder group to one single file before > jumping to another cg. I have found that in real life performance this is > adequate for "normal" disks, i.e. non-striped. Giving a full cg to one

Yep - given the size of cylinders these days.

> single file pushes performance for some 3 to 6% (depending on the file > size) in testing environments. But in a life production environment I've > seen performance drops of up to 30%, because of simultanuous accesses to > related files. I guess the caching algorithm has something to do with that.

You are probably right in accessing multiple files that letting it clog up like that will be an issue. Does your application use mmap() to access the files or read()/write()??

------------------------------------------------------------------------- end messages

-------------------------------------------------------------------------------- - Geert Devos geert.devos&ping.be - - Fax @ work +32 2 451 13 91 - Voice @ work +32 2 451 12 69 - - system administrator/system integrator/chief numskull - -------------------------------------------------------------------------------- -A committee is a life form with six legs or more and no brain (R. A. Heinlein)- ---------------------------------------------------------------------------- ----



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:57 CDT