SUMMARY: Confused about tape drive compression

From: Judith Reed (jreed@wukon.appliedtheory.com)
Date: Wed Jun 24 1998 - 09:05:10 CDT


Thanks to:
 The Hermit Hacker <scrappy@hub.org>
 Charlie Mengler <charliem@anchorchips.com>
 "Peter L. Wargo" <plw@ncgr.org>
 "Adams, Chad M CRL" <cadams@crl02.crrel.usace.army.mil>
 mason@ncipher.com
 Jochen Bern <bern@penthesilea.uni-trier.de>
 Harvey Wamboldt <harvey@iotek.ns.ca>
 "Brion Leary" <brion@dia.state.ma.us>
 "Ackerson, Greg" <ackerson_ga@nns.com>
 Greg Sawicki <sawicki@interlog.com>
 Rich Kulawiec <rsk@gsp.org>
 Michael Sullivan <mike@trdlnk.com>
 nobroin@sced.esoc.esa.de (Niall O Broin)
 martin@stavanger.geoquest.slb.com (Martin Oksnevad)
(and a few others - forgive me if I missed you)
who all supplied lots of good info on tape compression.

Our original question was about actual amounts of data that can be stored
to an 8705DX Sun tape drive rated to hold 7GB uncompressed, 14GB compressed.
We are seeing about 27GB of Oracle data going onto a drive which should
only support 14GB of compressed data.

A number of people explained that the specific data we were backing up in
this instance, Oracle databases, could be compressed to a very small size
because it often consisted mostly of empty space, waiting to be filled with
data.

"Oracle pre-allocated the disk space when you create the table, so if the
tables are relatively empty, even though 'df' shows it using 29gig of disk
space, compressed it will use much much less then that, conceivably next to
nothing..."

Others pointed out that there's "average compression", and then there's
actual compression, and the actual result can be much better or much
worse than the average.

"The "rule of thumb" on data compression is "2 to 1" "on the average".
For some files, compression might leave the file the same size or
even larger. On some ASCII files I've seen compression ratios of
10 to 1 or greater."

There are ways to see this for yourself:
==========================================================
%gzip -9v csh.txt
csh.txt: 71.2% -- replaced with csh.txt.gz

(The text file is 71.2% smaller when compressed)

%gzip -9v tire1-1.jpg
tire1-1.jpg: 1.0% -- replaced with tire1-1.jpg.gz

(The .jpg photo is already compressed, and can't go much further.)
============================================================
Try this simple test.

% mkfile 1m test
% ls -l test
% compress test
% ls -l test.Z

Note the difference in filesize. This is due to the fact that mkfile files the
file with all zero. A pattern that is highly compressable.
=============================================================

Two posters clarified the concept of compression:

"The SUN Guy was right in *some* Places and confused in others.
A Tape of 7G uncompressed Capacity holds 7G, either without or
after Compression, PERIOD. If you get 20G onto it, your Compression
Ratio is approx. 3:1 or better."

"For one thing the Sun person told you wrong when they said if it was
compressed 2:1 you could fit 28 GB; a 160M tape is rated to hold 14 GB
assuming a 2:1 ratio; hence, to get 28 GB you'd need a 4:1 ratio."

And one poster pointed out that if an oracle database is compressing
very small, you need to remember to plan for when that database fills up -
good point!:

        "However, there is some cause for concern: if the reason your data is
being compressed so well is due to having many empty fields of data, then it
stands to reason should meaningful data start getting imported or changed at
a fast pace, the database as a whole could quickly become a 2-tape deal.
(this happened to us just last week) It's better to plan for this
contingency then continuously expect your DB to fit on one tape."

It was pointed out that there's a compression FAQ:

"There's a three-part data compression FAQ in Usenet's news.answers that
goes into detail on algorithms, examples, etc."

and another poster talked about possible alternative ways to backup oracle
databases:

"One thing that I am wondering about is whether or not you export your
database before backing it up. Over the years, I've found that it's
much better to use the database's native tools to dump it in a simple
format (e.g. ASCII) and then back *that* up because should you need to
recover from a failure of some kind, your chances of successfully
being able to do so from ASCII (which you can work on with standard Unix
tools and re-import using the database's tools) are much better than your
chances of restoring the relatively fragile database itself.

Not only that, if you export the database, you'll get a better idea of
its true size, since only data that exists will get exported."

Another poster explained where the 2:1 ratio comes from:

"The commonly mentioned ratio of 2:1 compression is just
a convenient simplification for marketing purposes that roughly corresponds
to what "typical" users can expect with "typical" data. If the original
data is random, there is no redundancy, so the data cannot be compressed and
the ratio will be 1 (or perhaps, slightly less than 1 if the compression
algorithm imposes some fixed overhead). If the data is very redundant,
then very high compression ratios can be achieved."

One poster talked about the more normal situation, as opposed to the oracle
compression issue:

"It's very unusual to get 2:1 compression on tape drives and with
29gb on a 7gb tape you have 4:1 compression.

Data compression is very data (type) dependant but people normally
expect ~30% (20-40%) data compression (1.3:1). If your data is
compressed allready (ex. compressed tar files) don't expect to get
any compression at all on tape.

30% data compression on a 8705DX with 160 meter tapes should give
you ~9gb per tape.

With (a more normal over average) 45% data compression your 29gb you
would just fit on a 20gb tape on a Exabyte 8900 tape drive (Mammoth)."

Thanks to all who wrote!!!

-- 
Judith Reed
jreed@appliedtheory.com
(315) 453-2912 x335



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:42 CDT