SUMMARY: large data repository sync over WAN

From: Rob Windsor <windsor_at_warthog.com>
Date: Thu Feb 22 2007 - 12:47:24 EST
I received many responses, some pointed at tools (which is what I was 
looking for, honestly), but most had a common theme to them.  :)

Original Post:
> We need to sync 10TB of data in small files from one North American 
> coast to the other.
> 
> Our tenative plans are to sneakernet the data and then use some form of 
> sync to catch up the delta.
> 
> Aside from bandwidth constraints, we found that rsync quickly craps out 
> with large numbers of files.
> 
> What tools have you used to do this?

Most popular question:
> Does all 10TB of it change daily?

No.  The data comes in two flavors:
* Oracle DBF files (yes, changes daily), less than a TB here
* Small static files, the files themselves don't change, their count
   simply increases
   - side note: These files are about 16 subdirs deep and heavily
     scattered (er.. I mean.. "distributed")


Other common questions/comments:
> You didn't specify how rsync craps out, but i'm guessing

I forget the specifics, but it was basically "out of memory" due to the 
number of files and subdirs it has to dig in.

> what version of rsync you're using

2.6.8 (looking at 2.6.9 now to see if it addresses any of the problems 
we've had)

> but you can often throw ram at the issue.

Not in this case, unfortunately.

> In addition, you can fire off rsync on a subtree so it has less work to
> do.

That's certainly a consideration.  It won't be easy (c.f. "about 16 
subdirs deep" above).

Then there were these:
> (Deborah Santomauro) Have you tried "rdist"?
and
> (Anthony D'Atri) rdist 6 from www.magnicomp.com with SSH as the transport works great for managing files.

Holy cow, now that's oldschool love!  I'll look into that.


Brad Morrison mentioned:
> I think cpio has a flag to skip files with equal or newer mod dates,

Yeah, we've also considered something like:
    "rsync -av `find . -newer <somefile> -print` dest:/path"
just to limit the volume of files that rsync has to consider.

There was mention of NetApp, zfs, VxFS/VxVM, which aren't options in 
this situation.  As much as I tried to get to zfs, it wasn't available 
at the time we upgraded the DB/file servers to Sol10.

Hutin Bertrand mentioned an app called "aide", which is an Intrusion 
Detection tool (think tripwire) that you can use to spot files/subdirs 
that have changed.  interesting find. 
(http://sourceforge.net/projects/aide)

Gedaliah Wolosh pointed me at http://www.openafs.org
Karl Rossing mentioned http://opensolaris.org/os/project/avs/

AFS is quite an endeavor, we're not quite prepared to go that route.

AVS looks interesting, we might be able to do something with that, if 
heavy rsync-frobulation doesn't work out.

Thanks all!

Rob++
-- 
Internet: windsor@warthog.com                             __o
Life: Rob@Carrollton.Texas.USA.Earth                    _`\<,_
                                                        (_)/ (_)
"They couldn't hit an elephant at this distance."
   -- Major General John Sedgwick
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Thu Feb 22 12:48:44 2007

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:04 EST