[Dirvish] dirvish and weeding out duplicates

Shane R. Spencer shane at tdxnet.com
Wed Nov 16 19:13:58 EST 2005

On Wed, 2005-11-16 at 13:14 -0800, Keith Lofstrom wrote:
> On Wed, Nov 16, 2005 at 10:35:14AM -0900, Shane R. Spencer wrote:
> > Also.. dirvish needs/wants/is born on transparency it seems.  so it
> > would be difficult to hard link files with different uid/gid's without a
> > gid/uid map for its restoration.  I really think a filesystem proxy that
> > could speak the rsync tounge should be born. possibly dirvishfs/backupfs
> > (which would limit it to fuse on linux for the most part) or like I
> > mentioned a fancy program similar to unison to handle linking/gid/uid
> > mapping and time restoration.
> I would like to take a slightly different tack on this, which answers
> the need (smaller, faster backups) without the problems (new file
> system formats, extra compute time, extra backup disk wearout).
> Dirvish is a tool that is part of the larger problem of maintaining
> data integrity.  Part of that problem is avoiding bloat in the original
> filesystems.  While some file duplication stems from multiple similar
> machines (and some variant of backuppc might be better for dealing
> with those), much of it comes from errors and inefficencies managing
> the original data on the source drive (say, two downloads from the
> same data on a camera flash card, or two copies of the same .rpm ).  
> Source disk bloat is more of a problem than backup bloat, IMHO.  My
> laptop drive costs about 10x what my backup drives cost;  even with
> 3-way backup drive rotation, that is a 3x cost disadvantage for bloat
> at the source side.  What I would like to see is a dirvish-aware
> system for finding these duplications and hard-linking them out of the
> source drives.  The dirvish awareness is useful because the process
> will be imperfect, and we don't want to accidentally and permanently
> wipe something unique.  It would be good if we had some way of verifying
> that we had at least one copy of our duplicate source file in the
> backups before we start throwing files away.

Source bloat in my world is inevitable on multi-user or publicly
accessible systems, mainly web and ftp servers, but primarily
workstations for developers who keep 16 source trees of the same thing
around :)

> There are also rumors that rsync sometimes forgets about files in
> the backups.  While I have not seen this myself, a friend insisted
> that he has restored backups using rsync and had numerous files go
> missing in the process.  I would rather spend a few extra bucks on
> backup drives and duplicative storage than over-optimize myself into
> missing data, if optimization has a chance of increasing such errors.

I am personally scared of doing a restore.. I will be testing it on a
UML setup soon enough.

> Effort on new file systems, IMHO, should go into making ReiserFS4
> more tested and reliable, and finding ways to efficiently copy or
> move it.  I use ReiserFS3 on my backup drives, and it saves far
> more space (compared to, say ext3) than fooling with hardlinks on
> inefficient filesystems does.  The problem is, I don't trust 4,
> and don't trust 3 as much as I would like.   If I try to copy a
> dirvish vault from one disk to another,  the existing tools are
> laughably inadequate to deal with such hard-link laden monsters. 
> It can take many hours to just fsck a disk with dirvish vaults. 
> There are some good places to put that "new filesystem" energy
> without actually trying to wedge a new one into the kernel and
> into rsync.

Its a lame idea forced from a "must patch now" mentality.  My second
thought is to get dirvish incorporated with librsync and handling the
queue of files itself (including duplicity checks).  But that would be
very limiting and require the backup host to store a database for quick
reference. I like the idea bit it just isn't dirvishy.

I need to think more.. story of my life.

> So, good thinking, but don't stop there.   You've taken the first
> steps toward something wonderful.
> Keith

More information about the Dirvish mailing list