[Dirvish] dirvish and weeding out duplicates

Keith Lofstrom keithl at kl-ic.com
Wed Nov 16 16:14:44 EST 2005

On Wed, Nov 16, 2005 at 10:35:14AM -0900, Shane R. Spencer wrote:
> Also.. dirvish needs/wants/is born on transparency it seems.  so it
> would be difficult to hard link files with different uid/gid's without a
> gid/uid map for its restoration.  I really think a filesystem proxy that
> could speak the rsync tounge should be born. possibly dirvishfs/backupfs
> (which would limit it to fuse on linux for the most part) or like I
> mentioned a fancy program similar to unison to handle linking/gid/uid
> mapping and time restoration.

I would like to take a slightly different tack on this, which answers
the need (smaller, faster backups) without the problems (new file
system formats, extra compute time, extra backup disk wearout).

Dirvish is a tool that is part of the larger problem of maintaining
data integrity.  Part of that problem is avoiding bloat in the original
filesystems.  While some file duplication stems from multiple similar
machines (and some variant of backuppc might be better for dealing
with those), much of it comes from errors and inefficencies managing
the original data on the source drive (say, two downloads from the
same data on a camera flash card, or two copies of the same .rpm ).  

Source disk bloat is more of a problem than backup bloat, IMHO.  My
laptop drive costs about 10x what my backup drives cost;  even with
3-way backup drive rotation, that is a 3x cost disadvantage for bloat
at the source side.  What I would like to see is a dirvish-aware
system for finding these duplications and hard-linking them out of the
source drives.  The dirvish awareness is useful because the process
will be imperfect, and we don't want to accidentally and permanently
wipe something unique.  It would be good if we had some way of verifying
that we had at least one copy of our duplicate source file in the
backups before we start throwing files away.

There are also rumors that rsync sometimes forgets about files in
the backups.  While I have not seen this myself, a friend insisted
that he has restored backups using rsync and had numerous files go
missing in the process.  I would rather spend a few extra bucks on
backup drives and duplicative storage than over-optimize myself into
missing data, if optimization has a chance of increasing such errors.

Effort on new file systems, IMHO, should go into making ReiserFS4
more tested and reliable, and finding ways to efficiently copy or
move it.  I use ReiserFS3 on my backup drives, and it saves far
more space (compared to, say ext3) than fooling with hardlinks on
inefficient filesystems does.  The problem is, I don't trust 4,
and don't trust 3 as much as I would like.   If I try to copy a
dirvish vault from one disk to another,  the existing tools are
laughably inadequate to deal with such hard-link laden monsters. 
It can take many hours to just fsck a disk with dirvish vaults. 
There are some good places to put that "new filesystem" energy
without actually trying to wedge a new one into the kernel and
into rsync.

So, good thinking, but don't stop there.   You've taken the first
steps toward something wonderful.


Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs

More information about the Dirvish mailing list