[Dirvish] dirvish and weeding out duplicates

Shane R. Spencer shane at tdxnet.com
Wed Nov 16 14:28:55 EST 2005


On Wed, 2005-11-16 at 02:11 -0500, foner-dirvish at media.mit.edu wrote:
> I highly, highly recommend "faster-dupemerge", available from [1],
> which I actually found a few weeks ago because of a single reference
> by Steve Ramage on March 6, 2005 in the archives for this list.  (I'd
> decided to read the entire archive when considering using dirvish, and
> I'm glad I did.)  faster-dupemerge hardlinks duplicate files together
> a la the -H switch in rsync, and it does so efficiently.

It has no routine to access the rsync command. But I do like its
algorithm.

> On my workloads, its runtime scales approximately linearly with the
> number of files, and its memory usage is negligible.  For example, it
> took 412 minutes (about 7 hours) to link 2.4 million files into about
> 2 million files (in 130K directories) in a 300GB ext3fs filesystem,
> while using maybe 20 meg of memory to do so, and < 10% CPU.  This is
> on a typical Athlon 2800+ with a Seagate IDE 7200rpm disk drive.

7 hours isn't terrible at all.. it took almost 2 hours using finddup on
a slow athlon for approximately 650k files. In your case it would have
taken my slow server 8 hours to generate all those checksums.

> Later runs were much faster, because most of the files were already
> hardlinked; thus rerunning the faster-dupemerge took only 175 minutes,
> or about 2.35 times faster.

What I am proposing is quite simple however. Generate the md5sum
remotely or locally and cache it.  Even if faster-dupemerge checks each
inode for a hard-link count and filters the amount of analysis it will
do it could still never compare to loading pre-generated md5sums from a
flatfile.  faster-dupemerge/finddupes/finddup/etc.. still has to
generate these for like files. the md5sum process is the bottleneck and
will always be the bottleneck.

There are a few ways of fixing this.. one of which is generating an
md5sum for each file while storing it. 

You may laugh at the complexity of this matter but if you wanted to set
up a fuse http://fuse.sourceforge.net/ (at its core highly reliable)
filesystem that generates md5sums on data while it is being written
(checking for back-access and holes in 1k blocks) you could easily
generate the required md5sums as files are being written for the first
time and store those sums, inodes, reference to file size and name in
bdb/cdb/sqlite/flatfile/etc.. 

The other few methods I described before.. as you generate an index
generate md5sums. You only have to do this for files with a hard-link
count of 1. If you need a faster processor to handle and that processor
exists on the remote machine you are backing up a remote method (pre
rsync) should be set up. then of course all file access times rechecked
and the local pass run at the end to clear up any md5sums for changed
files from the remote pass. Given a better rsync output you could also
show the md5sums if you enable the -c option in rsync, Unfortunatly you
are wasting 2 times the cpu (remote and local) doing md5sums on each
pair of files to check.

I really like the rest of your comments. I just wish there were a better
way of generating fractured md5sums for files and a filesystem that
would dynamically offer md5sums or a similar sum as metadata. :)

I have backups of backups of backups of diplicate files of tars that
were opened and never removed of backups of backups.. and so on..  I
agree with case (a) in your email, I had recently run a 6 hour finddup
-l on a tree of such backups and freed over 14 gigs of data.  yes I am a
slob.

I also have large files that change, databases and so forth and dirvish
does well to ignore those if told, simply because in order to snapshot a
database you must dump it to a flatfile in most cases.. a file larger
than the cumulative amount of database files you have, however necissary
to maintain database integrity. Standard scripts to dump databases into
$backup/$database/$table.flatfile and a single $backup/restore.flatfile
to include all of those would be handy, easy and handy. As well as
allowing tables that have not changed to be removed by the duplicate
matcher.  This doesn't address that its times were just fluffed over and
rsync still re-transfers these. This could be fixed of course with the
same script by resetting the times to match if nothing has changed. I
could go into this with sql, I am sure somewhere out there a database
system can offer its own internal incremental backup to flatfile
solution.

I may try your solution for now. I can't break free from other
obligations to make a fuse filesystem or modify a few simple scripts :)

Good point about the hard link upper limit.. hence why I am using xfs..
For this reason I also assumed it would be in the best interest of any
program that does hard linking to start with the largest files first
when running their pass.

I want to have a lot more discussion on this.. fuse is not the way..
getting rsync to do things it can't do is not the way.. possibly
modifying dirvish to store an md5sum database.. possibly.. but what kind
of future skew problems would we have with something like that?

Shane Spencer



More information about the Dirvish mailing list