[Dirvish] dirvish and weeding out duplicates

Shane R. Spencer shane at tdxnet.com
Thu Nov 17 12:16:23 EST 2005


On Thu, 2005-11-17 at 11:32 +0100, Paul Slootman wrote:
> On Wed 16 Nov 2005, Shane R. Spencer wrote:
> > 
> > What I am proposing is quite simple however. Generate the md5sum
> > remotely or locally and cache it.  Even if faster-dupemerge checks each
> > inode for a hard-link count and filters the amount of analysis it will
> > do it could still never compare to loading pre-generated md5sums from a
> > flatfile.  faster-dupemerge/finddupes/finddup/etc.. still has to
> > generate these for like files. the md5sum process is the bottleneck and
> > will always be the bottleneck.
> 
> When I was confronted with the challenge of copying over a Debian
> archive from one system to another, I had to get creative...
> This was an archive of daily snapshots over a period of about 2 years.
> Common files between days hardlinked, total storage about 150GB, and the
> number of files in the millions. Rsync was clearly not an option, at
> least not in one go.
> 
> I started off by rsyncing each incremental day with the --link-dest
> option. I then found that because the original mirroring was sometimes
> not perfect (a day wasn't completed, for example) that this still led to
> many distinct identical files. I then came up with a perl script that
> checks every file, finds its md5sum, and then sees if there is a file in
> an "md5sum" directory, where the files are named as their md5sum (split
> into a number of directory levels to keep things manageable). If there
> is already a file there, it unlinks the newly discovered file and links
> the md5sum-named file to the new filename; if not, it links in the other
> direction, adding the new file to the md5sum tree.
> 
> Basically it's a way of quickly finding a file with a given md5sum.
> 
> This ignores other meta-data such as timestamp, uid/gid, modes but in
> this case of a Debian archive that isn't relevant, those should all be
> the same anyway.  Otherwise the md5sum tree could be extended to include
> that data in the file naming...
> 

This is the kind of concept I like :) I suppose reading and md5sum dir
full of inodes and filenames would be just as easy as reading an
flatfile.. and hopefully less corruptable.

And you run this per day I take it.. skipping files with inodes that
exist in the md5sums dir.

Shane

> 
> Paul Slootman
> _______________________________________________
> Dirvish mailing list
> Dirvish at dirvish.org
> http://www.dirvish.org/mailman/listinfo/dirvish



More information about the Dirvish mailing list