[Dirvish] dirvish and weeding out duplicates
paul at debian.org
Thu Nov 17 05:32:09 EST 2005
On Wed 16 Nov 2005, Shane R. Spencer wrote:
> What I am proposing is quite simple however. Generate the md5sum
> remotely or locally and cache it. Even if faster-dupemerge checks each
> inode for a hard-link count and filters the amount of analysis it will
> do it could still never compare to loading pre-generated md5sums from a
> flatfile. faster-dupemerge/finddupes/finddup/etc.. still has to
> generate these for like files. the md5sum process is the bottleneck and
> will always be the bottleneck.
When I was confronted with the challenge of copying over a Debian
archive from one system to another, I had to get creative...
This was an archive of daily snapshots over a period of about 2 years.
Common files between days hardlinked, total storage about 150GB, and the
number of files in the millions. Rsync was clearly not an option, at
least not in one go.
I started off by rsyncing each incremental day with the --link-dest
option. I then found that because the original mirroring was sometimes
not perfect (a day wasn't completed, for example) that this still led to
many distinct identical files. I then came up with a perl script that
checks every file, finds its md5sum, and then sees if there is a file in
an "md5sum" directory, where the files are named as their md5sum (split
into a number of directory levels to keep things manageable). If there
is already a file there, it unlinks the newly discovered file and links
the md5sum-named file to the new filename; if not, it links in the other
direction, adding the new file to the md5sum tree.
Basically it's a way of quickly finding a file with a given md5sum.
This ignores other meta-data such as timestamp, uid/gid, modes but in
this case of a Debian archive that isn't relevant, those should all be
the same anyway. Otherwise the md5sum tree could be extended to include
that data in the file naming...
More information about the Dirvish