[Dirvish] dirvish and weeding out duplicates

Shane R. Spencer shane at tdxnet.com
Tue Nov 15 20:05:42 EST 2005


On Tue, 2005-11-15 at 15:57 -0900, Shane R. Spencer wrote:
> I think I may have a solution to speed up checking for duplicate files.
> 
> First off its nice to have a quick little CPU because of this change.
> 
> in the dirvish backup script there is a small area where it generates
> the index.
> 
> when it recieves an index 

This is meant to say:

When it creates an index it changes the output per line (file) and
should take advantage of this process and should also create an md5sum
for all files that are not already a link, symlink, directory or special
devices/system files and append that to an md5sums file using the same
gzip/bzip options as the index. This would produce a small file of
md5sums for each actual file it is indexing, slowly but only effecting
the initial full backup to any magnitude.

> 
> Example (not from dirvish script):
> 
> index (grep \ \-rw|head):
> 9438940   32 -rwxr-xr-x   1 root     root        31404 Jul 16
> 2004 ./bin/chgrp
> 9437720    4 -rwxr-xr-x   1 root     root         2816 Sep 17
> 23:04 ./bin/arch
> 9437721  616 -rwxr-xr-x   1 root     root       625228 Dec 19
> 2004 ./bin/bash
> 9438939   20 -rwxr-xr-x   1 root     root        16504 Jul 16
> 2004 ./bin/cat
> 9438951   12 -rwxr-xr-x   1 root     root        10740 Dec 18
> 2003 ./bin/dnsdomainname
> 9438941   32 -rwxr-xr-x   1 root     root        31212 Jul 16
> 2004 ./bin/chmod
> 9438942   36 -rwxr-xr-x   1 root     root        34572 Jul 16
> 2004 ./bin/chown
> 9438943   56 -rwxr-xr-x   1 root     root        51212 Jul 16
> 2004 ./bin/cp
> 9438944   56 -rwxr-xr-x   1 root     root        51724 Oct  2
> 08:34 ./bin/cpio
> 9438945   88 -rwxr-xr-x   1 root     root        83960 May 11
> 2005 ./bin/dash
> 
> md5sums.gz (head):
> 648a70701ec59f7ffc69dc512ab812f9  ./bin/chgrp
> 713ccbecfb73a5362d639805e4473fbf  ./bin/arch
> 61e8b059aa16e0028a6f19ef5c979a31  ./bin/bash
> 388b2a370b29026d36ba484649098827  ./bin/cat
> 36ca6651f7671603d81fe1eba1eedd23  ./bin/dnsdomainname
> e3ebd7dabfe9fc1ddfcfae52139b8898  ./bin/chmod
> 88bfb20ed1595339b72aaee4461e59a0  ./bin/chown
> 45fbe95f3acc3c6d5410381c1e67c756  ./bin/cp
> b9e99f8e6b0936cdd4d34cae1b9900df  ./bin/cpio
> 9302508fccce18f1b0336adb2f3124fd  ./bin/dash
> 
> This will allow for a post-processor to read index as it was written and
> build a list of sizes, checksums and finally the filename + path without
> having to generate a list of checksums when it runs. The checksums are
> generated per image only once vs. using finddup/freedups.
> 
> You can also use the index generation in the main dirvish script to only
> write an md5sum for non 0 length files to eliminate processing, the
> contrary to that could be instead of only rejecting the md5sum creation
> replace those files with lseeks to up allocated space for 0 byte files.
> 
> Pregenerating the md5sums for files would allow for a sorted approach of
> hard linking (given there are a limited amount of hard links) allowing
> you to hardlink larger files first and files above a certain size, say
> above 4096|8192 bytes, for more of a user defined rule.
> 
> This would only have a costly effect on initial backups via --init,
> which in my opinion, however not in the code, should ignore the
> "checksum: 1" setting by default removing the "-c" option from the
> rsync.
> 
> There could be optimizations to this, including snagging the md5sum list
> from rsync itself (local and remote) if "checksum: 1" setting exists for
> that vault/branch.
> 
> Hard linking from these lists should happen in two passes, this needs to
> be rethought:
> 
>   1.) On the fly for all --reference based branches.  Read in the
>       md5sums and use the index creation of the dirvish script as an
>       opportunity to match md5sums and do the hard linking on the spot.
> 
>   2.) On each configured storage bank do a run on all the index and 
>       md5sum files in the storagebank/vaults/(branches). Possibly
>       skipping somehow marked files from the on-the-fly pass. Run this
>       after every dirvish-runall.
> 
> 
> Now to write the second pass dirvish-freedup which emulates finddup and
> freedup behavior except using the pregenerated lists.
> 
> Instead of hours to process over 6 million files you should be looking
> at a few minutes, since on snapshots from the initialization only
> changes are indexed in their respective directories.
> 
> Thoughts?
> 
> Shane Spencer
> TDXNet L.L.C.
> 
> _______________________________________________
> Dirvish mailing list
> Dirvish at dirvish.org
> http://www.dirvish.org/mailman/listinfo/dirvish



More information about the Dirvish mailing list