[Dirvish] dirvish and weeding out duplicates

Shane R. Spencer shane at tdxnet.com
Tue Nov 15 19:57:19 EST 2005


I think I may have a solution to speed up checking for duplicate files.

First off its nice to have a quick little CPU because of this change.

in the dirvish backup script there is a small area where it generates
the index.

when it recieves an index 

Example (not from dirvish script):

index (grep \ \-rw|head):
9438940   32 -rwxr-xr-x   1 root     root        31404 Jul 16
2004 ./bin/chgrp
9437720    4 -rwxr-xr-x   1 root     root         2816 Sep 17
23:04 ./bin/arch
9437721  616 -rwxr-xr-x   1 root     root       625228 Dec 19
2004 ./bin/bash
9438939   20 -rwxr-xr-x   1 root     root        16504 Jul 16
2004 ./bin/cat
9438951   12 -rwxr-xr-x   1 root     root        10740 Dec 18
2003 ./bin/dnsdomainname
9438941   32 -rwxr-xr-x   1 root     root        31212 Jul 16
2004 ./bin/chmod
9438942   36 -rwxr-xr-x   1 root     root        34572 Jul 16
2004 ./bin/chown
9438943   56 -rwxr-xr-x   1 root     root        51212 Jul 16
2004 ./bin/cp
9438944   56 -rwxr-xr-x   1 root     root        51724 Oct  2
08:34 ./bin/cpio
9438945   88 -rwxr-xr-x   1 root     root        83960 May 11
2005 ./bin/dash

md5sums.gz (head):
648a70701ec59f7ffc69dc512ab812f9  ./bin/chgrp
713ccbecfb73a5362d639805e4473fbf  ./bin/arch
61e8b059aa16e0028a6f19ef5c979a31  ./bin/bash
388b2a370b29026d36ba484649098827  ./bin/cat
36ca6651f7671603d81fe1eba1eedd23  ./bin/dnsdomainname
e3ebd7dabfe9fc1ddfcfae52139b8898  ./bin/chmod
88bfb20ed1595339b72aaee4461e59a0  ./bin/chown
45fbe95f3acc3c6d5410381c1e67c756  ./bin/cp
b9e99f8e6b0936cdd4d34cae1b9900df  ./bin/cpio
9302508fccce18f1b0336adb2f3124fd  ./bin/dash

This will allow for a post-processor to read index as it was written and
build a list of sizes, checksums and finally the filename + path without
having to generate a list of checksums when it runs. The checksums are
generated per image only once vs. using finddup/freedups.

You can also use the index generation in the main dirvish script to only
write an md5sum for non 0 length files to eliminate processing, the
contrary to that could be instead of only rejecting the md5sum creation
replace those files with lseeks to up allocated space for 0 byte files.

Pregenerating the md5sums for files would allow for a sorted approach of
hard linking (given there are a limited amount of hard links) allowing
you to hardlink larger files first and files above a certain size, say
above 4096|8192 bytes, for more of a user defined rule.

This would only have a costly effect on initial backups via --init,
which in my opinion, however not in the code, should ignore the
"checksum: 1" setting by default removing the "-c" option from the
rsync.

There could be optimizations to this, including snagging the md5sum list
from rsync itself (local and remote) if "checksum: 1" setting exists for
that vault/branch.

Hard linking from these lists should happen in two passes, this needs to
be rethought:

  1.) On the fly for all --reference based branches.  Read in the
      md5sums and use the index creation of the dirvish script as an
      opportunity to match md5sums and do the hard linking on the spot.

  2.) On each configured storage bank do a run on all the index and 
      md5sum files in the storagebank/vaults/(branches). Possibly
      skipping somehow marked files from the on-the-fly pass. Run this
      after every dirvish-runall.


Now to write the second pass dirvish-freedup which emulates finddup and
freedup behavior except using the pregenerated lists.

Instead of hours to process over 6 million files you should be looking
at a few minutes, since on snapshots from the initialization only
changes are indexed in their respective directories.

Thoughts?

Shane Spencer
TDXNet L.L.C.



More information about the Dirvish mailing list