[Dirvish] dirvish and weeding out duplicates
Shane R. Spencer
shane at tdxnet.com
Tue Nov 15 19:57:19 EST 2005
I think I may have a solution to speed up checking for duplicate files.
First off its nice to have a quick little CPU because of this change.
in the dirvish backup script there is a small area where it generates
when it recieves an index
Example (not from dirvish script):
index (grep \ \-rw|head):
9438940 32 -rwxr-xr-x 1 root root 31404 Jul 16
9437720 4 -rwxr-xr-x 1 root root 2816 Sep 17
9437721 616 -rwxr-xr-x 1 root root 625228 Dec 19
9438939 20 -rwxr-xr-x 1 root root 16504 Jul 16
9438951 12 -rwxr-xr-x 1 root root 10740 Dec 18
9438941 32 -rwxr-xr-x 1 root root 31212 Jul 16
9438942 36 -rwxr-xr-x 1 root root 34572 Jul 16
9438943 56 -rwxr-xr-x 1 root root 51212 Jul 16
9438944 56 -rwxr-xr-x 1 root root 51724 Oct 2
9438945 88 -rwxr-xr-x 1 root root 83960 May 11
This will allow for a post-processor to read index as it was written and
build a list of sizes, checksums and finally the filename + path without
having to generate a list of checksums when it runs. The checksums are
generated per image only once vs. using finddup/freedups.
You can also use the index generation in the main dirvish script to only
write an md5sum for non 0 length files to eliminate processing, the
contrary to that could be instead of only rejecting the md5sum creation
replace those files with lseeks to up allocated space for 0 byte files.
Pregenerating the md5sums for files would allow for a sorted approach of
hard linking (given there are a limited amount of hard links) allowing
you to hardlink larger files first and files above a certain size, say
above 4096|8192 bytes, for more of a user defined rule.
This would only have a costly effect on initial backups via --init,
which in my opinion, however not in the code, should ignore the
"checksum: 1" setting by default removing the "-c" option from the
There could be optimizations to this, including snagging the md5sum list
from rsync itself (local and remote) if "checksum: 1" setting exists for
Hard linking from these lists should happen in two passes, this needs to
1.) On the fly for all --reference based branches. Read in the
md5sums and use the index creation of the dirvish script as an
opportunity to match md5sums and do the hard linking on the spot.
2.) On each configured storage bank do a run on all the index and
md5sum files in the storagebank/vaults/(branches). Possibly
skipping somehow marked files from the on-the-fly pass. Run this
after every dirvish-runall.
Now to write the second pass dirvish-freedup which emulates finddup and
freedup behavior except using the pregenerated lists.
Instead of hours to process over 6 million files you should be looking
at a few minutes, since on snapshots from the initialization only
changes are indexed in their respective directories.
More information about the Dirvish