[Dirvish] Combining branches after upgrades

Keith Lofstrom keithl at kl-ic.com
Sat Sep 29 17:37:51 UTC 2007

This describes a possible new feature for dirvish, which might save
backup disk space and network time.

The dirvish "branch" option is useful for those who run many machines
with the same distribution or similar images.  When the initial image
for a branch is made, it may be hardlinked to another existing branch,
saving backup space.

Over time, however, the branch images diverge unnecessarily, especially
with automatic upgrades enabled.  For example, machines alpha, beta,
and gamma may all start out with identical copies of gimp, and if they
are set up as branch images, they will all point their backup images
at the same gimp files through hard links.  Every evening backup will
make three more hard links to each of those files. However, if gimp is
updated,  new gimp files will appear in each of the branch backup
images, not linked to the others.  They are all the same files, but
are not hard linked,  wasting backup disk space.

Rsync hard linking is performed using the --link-dest option.  When
dirvish makes a new daily image, it performs a --link-dest to the
previous daily image.  When dirvish -init is used to make a branch,
it makes a --link-dest to the image it is branching from.  However,
after the branch is made, and after files are upgraded, there is
no further connection between the upgraded files in those images.

Since version 2.6.4, rsync has allowed multiple --link-dest target
directories.  If a file is not found in the first --link-dest target,
rsync will search the next --link-dest target, then the next, and so
forth.  This is more work for rsync (I don't know how much more!) but
it does create more opportunities for hard linking and space savings. 
It also saves network transfer time.

If we wanted to incorporate this behavior into a future version of
dirvish, the most obvious (and incorrect!) thing to do would be to give
branch vaults two --link-dest targets, the first being the branches
previous image, the other to the branch image it originally branched
from.  Thus, if beta and gamma were originally branches of alpha, then
if alpha was updated, and beta and gamma were subsequently updated,
then they would all share the hard link.

However, alpha might get updated, or backed up, after beta and gamma.
This is the case for my machines - my "alpha" machine for a new distro
or new dirvish backup disk is a low priority test machine which can
tolerate long initialization times.  My high priority machines are
branches off of that (they initialize much faster that way).  However,
during normal operation I back up the high priority machines first,
because I need them back in production soonest.

A more robust (and more time consuming) way is to do "all to all"
--link-dest;  every branch checks every other branch before deciding to
create a new file.  That would maximize hard link possibilities, but
checking all those other N-1 branches may take a long time, possibly
resulting in an order N^2 slowdown for every changing file.  Since most
files don't change, that may not be much slowdown in absolute terms,
and it would save network bandwidth, so it may still be a win in time
savings.  A lot depends on the behavior of rsync.

A possible compromise is to do two link-dests, and make the second
link-dest to the previously backed-up image in the vault.  Thus, if
three machines are backed up on day N in the order beta(N), alpha(N),
gamma(N), then branch image gamma(N) will point its first --link-dest
at image gamma(N-1), and its second --link-dest at image alpha(N). 
beta(N+1) will point at beta(N) and then gamma(N).  This would require
keeping track of the last successful backup in a vault;  the easy way
is to leave a symlink after each successful backup, and always use
that symlink as the second --link-dest target.  This might miss some
file linking opportunities, especially if updates on some machines are
running at the same time as backups, and it might also fail to connect
identical files with slightly differing change times if the checksum
option is not used.  This "link-to-previous-image-in-the-vault"
behavior would be relatively easy to implement as a new option.

Perhaps you have better ideas.  Perhaps it isn't worth it, because
it uses extra disk accesses to save some disk space and network
bandwidth, and that may not be a good tradeoff.  What do you think?


Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs

More information about the Dirvish mailing list