[Dirvish] dirvish and weeding out duplicates
cf-2003-nn at xs4all.nl
Tue Nov 15 09:42:46 EST 2005
looking for a good backup system I stumbled upon dirvish. I've just set
up two backup systems, so my experience is very limited, but reading on
the wike i noticed that someone suggested using FreeDups to prune
duplicate files that dirvish missed as such. An interesting idea, this
would make it feasable to have two vaults, one for weekly backups for
the full filesytem and one for daily backups for a limited part of that
filesystem without wasting to much space.
I think I've a very simple and straightforward algoritme for this, and
the good thing is it is linear in time.
the idea is to create a pool area, one pool on each partition in use as
a vault. After dirvish has done its backup dance, a special pool dance
has to follow in which:
all new files are either hard-linked to the file in the pool with
the filename equal to the MD5sum of that new file replacing the
newly backuped file in the process,
or if that file didn't exist in the pool yet, do the hardlinking
the other way around and create a hardlink in the pool with as
name the MD5sum of that newly backuped file.
I take it that the list of newly added files in the last backup
round can be extracted from the rsync log and if people fear that
MD5sum isn't distinctive enough SHA1 could be used instead.
In order to prevent the pool directory from becoming unwieldy big it is
wise to create intermediate dirs, eg by splitting the MD5sum-file-name
and using those parts as dir names.
Ofcourse dirvish-expire has to be made aware of this change, it should
take into account that each and every file is linked into the pool too.
It's such an obvious approach that I'm surprised it hasn't been
mentioned on the wiki. The most likely reason being I'm overlooking
something simple and it's not going to work.
I'm not sure what will happen in the retrival process with files that
have been hardlinked in the above extra dirvish dances without having
been hardlinked in the original fs. But I think the same holds for the
other proposed schemas.
More information about the Dirvish