[Dirvish] dirvish and weeding out duplicates

Carel Fellinger cf-2003-nn at xs4all.nl
Tue Nov 15 09:42:46 EST 2005


Hai List,

looking for a good backup system I stumbled upon dirvish.  I've just set
up two backup systems, so my experience is very limited, but reading on
the wike i noticed that someone suggested using FreeDups to prune
duplicate files that dirvish missed as such.  An interesting idea, this
would make it feasable to have two vaults, one for weekly backups for
the full filesytem and one for daily backups for a limited part of that
filesystem without wasting to much space.

I think I've a very simple and straightforward algoritme for this, and
the good thing is it is linear in time.

the idea is to create a pool area, one pool on each partition in use as
a vault.  After dirvish has done its backup dance, a special pool dance
has to follow in which:

  all new files are either hard-linked to the file in the pool with
  the filename equal to the MD5sum of that new file replacing the
  newly backuped file in the process,

  or if that file didn't exist in the pool yet, do the hardlinking
  the other way around and create a hardlink in the pool with as
  name the MD5sum of that newly backuped file.

I take it that the list of newly added files in the last backup
round can be extracted from the rsync log and if people fear that
MD5sum isn't distinctive enough SHA1 could be used instead.

In order to prevent the pool directory from becoming unwieldy big it is
wise to create intermediate dirs, eg by splitting the MD5sum-file-name
and using those parts as dir names.

Ofcourse dirvish-expire has to be made aware of this change, it should
take into account that each and every file is linked into the pool too.

It's such an obvious approach that I'm surprised it hasn't been
mentioned on the wiki.  The most likely reason being I'm overlooking
something simple and it's not going to work.

I'm not sure what will happen in the retrival process with files that
have been hardlinked in the above extra dirvish dances without having
been hardlinked in the original fs.  But I think the same holds for the
other proposed schemas.

-- 
groetjes, carel


More information about the Dirvish mailing list