[Dirvish] dirvish and weeding out duplicates

foner-dirvish at media.mit.edu foner-dirvish at media.mit.edu
Wed Nov 16 02:11:06 EST 2005


I highly, highly recommend "faster-dupemerge", available from [1],
which I actually found a few weeks ago because of a single reference
by Steve Ramage on March 6, 2005 in the archives for this list.  (I'd
decided to read the entire archive when considering using dirvish, and
I'm glad I did.)  faster-dupemerge hardlinks duplicate files together
a la the -H switch in rsync, and it does so efficiently.

On my workloads, its runtime scales approximately linearly with the
number of files, and its memory usage is negligible.  For example, it
took 412 minutes (about 7 hours) to link 2.4 million files into about
2 million files (in 130K directories) in a 300GB ext3fs filesystem,
while using maybe 20 meg of memory to do so, and < 10% CPU.  This is
on a typical Athlon 2800+ with a Seagate IDE 7200rpm disk drive.

Later runs were much faster, because most of the files were already
hardlinked; thus rerunning the faster-dupemerge took only 175 minutes,
or about 2.35 times faster.

The figures above were in its default mode, which checksums -and-
compares files; I think this is needlessly conservative, and I'll
probably change it.  (I've been in touch with its author [about adding
some additional documentation], and he commented that the comparison has
saved him when he's run into flaky disk hardware.)  On the other hand,
given that recently-read files are cached in RAM by the OS, the
compare step is probably negligible if it immediately follows the
checksum step, because no further disk I/O is necessary---and I don't
see how it would sidestep flakiness, either, if the file only gets
read once.  (I haven't verified if this is really what's going on.)

It also takes arguments which can be passed to "find" and used as a
filter, so if you only want to compare a subset of your files, it will
run much, much faster.  I'll give an example of why you'd do that below.

In my case, I had two problems to solve:
(a) I had various near-duplicates of several filesystems on one disk,
    because they'd been moved around from machine to machine over the
    last few years, and were typically things like exactly-copied
    trees of files from older machines with smaller disks, snapshots
    of installations of various Linux distros (which usually have lots
    of mostly-identical files between them, etc).  This wastes a lot
    of space, and while -eventually- all but one copy will go away,
    that day isn't today.
(b) I have a few large files (ISO's, the occasional gigantic tar, etc)
    and I wanted to be able to (1) have a long retention time yet
    frequent backups in dirvish, (2) move them around or rename them
    if I decided to reorganize a directory, but (3) not have numerous
    multiple copies of them taking up enormous amounts of space in my
    vault.  (Moving something around means that rsync can't hardlink
    it to its old location, so each move means I'd have a complete
    copy of the file, no matter how cleverly each set of snapshots
    before (or after) the move was using hardlinks.)

Here's how I used faster-dupemerge to solve all of these problems:

(a) I ran it on the source disk, once, and told it to merge absolutely
    everything it could; it reclaimed 17GB and 400K inodes.  Wow!
    Yup, I had a bunch of duplicated stuff---and now it's -really-
    easy to figure out -what- was duplicated, because I need only look
    for regular files that have more than one hardlink.  (This would
    probably be a good preprocessing step for a utility that uses the
    information to recursively consolidate entire directories and
    trees of directories; anybody know of a tool for this?  If not,
    I'll probably write it someday.)
(b) After each dirvish run, I run it on the most recent pair of runs,
    e.g., the one that just finished, and the one before that.  Any
    file that I moved around on the source machine (the one being
    backed up) and which consequently got moved in the latest snapshot
    will wind up getting hardlinked by faster-dupemerge, and hence
    there's -still- only one copy of the file in the vault.

Now, what about those "find" arguments?  Well, I don't really want
faster-dupemerge to bang on the backup disk for 6 hours or so after
each run.  (The dirvish runs already take about an hour, the vast
majority of which is runtime because rsync -H is very computationally
expensive since it must do so much work to preserve hardlinks.)

On the other hand, the sizes of individual files in most filesystems
looks exponentially distributed, or like a power law.  This means that
the vast majority of files are tiny---and it's pointless to try to
account for the move of every tiny file just to save a small amount
of space if it takes 6 hours to do it.

For example, in my filesystem, here's a table of the number of files
bigger than 0.1, 1, 10, 100, and 1000 MB, and the time it took for
faster-dupemerge to check for duplicates among that subset:

        size(>) number  time(m)
        0	~2M	175
        100 KB  92,000  17.5
        1 MB    9500    8
        10 MB   1400    7
        100 MB  200     6.6
        1 GB    30      6.5

Clearly, it takes around 6 minutes just to scan the filesystem at all,
no matter how many files are selected.  And if I want to check all the
way down to 100K files (pretty small), it's only going to take about
18 minutes to do so---a far cry from the 6 hours it'd take to check
-every- file.  [Note that the 200 files > 100MB -include- the 30
files > 1GB, etc.]

This solution is -far- easier (and more effective!) than either of my
original two ideas, which were either to segregate all the large files
in one place with a different retention policy, or to run dirvish
twice for each tree, putting either all the small files or all the
large files into their own snapshots (again so they could have
different policies), which would have required using the very-new
min/max-size args to the (currently) CVS version of rsync.

P.S.  This strategy -did- trip over one problem, however.  One of the
directories in the source filesystem had 1433 identical files, each of
which said, essentially "this helpfile has been superceded by the manual",
and there turned out to be 4 copies of this directory hanging around.
This means that the single copy of those bits on the source disk had 5732
hardlinks to it.  So the first dirvish run created, similarly, a file
with 5732 hardlinks to it, and successive runs upped that to 11464,
17196, 22928, 28660, and -boom-:  rsync got thousands of errors in
that directory on the next dirvish run because the filesystem
apparently can't cope with more than 32000 (not 32768, oddly)
hardlinks to a single file.  I think I'll replace that directory
everywhere with a gzipped tar file; gzip will no doubt compress all
that identical data down to a tiny size, and I won't have a file
hanging around with thousands of hardlinks to it.  But keep this
failure mode in mind...

[1] http://www.furryterror.org/~zblaxell/projects/dupemerge/dupemerge.html


More information about the Dirvish mailing list