[Dirvish] freedups

Steve Ramage dirvishusersspammedme at sjrx.net
Sun Mar 6 00:27:19 PST 2005


Keith Loftstrom said:

...fixing the kind of problem Steve Ramage had when he accidentally
created a second image set...

Actually I just checked, because it happened early in my Dirvish career, as it turns out all the sets from before 'The Great Split' have since expired, and disk space is back down thankfully.

>I can write a nice email back to the author (assuming he is
>the source of the comment) asking him to consider a hash table
>on disk or something.  Still, even with an on-disk table, he
>will have to traverse the directories of the whole dirvish
>vault, and that could take a VERY long time!  
>  
>
A quick search on google though has turned up another program, well 
actually set of programs that have sort of evolved called dupemerge: 
http://www.furryterror.org/~zblaxell/dupemerge/dupemerge.html

There are several versions ranging from the original, 'dupescan' from 
the mid-90's, which according to the site scales to about 10K files, to 
the latest rendition which claims 100 Million Files.

On the homepage there is a comparission of each version and here is the 
latest one.

faster-dupemerge

Date
    2002-2003 
Scalability
    100M files 
Implementation
    Perl (File::Compare, File::Temp, Fcntl, Digest::SHA1 or external
    md5sum program) + sort + find 
Persistent Storage
    None 
Temporary Storage
    Whatever sort needs (usually /tmp), small Perl arrays 
Style
    Big Bang 
Strengths

        * Low RAM requirement
        * Avoids unnecessary hashing of files until two files of the
          same size seen
        * Supports compare-by-hash and compare-byte-by-byte modes
        * Space and time-efficient sorting of files before processing
        * Full access to find and sort program command line arguments
          for high flexibility
        * Built-in fcntl locking to exclude concurrent access (this was
          required by one of the applications...)
        * Uses no external programs per file if all Perl modules
          installed, but can use an external md5sum program
        * Uses stronger SHA1 hash by default
        * Copes better (but still not very well) with files in
          non-writable directories

Weaknesses

        * Limited by sort program (fortunately the limit is very large)
        * Needs disk space for large sets of files (used by sort)
        * No persistent storage
        * Less portable (needs GNU find, GNU sort)



I haven't tested / played with it, I was just doing a google search and 
thought it might be worth mentioning

Steve R



More information about the Dirvish mailing list