Keith Lofstrom keithl at kl-ic.com
Sat Mar 5 18:34:40 PST 2005

> Keith Lofstrom said:
... about using Bill Stearn's "freedups" to locate identical files...
> <snip>
> This looks useful.  The quality of the software is unknown, and it
> could use some documentation (limited to a README and the option
> --usage).  I will make a dd copy of a backup drive and see if this
> works over the next few days.
On Sat, Mar 05, 2005 at 08:34:28PM -0500, Jason Boxman wrote:
> It seems to keep a structure in memory with md5s and inodes.  I ran it
> against my vault with has over a million hard links and likely at least
> 200,000 files.  It seemed to choke rather badly.

Sigh.  Jason,  thanks for being courageous (and saving me the
time)!  A response on the VaultBranch wiki page is called for; 
do you want to do it, or shall I?  I would probably refactor
the information to a FreeDups page, anyway.

I can write a nice email back to the author (assuming he is
the source of the comment) asking him to consider a hash table
on disk or something.  Still, even with an on-disk table, he
will have to traverse the directories of the whole dirvish
vault, and that could take a VERY long time!  

I wonder, though, if a specialized version of freedups could be
designed to join two sets of similarly-named images (reducing the
search time)?  This would have two applications:  (1) Healing a
large set of branches after a major multiple-machine upgrade, and
(2) fixing the kind of problem Steve Ramage had when he accidentally
created a second image set.  Even for that, freedups2 would need an
on-disk table!  Well, that is Mr. Stearn's problem.


