[Dirvish] Fatal errors with no apparent cause
keithl at kl-ic.com
Mon Mar 28 23:40:33 PST 2005
On Tue, Mar 29, 2005 at 10:37:53AM +1000, Matthew Palmer wrote:
> Every night, my dirvish runs produce error messages like this:
> dirvish vulcan/home:default fatal error: write error, filesystem probably full
> dirvish vulcan/home:default fatal error (12) -- write error, filesystem probably full
> dirvish goldwing/var:default fatal error: write error, filesystem probably full
> dirvish goldwing/var:default fatal error (12) -- write error, filesystem probably full
> It's different vaults every night, but it's rare if there isn't at least one
> that fails, but most of them work each night.
> In no case is the filesystem full, either in space or (as far as I can
> determine with dumpe2fs) inodes:
> Free blocks: 27583157
> Free inodes: 48609129
> I also can't find any hardware errors in dmesg.
> I'm not sure if it's appropriate, but (12) is ENOMEM in errno.h, which
> doesn't seem right, since the machine most of this is running on (vulcan)
> has oodles of memory, and I've never had any problems running a dirvish
> instance manually. I'm guessing that (12) is relating to something else,
> but I'm stuffed if I can work out what.
Welcome to dirvish! Your errors are in /var/log/messages, right?
That error is caused by rsync throwing either a 'failed to write XXXX
bytes:' error or a 'write failed' error. Since you observe that it
happens with different vaults every night, it may be caused by some
other memory hog running at the same time.
One thing that might be happening is that /etc/cron.daily/slocate.cron
is running update.db at the same time, and you neglected to exclude
the dirvish bank directories from the -e excluded directories. That
will use an enormous amount of memory before it fails.
You should look at the $VAULT/log files that dirvish makes alongside
the tree image. If that doesn't suggest a cure, you can add the lines
To /etc/dirvish/master.conf, for a big steaming heap of debug output
in your log file. You can also start a top -b > top_log_file to
run overnight (collecting about 6MB/hr of activity).
Rsync can be a memory hog. If you have millions of little files, it
can run out of memory and fail. rsync needs a partial rewrite to fix
this, and the workaround is to artificially construct smaller vaults.
But since you say it happens on a different vault every night, I don't
think this is your problem. An Rsync memory hog failure would show up
in the top log file if you make one.
Let us know what you discover, and if you learn a Cool Trick, put it
on the wiki. If you are still stumped after trying all this, put some
files on your server for us to look at, and send us a pointer. Copies
of your config files, master and vault, a typical failing image log
and rsync_error file would give give us a better idea of what is
This does point out a bad feature of dirvish - that it rewrites these
two different errors to "write error, filesystem probably full".
Those rewrites are occuring in the @erraction table in the errorscan
sub. The rewrites seem to be more troublesome than helpful. Should
we simplify the code and take the rewrites out?
Keith Lofstrom keithl at keithl.com Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs
More information about the Dirvish