[Dirvish] Re: Fatal errors with no apparent cause

Matthew Palmer mpalmer at hezmatt.org
Tue Mar 29 04:38:34 PST 2005


On Mon, Mar 28, 2005 at 11:40:33PM -0800, Keith Lofstrom wrote:
> Welcome to dirvish!   Your errors are in /var/log/messages, right?
> That error is caused by rsync throwing either a 'failed to write XXXX
> bytes:' error or a 'write failed' error.   Since you observe that it
> happens with different vaults every night, it may be caused by some
> other memory hog running at the same time.

Aha.  rsync_error (which I knew about but foolishly didn't look at in this
case) provides errors such as:

rsync error: timeout in data send/receive (code 30) at io.c(153)
rsync: connection unexpectedly closed (3018949 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(359)
rsync: connection unexpectedly closed (8 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(359)

Which seems very, very strange to me -- this is doing a backup of the local
machine in this instance.  The same errors occur on remote transfers, so I'm
not convinced it's a network error (although I had some terrible troubles
with that at first -- damn old routers not liking lots of 1500 octet MTU
packets...).  I did push the rsync timeout down to 30 seconds, but it
*certainly* shouldn't fail on the local machine, and since I'm running
across 100Mb/s ethernet for the other machines, it shouldn't take 30 seconds
to transfer anything either. 

> One thing that might be happening is that /etc/cron.daily/slocate.cron 
> is running  update.db  at the same time, and you neglected to exclude
> the dirvish bank directories from the -e excluded directories.  That
> will use an enormous amount of memory before it fails. 

Good point, hadn't considered that.  I've removed the bank from locate's
purview, which should solve that particular problem (if that's the cause).

> You should look at the $VAULT/log files that dirvish makes alongside

Lots of "broken pipe" messages.  Tallies up with the I/O timeout messages
earlier, but doesn't help me so much with a cure.

> rsync-option:
>       -vv

I'll do that for tonight.  See if we get anything useful.

> This does point out a bad feature of dirvish - that it rewrites these
> two different errors to "write error, filesystem probably full". 
> Those rewrites are occuring in the @erraction table in the errorscan
> sub.  The rewrites seem to be more troublesome than helpful.  Should
> we simplify the code and take the rewrites out?

I don't think hiding error messages is a particularly handy thing to do for
a tool of this type, and guessing at the cause ("filesystem probably full")
confused the heck out of me for a while.

- Matt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://www.dirvish.org/pipermail/dirvish/attachments/20050329/cc87db84/attachment.bin


More information about the Dirvish mailing list