[Dirvish] Re: Fatal errors with no apparent cause

Keith Lofstrom keithl at kl-ic.com
Tue Mar 29 09:32:33 PST 2005

On Tue, Mar 29, 2005 at 10:38:34PM +1000, Matthew Palmer wrote:
> Aha.  rsync_error (which I knew about but foolishly didn't look at in this
> case) provides errors such as:
> rsync error: timeout in data send/receive (code 30) at io.c(153)
> rsync: connection unexpectedly closed (3018949 bytes received so far) [generator]
> rsync error: error in rsync protocol data stream (code 12) at io.c(359)
> rsync: connection unexpectedly closed (8 bytes received so far) [sender]
> rsync error: error in rsync protocol data stream (code 12) at io.c(359)
> Which seems very, very strange to me -- this is doing a backup of the local
> machine in this instance.  The same errors occur on remote transfers, so I'm
> not convinced it's a network error (although I had some terrible troubles
> with that at first -- damn old routers not liking lots of 1500 octet MTU
> packets...).  I did push the rsync timeout down to 30 seconds, but it
> *certainly* shouldn't fail on the local machine, and since I'm running
> across 100Mb/s ethernet for the other machines, it shouldn't take 30 seconds
> to transfer anything either.

Assuming that the lo interface is configured correctly and dirvish/rsync
is using it properly, you are probably right in ruling out network errors.
I've emphasized in presentations that Rsync really thrashes hardware and
software - it sucks up every drop of resources it can find to get the job
done fast.  So if anything else is pushing you to the limit, rsync may
push you right over.

Tell us a little more about your local machine.  Stock kernel or hacked?
2.4 or 2.6?  Distro?  ( uname -a )  What are you using for source and
target disks, and on what type of interfaces?  Which version of rsync
( rsync --version )?  Which version of perl ( perl --version )?  Which
version of ssh ( ssh -V )?  And THAT points out the need for ANOTHER
dirvish feature, one that reports not just the dirvish version, but the
versions of all the components dirvish relies on. 
I am more curious than ever what top -b > top_log_file , run overnight,
would report.  

I am 90% sure this is NOT a problem with dirvish itself.  But since we
specify the components (dirvish, ssh, a *nix box), we are responsible
for at least determining which of those components is the culprit.

Lastly, since the problem is mediated by rsync, you should google the
rsync list for clues.  try this google search string:

site:lists.samba.org "rsync error: error in rsync protocol data stream (code 12)"

A lot of people on the rsync list have seen these errors, and they often
resolve to a machine resource issue ( early 2.4 kernels unable to handle
 >1GB RAM, for example ).  


Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs

More information about the Dirvish mailing list