[Dirvish] Job checkpoint/restart
Brian at MartinConsulting.com
Mon Jun 13 22:55:44 PDT 2005
So the classic Dirvish problem occurs when one is running the first back-up
of a system, or perhaps the first back-up after a major change like a system
upgrade. If you're going across the Internet, you could be looking at a
back-up that runs for several days. It's easy to have a job that's been
running for, say, two days, and then fails due to a network interruption.
In my experience, when dirvish gives up on a job, it deletes everything it
has backed up so far, possibly throwing away several days worth of work.
That actually makes sense in a normal, day-to-day situation, as you wouldn't
want to use a partial copy (taken, say, on Tuesday) as a link reference, if
you had a full copy (taken on Monday) to use instead.
I've done some manual stuff in this area, but it's not pretty, it only works
if I know the failure is going to happen (i.e. I've decided the one of the
servers has to go down for other reasons), and I'm not even sure it's a
sound procedure. Let's suppose I get a normal back-up on Monday, then do a
big system upgrade Tuesday morning. Tuesday night the back-ups start
running, and would normally run until Friday to get everything that's
changed. Come Thursday, I decide the upgraded machine has to go down for
some emergency fixes. Here's what I do:
1) I start by killing all the dirvish and rsync processes on the system
doing the Tuesday back-up. This stops the job and prevents Dirvish from
cleaning up the incomplete directory.
2) I manually add the appropriate entry into the .hist file, so this partial
copy will be used as a link destination for the next run.
3) If there was an earlier back-up (i.e. Monday) that was being used as a
link destination for this job, I copy all it's files into the new tree using
the "cp -a --link" options (something like "cp -a --link Monday/tree
Tuesday"). As far as I can see, this will add in all the files from the
Monday tree that are not present in the Tuesday tree into the Tuesday tree.
A file in the Monday tree that's already present in the Tuesday tree will
generate an error message saying it's already there, thus leaving the
Tuesday copy. This "cp" command basically uses Monday to fill in anything
in Tuesday that didn't get copied. We might get some deleted files back,
and of course the fill-in files are from Monday, not Tuesday, but it gives
us the semblance of a full back-up.
4) After the emergency maintenance I restart the job. Dirvish will use the
Tuesday tree as a reference. Dirvish will spend some extra time looking for
differences in the files that were already copied, but all work that was
done over Tuesday and Wednesday doesn't need to be copied again and it'll
basically begin copying files where it left off.
This seems to work, but it's awful! And it only works if I know the outage
is coming. Is there any better way to do this?
More information about the Dirvish