Poor Man's data replication for Linux-HA

Discussion:

Stephen C. Tweedie

1999-06-28 17:17:24 UTC

Hi,

Set up RAID mirroring between a local disk partition, and a Network
Block Device (NBD). Hack the RAID code so that it prefers local disk to
the network block device, so it always *reads* from the local disk.

This is great (and exactly what IBM's HAGEO does) but we also need a
transaction-based mechanism to make sure things are a) written in the
correct sequence (serial, in parallel, or others, sync or async across the
network etc.) and b) reintegrates correctly after a network failure.

As far as the filesystem is concerned, journaling provides the necessary
guarantees. The only requirements are that (a) we don't acknowledge the
entire IO as complete until all copies have been written, and (b) on
failure, we make sure that reads are from one disk only until
reconstruction is complete, to avoid inconsistent data.

--Stephen

Harald Milz

1999-06-28 17:55:06 UTC

Permalink

Post by Stephen C. Tweedie
As far as the filesystem is concerned, journaling provides the necessary
guarantees. The only requirements are that (a) we don't acknowledge the
entire IO as complete until all copies have been written, and (b) on
failure, we make sure that reads are from one disk only until
reconstruction is complete, to avoid inconsistent data.

That's exactly what I meant. Alan referred to NBD w/ RAID1, not specific
for filesystems (actually as soon as we have raw IO, we could use the NBD
to mirror the raw device across a geography). As for FS use, you are right.

--
Harald Milz phone +49 (0) 179 294-9444
SuSE Muenchen GmbH fax +49 (0) 89 4201-7701
Stahlgruberring 28, D-81829 Muenchen email hm at suse.de
http://www.suse.de (Deutsch) http://www.suse.com (English)

Stephen C. Tweedie

1999-06-28 20:47:31 UTC

Permalink

Hi,

Post by Harald Milz
That's exactly what I meant. Alan referred to NBD w/ RAID1, not specific
for filesystems (actually as soon as we have raw IO, we could use the NBD
to mirror the raw device across a geography). As for FS use, you are right.

The raid code already does this, even if you don't have a filesystem
mounted. If it didn't restrict reads to a single device when doing
recovery, then an fsck running on top would be in danger of reading
inconsistent data.

--Stephen

Alan Robertson

1999-06-12 04:49:38 UTC

Permalink

The following shared-nothing scheme for data replication is simple,
doable, and wonderfully perverse. I'm not completely sure, but credit
for this idea goes to Alan Cox and/or Mike Wangsmo.

Set up RAID mirroring between a local disk partition, and a Network
Block Device (NBD). Hack the RAID code so that it prefers local disk to
the network block device, so it always *reads* from the local disk.

Voila! A standby copy of the partition you're mirroring.

This idea works best with some kind of journalling filesystem on top of
the mirrored partition, a dedicated 100MB network, and an application
which is not write-intensive. Mike Wangsmo has actually tried this.

-- Alan Robertson
alanr at bell-labs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/19990611/d1087cae/attachment.html>

Steve Underwood

1999-06-12 08:14:14 UTC

Permalink

To anyone familiar with the innards of the file systems on Linux,

How tough would it be to allow dynamic switching of the mount state of a
Linux (e.g. ext2fs) partition between read only and read/write? The
reason I think this is useful is that a lot of stuff is now being served
up which changes only a few times per day (e.g. news). If the mount
state could be made read/write during those periods, and then changed to
read only in between, a disk (RAID or otherwise) would be much less
vulnerable to data loss if the host fails - only having significant
vulnerability if the failure occurs during the short update periods. The
"partition clean" flag would be set for most of the day, avoiding even
the limited cleanup journaling needs. All data would be guaranteed
flushed to disk, offering zero data loss, which journaling does not
generally do.

If there is huge complexity in dynamic changes of this kind its probably
a waste of time. If its fairly easy to achieve it seems like a useful
extra tool in the war against failures. A superficial look suggests it
shouldn't be too hard - check the open files, and ensure none are open
with write access; flush any outstanding data to disk, just like a
umount would; and set the partition clean flag. Is there more to it than
I am seeing?

Steve

Steve Underwood

1999-06-13 06:27:21 UTC

Permalink

Hi,

Post by Steve Underwood
How tough would it be to allow dynamic switching of the mount state of a
Linux (e.g. ext2fs) partition between read only and read/write? The

Just calling a command...

# mount /dev/sda7 /img
# mount | grep /img
/dev/sda7 on /img type ext2 (rw)
# mount -o remount,ro /img
# mount | grep /img
/dev/sda7 on /img type ext2 (ro)
# mount -o remount,rw /img
# mount | grep /img
/dev/sda7 on /img type ext2 (rw)
Next problem?

I tried something similar to this before posting my original message. Linux
seemed to behave like a remount was a umount followed by a mount. In other
words, it wouldn't work if there were any open files. Now I have tried
again, and I only get complaints changing from RW to RO when there are files
open with write access. I'm not sure what I am doing different this time,
but I guess I did something stupid before. Anyway, it does appear you can
change between RO and RW quite freely, and only a file opened with write
access will cause (appropriate) problems.

Steve

Dominique Chabord

1999-06-12 08:42:45 UTC

Permalink

I agree with you Alan,

It's not only a solution for poor mens, in fact several commercial products do not do better.
It becomes painful during resynch for any reason. Management outside a cluster is complex.
Full resynch is necessary in case a disk in repaired. This can be solved by mirroring the local disk and the remote disk. In order to avoid twice the traffic on the net, remote RAID1should be done by the remote computer.
Differential resynch is necessary when the net is off and back or after interruption of service of one machine.

Journalled file system is not an option. You will just not recover after crash if you don't have it. The FSCK will usually miss the buffered blocks.

This solution was built by my team, in our lab at Digital for ULTRIX (old Digital Unix) in 1993. It was made reliable with limited effort. Nbd equivalent was lad/last protocol at that time ( a technology of VMS clusters). It was never launched on the market, because Digital dropped Ultrix and migrated to OSF/1 which became Digital UNIX. I think you have to go to a specific implementation of LVM. We couldn't do it with LSM from Veritas and therefore we canceled the project.

By the way, tell me if I'm wrong: Isn't heartbeat challenged with this kind of topology ? I mean, in case you lose access to the local disk, you can either decide to live with degraded performance or failover to the other machine. Can Heartbeat implement both, depending on real situations ? Can it prevent failover if disks are not in-synch ?

Dominique
-----Message d'origine-----
De : Alan Robertson <alanr at bell-labs.com>
? : Linux-HA mailing list <linux-ha at muc.de>
Cc : rob at pangalactic.org <rob at pangalactic.org>; kevin at scrye.com <kevin at scrye.com>
Date : samedi 12 juin 1999 06:54
Objet : Poor Man's data replication for Linux-HA

The following shared-nothing scheme for data replication is simple,
doable, and wonderfully perverse. I'm not completely sure, but credit
for this idea goes to Alan Cox and/or Mike Wangsmo.

Set up RAID mirroring between a local disk partition, and a Network
Block Device (NBD). Hack the RAID code so that it prefers local disk to
the network block device, so it always *reads* from the local disk.

Voila! A standby copy of the partition you're mirroring.

This idea works best with some kind of journalling filesystem on top of
the mirrored partition, a dedicated 100MB network, and an application
which is not write-intensive. Mike Wangsmo has actually tried this.

-- Alan Robertson
alanr at bell-labs.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux-ha.org/pipermail/linux-ha/attachments/19990612/cbf58693/attachment.html>

Chrisman

1999-06-12 10:30:14 UTC

Permalink

Post by Steve Underwood
To anyone familiar with the innards of the file systems on Linux,
How tough would it be to allow dynamic switching of the mount state of a
Linux (e.g. ext2fs) partition between read only and read/write? The
reason I think this is useful is that a lot of stuff is now being served
up which changes only a few times per day (e.g. news). If the mount
state could be made read/write during those periods, and then changed to
read only in between, a disk (RAID or otherwise) would be much less
vulnerable to data loss if the host fails - only having significant
vulnerability if the failure occurs during the short update periods. The
"partition clean" flag would be set for most of the day, avoiding even
the limited cleanup journaling needs. All data would be guaranteed
flushed to disk, offering zero data loss, which journaling does not
generally do.

I'm not sure what you are gaining here. If your application is defined
to be not writing data... there won't be a data loss/correuption,
regardless of whether your filesystem is mounted read only or read/write.
I guess with read only, you can avoid a fsck on takeover. I fyou were
planning on mounting the fs from both nodes simultaneously.. that might
be more difficult... and you'd probably be better off mirroring your
data to the second node at the start of the day when you're doing your
processing. Of course, if you took the fs offline, and then remounted
it read only.. you could mount it read only from another node...
(shared bus storage). This would make you close the files first though,
which is what I guess you don't want to have to make your application do.

Post by Steve Underwood
If there is huge complexity in dynamic changes of this kind its probably
a waste of time. If its fairly easy to achieve it seems like a useful
extra tool in the war against failures. A superficial look suggests it
shouldn't be too hard - check the open files, and ensure none are open
with write access; flush any outstanding data to disk, just like a
umount would; and set the partition clean flag. Is there more to it than
I am seeing?

Basically you want to be able to remount a filesystem from read/write to
read only... in the same way that right now you can remount the filesystem
from read only to read/write. Will definitely read up on this.. :-)

Post by Steve Underwood
Steve
------------------------------------------------------------------------------
http://www.henge.com/~alanr/ha/
http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
------------------------------------------------------------------------------

Alan Cox

1999-06-12 12:14:15 UTC

Permalink

Post by Steve Underwood
How tough would it be to allow dynamic switching of the mount state of a
Linux (e.g. ext2fs) partition between read only and read/write? The

You mean doing mount -o rw/mount -o ro automatically on demand. As a kernel
hack its very easy to do. Doing it right is probably far less trivial. If you
wanted to "do it right" you'd want some kind of thread doing timer driven
disk tidies.

Problems however:
1. We switch to r/o on a physical media failure. You need to
avoid switching back

2. Read only file systems have different semantics to read/write
ones about atime updates.

It is a nice idea

Alan

Alan Robertson

1999-06-12 13:53:30 UTC

Permalink

Post by Alan Cox

Post by Steve Underwood
How tough would it be to allow dynamic switching of the mount state of a
Linux (e.g. ext2fs) partition between read only and read/write? The

You mean doing mount -o rw/mount -o ro automatically on demand. As a kernel
hack its very easy to do. Doing it right is probably far less trivial. If you
wanted to "do it right" you'd want some kind of thread doing timer driven
disk tidies.
1. We switch to r/o on a physical media failure. You need to
avoid switching back
2. Read only file systems have different semantics to read/write
ones about atime updates.
It is a nice idea

Maybe you could connect it to some variation of an automounter (?), so
that this could occur automatically on extended periods of write
inactivity. Of course, write inactivity is easiest to detect in the
kernel.

-- Alan Robertson
alanr at bell-labs.com

Alan Robertson

1999-06-12 13:45:07 UTC

Permalink

Dominque Chabord sent this email to the list, and to me directly, but
the copy on the list was blank for some reason. I'm resending it. I'll
comment on it later :-)
-- Alan R.
----------------------------------------

I agree with you Alan, It's not only a solution for poor mens, in fact
several commercial products do not do better.It becomes painful during
resynch for any reason. Management outside a cluster is complex.Full
resynch is necessary in case a disk in repaired. This can be solved by
mirroring the local disk and the remote disk. In order to avoid twice
the traffic on the net, remote RAID1should be done by the remote
computer.Differential resynch is necessary when the net is off and back
or after interruption of service of one machine. Journalled file system
is not an option. You will just not recover after crash if you don't
have it. The FSCK will usually miss the buffered blocks. This solution
was built by my team, in our lab at Digital for ULTRIX (old Digital
Unix) in 1993. It was made reliable with limited effort. Nbd equivalent
was lad/last protocol at that time ( a technology of VMS clusters). It
was never launched on the market, because Digital dropped Ultrix and
migrated to OSF/1 which became Digital UNIX. I think you have to go to a
specific implementation of LVM. We couldn't do it with LSM from Veritas
and therefore we canceled the project. By the way, tell me if I'm wrong:
Isn't heartbeat challenged with this kind of topology ? I mean, in case
you lose access to the local disk, you can either decide to live with
degraded performance or failover to the other machine. Can Heartbeat
implement both, depending on real situations ? Can it prevent failover
if disks are not in-synch ? Dominique

-----Message d'origine-----
De : Alan Robertson <alanr at bell-labs.com>
? : Linux-HA mailing list <linux-ha at muc.de>
Cc : rob at pangalactic.org <rob at pangalactic.org>;
kevin at scrye.com <kevin at scrye.com>
Date : samedi 12 juin 1999 06:54
Objet : Poor Man's data replication for Linux-HA
The following shared-nothing scheme for data replication is
simple,
doable, and wonderfully perverse. I'm not completely sure,
but credit
for this idea goes to Alan Cox and/or Mike Wangsmo.

Set up RAID mirroring between a local disk partition, and a
Network
Block Device (NBD). Hack the RAID code so that it prefers
local disk to
the network block device, so it always *reads* from the local
disk.

Voila! A standby copy of the partition you're mirroring.

This idea works best with some kind of journalling filesystem
on top of
the mirrored partition, a dedicated 100MB network, and an
application
which is not write-intensive. Mike Wangsmo has actually tried
this.

-- Alan Robertson
alanr at bell-labs.com

wanger

1999-06-27 19:05:18 UTC

Permalink

Post by Alan Robertson
The following shared-nothing scheme for data replication is simple,
doable, and wonderfully perverse. I'm not completely sure, but credit
for this idea goes to Alan Cox and/or Mike Wangsmo.

Actually, I think Stephen owns a much larger chunk of the credit here
than myself or Alan.

Post by Alan Robertson
Set up RAID mirroring between a local disk partition, and a Network
Block Device (NBD). Hack the RAID code so that it prefers local disk to
the network block device, so it always *reads* from the local disk.

Ingo has this in his queue, but I'm unsure as to whether or not he has
actually done anything with it yet.

Post by Alan Robertson
Voila! A standby copy of the partition you're mirroring.
This idea works best with some kind of journalling filesystem on top of
the mirrored partition, a dedicated 100MB network, and an application
which is not write-intensive. Mike Wangsmo has actually tried this.

Harald Milz

1999-06-25 09:50:00 UTC

Permalink

This idea works best with some kind of journalling filesystem on top of
the mirrored partition, a dedicated 100MB network, and an application
which is not write-intensive. Mike Wangsmo has actually tried this.

IBM does this with apps which _are_ write intensive. During my time
in IBM we installed a SAP R/3 system (Oracle V7 database) with HAGEO
disaster recovery like that.

Kris Land

1999-06-24 23:18:56 UTC

Permalink

Hello Folks;

I need help, My company is in the middle of writing a specilized high speed
device driver that links to the Kernal and talks through an API to a
protected code table for doing high speed RAID 5 and such.

I need help getting it debugged and finished, esp. for hot-swapping of drives.

Kris

619-566-2514

Thanks

Post by wanger

Actually, I think Stephen owns a much larger chunk of the credit here
than myself or Alan.

Ingo has this in his queue, but I'm unsure as to whether or not he has
actually done anything with it yet.

Yep, ran it over full-duplex 100Mbit dedicated link. It works pretty
well, except the RAID code is a bit unstable for this right now and the
array breaks apart under loads at times. I haven't had time to play
with it much lately.
Mike
-----------------------------------------------------------------------
Mike Wangsmo Red Hat, Inc
"I've seen this before in Montana! Its snowing, nobody lick a flag
pole" -- Peggy Hill
------------------------------------------------------------------------------
http://www.henge.com/~alanr/ha/
http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
------------------------------------------------------------------------------

Continue reading on narkive:

Search results for 'Poor Man's data replication for Linux-HA' (Questions and Answers)

replies

Which laptop to buy today?

started 2007-10-01 17:31:14 UTC

laptops & notebooks