IMHO, a good thing in SA is its mechanism to deal with resource
Do you know how FailSafe handles resource dependencies ?
If so, could you please comment a bit about it?
I'm not Lars ;-) but I'll try a provide an overall comparison here,
although I haven't used FailSafe, just read about it. I'll also provide
information from HACMP, since I know that reasonably well. Of course, only
FailSafe is currently available on Linux, but if the goal is to provide a
comprehensive HA platform on Linux, input is good! (HACMP documentation is
primarily the 'Concepts and Facilities' book to begin with.)
I might get bits about FailSafe or SA wrong, as I've not extensive
experience with either, but, I'm sure someone will correct me ;-)
Both FailSafe and HACMP provide 'Resource Groups' and in both they mean the
exact same thing - they encapsulate a set of related resources and they are
the failover unit. HACMP only recognizes a defined set of resource types,
e.g., nodes, IP addresses, disk volumes, file systems, applications.
FailSafe provides an ability to define new resource types, assuming you
provide to it the scripts to control that resource. SA can deal with
individual resources and also provides 'application groups' which collect
SA uses 'agents' to manage resources, and has a model for how these work.
Phoenix (RSCT) likewise has a component that provides an API that can be
used to build such agents (the RMC component, a follow-on to the Event
Management work I described.) Unfortunately, RMC is only on AIX, it's
release on Linux and/or as OSS is not currently announced. Send requests
to the usual place...
HACMP and FailSafe both use scripts to manage resources (most provided by
the system, but users can add their own, or, if they're very brave, modify
the system-provided ones.)
Note that for HACMP and FailSafe a resource can appear in only one resource
group (nodes and communication adapters aren't really resources, so, aren't
limited this way, but, applications, file systems, IP addresses, etc. are
so limited.) SA doesn't have this, resources can be contained in multiple
resoruce groups if desired, as well as containing other application groups.
As to resource dependencies, HACMP recognizes only pre-defined
relationships, i.e., an application depends on its file system(s) which in
turn depend on volume group(s) [disks], the application also depends on an
IP address. These relationships are imputed by the fact that these
resources are collected in a resource group, thus, HACMP will use this info
when starting a resource group:
- ensure the volume group (disks) is varied on (i.e., available for use)
- mount the file system(s)
- set the IP address up on an adapter
- start the application
HACMP also understands NFS file systems, and imputs them to be dependent
upon an IP address, thus sets up the IP address then mounts the file
system. Although HACMP doesn't strictly allow you to define 'new' resource
types, it does provide many 'user exit' points where users can add in their
own scripts, and this provides a rough way to manage 'new' resource types
in a limited fashion.
FailSafe appears to offer some ability to manipulate resource dependencies,
or, at least resource type dependencies, although the documentation is a
bit unclear. But, it does clearly describe that there are 'levels' for
resources, and these levels are used to order the bring up and shutdown of
resources. By default, I'd assume it works very much like HACMP, but it
does appear to provide a bit more flexibility and customisability here by
allowing new resource types to be fit into the framework.
SA uses the user-configured resource dependency relationships to understand
which resources need to work together, and to determine the order of
bringup and shutdown. The user can configure these dependencies as
desired, thus making this extremely flexible. Having never tried to
configure SA, I can't rationally comment on how many defaults are provided,
i.e., how many resource types SA recognise by default and is able to impute
default dependency relationships among them. As far as I an tell, if you
are willing to write an agent for the resource type, SA can manage it.
HACMP and FailSafe both provide 'fine grain' failover in that a single
resource group can be moved or failed-over to a different node, leaving
other resource groups alone on the node. SA allows you to specify a single
resource, and the dependency informaiton will be used to determine the set
of resources, or, you can direct an application group to be moved. I know
with HACMP (and I think it's the same with FailSafe) that all resource
groups are 'equal' to each other in priority.
To determine where a resource group is placed, HACMP provides two policies,
'cascading' and 'rotating'. Cascading is essentially the same as
FailSafe's 'ordered' default policy, the first node in the list and a
member of the cluster is chosen. Rotating also uses a list of nodes, but
treats all node as equals and picks one based on which one has available
network adapters to use. For both HACMP policies, the list of nodes can be
all nodes in the cluster, or a subset. FailSafe has a 'round-robin' but (I
think) that uses all nodes in the cluster. In addition, FailSafe provides
a user exit here, where the user can provide a script that is allowed to
return a list of nodes dynamically to place the resource group (although
I'm a bit unclear how dynamic it is.)
HACMP and FailSafe both support various 'modifiers' such as auto-failback
and in-place recovery and such.
Here, SA goes somewhat over the top. It uses a combination of weights on
resources, load information, time of day constraints, dependencies and
where competing resources have been placed to determine the best home for a
resource. It also appears to have various goal-based performance setups to
allow 'less important' resources to be moved if a system is getting too
loaded, so the user can define which resources are more important than
others, and, (I think) that based on the time of day these settings can
change automatically. You can model the behaviours of HACMP and FailSafe,
as well as going well beyond them if desired.
I could sum this point up that HACMP and FailSafe are happy so long as all
resource groups are running, whereas SA isn't happy unless a whole range of
dependency, performance, load and time constraints are satisfied.
Some subjective comments. SA is probably overkill when looking at the vast
bulk of Linux clusters likely to be built. I don't see any limitation in
any of these tools supporting common cluster usage, e.g., shared-disk DB
(e.g., Oracle), shared-nothing DB (DB2 UDB), web servers with failover,
etc., since FailSafe provides a proof point, being the only one currently
on Linux. IMHO SA would be quite useful, but, as an administrator with a
small number of nodes and no background in OS/390, it would appear to be a
VERY steep learning curve to get it working. On the other hand, FailSafe
follows a relatively common model for UNIX-based commercial
recovery/failover tools (per its many similarities with HACMP) so would be
familiar to admins with UNIX HA experience. Plus, it's a bit less
overwhelming to a rookie first approaching it.
In a heterogeneous environment centered around a 390 (oops, I mean zServer)
running OS/390 (oops, z/OS) and one or more Linux images, SA may be quite
useful to maintain a consistent model across the whole installation. On a
cluster of Intel workstations running Linux, it will be a harder sell.
These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat at us.ibm.com
and in no way should be construed as official opinion of IBM, Corp., my
email id notwithstanding.