Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Whither cman
1. Whatever happened to cman?
Christine “Chrissie” Caulfield, Red Hat
ccaulfie@redhat.com
Version history
0.1 30th January 2009 First draft
0.2 3rd February 2009 Add a chapter on migrating from libcman
0.2 27th February 2009 Add some commas and fix some typos
2. 1. Introduction
CMAN was the mainstay of Red Hat's cluster suite in Red Hat Enterprise Linux 4, 5 and 6. Actually
CMAN in RHEL4 was a totally different thing to that found in RHEL5 and 6 but the external interfaces
looked very similar.
So what was CMAN?
CMAN stood for “Cluster MANager” (boring eh?) and provided the low-level infrastructure for the
cluster; node communications, votes, quorum etc. Most of the useful bits of the cluster stack were built
on top of cman so few people got to see it unless things went horribly wrong.
Starting with RHEL5 we dispensed with the custom-written kernel CMAN and based the cluster
manager on OpenAIS. For more details about this see my document on the subject:
http://people.redhat.com/ccaulfie/docs/aiscman.pdf.
OK, a short summary then. CMAN in RHEL 5/6 is really just a quorum module that loads into openais
(these days known as corosync). It provides other API services such as messaging but those are really
just wrappers around other openais services.
At the cluster summit in 2008 it was decided that keeping companies' own clustering modules was a
waste of effort and almost always fails to build a community. Even though CMAN (all versions of it)
were released as free software, they lived in their own repositories and were seldom modified by the
original author (ie me!). So efforts were made to make the software even more open and included in the
components of their parent projects. Also part of the collaboration efforts were the pooling of resource
and fence agents and several others. But this is just about CMAN OK? Come on now, concentrate.
2. corosync/openais – what happened there?
In Red Hat Enterprise Linux 5 cman was a module for openais. In RHEL6 (we're not actually supposed
to call it RHEL, but “Red Hat Enterprise Linux” is a lot of typing. And even if I do a search/replace
afterwards it'll be unwieldy to read ... sorry marketing, but this is an engineering paper, not a sales one)
it's a module for corosync. So, what happened there?
Just in case you weren't already confused enough, openais has now been split into two parts. The parts
still called openais are the modules for corosync that implement the SAF AIS services and APIs – you
know: the ones with the VeryAnnoyinglyCapitalisedNames. Corosync is the communications layer
below that, but it also includes other services such as cpg (Closed Process Groups ... very popular
amongst service writers than one) and, now, quorum. Most people won't really notice this split, apart
from, possibly, the name of the process in 'ps' output.
Just to put a little bit of detail on this, the services that are included with corosync are:
cfg cpg evs quorum
3. Services included in openais are:
Amf Ckpt Clm Evt Lck Msg (see what I mean about the capitalisation ... and that's just the names!)
corosync is the name of the daemon, replacing aisexec.
I hope that's all clear because it gets more complicated from here on in.
3. Intermediate status – RHEL6
Before getting onto the really new code, I'll talk a little about the RHEL6 code which is sort of a half-
way house between RHEL5 and the code that will be in RHEL7. I suppose that makes some sort of
sense, numerically if nothing else.
The first thing you'll notice about RHEL6 is that ccsd is not there any more. Actually I very much
doubt it'll be really the very first thing you notice unless you're incredibly sad or a real clustering nerd,
but I do hope you might have spotted it at some stage. I'll talk (well, write – unless you want to invite
me, all expenses paid, to give a presentation) about that in the next chapter.
Apart from this, you will notice very little difference in the RHEL6 cman. We were under pressure to
allow rolling upgrades from RHEL5 (ie no wire protocol changes) and time constraints mean that the
code discussed below was not stable enough to be included for the deadline. The only other substantial
difference is that cman now is a quorum provider for corosync. I'll also write more about this this later
on (you're really agog now aren't you?), but basically it means that the rest of the corosync modules
will know about the quorum status and can act on it if they feel the urge, and that applications that are
only interested in the quorum status and node list can link straight into libquorum and bypass cman
altogether. This might help some people with the upgrade path ... or not ... we'll see.
4. Configuring corosync
This chapter is also relevant to RHEL6 by the way, just in case you were going to skip the rest of the
document and have an early night.
So, as I mentioned above, ccsd is now gone and I can hear the cheers already. The distribution of
cluster.conf files is now delegated to ricci and a new command ccs_sync. I will not focus too closely on
that bit any more but it brings me to the next main change in the way CMAN (as it still is at this stage)
is loaded. Actually you don't have to use ccs_sync to distribute cluster config files, conga, scp, nfs and
several other methods work perfectly well. Pick that one that suits you best; “choice is good” as the
free-marketeers love to say.
The way that corosync (not aisexec remember?) is launched is still via cman_tool as before. But it now
sets a few environment variables first. The most important of these tells corosync which configuration
modules to use, and there are up to three of those now.
The first, xmlconfig, simply loads the contents of /etc/cluster/cluster.conf into the corosync object
database, where it keeps all the configuration information. The second, cman-preconfig, interprets that
format into a format that is native to corosync by resolving nodenames, generating multicast addresses
etc.
4. 4.1. Those modules in full
Corosync config modules are different from the normal service modules that corosync uses. They are
specified in the environment variable COROSYNC_DEFAULT_CONFIG_IFACE before corosync is
started. This variable can contain a colon-separated list of modules that will be called before corosync
is started for real. They also manage the reloading of data when the underlying configuration file or
device changes. By default cman_tool starts corosync with
COROSYNC_DEFAULT_CONFIG_IFACE=xmlconfig:cmanpreconfig:openaisserviceenable
4.2. xmlconfig & friends
xmlconfig is simply an XML parser with a default filename of /etc/cluster/cluster.conf. It reads the tree
of objects it finds in that file and loads it, unmodified, into corosync's object database (objdb). There
are a few other config modules that can be used here. The is an ldap one (called, rather obviously I
hope, ldapconfig) and even a ccsd one too if you're hopelessly addicted to the past. These modules are
quite easy to write and if you have a different back-end database you want to use it's probably quite
simple to write a corosync config module for it.
Just in case you were wondering, objdb is simply an in-memory hierarchical database where corosync
holds configuration data and some state.
The point about xmlconfig and it's similarly-named friends (actually, maybe they're relatives ... hmm)
is that they do not try to interpret the data, they simply copy what is there into objdb. If it makes no
sense to corosync then it will tell you when it tries to start up properly shortly afterwards. You don't
need cman-preconfig (see below) to use these if the format of the data matches the hierarchy of
corosync.conf(5).
cman_tool join allows you to specify a different main config module than xmlconfig on the command-
line with cman_tool join -C <module>. It will always add cman-preconfig though. eg the command:
# cman_tool join -C ldapconfig
will load corosync with the environment variable:
COROSYNC_DEFAULT_CONFIG_IFACE=ldapconfig:cmanpreconfig:openaisserviceenable
4.3 cman-preconfig
Cman-preconfig is a rather strange beast in some ways. It's main purpose is to re-interpret the loaded
configuration file. Let me endeavour to explain.
CMAN since RHEL4 has a very specific XML-based format which has not changed significantly in the
meantime. It contains details of all the nodes in the cluster, along with their fence details and also
rgmanager services too. This format bears no resemblance to corosync.conf. If you load a normal
RHEL cluster.conf into corosync using xmlconfig alone, without passing it through cman-preconfig
then corosync will not start. It will not be able to find a single entry that it needs to run.
cman-preconfig re-interprets this configuration data into a form that corosync expects, and adds some
sensible, we hope, defaults for things not specified at all, eg the cluster multicast address. It also has a
5. mode where it can start up cman/corosync with no configuration data at all (see cman_tool -X) which is
very useful for time-poor or lazy people who just want to test the cluster a little, or perhaps for
commissioning virtualised clusters. There's a whole chapter on this later on.
4.4. openaisserviceenable
This module simply tells corosync to also load the openais services as well it its own default set.
Normally this is what you will want as many cluster daemons use openais services like checkpointing.
You can disable this by adding the -A switch to cman_tool join.
5. CMAN itself
Now we get to the nitty-gritty. The CMAN module exists in RHEL5 and RHEL6 and does all the
cman-y things that cman does such as quorum and messaging. It also services the libcman library using
an IPC mechanism all of its own. It does not use corosync's IPC (library to daemon communications)
system for reasons that are unlikely to become clear any more.
The new code, as I've said, dispenses with anything called cman at all. The main work of managing
quorum is done by a built-in corosync module called votequorum. This code is actually largely lifted
from the original cman module, with all the non-quorum bits removed, and converted to use the
corosync IPC layer.
This module also includes code for managing quorum devices (eg qdiskd) as well as the hated
“Disallowed” mode. Though both are inactive in a default corosync system.
At the time of writing libcman still exists. It provides the same API as the libcman in previous versions
though there might be small semantic differences. It's also not ABI compatible so the soname has been
incremented to 4. This libcman uses a combination of calls to quorum, confdb and a custom cman
module to provide these services. It is provided largely as a compatibility layer for older applications
though the source code to libcman can also provide ideas on how to do cman-y type things with the
existing corosync services. All the programs in Red Hat clustering will be changed to use the standard
APIs that are provided with corosync.
6. The CMAN module
Now hang on there, you just said “a custom cman module”. So cman does still exist?
Well, yes and no.
There is a module called “cmanbits” which provides services for some libcman APIs. These are:
cman_send_data,
cman_start_recv_data
6. cman_end_recv_data
cman_is_listening
Of these, the messaging API can be replaced with the normal openais messaging system. There is no
suitable replacement for is_listening.
Missing totally from this new cman is the barriers API which nobody used, to my knowledge. If you
did and are fuming with me for leaving it out ... sorry. But there are actually better ways of achieving
the same effect with existing corosync or openais calls.
Some other functions that were in libcman are now provided by the corosync cfg service. These are
corosync_cfg_kill_node
corosync_cfg_try_shutdown
corosync_cfg_replyto_shutdown
corosync_cfg_get_node_addrs
The are wrapped in the new libcman but are deprecated.
7. corosync quorum
7.1. quorum plugins
A brief word about corosync's quorum system might he helpful here. Corosync has a basic notion of
quorum built in to the executive. Services cannot be synchronised without it and several IPC calls will
be blocked if the cluster is inquorate.
By default corosync does not load the quorum module. This is for backwards compatibility so that
existing clusters will work. If the quorum module was loaded and no quorum provider was specified
then the cluster would never be quorate and never get any work done. Not a popular system I would
guess. If the quorum module is absent then everything just works all the time.
To load the quorum module, you need to add the following stanza to corosync.conf (or equivalent):
service {
name: corosync_quorum
ver:0
}
cman-preconfig does this for you by the way. Corosync will then load the quorum module and allow
users to use the basic quorum API as defined in /usr/include/corosync/quorum.h. This simply allows
programs to query the quorate state and to receive notifications when it changes. This notification also
includes a list of nodes that are part of the quorum. It's useful (I hope) to realise that this plugin is also
available in RHEL6 and that cman provides quorum for it. So if you only need to know the quorum
state (ie you don't need any of the other libcman gubbins) then calling into libquorum instead will
provide you with forward-compatibility.
7. However, as I mentioned above (or on the previous page if you were ungreen enough to print this
document out) this will not give you quorum, all it will do is to block a lot of useful activity in
corosync. To make it work you need a quorum provider.
A quorum provider is a corosync module that tells the corosync quorum plugin when the quorate state
changes, based in its own internal algorithm. There are many ways of calculating quorum so there are
several modules available, and I hope people will write more to suit their own needs. The quorum
provider is loaded by putting its name in corosync.conf as follows:
quorum {
provider: testquorum
}
This example, as I sincerely hoped you have guessed, loads the 'testquorum' module. Testquorum is an
extremely simple module which I recommend as a template to other quorum provider writers. It simply
changes the corosync quorate status based on the quorum.quorate variable in the object database. Using
corosync-objctl the quorate state can be switched on and off at will. If you have some form of external
quorum system this might actually be useful to you.
Another quorum provider shipped with corosync is YKD. This is the YKD system that has been
shipped with openais for some time, it has merely been re-designated a quorum plugin now. Sadly the
developers did not have time to give it a shiny new logo too so it looks pretty much like it always did.
To be quite frank, I know almost nothing about YKD and it scares me a little, so unless a knight in
shining armour (ideally looking like David Tennant) comes along to rescue me I'll leave it there.
Now, where was I, sorry I got distracted by the David Tennant reference. Oh yes, votequorum is the
quorum module based on cman. This is the one I know most about and is probably the most interest to
those of you coming from a cman background. So I'll devote an entire subsection it. Favouritism? Pah!
7.2. votequorum
Votequorum is the name of the corosync module that replaces most of that cman used to do. Only it
does it in a more 'corosync-y' way. It uses the corosync IPC system for communication with users, it
provides quorum to the central core of the corosync daemon and provides familiar tracking callbacks.
I'll outline the API calls for votequorum here. If you are familiar with libcman then you can go and get
a hot chocolate or something while this bit is going on ... see you in chapter 8.
cs_error_t votequorum_initialize(votequorum_handle_t *handle, votequorum_callbacks_t
*callbacks)
cs_error_t votequorum_finalize(votequorum_handle_t handle)
These two are the familiar corosync initialise & finalise calls that corosync/openais developers should
already be familiar with. libcman users will note that the callbacks are specified here rather than in later
API calls. The callbacks are for quorum change or expected votes change – remember votequorum
does not have a messaging API.
8. cs_error_t votequorum_fd_get(votequorum_handle_t handle, int *fd);
cs_error_t votequorum_dispatch(votequorum_handle_t handle,cs_dispatch_flags_t
dispatch_types)
Again, these should be familiar sort of call to both corosync and libcman users. You call
votequorum_dispatch when the FD returned from votequrum_fd_get becomes active from select or
poll.
cs_error_t votequorum_getinfo (votequorum_handle_t handle, unsigned int nodeid, struct
votequorum_info *info)
This call fills in the info structure with information about the quorum subsystem and the node for
which you requested information. Members starting node_ are specific to the node in question, the
others are for the cluster as a whole.
cs_error_t votequorum_setexpected(votequorum_handle_t handle, unsigned int expected_votes)
This call changes the expected_votes value for the cluster. This will persist until it is changed again.
NOTE: if you are using a configuration file then a forced reload of that file will overwrite a manually
set votes value, as will adding a new node to the cluster.
cs_error_t votequorum_setvotes (votequorum_handle_t handle, unsigned int nodeid, unsigned
int votes)
This call changes the number of votes assigned to a existing node in the cluster. This will persist until it
is changed again. NOTE: if you are using a configuration file then a forced reload of that file will
overwrite a manually set votes value.
cs_error_t votequorum_qdisk_register(votequorum_handle_t handle, char *name, unsigned int
votes)
cs_error_t votequorum_qdisk_unregister(votequorum_handle_t handle)
cs_error_t votequorum_qdisk_poll(votequorum_handle_t handle, unsigned int state)
cs_error_t votequorum_qdisk_getinfo(votequorum_handle_t handle, struct
votequorum_qdisk_info *info)
This group of calls manages an optional extra quorum device, often a quorum disk. There can only be
one quorum device per cluster and it is up to the quorum device software to ensure that its state is
consistent across the cluster or to remove it from quorum. The device can have any number of votes,
but those votes cannot be changed without re-registering the device – doing this might cause a
temporary loss of quorum. Quorum devices must poll corosync regularly to be regarded as alive, by
default if a quorum device has not called votequorum_qdisk_poll() after ten seconds then it will be
regarded as dead and removed from quorum. This poll period is customisable, see below.
9. cs_error_t votequorum_setstate(votequorum_handle_t handle)
This call sets a non-resettable flag “HasState” in votequorum. This state is available from the getinfo
call above and used by some subsystems to prevent cluster merges. It only makes sense to call this if
the dreaded “disallowed” flag is set. (see configuration variable below).
cs_error_t votequorum_trackstart(votequorum_handle_t handle, uint64_t context, unsigned int
flags)
cs_error_t votequorum_trackstop(votequorum_handle_t handle)
These are the usual tracking start/stop calls for a corosync plugin. Trackstart enables quorum or
expected_votes change messages to be sent to the notification callback. To avoid race conditions it is
recommended that the flags CS_TRACKSTART | CS_TRACKCHANGES are used so the program
will get one immediate callback containing the current quorum state, then further callbacks as nodes
join or leave the cluster.
cs_error_t votequorum_leaving(votequorum_handle_t handle)
This call sets a flag inside all nodes in the cluster telling them that the node will be leaving the cluster
soon. When the node actually does leave the cluster, then quorum will be reduced and activity will
carry on. This is like 'cman_tool leave remove' in cman. If this is not called before removing a node
then the number of active votes will be reduced and the cluster could lose quorum.
Not that this call does not remove the node from the cluster, it merely sets the flag saying it will. There
is a timeout on this flag of ten seconds. If the node is not shut down within this time it will be
cancelled. It is important that this timeout is longer than CFGs shutdown timeout otherwise leaving will
be cancelled if daemons don't respond.
cs_error_t votequorum_context_get(votequorum_handle_t handle, void **context)
cs_error_t votequorum_context_set(votequorum_handle_t handle, void *context)
These two calls simply allow the context variable returned to the callbacks to be changed.
7.2.1 configuring votequorum
The votequorum module has a small number of variables that you can set in objdb. They are all held
under the main quorum{} section.
quorumdev_poll This is the time between quorum device polls, in milliseconds.
The default is 10000.
leaving_timeout If the node is not shutdown <leaving_timeout> milliseconds after
votequorum_leave is called then the leave flag will be cancelled.
The default is 10000.
votes The number of votes this node has.
The default is 1.
10. expected_votes The number of votes that this cluster expects. votequorum will use the
largest value of all the nodes in the cluster.
The default is 1024. ie you will not have quorum unless you change it!
two_node Enables a special two-node mode where a cluster with only two nodes
and no quorum device can continue running with only one node. This
expects you to have hardware fencing configured. two_node mode can be
cleared without a cluster restart (ie you can add a third node) but it
cannot be set without restarting all nodes.
The default is 0.
disallowed This is an odd mode that prevents two clusters with state (See setstate()
above) from merging. If this happens then nodes will be marked
“disallowed” and not included in quorum calculations. Those who have
read up on RHEL5 cman will know this mode and fear it. It cannot be
changed without a cluster restart.
The default is 0.
8 Configuration free clusters
Configuring a cluster is hard work. Don't you wish you could just plug in a node and have it “just
work”? Well, within several limitations you can. This mode has existed in cman since RHEL4 but is
not well documented (oops, maybe that's my fault) and is hardly used to my knowledge.
The key command to remember is
# cman_tool join -X
On a freshly installed system this will fail with the error:
corosync died: Could not read cluster configuration
OK, so it's not that configuration-free. You don't get something for nothing in this life. However, don't
despair. All you need to do is generate a security key for the cluster and try again:
# mkdir /etc/cluster
# corosync-keygen
# mv /etc/ais/authkey /etc/cluster/cman_authkey
You now need to copy this file to all the nodes you want to join in this cluster. Once you've done that
try cman_tool join -X again and it should magically start up. If you start up the other nodes then they
will join the same cluster.
cman_tool join -X sets up several defaults to get you going. Maybe the first thing you will notice is that
the node numbers are HUGE! They are the IP addresses of the nodes. This isn't a problem, it just looks
a little unwieldy, especially on little-endian systems where adjacently numbered nodes might appear to
11. have random and massively different nodeids. eg:
# cman_tool nodes
Node Id Sts Name
16951488 M amy
33728704 M anna
50505920 M clara
67283136 M fanny
117614784 M nadia
134392000 M sofia
Anyway, if you can cope with those this system should get you going with a simple single cluster that
has no fencing requirements (ie GFS cannot be used).
You can override several things on the cman_tool command-line to add things like sensible nodeids
and a different cluster name or cluster ID. The last two will allow you to run multiple configuration-
less clusters on a single network. By default all nodes will join a cluster called “RHCluster”.
Actually, you can change the nodeids by adding -n<nodeid> to the cman_tool command-line, but that
probably doesn't fit in too well with the zero-configuration system I suppose. See the cman_tool man
page for a list of options.
Quorum is the big problem with this system. There is no way to know how many nodes are needed in
such a cluster. You can set expected_votes on the cman_tool command-line, and I urge you to do so,
but without this quorum will default to 1. Yes ONE. Your cluster will always be quorate. I know this
goes against the principles of a lot of clustering but there are two main reasons why it is done this way.
1. It's easy to get going. A lot of the time, this mode will be used by people just testing clusters and
they want to get going and see what all the fuss is about. Once they understand clusters they can get
involved in sorting out quorum, fencing and configuration files.
2. A lot of the time these clusters are run as virtual machines in systems like Xen or KVM. In this case
there is no danger of a “split-brain” occurring as the systems are either all on the same host computer or
on nodes that are already part of a properly configured cluster with real hardware fencing.
9. Migrating code from libcman
This chapter is just a list of libcman calls showing you where in corosync to find equivalents. Most of
them have very similar names so it shouldn't be too hard I hope. The compatibility libcman provides
most of these calls so you might like to look at the source code for that library if you are unsure about
how to achieve a particular result.
cman_admin_init
cman_init
cman_finish
12. cman_setprivdata
cman_getprivdata
cman_start_notification
cman_stop_notification
cman_start_confchg
cman_stop_confchg
cman_get_fd
cman_dispatch
Use the equivalent calls for the subsystem(s) you are using.
cman_set_votes
cman_set_expected_votes
cman_set_dirty
cman_register_quorum_device
cman_unregister_quorum_device
cman_poll_quorum_device
cman_get_quorum_device
cman_get_node_count
cman_get_nodes
cman_get_disallowed_nodes
cman_get_node
Use libvotequorum. Note though that this does not provide you with node names for the cman_get_
calls.
cman_is_quorate
Use libquorum, or libvotequorum. You should only use libvotequorum if you are also using other
features of that module. If all you need to know is the quorum status then you should use libquorum as
that will work regardless of quorum provider.
cman_get_version
cman_set_version
cman_get_cluster
cman_set_debuglog
Use libconfdb or libccs to read/write to the objdb.
cman_kill_node
cman_shutdown
cman_replyto_shutdown
cman_get_node_addrs
Use libcfg
cman_leave_cluster
If you just need to leave the cluster then libcfg will work. If you need to leave the cluster and make the
remaining nodes keep quorum then you also need to tell votequorum.
cman_send_data
cman_start_recv_data
cman_end_recv_data