git.samba.org - sahlberg/ctdb.git/log

add better errorchecking that nodes we try to talk to using the "ctdb" tool actually exist and that it is connected.

two new dedicated ctdb error codes
21: node does not exist
22: node is disconnected

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 17 Dec 2008 01:01:40 +0000 (12:01 +1100)]

dont call ctdb_fatal() just because we are asked to restart a connection
to a remote node and ctdb->methods is NULL.

This can happen when we are in the middle of a normal shutdown of the
daemon and we have already shut down the transport layer (thus setting
ctdb->methods == NULL in the transport layer destructor)
band there is some unprocessed data related to a remote node.

This prevents an ugly race condition where ctdb might sometimes (rare)
cause a core dump during "ctdb shutdown".

commit | commitdiff | tree

Michael Adam [Mon, 15 Dec 2008 17:21:37 +0000 (18:21 +0100)]

skip directories containing macros (%) in ctdb_check_directories_probe

This prevents the monitor action of 50.samba from failing
on e.g. a typical [homes] service with "path = /home/%S" .

Michael

commit | commitdiff | tree

Michael Adam [Sat, 5 Jul 2008 12:28:27 +0000 (14:28 +0200)]

ctdb.init: add Default-Start to init script to enable autostart.

Michael

commit | commitdiff | tree

Michael Adam [Fri, 12 Dec 2008 15:57:58 +0000 (16:57 +0100)]

ctdb.init: check availability of ctdb (with ping) before calling ctdb status

Michael

commit | commitdiff | tree

Michael Adam [Fri, 12 Dec 2008 15:00:07 +0000 (16:00 +0100)]

ctdb.init: behave correctly when calling "service ctdb stop" on stopped service

When "service ctdb stop" is called and the ctdbd is not running,
don't print the "Failed to connect to daemon" error messages.
But print a warning and exit with status success instead.

Michael

commit | commitdiff | tree

Michael Adam [Fri, 12 Dec 2008 15:05:04 +0000 (16:05 +0100)]

ctdb.init: fix return code of "service ctdb stop" on non-redhat systems

Michael

commit | commitdiff | tree

Michael Adam [Fri, 12 Dec 2008 15:04:29 +0000 (16:04 +0100)]

ctdb.init: fix status message of "service ctdb stop" on suse systems

Michael

commit | commitdiff | tree

Michael Adam [Sat, 5 Jul 2008 12:42:46 +0000 (14:42 +0200)]

packaging: set docdir in calls to make (to get it right on e.g. SuSE systems).

Currently docdir = /usr/share/doc is hardcoded in the Makefile.in.
Some systems use a different doc dir (SuSE uses /usr/share/doc/packages).

And not all versions of autoconf provide the --docdir parameter
(2.61 does, while 2.59 does not). So we use the quick solution
to specify "docdir=%{_docdir}" in the make calls in the spec file.

Michael

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 11 Dec 2008 22:39:55 +0000 (09:39 +1100)]

New version 1.0.68

commit | commitdiff | tree

Michael Adam [Wed, 10 Dec 2008 21:27:36 +0000 (22:27 +0100)]

Improve the monitor event test for ethernet interfaces (link detection).

On some systems, the ethtool link detection is not successful when a
cable is plugged but the interface has not been brought up previously.
This improves the test by bringing the interface up (without checking
for success here) and trying the ethtool test again afterwards.

Michael

commit | commitdiff | tree

Michael Adam [Wed, 10 Dec 2008 21:19:31 +0000 (22:19 +0100)]

Use "grep -q" instead of "grep ... > /dev/null" in events.d/10.interfaces
This enhances readability.

Michael

commit | commitdiff | tree

root [Wed, 10 Dec 2008 01:06:51 +0000 (12:06 +1100)]

update the "ctdb recover" command.

block and wait until the clustered has completed the recovery before returning.
this  makes it easier to script since it avoids the common need for
   ctdb recover
   ... complex loop to wait for recovery to complete ...
   script continues

commit | commitdiff | tree

root [Wed, 10 Dec 2008 01:01:19 +0000 (12:01 +1100)]

add a CTDB_TIMEOUT variable for the ctdb tool.
If set this specified the maximum runtime for the ctdb tool before it will terminate with status == 20
Just like the -T ... option would.

commit | commitdiff | tree

root [Wed, 10 Dec 2008 00:49:51 +0000 (11:49 +1100)]

make sure we return an errorcode when the ctdb command has hung and is timeodout by the -T <timeout> setting

commit | commitdiff | tree

root [Tue, 9 Dec 2008 01:03:42 +0000 (12:03 +1100)]

add a helper that waits until the clueter is no longe rin recovery mode and return the generation number.

change the ban/unban logic to wait until we are not in recovery before it bans/unbans the node.

also wait until after the cluster has recovered from the ban/unban before returning so that the cluster is in recpovery mode == normal when the command returns. this makes it much easier to script things ...

commit | commitdiff | tree

root [Mon, 8 Dec 2008 23:45:14 +0000 (10:45 +1100)]

update to the flags handling
make sure to abort the monitoring and restart if we failed to get the nodemap from a remote node

commit | commitdiff | tree

root [Mon, 8 Dec 2008 06:29:17 +0000 (17:29 +1100)]

If ctdbd was started with the --socket option then we also set the CTDB_SOCKET variable so that the eventscripts can pick up the name proper

commit | commitdiff | tree

root [Mon, 8 Dec 2008 01:57:40 +0000 (12:57 +1100)]

return -1 if ctdb ping failed

commit | commitdiff | tree

root [Fri, 5 Dec 2008 05:32:30 +0000 (16:32 +1100)]

redo and update how we synchronize flags across the cluster.
this simplifies the code and should close a race condition between the local recovery daemon and a remote node when flags are changing.

commit | commitdiff | tree

root [Thu, 4 Dec 2008 23:33:38 +0000 (10:33 +1100)]

some platforms are very picky about the third argument passed to bind().
and would complain if sa.family is AF_INET and the third argument is not exactly the size of a sockaddr_in.

We used to pass a union containing both a sockaddr_in and a sockaddr_in6 which would mean that on those platforms bind() would fail since the passed structure for AF_INET would be too big.

Thus we need to set and pass the appropriate size to bind. At the same time for thos eplatforms we can also set sin[6]_size to the expected size.
(bind() on those platforms were isurprisingly perfectly ok with sin_len was "too big")

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 4 Dec 2008 04:25:03 +0000 (15:25 +1100)]

new version 1.0.67

commit | commitdiff | tree

root [Thu, 4 Dec 2008 04:03:40 +0000 (15:03 +1100)]

fix an incorrect path

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 4 Dec 2008 03:35:00 +0000 (14:35 +1100)]

add a description of the recovery-process

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 2 Dec 2008 03:08:10 +0000 (14:08 +1100)]

print the list of valid debug level literals when an invalid debug level
is specified in 'ctdb setdebug'

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 2 Dec 2008 02:26:30 +0000 (13:26 +1100)]

redesign how reloadnodes is implemented.

modify the transport methods to allow to restart individual connections
and set up destructors properly.

only tear down/set-up tcp connections to nodes removed from the cluster
or nodes added to the cluster.
Leave tcp connections to unchanged nodes connected.

make "ctdb reloadnodes" explicitely cause a recovery of the cluster once
the files have been realoaded

commit | commitdiff | tree

root [Fri, 28 Nov 2008 00:29:43 +0000 (11:29 +1100)]

debuglevel is a signed int, not usnigned.

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 27 Nov 2008 22:52:26 +0000 (09:52 +1100)]

make it possible to delete an ip from all nodes at once using
"ctdb delip x.x.x.x -n all"

This is not as straightforward as one might think since during the
delete process we don not want the ip to be bouncing from one node to
another as node by node deletes it.

Thus we first delete the ip from all connected nodes which are not
currently hosting it.

After this we delete the ip from the node which is hosting it.

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 24 Nov 2008 08:06:02 +0000 (19:06 +1100)]

inew version 1.0.66
ddwq

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 21 Nov 2008 05:24:12 +0000 (16:24 +1100)]

allow to change the recmaster even the database is not frozen

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 21 Nov 2008 00:30:32 +0000 (11:30 +1100)]

remove two variables no longer used from the example sysconfig file

commit | commitdiff | tree

Andrew Tridgell [Thu, 20 Nov 2008 21:05:59 +0000 (08:05 +1100)]

fixed problem with looping ctdb recoveries

After a node failure, GPFS can get into a state where non-blocking
fcntl() locks can take a long time. This means to the ctdb set_recmode
test timing out, which leads to a recovery failure, and a new
recovery. The recovery loop can last a long time.

The fix is to consider a fcntl timeout as a success of this test. The
test is to see that we can't lock the shared reclock file, so a
timeout is fine for a success.

commit | commitdiff | tree

Andrew Tridgell [Thu, 20 Nov 2008 10:23:26 +0000 (21:23 +1100)]

Merge commit 'ronnie/master'

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 20 Nov 2008 05:39:56 +0000 (16:39 +1100)]

dont override/change CTDB_BASE if it is already set by the shell

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 20 Nov 2008 02:35:08 +0000 (13:35 +1100)]

Keepalive packets were only sent every KeepaliveInterval if the socket
had been completely idle during that interval.
If we had been sending other packets such as Messages, Calls or Controls
there wouldnt be any need for an explicit keepalive and thus we didnt
send one.

This does make it somewhat awkward when analyzing traces since it is
non-intuitive when keepalives are sent and when they are not sent.

Change the keepalive logic to always send a keepalive regardless of
whether the link is idle or not.

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 19 Nov 2008 03:43:46 +0000 (14:43 +1100)]

reqrite the handling of flag updates across the cluster to eliminate a
race between the ctdb tool and the recovery daemon both at once
trying to push flag changes across the cluster.

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 12 Nov 2008 23:55:20 +0000 (10:55 +1100)]

new version 1.0.65

update the example sysconfig file. the default log level is 2, not 0

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 11 Nov 2008 03:49:30 +0000 (14:49 +1100)]

add a CTDB_SOCKET variable that can be used to override the default
/tmp/ctdb.socket

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 3 Nov 2008 10:54:52 +0000 (21:54 +1100)]

we actually need a ctdb_db variable

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 30 Oct 2008 02:34:10 +0000 (13:34 +1100)]

latency is measured in us, not ms

use an explicit ctdb_db variable instead of dereferencing state

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 30 Oct 2008 01:49:53 +0000 (12:49 +1100)]

add control and logging of very high latencies.

log the type of operation and the database name for all latencies higher
than a treshold

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 22 Oct 2008 00:06:18 +0000 (11:06 +1100)]

new version 1.0.64

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 22 Oct 2008 00:04:41 +0000 (11:04 +1100)]

add a context and a timed event so that once we have been in recovery
mode for too long we drop all public ip addresses

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 19 Oct 2008 22:47:54 +0000 (09:47 +1100)]

new version 1.0.63

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 19 Oct 2008 22:45:15 +0000 (09:45 +1100)]

dont log "running periodic cleanup" ...

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 17 Oct 2008 10:38:42 +0000 (21:38 +1100)]

null out the pointer before we reload the nodes file

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 17 Oct 2008 10:18:06 +0000 (21:18 +1100)]

when we reload the nodes file, we may need to reload the nodes file
inside the recovery daemon as well.

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 16 Oct 2008 22:02:03 +0000 (09:02 +1100)]

make it possible to set the script log level in CTDB sysconfig

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 16 Oct 2008 20:56:12 +0000 (07:56 +1100)]

specify a "script log level" on the commandline to set under which log
level any/all output from eventscripts will be logged as

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 16 Oct 2008 06:59:55 +0000 (17:59 +1100)]

new version 1.0.62

commit | commitdiff | tree

Ronnie Sahlberg [Thu, 16 Oct 2008 06:57:50 +0000 (17:57 +1100)]

allow multiple eventscripts using the same prefix.
this eases the pain for users that use out of tree eventscripts

commit | commitdiff | tree

Andrew Tridgell [Thu, 16 Oct 2008 01:58:25 +0000 (12:58 +1100)]

Merge commit 'ronnie/master'

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 15 Oct 2008 05:40:44 +0000 (16:40 +1100)]

new version 1.0.61

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 15 Oct 2008 05:29:09 +0000 (16:29 +1100)]

install the new multipath monitoring event script

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 15 Oct 2008 05:27:33 +0000 (16:27 +1100)]

add an eventscript to monitor that the multipath devices are healthy

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 14 Oct 2008 21:33:37 +0000 (08:33 +1100)]

we must also check the status returned from the get tickles control to
determine whether it was successful or not

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 14 Oct 2008 16:02:09 +0000 (03:02 +1100)]

lower the loglevel for the informational message that a TCP_ADD opeation
described an ip address not known to be a public address.

This could happen if someone for genuine reasons accesses a share
through a static ip address.
It can also happen if non homogenous public address configurations are
used and when a tcp description is pushed out to a different node that
does not server/know the specific ip address.

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 14 Oct 2008 14:49:19 +0000 (01:49 +1100)]

change ip route add to route add -net since this works more reliably

update the makefile and rpm to install 99.routing

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 14 Oct 2008 14:32:46 +0000 (01:32 +1100)]

new version 1.0.60

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 14 Oct 2008 14:23:57 +0000 (01:23 +1100)]

verify that the nodes we try to ban/unban are operational and print an
error to the user othervise.

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 14 Oct 2008 14:08:29 +0000 (01:08 +1100)]

Revert "from Mathieu Parent <math.parent@gmail.com>"

This reverts commit dc9cd4779db4a89697731e4cf415be51067a07c1.

Conflicts:

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 14 Oct 2008 13:24:44 +0000 (00:24 +1100)]

update the client side of getnodemap and getpublicips controls to
fallback to the old-style ipv4-only controls if the new-style ipv4/ipv6
control fails.

this allows a 1.0.59+ (ipv4/ipv6) ctdb daemon being recmaster to be
compatible with
pre-1.0.59 versions of ctdb that are ipv4 only.

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 13 Oct 2008 23:40:29 +0000 (10:40 +1100)]

update TAKEIP/RELEASEIP/GETPUBLICIP/GETNODEMAP controls so we retain an
older ipv4-only version of these controls.

We need this so that we are backwardcompatible with old versions of ctdb
and so that we can interoperate with a ipv4-only recmaster during a
rolling upgrade.

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 12 Oct 2008 21:27:33 +0000 (08:27 +1100)]

from Mathieu Parent <math.parent@gmail.com>
Hi,

I have attached a patch necessary as debian log dir (/var/log) is not
a subdir of VARDIR (/var/lib on rpm systems, /var/lib/ctdb on debian).
As I don't know much about autotools and friends, this patch may be
hacky.

This is part of the process to minimize diff between distributions.

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 12 Oct 2008 21:21:20 +0000 (08:21 +1100)]

From Mathieu Parent
patch to make debian systems log the package versions in
ctdb_diagnostics

commit | commitdiff | tree

Andrew Tridgell [Thu, 9 Oct 2008 07:45:12 +0000 (18:45 +1100)]

added some more gpfs commands per-filesystem

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 7 Oct 2008 08:34:34 +0000 (19:34 +1100)]

skip empty lines in the public addresses file, not skip all non-empty
lines

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 7 Oct 2008 08:25:10 +0000 (19:25 +1100)]

from Michael Adams : allow #-style comments in the nodes and public
addresses file

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 7 Oct 2008 07:23:12 +0000 (18:23 +1100)]

new version 1.0.59

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 7 Oct 2008 07:14:44 +0000 (18:14 +1100)]

remove an unused variable

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 7 Oct 2008 07:12:54 +0000 (18:12 +1100)]

When we reload the nodes file
instead of shutting down/restarting the entire tcp layer
just bounce all outgoing connections and reconnect

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 7 Oct 2008 00:03:30 +0000 (11:03 +1100)]

add a new eventscript : 99.routing that is used to add static routes to
interfaces when they are activated (an ip address is added during
takeip)

commit | commitdiff | tree

Andrew Tridgell [Tue, 30 Sep 2008 14:16:17 +0000 (07:16 -0700)]

The author of the upstream code asked for this code to be GPLv2+ not GPLv3

commit | commitdiff | tree

Andrew Tridgell [Tue, 30 Sep 2008 14:09:06 +0000 (07:09 -0700)]

merged a bugfix for the idtree code from the Linux kernel. This
matches commit 7aae6dd80e265aa9402ed507caaff4a5dba55069 in the kernel.

Many thanks to Jim Houston for pointing out this fix to us

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 22 Sep 2008 15:38:28 +0000 (01:38 +1000)]

Check that a database exists first before we dump its content (and
implicitely also create it) using 'ctdb catdb'

commit | commitdiff | tree

Andrew Tridgell [Wed, 17 Sep 2008 11:00:04 +0000 (21:00 +1000)]

expanded ctdb_diagnostics based on recent experience

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 17 Sep 2008 04:24:12 +0000 (14:24 +1000)]

use the correct tunable failcount not timeout

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 17 Sep 2008 04:17:41 +0000 (14:17 +1000)]

The ctdb daemon keeps track of whether the recovery process is running
correctly by measuring how long it was since the last successful
communication with the recovery daemon was recorded.

After a certain timeout the ctdb daemon would deem the recovery daemon
as inoperable and shut down.

If the system clock is suddenly changed forward by many (60 or more)
seconds this could cause the timeout to trigger prematurely/immediately
where ctdb would incorrectly think that more than 60 seconds had passed
since last successful communications and thus abort.

Instead of cehcking for one timeout occuring, only deem the recovery
daemon to be "down" and trigger a shutdown if communications have
timedout for three intervals in a row.

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 15 Sep 2008 23:00:48 +0000 (09:00 +1000)]

fix a slow memory leak in the recovery daemon in the error paths for the
memdump function

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 15 Sep 2008 21:55:57 +0000 (07:55 +1000)]

fix some slow memory leaks in the vacuuming handler in the recovery
daemon

commit | commitdiff | tree

Ronnie Sahlberg [Mon, 15 Sep 2008 20:50:28 +0000 (06:50 +1000)]

From Volker L
Fix a slow memory leak in the recovery daemon if there is a recoery
triggered during the public ip reassignment process

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 14 Sep 2008 21:04:26 +0000 (07:04 +1000)]

updates to the precompiled documentation

commit | commitdiff | tree

Martin Schwenke [Fri, 12 Sep 2008 08:20:52 +0000 (18:20 +1000)]

Document the new descriptive node specifications.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Fri, 12 Sep 2008 06:55:18 +0000 (16:55 +1000)]

onnode changes. "ok" is an alias for "healthy", "con" is an alias for
"connected". Allow "rm" or "recmaster" to be a nodespec for the
recovery master. Better error handling for interaction with ctdb
client.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Fri, 12 Sep 2008 08:21:51 +0000 (18:21 +1000)]

Merge commit 'origin/master' into for-ronnie

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 12 Sep 2008 02:06:53 +0000 (12:06 +1000)]

i add a new ctdb command "ctdb recmaster"
this shows the node id of hte current recmaster

commit | commitdiff | tree

Martin Schwenke [Fri, 12 Sep 2008 01:22:50 +0000 (11:22 +1000)]

Changes to onnode. Add "healthy" and "connected" as possible
nodespecs. Since we're now explicitly using bash, use local variables
when sensible.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Martin Schwenke [Fri, 12 Sep 2008 01:26:25 +0000 (11:26 +1000)]

Merge commit 'origin/master' into for-ronnie

commit | commitdiff | tree

Martin Schwenke [Fri, 12 Sep 2008 00:36:15 +0000 (10:36 +1000)]

Minor documentation fixes.

Signed-off-by: Martin Schwenke <martin@meltin.net>

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 9 Sep 2008 03:59:48 +0000 (13:59 +1000)]

lower the debuglevel when logging unknown idr in responses

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 9 Sep 2008 03:55:31 +0000 (13:55 +1000)]

lower the debug level for when printing that the nodeflags have changed

commit | commitdiff | tree

Ronnie Sahlberg [Tue, 9 Sep 2008 03:44:46 +0000 (13:44 +1000)]

additional monitoring between the two daemons.

we currently only monitor that the dameons are running by kill(0, pid)
and verifying the the domain socket between them is ok.

this is not sufficient since we can have a situation where the recovery
daemon is hung.

this new code monitors that the recovery daemon is operating.
if the recovery hangs, we log this and shut down the main daemon

commit | commitdiff | tree

Ronnie Sahlberg [Sun, 7 Sep 2008 22:57:42 +0000 (08:57 +1000)]

From C Cowan.
Patch to make AIX compile with the new ipv6 additions.

commit | commitdiff | tree

Ronnie Sahlberg [Fri, 29 Aug 2008 02:26:02 +0000 (12:26 +1000)]

zero out the address structure to keep valgrind happy

commit | commitdiff | tree

Ronnie Sahlberg [Wed, 27 Aug 2008 00:26:34 +0000 (10:26 +1000)]

new version 1.0.58

CTDB repository

RSS Atom