Ronnie Sahlberg [Wed, 29 Jul 2009 03:25:43 +0000 (13:25 +1000)]
initial part of new vacuuming patch.
create some new fields for ctdb_db and tunables
Martin Schwenke [Wed, 30 Sep 2009 11:21:56 +0000 (21:21 +1000)]
Minor fixes to 01.reclock eventscript.
test -z really needs its argument to be quoted. Simplified a status
test.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Ronnie Sahlberg [Mon, 28 Sep 2009 04:12:59 +0000 (14:12 +1000)]
change the reclock fail count to 19 monitor intervals before we shut down ctdbd
Ronnie Sahlberg [Mon, 28 Sep 2009 04:06:40 +0000 (14:06 +1000)]
add a new eventscript 01.reclock
if the reclock file has been set, then this script will test that the
reclock file can actually be accessed.
if the file does not exist, or if the attempts to stat the file hangs,
the node will be marked unhealthy after the third failed monitoring event
and after the tenth failure, ctdb itself will shutdown.
Ronnie Sahlberg [Mon, 27 Jul 2009 03:10:32 +0000 (13:10 +1000)]
new version 1.0.82-7
Ronnie Sahlberg [Tue, 30 Jun 2009 02:17:05 +0000 (12:17 +1000)]
dont try sending a keepalive if the transport is down
Ronnie Sahlberg [Tue, 30 Jun 2009 02:16:13 +0000 (12:16 +1000)]
Dont even try allocating and sending a CALL packet if the transport is down
Ronnie Sahlberg [Tue, 30 Jun 2009 02:14:58 +0000 (12:14 +1000)]
failing a dmaster send due to the transport being down is fatal
Ronnie Sahlberg [Tue, 30 Jun 2009 02:13:15 +0000 (12:13 +1000)]
if we fail a dmaster migration due to the transport being down, then that is a fatal condition.
Ronnie Sahlberg [Tue, 30 Jun 2009 02:10:27 +0000 (12:10 +1000)]
dont try to send error packets if the transport is down
Ronnie Sahlberg [Tue, 30 Jun 2009 02:09:28 +0000 (12:09 +1000)]
dont even try to send a message from the main daemon if the transport is down
Ronnie Sahlberg [Tue, 30 Jun 2009 02:03:12 +0000 (12:03 +1000)]
Dont try to allocate and send packets if the transport is down
Ronnie Sahlberg [Tue, 30 Jun 2009 01:55:42 +0000 (11:55 +1000)]
dont even try to allocate a packet if the transport is down since it will fail
Ronnie Sahlberg [Tue, 23 Jun 2009 01:29:26 +0000 (11:29 +1000)]
rename 99.routing to 11.routing so that it executed before the service scripts
Ronnie Sahlberg [Tue, 14 Jul 2009 00:54:05 +0000 (10:54 +1000)]
new version 1.0.82-6
Ronnie Sahlberg [Mon, 18 May 2009 22:55:42 +0000 (08:55 +1000)]
Change the loglevel of "registered tcp client for ..." to INFO
instead of ERR
Ronnie Sahlberg [Wed, 10 Jun 2009 00:35:32 +0000 (10:35 +1000)]
new version 1.0.82-5
Ronnie Sahlberg [Wed, 10 Jun 2009 00:28:47 +0000 (10:28 +1000)]
When we ban a node, only drop the IPs on the node being banned, not on every node
Ronnie Sahlberg [Tue, 9 Jun 2009 02:33:06 +0000 (12:33 +1000)]
new version 1.0.82-4
Ronnie Sahlberg [Tue, 9 Jun 2009 02:31:36 +0000 (12:31 +1000)]
dont remove the socket when the dameon stops. This can race if the
service is immediately restarted
Conflicts:
server/ctdb_daemon.c
Ronnie Sahlberg [Tue, 2 Jun 2009 09:44:51 +0000 (19:44 +1000)]
new version 1.0.82-3
Ronnie Sahlberg [Tue, 2 Jun 2009 09:43:47 +0000 (19:43 +1000)]
make ctdb statistics machinereadable
Ronnie Sahlberg [Tue, 2 Jun 2009 07:59:03 +0000 (17:59 +1000)]
new version 1.0.82-2
Ronnie Sahlberg [Tue, 2 Jun 2009 07:56:20 +0000 (17:56 +1000)]
Add -Y machinereadable output to ctdb listvars and ctdb getvar
Ronnie Sahlberg [Thu, 14 May 2009 00:33:25 +0000 (10:33 +1000)]
Track how long it takes to take out the recovery lock from both the main dameon and also from the recovery daemon.
Log this in "ctdb statistics".
Also add a varaible "RecLockLatencyMs" that will log an error everytime it takes longer than this to access the reclock file.
Ronnie Sahlberg [Wed, 13 May 2009 22:55:40 +0000 (08:55 +1000)]
new version 1.0.82
Ronnie Sahlberg [Wed, 13 May 2009 22:55:05 +0000 (08:55 +1000)]
use scope host when adding the interface to loopback so we dont respond to ARPs for this ip
Ronnie Sahlberg [Wed, 13 May 2009 22:12:48 +0000 (08:12 +1000)]
change the prefix NATGW_ to CTDB_NATGW_
Michael Adam [Tue, 12 May 2009 05:56:23 +0000 (07:56 +0200)]
ping pong: fix logic for mmap reads vs. preads
Michael
Michael Adam [Tue, 12 May 2009 20:59:35 +0000 (22:59 +0200)]
maketarball.sh: add GPL license header
Michael
Michael Adam [Tue, 12 May 2009 20:59:08 +0000 (22:59 +0200)]
makerpms.sh: add GPL license header
Michael
Michael Adam [Thu, 26 Mar 2009 18:03:03 +0000 (19:03 +0100)]
Remove generated binary files.
Noted by Mathieu Parent <math.parent@gmail.com>
Michael
Ronnie Sahlberg [Tue, 12 May 2009 08:21:26 +0000 (18:21 +1000)]
remove NATGW_PRIVATE_IFACE from the documentation since we do not need
it any more.
Ronnie Sahlberg [Tue, 12 May 2009 08:42:13 +0000 (18:42 +1000)]
assign the natgw address to loopback and not the private network so that natgw will still work even when public and private networks are one and the same
Ronnie Sahlberg [Tue, 12 May 2009 08:39:34 +0000 (18:39 +1000)]
add extra debug statements to the log to make it easier to see when a recovery dameon has hung due to the underlying filesystem hanging.
Ronnie Sahlberg [Tue, 12 May 2009 08:32:41 +0000 (18:32 +1000)]
check that a node is banned before trying to unban it.
Martin Schwenke [Fri, 3 Apr 2009 01:54:26 +0000 (12:54 +1100)]
In 51_ctdb_bench.sh now allows a 2% difference between positive and
negative. ctdb_bench.c checks to ensure the timer has advanced from 0
before dividing.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 21 Apr 2009 06:50:37 +0000 (16:50 +1000)]
Avoid floating point divide by 0 in ctdb_fetch.c's bench_fetch().
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 1 May 2009 07:40:45 +0000 (17:40 +1000)]
Bug fixes for tests: simple/12_ctdb_getdebug.sh and scripts/test_wrap.
simple/12_ctdb_getdebug.sh now recognises output with multi-digit node
numbers.
Sharing the ctdb directory via NFS and testing on a real cluster by
setting CTDB_TEST_REAL_CLUSTER didn't work by default. The fix is to
hack scripts/test_wrap so that it tries to find a valid bin directory
next to the directory containing it is in.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Ronnie Sahlberg [Mon, 11 May 2009 22:59:49 +0000 (08:59 +1000)]
From: Sumit Bose <sbose@redhat.com>
fix handling of AC_INIT
Martin Schwenke [Mon, 11 May 2009 04:43:17 +0000 (14:43 +1000)]
Fix lvsmaster and natgwlist nodespecs.
They both need to use a -Y option to ctdb and for natgwlist we only
want the 1st line.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 11 May 2009 04:14:11 +0000 (14:14 +1000)]
Updated onnode docs to reflect recent changes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 11 May 2009 03:39:31 +0000 (13:39 +1000)]
New lvs/lvsmaster and natgw/natgwlist nodespecs for onnode.
Some code re-factoring to implement this and to make it easy to
implement new ones. New simpler implementation of echo_nth() no
longer uses deleted get_nth() function.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Wed, 6 May 2009 03:17:34 +0000 (13:17 +1000)]
New option "-o <prefix>" saves stdout from each node to file <prefix>.<ip>.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Tue, 5 May 2009 06:02:30 +0000 (16:02 +1000)]
Use ctdb_fetch_lock rather than ctdb_call.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Mon, 11 May 2009 04:50:28 +0000 (14:50 +1000)]
41.httpd event script workaround for RHEL5-ism.
RHEL5 can SIGKILL httpd when stopping it, causing it to leak
semaphores. This means that eventually a node runs out of semaphores
and httpd can't be started. So, before we attempt to start httpd we
clean up any semaphores owned by apache. We also try to restart httpd
in the monitor event if httpd has gone away.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Ronnie Sahlberg [Mon, 11 May 2009 04:44:59 +0000 (14:44 +1000)]
Add a -Y machinereadable flag to "lvsmaster"
Ronnie Sahlberg [Mon, 11 May 2009 03:56:28 +0000 (13:56 +1000)]
in the "lvsmaster" command, return -1 if there is no lvsmaster
Ronnie Sahlberg [Fri, 8 May 2009 07:29:57 +0000 (17:29 +1000)]
new version 1.0.81
Ronnie Sahlberg [Wed, 6 May 2009 10:32:39 +0000 (20:32 +1000)]
From: Sumit Bose <sbose@redhat.com>
fix handling of AC_INIT and read version from ctdb.spec
Michael Adam [Tue, 5 May 2009 11:16:38 +0000 (13:16 +0200)]
ping_pong: add GPL comment header with Tridge's copyright
Michael
Michael Adam [Wed, 29 Apr 2009 22:35:55 +0000 (00:35 +0200)]
ping_pong: get pread/pwrite prototypes from unistd.h
by defining _XOPEN_SOURCE to be 500 before including headers
Michael
Michael Adam [Wed, 29 Apr 2009 16:03:03 +0000 (18:03 +0200)]
ping_pong: reduce a couple of prototype warnings
Michael
Michael Adam [Wed, 29 Apr 2009 15:58:17 +0000 (17:58 +0200)]
packaging: also package ping_pong
Michael
Michael Adam [Wed, 29 Apr 2009 15:57:43 +0000 (17:57 +0200)]
build: also build and install ping_pong
Michael
Michael Adam [Wed, 29 Apr 2009 15:50:38 +0000 (17:50 +0200)]
add tridge's ping_pong.c to the utils folder
Michael
Ronnie Sahlberg [Wed, 6 May 2009 00:29:07 +0000 (10:29 +1000)]
From Sumit Bose <sbose@redhat.com>
add more 64bit plattforms to configure.ac and preserve cli settings
Andrew Tridgell [Tue, 5 May 2009 06:06:58 +0000 (16:06 +1000)]
added link to michaels sambaxp papers
Andrew Tridgell [Tue, 5 May 2009 06:49:05 +0000 (16:49 +1000)]
allow pages in subdirs
Andrew Tridgell [Tue, 5 May 2009 06:52:24 +0000 (16:52 +1000)]
more subdir html support
Andrew Tridgell [Tue, 5 May 2009 22:18:21 +0000 (08:18 +1000)]
use less intrusive smbstatus call in periodic connections cleanup
root [Tue, 5 May 2009 06:33:21 +0000 (16:33 +1000)]
change the talloc hierarchy for the main transaction_start context and the individual transaction_all handles
root [Tue, 5 May 2009 21:32:25 +0000 (07:32 +1000)]
fixed a problem with clients disconnecting during a traverse
When a client (such as smbstatus) is killed, it may have outstanding
traverse children on remote nodes. We need to catch the client
disconnect in ctdbd and send a control to all nodes telling them to
kill those outstanding traverse children.
root [Fri, 1 May 2009 02:37:52 +0000 (12:37 +1000)]
new version 1.0.80
root [Fri, 1 May 2009 02:30:26 +0000 (12:30 +1000)]
when tracking the ctdb statistics, only decrement num_clients and pending_calls IFF the counter is >0
Otherwise there is the chance that we will reset the statistics after the counter has been incremented (client connects) to zero and when the client disconnects we decrement it to a negative number.
this is a pure cosmetic patch with no operational impact to ctdb
root [Thu, 30 Apr 2009 15:18:27 +0000 (01:18 +1000)]
Add a new variable VerifyRecoveryLock which can be used to disable the test that the recovery daemon holds the lock properly when performing a recovery
Ronnie Sahlberg [Thu, 30 Apr 2009 07:38:30 +0000 (17:38 +1000)]
dont unconditionally kill/restart ctdb when given "service ctdb start" only start ctdb if it is not already running, and print an error message othervise
Ronnie Sahlberg [Sat, 25 Apr 2009 22:47:38 +0000 (08:47 +1000)]
we only need to have transaction nesting disabled when we start the new transaction for the recovery
Ronnie Sahlberg [Sat, 25 Apr 2009 22:42:54 +0000 (08:42 +1000)]
set the TDB_NO_NESTING flag for the tdb before we start a transaction from within recovery
Ronnie Sahlberg [Sat, 25 Apr 2009 22:38:37 +0000 (08:38 +1000)]
add TDB_NO_NESTING. When this flag is set tdb will not allow any nested transactions and tdb_transaction_start() will implicitely _cancel() any pending transactions before starting any new ones.
Ronnie Sahlberg [Fri, 24 Apr 2009 08:23:48 +0000 (18:23 +1000)]
add a tuneable RecoveryDropAllIPs so it is possible to control after how long a node that has been stuck in recovery will wait until it will yield all public addresses.
this now defaults to 60 seconds
This is useful if a split brain occurs due to network partitioning since it will make sure that the "other half" of the cluster that does not contain the recovery master will eventually release all ips and thus avoiding a duplicate ip situation for the public addresses
Ronnie Sahlberg [Fri, 24 Apr 2009 08:09:51 +0000 (18:09 +1000)]
increase the loglevel for the message we print when we automatically release all ips when we have been in recovery for too long
Ronnie Sahlberg [Fri, 24 Apr 2009 04:41:21 +0000 (14:41 +1000)]
tweak some timeouts so that we do trigger a banning even if the control hangs/timesout
Ronnie Sahlberg [Fri, 24 Apr 2009 03:58:32 +0000 (13:58 +1000)]
If we can not pull a database from a node during recovery, mark this node as a "culprit" so that it will eventually become banned.
Andrew Tridgell [Thu, 23 Apr 2009 01:35:42 +0000 (11:35 +1000)]
change shutdown level for ctdb to be 01
We want ctdb to shutdown first, as it manages many other
services. With the old level of 32 the NFS service would shutdown
first, and that would trigger ctdb to do a recovery. Then ctdb itself
would be shutdown a few seconds later, which causes a lot of error
messages in the other nodes logs
Andrew Tridgell [Thu, 23 Apr 2009 01:00:16 +0000 (11:00 +1000)]
Merge commit 'ronnie/master'
Ronnie Sahlberg [Wed, 8 Apr 2009 02:56:52 +0000 (12:56 +1000)]
new version 1.0.79
Ronnie Sahlberg [Wed, 8 Apr 2009 02:49:28 +0000 (12:49 +1000)]
create a function "remote_ip" which can be used from scripts to remove a single ip from an interface.
use this fucntion from the natgw eventscript
Ronnie Sahlberg [Wed, 8 Apr 2009 00:45:00 +0000 (10:45 +1000)]
set libdir to ../lib64 on x86-64 platforms
Ronnie Sahlberg [Tue, 7 Apr 2009 23:34:20 +0000 (09:34 +1000)]
install ctdb.pc from the RPM
Ronnie Sahlberg [Tue, 7 Apr 2009 23:21:11 +0000 (09:21 +1000)]
From Mathieu Parent <math.parent@gmail.com>
Install the pkgconfig file
Mathieu Parent [Tue, 7 Apr 2009 23:14:20 +0000 (09:14 +1000)]
Ronnie Sahlberg [Tue, 7 Apr 2009 22:48:55 +0000 (08:48 +1000)]
install /etc/ctdb/notify.sh as executable.
this addresses bug 6250
Andrew Tridgell [Tue, 7 Apr 2009 07:07:41 +0000 (17:07 +1000)]
Merge commit 'ronnie/master'
Ronnie Sahlberg [Mon, 6 Apr 2009 04:03:09 +0000 (14:03 +1000)]
we only need to switch into client mode from the eventscript child if we are running the monitor event
Ronnie Sahlberg [Mon, 6 Apr 2009 04:00:41 +0000 (14:00 +1000)]
increase the listen queue. Now that the eventscripts may become clients and connect back to the server we do get a lot more concurrent connection attempts (takepip/teleaseip are performed in parallell)
Ronnie Sahlberg [Mon, 6 Apr 2009 03:16:36 +0000 (13:16 +1000)]
use _exit() and not exit() when we terminate a failed eventscript child process
Ronnie Sahlberg [Mon, 6 Apr 2009 02:00:22 +0000 (12:00 +1000)]
We dont need to verify the nodemap on remote nodes that are banned
Ronnie Sahlberg [Thu, 2 Apr 2009 03:50:43 +0000 (14:50 +1100)]
if we cant pull the remote nodemap off a node we should mark it as a culprit so it eventually becomes banned.
Ronnie Sahlberg [Wed, 1 Apr 2009 06:21:38 +0000 (17:21 +1100)]
Change the (dodgy) seqnumfrequency variable to have ms resolution instead of second resolution.
Rename the variable to SeqnumInterval for
1, it is an interval and not a 1/interval unit
2, so that we catch when people use this old variable and can update the sysconfig file instead of silently changin semantics of this variable
this is a real dodgy variable
Ronnie Sahlberg [Wed, 1 Apr 2009 06:13:48 +0000 (17:13 +1100)]
remove a prototype for a function no longer used
Ronnie Sahlberg [Tue, 31 Mar 2009 09:04:45 +0000 (20:04 +1100)]
new release 1.0.78
Ronnie Sahlberg [Tue, 31 Mar 2009 09:00:00 +0000 (20:00 +1100)]
we should also install the 11.natgw eventscript if we want to be able to use it
Ronnie Sahlberg [Tue, 31 Mar 2009 03:38:52 +0000 (14:38 +1100)]
install a default /etc/ctdb/notify.sh script as example on how to use
snmptrap/email to notify that a node has changed health status
Ronnie Sahlberg [Tue, 31 Mar 2009 03:23:31 +0000 (14:23 +1100)]
add a mechanism where the ctdb daemon will run a usercontrolled script when the node status changes to/from UNHEALTHY state.
This would allow a sysadmin to set up ctdb to send an email/snmptrap/... when the status of the node changes.
Ronnie Sahlberg [Tue, 31 Mar 2009 00:42:10 +0000 (11:42 +1100)]
new version 1.0.77
Ronnie Sahlberg [Tue, 31 Mar 2009 00:33:28 +0000 (11:33 +1100)]
we must also try to set the routes when we release an ip since during the release/10.interfaces there can actually be a window where the kernel decides to remove all addresses (before we manually add them back in 10.interfaces) during which the kernel may also decide to delete all routes since there are no gateways reachable through this interface anymore.
Ronnie Sahlberg [Wed, 25 Mar 2009 03:52:08 +0000 (14:52 +1100)]
new version 1.0.76
Ronnie Sahlberg [Wed, 25 Mar 2009 03:46:05 +0000 (14:46 +1100)]
change the ctdb command table to allow us to describe commands which can be run independtly of the ctdb daemon.
create a new debugging command xpnn which discovers the pnn of the local node and which works even if the local daemon is not running
Ronnie Sahlberg [Wed, 25 Mar 2009 02:46:41 +0000 (13:46 +1100)]
iupdate the documentation for NATGW to reflect that you can now use
multiple natgw groups in one cluster