Andrew Tridgell [Thu, 9 Oct 2008 07:45:12 +0000 (18:45 +1100)]
added some more gpfs commands per-filesystem
Andrew Tridgell [Tue, 30 Sep 2008 14:16:17 +0000 (07:16 -0700)]
The author of the upstream code asked for this code to be GPLv2+ not GPLv3
Andrew Tridgell [Tue, 30 Sep 2008 14:09:06 +0000 (07:09 -0700)]
merged a bugfix for the idtree code from the Linux kernel. This
matches commit
7aae6dd80e265aa9402ed507caaff4a5dba55069 in the kernel.
Many thanks to Jim Houston for pointing out this fix to us
Ronnie Sahlberg [Mon, 22 Sep 2008 15:38:28 +0000 (01:38 +1000)]
Check that a database exists first before we dump its content (and
implicitely also create it) using 'ctdb catdb'
Andrew Tridgell [Wed, 17 Sep 2008 11:00:04 +0000 (21:00 +1000)]
expanded ctdb_diagnostics based on recent experience
Ronnie Sahlberg [Wed, 17 Sep 2008 04:24:12 +0000 (14:24 +1000)]
use the correct tunable failcount not timeout
Ronnie Sahlberg [Wed, 17 Sep 2008 04:17:41 +0000 (14:17 +1000)]
The ctdb daemon keeps track of whether the recovery process is running
correctly by measuring how long it was since the last successful
communication with the recovery daemon was recorded.
After a certain timeout the ctdb daemon would deem the recovery daemon
as inoperable and shut down.
If the system clock is suddenly changed forward by many (60 or more)
seconds this could cause the timeout to trigger prematurely/immediately
where ctdb would incorrectly think that more than 60 seconds had passed
since last successful communications and thus abort.
Instead of cehcking for one timeout occuring, only deem the recovery
daemon to be "down" and trigger a shutdown if communications have
timedout for three intervals in a row.
Ronnie Sahlberg [Mon, 15 Sep 2008 23:00:48 +0000 (09:00 +1000)]
fix a slow memory leak in the recovery daemon in the error paths for the
memdump function
Ronnie Sahlberg [Mon, 15 Sep 2008 21:55:57 +0000 (07:55 +1000)]
fix some slow memory leaks in the vacuuming handler in the recovery
daemon
Ronnie Sahlberg [Mon, 15 Sep 2008 20:50:28 +0000 (06:50 +1000)]
From Volker L
Fix a slow memory leak in the recovery daemon if there is a recoery
triggered during the public ip reassignment process
Ronnie Sahlberg [Sun, 14 Sep 2008 21:04:26 +0000 (07:04 +1000)]
updates to the precompiled documentation
Martin Schwenke [Fri, 12 Sep 2008 08:20:52 +0000 (18:20 +1000)]
Document the new descriptive node specifications.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 12 Sep 2008 06:55:18 +0000 (16:55 +1000)]
onnode changes. "ok" is an alias for "healthy", "con" is an alias for
"connected". Allow "rm" or "recmaster" to be a nodespec for the
recovery master. Better error handling for interaction with ctdb
client.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 12 Sep 2008 08:21:51 +0000 (18:21 +1000)]
Merge commit 'origin/master' into for-ronnie
Ronnie Sahlberg [Fri, 12 Sep 2008 02:06:53 +0000 (12:06 +1000)]
i add a new ctdb command "ctdb recmaster"
this shows the node id of hte current recmaster
Martin Schwenke [Fri, 12 Sep 2008 01:22:50 +0000 (11:22 +1000)]
Changes to onnode. Add "healthy" and "connected" as possible
nodespecs. Since we're now explicitly using bash, use local variables
when sensible.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Martin Schwenke [Fri, 12 Sep 2008 01:26:25 +0000 (11:26 +1000)]
Merge commit 'origin/master' into for-ronnie
Martin Schwenke [Fri, 12 Sep 2008 00:36:15 +0000 (10:36 +1000)]
Minor documentation fixes.
Signed-off-by: Martin Schwenke <martin@meltin.net>
Ronnie Sahlberg [Tue, 9 Sep 2008 03:59:48 +0000 (13:59 +1000)]
lower the debuglevel when logging unknown idr in responses
Ronnie Sahlberg [Tue, 9 Sep 2008 03:55:31 +0000 (13:55 +1000)]
lower the debug level for when printing that the nodeflags have changed
Ronnie Sahlberg [Tue, 9 Sep 2008 03:44:46 +0000 (13:44 +1000)]
additional monitoring between the two daemons.
we currently only monitor that the dameons are running by kill(0, pid)
and verifying the the domain socket between them is ok.
this is not sufficient since we can have a situation where the recovery
daemon is hung.
this new code monitors that the recovery daemon is operating.
if the recovery hangs, we log this and shut down the main daemon
Ronnie Sahlberg [Sun, 7 Sep 2008 22:57:42 +0000 (08:57 +1000)]
From C Cowan.
Patch to make AIX compile with the new ipv6 additions.
Ronnie Sahlberg [Fri, 29 Aug 2008 02:26:02 +0000 (12:26 +1000)]
zero out the address structure to keep valgrind happy
Ronnie Sahlberg [Wed, 27 Aug 2008 00:26:34 +0000 (10:26 +1000)]
new version 1.0.58
Ronnie Sahlberg [Wed, 27 Aug 2008 00:24:35 +0000 (10:24 +1000)]
rename ctdb_tcp_client back to the original name ctdb_control_tcp
Ronnie Sahlberg [Mon, 25 Aug 2008 00:13:18 +0000 (10:13 +1000)]
From Abhijith Das <adas@redhat.com>:
Fixup the initscript sdo it passes rpm-lint
Ronnie Sahlberg [Mon, 25 Aug 2008 00:03:16 +0000 (10:03 +1000)]
Add a "reload" option to the initscript.
Ronnie Sahlberg [Sun, 24 Aug 2008 23:41:08 +0000 (09:41 +1000)]
add a link to my webpage
Ronnie Sahlberg [Sun, 24 Aug 2008 22:52:29 +0000 (08:52 +1000)]
version 1.0.57 : initial ipv6 support
Ronnie Sahlberg [Thu, 21 Aug 2008 23:25:47 +0000 (09:25 +1000)]
Do not fail the takeip event if the "ip addr add ..." command failed.
Let the event complete successfully. the local recovery daemon will check that we have the address and reissue takip othervise.
There are several reasons why "ip addr add " can fail, one is a misconfiguration
anothe ris that for ipv6 the stack is a lot more picky than for ipv4. for examplke this WILL fail in ipv6 if there is a duplicate ip address on the network.
thus this check could cause rolling-recoveries which is why it has to go
Ronnie Sahlberg [Thu, 21 Aug 2008 23:09:08 +0000 (09:09 +1000)]
when we collect all ip addresses and sort them for the "ctdb ip -n all" output we must look at more than just the first 4 bytes of the sockaddr address or ipv6 wont work
Ronnie Sahlberg [Wed, 20 Aug 2008 02:50:50 +0000 (12:50 +1000)]
When we harvest all tcp connections to kill off after a takeip/releaseip event we must also harvest the ipv4 connections which may be presented in ::ff:xxxx:xxxx form by netstat
Ronnie Sahlberg [Wed, 20 Aug 2008 02:02:54 +0000 (12:02 +1000)]
we must canonicalize the sockaddr structures in killtcp so that we do the necessary downgrade if required
Ronnie Sahlberg [Wed, 20 Aug 2008 01:58:27 +0000 (11:58 +1000)]
make the function to canonicalize a sockaddr structure public
Ronnie Sahlberg [Wed, 20 Aug 2008 01:52:36 +0000 (11:52 +1000)]
when we compare ip addresses in ctdb_same_ip we must first canonicalize the addresses so that we realize that 127.0.0.1:22 is really the same thing as ::ffff:127.0.0.1:22
Downgrade all AF_INET6 ::ffff:xxxx:xxxx sockaddresses into AF_INET ones
Ronnie Sahlberg [Tue, 19 Aug 2008 23:47:00 +0000 (09:47 +1000)]
update the socketkiller in the eventscripts to be able to handle ipv6
Ronnie Sahlberg [Tue, 19 Aug 2008 23:23:31 +0000 (09:23 +1000)]
fix a bug in the tcp socketkiller for ipv6
Ronnie Sahlberg [Tue, 19 Aug 2008 08:24:08 +0000 (18:24 +1000)]
fix the ipv6 checksum calculation for pseudoheader so that it actually works
add support to send ipv6 "gratious arp" aka neighbor solicitation packets from ctdb
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Ronnie Sahlberg [Tue, 19 Aug 2008 04:58:57 +0000 (14:58 +1000)]
remove a file we dont need
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Ronnie Sahlberg [Tue, 19 Aug 2008 04:58:29 +0000 (14:58 +1000)]
initial ipv6 patch
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Ronnie Sahlberg [Thu, 14 Aug 2008 00:57:08 +0000 (10:57 +1000)]
use a local tdb_traverse instead of a ctdb_pulldb to lessen the impact of the system while performing a database backup
Ronnie Sahlberg [Wed, 13 Aug 2008 23:52:23 +0000 (09:52 +1000)]
only freeze the local node when doing a backup and not the entire cluster
Ronnie Sahlberg [Wed, 13 Aug 2008 22:36:39 +0000 (08:36 +1000)]
store the database name, not the backup filename in the database header
Ronnie Sahlberg [Wed, 13 Aug 2008 22:35:19 +0000 (08:35 +1000)]
Encode a file version number in the database backup header
Encode the database name in the header so we dont need to provide the database
name when doing a restore
Encode a timestamp in the header telling us when the backup was created
Ronnie Sahlberg [Wed, 13 Aug 2008 12:03:29 +0000 (22:03 +1000)]
Add two new ctdb commands :
ctdb backupdb : which will copy a database out from ctdb and write it to a file
ctdb restoredb : which will read a database backup from a file and write it into ctdb
Andrew Tridgell [Mon, 11 Aug 2008 14:10:48 +0000 (00:10 +1000)]
fixed merge
Andrew Tridgell [Mon, 11 Aug 2008 13:52:46 +0000 (23:52 +1000)]
up release version
Ronnie Sahlberg [Mon, 11 Aug 2008 13:50:42 +0000 (23:50 +1000)]
new version 1.0.56
Andrew Tridgell [Mon, 11 Aug 2008 13:33:46 +0000 (23:33 +1000)]
Merge commit 'ronnie/master'
Andrew Tridgell [Mon, 11 Aug 2008 13:33:05 +0000 (23:33 +1000)]
fixed a memory leak in the recovery daemon
thanks to vl for spotting this
Ronnie Sahlberg [Mon, 11 Aug 2008 00:36:38 +0000 (10:36 +1000)]
fix the date soe rpmbuild works
Ronnie Sahlberg [Mon, 11 Aug 2008 00:33:22 +0000 (10:33 +1000)]
new version 1.0.55
Andrew Tridgell [Fri, 8 Aug 2008 12:06:39 +0000 (22:06 +1000)]
fixed send of release IP message
Ronnie Sahlberg [Fri, 8 Aug 2008 03:11:07 +0000 (13:11 +1000)]
Merge git://git.samba.org/tridge/ctdb
Andrew Tridgell [Fri, 8 Aug 2008 03:11:41 +0000 (13:11 +1000)]
added retry handling in client
Andrew Tridgell [Fri, 8 Aug 2008 03:11:28 +0000 (13:11 +1000)]
added a new control CTDB_CONTROL_TRANS2_COMMIT_RETRY so we can tell
the difference between a initial commit attempt and a retry, which
allows us to get the persistent updates counter right for retries
Andrew Tridgell [Fri, 8 Aug 2008 01:04:21 +0000 (11:04 +1000)]
imported failure handling from dbwrap_ctdb.c
Ronnie Sahlberg [Fri, 8 Aug 2008 00:59:40 +0000 (10:59 +1000)]
Merge git://git.samba.org/tridge/ctdb
Andrew Tridgell [Fri, 8 Aug 2008 00:15:23 +0000 (10:15 +1000)]
save writing the same data twice
Ronnie Sahlberg [Fri, 8 Aug 2008 00:01:20 +0000 (10:01 +1000)]
new version 1.0.54
Andrew Tridgell [Fri, 8 Aug 2008 00:00:33 +0000 (10:00 +1000)]
up release number
Andrew Tridgell [Thu, 7 Aug 2008 23:58:49 +0000 (09:58 +1000)]
return a more detailed error code from a trans2 commit error
Andrew Tridgell [Thu, 7 Aug 2008 14:48:19 +0000 (00:48 +1000)]
Merge commit 'ronnie/1.0.53'
Andrew Tridgell [Thu, 7 Aug 2008 14:44:33 +0000 (00:44 +1000)]
fixed a looping error bug with the new transactions code
Ronnie Sahlberg [Thu, 7 Aug 2008 08:57:24 +0000 (18:57 +1000)]
new version 1.0.53
this adds completely new transaction code for persistent databases
Ronnie Sahlberg [Thu, 7 Aug 2008 08:50:48 +0000 (18:50 +1000)]
Merge git://git.samba.org/tridge/ctdb
Andrew Tridgell [Thu, 7 Aug 2008 03:34:18 +0000 (13:34 +1000)]
cover some corner cases where the persistent database could become
inconsistent
Ronnie Sahlberg [Wed, 6 Aug 2008 01:52:26 +0000 (11:52 +1000)]
remove the reclock file we store pnn counts in.
This file creates additional locking stress on the backend filesystem and we may not need it anyway.
Ronnie Sahlberg [Tue, 5 Aug 2008 23:17:41 +0000 (09:17 +1000)]
Merge git://git.samba.org/tridge/ctdb
Ronnie Sahlberg [Mon, 4 Aug 2008 04:58:52 +0000 (14:58 +1000)]
New version 1.0.52
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Andrew Tridgell [Fri, 1 Aug 2008 04:23:15 +0000 (14:23 +1000)]
we need an additional gratuitous arp before the NFS tickles
Andrew Tridgell [Fri, 1 Aug 2008 04:17:50 +0000 (14:17 +1000)]
ensure we use killtcp on non-NFS/non-CIFS ports for faster failover of
other protocols
Andrew Tridgell [Mon, 4 Aug 2008 04:51:51 +0000 (14:51 +1000)]
implemented replayable transactions in ctdb to prevent deadlock
Andrew Tridgell [Fri, 1 Aug 2008 04:23:15 +0000 (14:23 +1000)]
we need an additional gratuitous arp before the NFS tickles
Andrew Tridgell [Fri, 1 Aug 2008 04:17:50 +0000 (14:17 +1000)]
ensure we use killtcp on non-NFS/non-CIFS ports for faster failover of
other protocols
Andrew Tridgell [Wed, 30 Jul 2008 09:59:54 +0000 (19:59 +1000)]
fixed some warnings
Andrew Tridgell [Wed, 30 Jul 2008 09:59:42 +0000 (19:59 +1000)]
fixed a warning
Andrew Tridgell [Wed, 30 Jul 2008 09:59:34 +0000 (19:59 +1000)]
cleanup of the old persistent db test
Andrew Tridgell [Wed, 30 Jul 2008 09:59:18 +0000 (19:59 +1000)]
renamed the pulldb structure to a ctdb_marshall_buffer
Andrew Tridgell [Wed, 30 Jul 2008 09:58:49 +0000 (19:58 +1000)]
make sure we honor the TDB_NOSYNC flag from clients in the server
Andrew Tridgell [Wed, 30 Jul 2008 09:58:27 +0000 (19:58 +1000)]
new prototypes
Andrew Tridgell [Wed, 30 Jul 2008 09:58:17 +0000 (19:58 +1000)]
added marshalling helper functions
Andrew Tridgell [Wed, 30 Jul 2008 09:58:03 +0000 (19:58 +1000)]
we don't need ctdb_ltdb_persistent_store() any more
Andrew Tridgell [Wed, 30 Jul 2008 09:57:48 +0000 (19:57 +1000)]
added client side functions for new transaction code
Andrew Tridgell [Wed, 30 Jul 2008 09:57:00 +0000 (19:57 +1000)]
added new multi-record transaction commit code
Andrew Tridgell [Wed, 30 Jul 2008 09:55:54 +0000 (19:55 +1000)]
added a new persistent transaction test program
Andrew Tridgell [Wed, 30 Jul 2008 04:24:56 +0000 (14:24 +1000)]
rename the structure we use for marshalling multiple records
Andrew Tridgell [Wed, 30 Jul 2008 03:21:02 +0000 (13:21 +1000)]
cleanup on SIGINT
Andrew Tridgell [Wed, 30 Jul 2008 03:20:47 +0000 (13:20 +1000)]
- cleanup persistent db at start
- catch SIGINT and kill daemons
Andrew Tridgell [Wed, 30 Jul 2008 03:20:24 +0000 (13:20 +1000)]
- show pids during test
- don't use first_time, as it is not safe for multiple
clients on a node
Ronnie Sahlberg [Mon, 28 Jul 2008 07:11:15 +0000 (17:11 +1000)]
new version 1.0.51
Ronnie Sahlberg [Mon, 28 Jul 2008 07:07:44 +0000 (17:07 +1000)]
From Alexander Saupp.
If we use vlan tagging and bonding we must strip the vlan part off the name
so we can check the main bonde device for status.
I.e. check bond0 instead of bond0.<VLANTAG>
Andrew Tridgell [Wed, 23 Jul 2008 05:36:23 +0000 (15:36 +1000)]
run the testparm commands in 50.samba in the background, only running
in the foreground if something fails
Andrew Tridgell [Wed, 23 Jul 2008 05:35:46 +0000 (15:35 +1000)]
allow for probing of directories without raising an error
Andrew Tridgell [Wed, 23 Jul 2008 05:25:52 +0000 (15:25 +1000)]
fixed buffering in ctdb logging code to handle multiple lines
correctly
Ronnie Sahlberg [Mon, 21 Jul 2008 23:07:42 +0000 (09:07 +1000)]
From Michael Adams,
change one element from private to private_data
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Ronnie Sahlberg [Fri, 18 Jul 2008 03:49:05 +0000 (13:49 +1000)]
new version 1.0.50
Ronnie Sahlberg [Fri, 18 Jul 2008 03:42:39 +0000 (13:42 +1000)]
Merge git://git.samba.org/tridge/ctdb
Andrew Tridgell [Fri, 18 Jul 2008 03:46:01 +0000 (13:46 +1000)]
fixed a bug where we would look for a signal past the end of the
signal arrays
This could cause the events code to get into a loop chewing CPU
Ronnie Sahlberg [Fri, 18 Jul 2008 02:07:25 +0000 (12:07 +1000)]
if a new node enters the cluster, that node will already be frozen at start
but the rest of the nodes are not frozen.
at this stage an election is called by the new node.
Since in this case the nodes are not froze, we can not modify the recmaster
of the nodes so it is expected that this control would fail.
Add a boolean to send_election_request() to make it not
try to set the recmaster locally for the case where we are in an election phase
while not frozen.