amitay/ctdb.git
10 years agoeventscripts: Become unhealthy faster on nfsd failure
Martin Schwenke [Mon, 12 Aug 2013 01:36:25 +0000 (11:36 +1000)]
eventscripts: Become unhealthy faster on nfsd failure

Anecdotal evidence suggests that most nfsd RPC check failures are due
to cluster filesystem or storage problem.  Apparently these are rarely
helped by attempting to restart the NFS service because the restart
tends to hang.

Fail after 2 nfsd RPC check failures, instead of waiting for 6
failures.  Restart on every 10th failure to try to bring the node back
to good health.

Update unit tests to match.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agotools/ctdb: Increase default control timeout to 10 seconds
Martin Schwenke [Fri, 9 Aug 2013 01:56:29 +0000 (11:56 +1000)]
tools/ctdb: Increase default control timeout to 10 seconds

The current 3 second timeout is arbitrary and users trip over it
sometimes.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Improve message logged when a counter hits a limit
Martin Schwenke [Thu, 8 Aug 2013 06:02:44 +0000 (16:02 +1000)]
eventscripts: Improve message logged when a counter hits a limit

It should print the actual number of consecutive failures rather than
the limit.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Print a message when waiting for TCP connections to be killed
Martin Schwenke [Tue, 6 Aug 2013 02:42:13 +0000 (12:42 +1000)]
eventscripts: Print a message when waiting for TCP connections to be killed

This makes the gaps in the logs more obvious.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: New configuration variable $CTDB_RPCINFO_LOCALHOST
Martin Schwenke [Mon, 5 Aug 2013 05:12:14 +0000 (15:12 +1000)]
eventscripts: New configuration variable $CTDB_RPCINFO_LOCALHOST

Passing "localhost" to the rpcinfo command causes overheads, like
reading /etc/services multiple times.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoeventscripts: Add modulo (%) operator to ctdb_check_counter()
Martin Schwenke [Fri, 2 Aug 2013 05:18:47 +0000 (15:18 +1000)]
eventscripts: Add modulo (%) operator to ctdb_check_counter()

Also add it to the corresponding eventscript unit test infrastructure.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Separate out RPC service restart code
Martin Schwenke [Fri, 2 Aug 2013 06:05:46 +0000 (16:05 +1000)]
eventscripts: Separate out RPC service restart code

While doing this:

* Explicitly assign RPC program and version information in
  _nfs_check_rpc_common().  This is more lines of code but is easier
  to read.

* Don't print the options when starting a service.  Trying to print it
  makes the code messy for little benefit.

  Update the eventscript unit testing code and a Ganesha test to
  reflect this.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agotests/eventscripts: Override background_with_logging(), just prepend "&"
Martin Schwenke [Fri, 2 Aug 2013 06:03:42 +0000 (16:03 +1000)]
tests/eventscripts: Override background_with_logging(), just prepend "&"

That is, output that goes through background_with_logging() just gets
"&" prepended to each line.  This is cleaner than having the tests
grovel through logs.

Update some 49.winbind/50.samba tests to deal with this.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Remove support for RPC service 'q' and 's' restart flags
Martin Schwenke [Tue, 30 Jul 2013 06:24:24 +0000 (16:24 +1000)]
eventscripts: Remove support for RPC service 'q' and 's' restart flags

They're hard to maintain and provide very little benefit.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: When restarting the nfslock service only show output of start
Martin Schwenke [Tue, 30 Jul 2013 06:21:36 +0000 (16:21 +1000)]
eventscripts: When restarting the nfslock service only show output of start

That is, /dev/null the "stop" output.  This is consistent with the way
CTDB generally deals with the output when stopping a service.

It also makes updating the eventscript unit tests easier.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agotests/simple: Unreachable node test should wait for recovery to complete
Martin Schwenke [Mon, 29 Jul 2013 05:27:24 +0000 (15:27 +1000)]
tests/simple: Unreachable node test should wait for recovery to complete

This should minimise the chances of a control timing out.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agotests/simple: Fix the missing IP test
Martin Schwenke [Mon, 29 Jul 2013 05:09:23 +0000 (15:09 +1000)]
tests/simple: Fix the missing IP test

Update the missing IP test to wait until restarts are complete.
Otherwise a service restart can collide with the following monitor
event and cause chaos.

Also, do not disable 10.interface until it matters.  Disabling it too
early can cause even more chaos if something goes wrong with the
monitor step.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agorecoverd: Use TDB_INCOMPATIBLE_HASH when creating volatile databases
Amitay Isaacs [Tue, 13 Aug 2013 04:02:46 +0000 (14:02 +1000)]
recoverd: Use TDB_INCOMPATIBLE_HASH when creating volatile databases

When creating missing databases either locally or remotely, recovery
master calls ctdb_ctrl_createdb().  Recovery master always passes 0
for tdb_flags.  For volatile databases, if TDB_INCOMPATIBLE_HASH is not
specified, then they will be attached without using jenkins hash causing
database corruption.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoRevert "recoverd: Use correct tdb flags when creating missing databases"
Amitay Isaacs [Tue, 13 Aug 2013 03:55:47 +0000 (13:55 +1000)]
Revert "recoverd: Use correct tdb flags when creating missing databases"

This reverts commit 10a057d8e15c8c18e540598a940d3548c731b0b4.

This approach would not work when creating local databases since currently
there is no control to receive TDB flags for remote databases.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agocommon/io: Keep queue buffer size multiple of 4K
Amitay Isaacs [Mon, 5 Aug 2013 07:28:47 +0000 (17:28 +1000)]
common/io: Keep queue buffer size multiple of 4K

Currently queue buffer size is realloc'd every time we need to extend the
buffer.  Small increments can cause memory fragmentation.  Instead always
extend buffer in multiples of 4K.  This should reduce multiple talloc_realloc
calls when there are lots of packets in the socket buffer.

Also, if queue buffer has grown larger than 64K, throw away the buffer once
all the requests in the queue have been processed.  That way queue does not
hold on to large buffers.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: Allow setting custom release number in RPM spec file
Martin Schwenke [Fri, 26 Jul 2013 03:57:03 +0000 (13:57 +1000)]
packaging: Allow setting custom release number in RPM spec file

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-Programmed-With: Amitay Isaacs <amitay@gmail.com>

10 years agoctdbd: When a record is made sticky, log only once
Amitay Isaacs [Wed, 31 Jul 2013 05:59:11 +0000 (15:59 +1000)]
ctdbd: When a record is made sticky, log only once

Instead of logging from ctdb_request_call(), log the message from
ctdb_make_record_sticky().  That way if the record is already sticky, the
message is not repeated unnecessarily.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Improve high hopcount log messages when request is redirected
Amitay Isaacs [Mon, 15 Jul 2013 07:34:31 +0000 (17:34 +1000)]
ctdbd: Improve high hopcount log messages when request is redirected

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoscripts: Do not run ctdb tool commands when debugging hung "init" event
Martin Schwenke [Tue, 6 Aug 2013 06:11:40 +0000 (16:11 +1000)]
scripts: Do not run ctdb tool commands when debugging hung "init" event

CTDB daemon is not ready to accept clients in INIT runstate (init event).
CTDB daemon will start accepting connections in SETUP runstate (setup event)
and later.

Also, minor log formatting changes.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Avoid leaking file descriptor if talloc fails
Amitay Isaacs [Mon, 5 Aug 2013 07:38:42 +0000 (17:38 +1000)]
ctdbd: Avoid leaking file descriptor if talloc fails

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoeventscript: Wait for debug hung script to finish or timeout before continuing
Amitay Isaacs [Mon, 5 Aug 2013 04:08:28 +0000 (14:08 +1000)]
eventscript: Wait for debug hung script to finish or timeout before continuing

Currently if the debug hung script takes long time to finish, the subsequent
monitor event can collide with the previous event which is not yet finished.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoeventscripts: Use configured RECLOCK file instead of asking CTDB
Amitay Isaacs [Fri, 2 Aug 2013 05:49:06 +0000 (15:49 +1000)]
eventscripts: Use configured RECLOCK file instead of asking CTDB

On cluster where recovery lock file is not being used, asking CTDB daemon
is unnecessary overhead.  And if CTDB is using recovery file, then changing
configuration without restarting is *stupid*.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>

10 years agolocking: Do not create multiple lock processes for the same key
Amitay Isaacs [Fri, 2 Aug 2013 00:54:38 +0000 (10:54 +1000)]
locking: Do not create multiple lock processes for the same key

If there are multiple lock helper processes waiting for the same record, then
it will cause a thundering herd when that record has been unlocked.  So avoid
scheduling lock contexts for the same record.  This will also mean that
multiple requests will get queued up behind the same lock context and can be
processed quickly once the lock has been obtained.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agolocking: Move function find_lock_context() before ctdb_lock_schedule()
Amitay Isaacs [Fri, 2 Aug 2013 00:51:45 +0000 (10:51 +1000)]
locking: Move function find_lock_context() before ctdb_lock_schedule()

So that ctdb_lock_schedule() can call this function without requiring extra
prototype declaration.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Print set db sticky message after it's set
Amitay Isaacs [Tue, 30 Jul 2013 04:17:55 +0000 (14:17 +1000)]
ctdbd: Print set db sticky message after it's set

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotests: Add a test program to hold a lock on a database
Amitay Isaacs [Tue, 4 Dec 2012 07:27:10 +0000 (18:27 +1100)]
tests: Add a test program to hold a lock on a database

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Use correct tdb flags when creating missing databases
Amitay Isaacs [Tue, 30 Jul 2013 02:45:01 +0000 (12:45 +1000)]
recoverd: Use correct tdb flags when creating missing databases

When creating missing databases either locally or remotely, make sure
to use the correct tdb flags from other nodes.  Without this, volatile
databases can get attached without TDB_INCOMPATIBLE_HASH flag.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoclient: Always use jenkins hash when attaching volatile databases
Amitay Isaacs [Thu, 1 Aug 2013 01:07:59 +0000 (11:07 +1000)]
client: Always use jenkins hash when attaching volatile databases

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Make sure to use jenkins hash for recovery databases
Amitay Isaacs [Mon, 29 Jul 2013 03:50:44 +0000 (13:50 +1000)]
recoverd: Make sure to use jenkins hash for recovery databases

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Assemble up-to-date node flags information from remote nodes
Amitay Isaacs [Mon, 22 Jul 2013 07:26:28 +0000 (17:26 +1000)]
recoverd: Assemble up-to-date node flags information from remote nodes

Currently nodemap used by recovery master is the one obtained from the local
node.  This information may have been updated while processing main loop.
Before comparing node flags on all the nodes, create up-to-date node flags
information based on the information received from all the nodes.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotools/ctdb: Only print the hot records with non-zero hopcount
Amitay Isaacs [Mon, 15 Jul 2013 06:35:30 +0000 (16:35 +1000)]
tools/ctdb: Only print the hot records with non-zero hopcount

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Don't consider a hot record if the hopcount is zero
Amitay Isaacs [Mon, 15 Jul 2013 06:32:40 +0000 (16:32 +1000)]
ctdbd: Don't consider a hot record if the hopcount is zero

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Fix updating of hot keys in database statistics
Amitay Isaacs [Fri, 12 Jul 2013 07:33:13 +0000 (17:33 +1000)]
ctdbd: Fix updating of hot keys in database statistics

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Remove incomplete ctdb_db_statistics_wire structure
Amitay Isaacs [Mon, 15 Jul 2013 05:24:11 +0000 (15:24 +1000)]
ctdbd: Remove incomplete ctdb_db_statistics_wire structure

Instead of maintaining another structure, add an element as place holder for
marshall buffer of hot keys.  This avoids duplication of the structure.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoRevert "ctdbd: Remove incomplete ctdb_db_statistics_wire structure"
Amitay Isaacs [Mon, 15 Jul 2013 04:52:07 +0000 (14:52 +1000)]
Revert "ctdbd: Remove incomplete ctdb_db_statistics_wire structure"

The structure cannot be removed without adding support for marshalling keys
for hot records.

This reverts commit 26a4653df594d351ca0dc1bd5f5b2f5b0eb0a9a5.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agodoc: Update XML files to use standard DocBook DTD
Martin Schwenke [Fri, 26 Jul 2013 05:09:24 +0000 (15:09 +1000)]
doc: Update XML files to use standard DocBook DTD

This simplifies building since we don't use any of the Samba
extensions.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoinitscript: The wrapper script should export CTDB_SOCKET
Martin Schwenke [Fri, 26 Jul 2013 01:20:47 +0000 (11:20 +1000)]
initscript: The wrapper script should export CTDB_SOCKET

This ensures that any invocation of the ctdb tool (within the wrapper)
gets the desired value.  This at least ensures that ctdbd will be
started.

If a non-standard value is set for CTDB_SOCKET then command-line users
will still need the variable in their environment.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoctdbd: Kill client process without checking for tracked child
Martin Schwenke [Thu, 25 Jul 2013 06:17:07 +0000 (16:17 +1000)]
ctdbd: Kill client process without checking for tracked child

Commit f73a4b1495830bcdd094a93732a89dd53b3c2f78 added a safety check
to ensure that CTDB never kills unrelated processes.  However, client
processes are unrelated.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: kill_tcp_connections() should send connections to stdin
Martin Schwenke [Thu, 25 Jul 2013 03:40:43 +0000 (13:40 +1000)]
eventscripts: kill_tcp_connections() should send connections to stdin

This avoids issuing multiple "ctdb killtcp" commands to terminate tcp
connections, one per connection.  This will considerably reduce the
time when there is a large number of tcp connections.  This also makes
it possible to avoid calling "ctdb killtcp" when there are no connections.

Add a couple of unit tests for killtcp and update eventscript unit
test infrastructure to support.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agotools/ctdb: Allow killtcp to read connections from standard input
Martin Schwenke [Thu, 25 Jul 2013 03:28:26 +0000 (13:28 +1000)]
tools/ctdb: Allow killtcp to read connections from standard input

This will allows eventscripts to send information about multiple tcp
connections to a single "ctdb killtcp" command, saving the overhead of
setting up a client connection per tcp connection.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agotests: Always tally the number of passed/failed tests
Martin Schwenke [Mon, 22 Jul 2013 10:11:58 +0000 (20:11 +1000)]
tests: Always tally the number of passed/failed tests

Regardless of whether a summary is being printed!

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Call takeover fail callback only once per node
Martin Schwenke [Mon, 22 Jul 2013 06:39:46 +0000 (16:39 +1000)]
recoverd: Call takeover fail callback only once per node

Currently the fail callback is called once per (takeip/releaseip) control
failure.  This is overkill and can get a node banned much too quickly.

Instead, keep track of control failures per node and only call fail
callback once per failed node.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoscripts: Run scriptstatus for hung event
Martin Schwenke [Mon, 22 Jul 2013 05:08:32 +0000 (15:08 +1000)]
scripts: Run scriptstatus for hung event

The timeout information printed by ctdbd is less than useful because
it refers to the cumulative time taken by the eventscripts run so far.
Adding scriptstatus output indicates where time was actually spent.

Since there is now quite a bit of output, serialise the calls to this
script using flock.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoctdbd: Pass event name to hung script debugger
Martin Schwenke [Mon, 22 Jul 2013 05:06:52 +0000 (15:06 +1000)]
ctdbd: Pass event name to hung script debugger

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agotests/complex: Fix NFS tests to work with root_squash
Martin Schwenke [Mon, 22 Jul 2013 04:32:13 +0000 (14:32 +1000)]
tests/complex: Fix NFS tests to work with root_squash

Refactor the NFS test setup/cleanup code into new common functions.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agotests: Fix exit status of run_tests when a single test is run with -H
Martin Schwenke [Fri, 19 Jul 2013 09:59:43 +0000 (19:59 +1000)]
tests: Fix exit status of run_tests when a single test is run with -H

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agotests/simple: Add -p in onnode test to help show groups of connections
Martin Schwenke [Fri, 19 Jul 2013 05:33:38 +0000 (15:33 +1000)]
tests/simple: Add -p in onnode test to help show groups of connections

Change the command from "true" to "hostname" since the former won't
produce any output when used in combination with "onnode -p".  This
could just be changed to "echo" but the hostname might actually be
useful.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Sleep at exit to allow time for log messages to flush
Martin Schwenke [Wed, 17 Jul 2013 01:14:37 +0000 (11:14 +1000)]
ctdbd: Sleep at exit to allow time for log messages to flush

Register print_exit_message() earlier so that it covers most of the
early exits.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoctdbd: Exit if something is already listening on CTDB socket
Martin Schwenke [Fri, 19 Jul 2013 05:36:29 +0000 (15:36 +1000)]
ctdbd: Exit if something is already listening on CTDB socket

Don't blindly remove the socket.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agotests/eventscripts: Add tests for monitoring of missing interfaces
Martin Schwenke [Tue, 16 Jul 2013 09:57:18 +0000 (19:57 +1000)]
tests/eventscripts: Add tests for monitoring of missing interfaces

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: A missing interface should cause monitoring to fail
Martin Schwenke [Fri, 12 Jul 2013 02:48:34 +0000 (12:48 +1000)]
eventscripts: A missing interface should cause monitoring to fail

A missing interface is at least as bad as an interface with a link
that is down so should have a similar effect.

This couldn't be done previously because orphaned interfaces used to
be listed for monitoring.  This was worked around in 10.interface in
commit 49b2d1bd9554461ed8edbfc21e777c0eca9e1443 and fixed in ctdbd in
commit cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.

If $CTDB_PARTIALLY_ONLINE_INTERFACES="yes" then monitoring won't
actually fail but the interface is still marked as down.

While we're touching this code, use "ip link" instead of "ip addr".
It is marginally cheaper but not enough for a separate patch.  ;-)

This effectively reverts d67955b42f7627be9dae995230c8fcbb8a948ec2.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Get list of configured interfaces using "ctdb ifaces"
Martin Schwenke [Fri, 12 Jul 2013 02:33:36 +0000 (12:33 +1000)]
eventscripts: Get list of configured interfaces using "ctdb ifaces"

This was previosuly changed because ctdbd didn't garbage collect
orphaned interfaces.  This was fixed in commit
cc1a3ae911d3fee8b87fda5de5ab6d9499d7510a.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Allow extra recovery to repair persistent DBs during first recovery
Martin Schwenke [Mon, 24 Jun 2013 05:49:48 +0000 (15:49 +1000)]
ctdbd: Allow extra recovery to repair persistent DBs during first recovery

Commit 8076773a9924dcf8aff16f7d96b2b9ac383ecc28 introduced a potential
regression because a node may not have completed the "recovered" event
(so might still be in CTDB_RUNSTATE_FIRST_RECOVERY) when another node
becomes healthy.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agopackaging: Bundle debug_locks.sh script in RPM
Amitay Isaacs [Tue, 16 Jul 2013 02:53:16 +0000 (12:53 +1000)]
packaging: Bundle debug_locks.sh script in RPM

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: No need to check for existence of scripts, they always do
Amitay Isaacs [Tue, 16 Jul 2013 02:52:00 +0000 (12:52 +1000)]
packaging: No need to check for existence of scripts, they always do

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoscripts: ctdbd_wrapper logs a message to syslog if syslog is not being used
Martin Schwenke [Thu, 11 Jul 2013 04:26:38 +0000 (14:26 +1000)]
scripts: ctdbd_wrapper logs a message to syslog if syslog is not being used

It can be very disconcerting when logging to syslog is expected but
nothing is being logged there.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoUpdate Nagios check to work with ctdb versions past 30 Aug 2011
Mathieu Parent [Fri, 7 Jun 2013 17:01:06 +0000 (19:01 +0200)]
Update Nagios check to work with ctdb versions past 30 Aug 2011

Because of commit a779d83a6213e2ba

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Really fix bogus info in message about changed flags
Martin Schwenke [Thu, 11 Jul 2013 03:01:13 +0000 (13:01 +1000)]
recoverd: Really fix bogus info in message about changed flags

Commit 9119a568c2b4601318f7751f537dca2f92a7230b attempted to fix this.
However, this was wrong because old_flags and new_flags were confused.
The latter has since been fixed in commit
7eb2f89979360b6cc98ca9b17c48310277fa89fc so this can now be fixed
properly.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agodoc: Update NEWS
Martin Schwenke [Wed, 10 Jul 2013 04:44:56 +0000 (14:44 +1000)]
doc: Update NEWS

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoPrint deleted nodes as well
Sumit Bose [Mon, 19 Nov 2012 17:45:37 +0000 (18:45 +0100)]
Print deleted nodes as well

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoIPv6 neighbor solicit cleanup
Sumit Bose [Thu, 1 Sep 2011 13:18:46 +0000 (15:18 +0200)]
IPv6 neighbor solicit cleanup

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoFix memory leak in ctdb_send_message()
Sumit Bose [Mon, 19 Nov 2012 10:13:03 +0000 (11:13 +0100)]
Fix memory leak in ctdb_send_message()

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoFixes for various issues found by Coverity
Sumit Bose [Wed, 10 Aug 2011 15:53:56 +0000 (17:53 +0200)]
Fixes for various issues found by Coverity

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoCheck return value of tdb_delete()
Sumit Bose [Mon, 19 Nov 2012 10:20:31 +0000 (11:20 +0100)]
Check return value of tdb_delete()

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoweb: Update webpages
Amitay Isaacs [Thu, 11 Jul 2013 03:46:18 +0000 (13:46 +1000)]
web: Update webpages

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoTests: Correct the arguments to memset
Amitay Isaacs [Thu, 11 Jul 2013 01:34:46 +0000 (11:34 +1000)]
Tests: Correct the arguments to memset

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agodoc: Update NEWS
Amitay Isaacs [Wed, 10 Jul 2013 04:44:56 +0000 (14:44 +1000)]
doc: Update NEWS

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-programmed-with: Martin Schwenke <martin@meltin.net>

10 years agopackaging: Add systemd support
Martin Schwenke [Wed, 10 Jul 2013 07:19:55 +0000 (17:19 +1000)]
packaging: Add systemd support

Based on an original patch by Sumit Bose <sbose@redhat.com>.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agobuild: Turn off all deprecation warnings
Martin Schwenke [Wed, 10 Jul 2013 06:35:53 +0000 (16:35 +1000)]
build: Turn off all deprecation warnings

The "‘tevent_loop_allow_nesting’ is deprecated" warnings will be
around for a while and are annoying.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agobuild: Remove -DTEVENT_DEPRECATED_QUIET=1 from CFLAGS
Martin Schwenke [Wed, 10 Jul 2013 06:30:29 +0000 (16:30 +1000)]
build: Remove -DTEVENT_DEPRECATED_QUIET=1 from CFLAGS

This reverts the last part of 788cdbddbc902a5b076d23473450065b551d274d
- the rest of this has been implicitly reverted via tevent syncs.
This is just leftover noise.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoinitscript: Simpify initscript and control CTDB via new ctdbd_wrapper
Martin Schwenke [Tue, 9 Jul 2013 05:22:07 +0000 (15:22 +1000)]
initscript: Simpify initscript and control CTDB via new ctdbd_wrapper

Currently the initscript is very complex.  This makes it hard to read
and hard to add support for new init systems, such as systemd.

Create a wrapper called ctdbd_wrapper to be installed alongside ctdbd.
This is called by the initscript to start and stop ctdbd.  It does the
ctdbd option construct and waits until ctdbd is properly initialised
before it exits.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agorecoverd: Recovery daemon should use ctdb_get_pnn, which can't fail
Martin Schwenke [Mon, 8 Jul 2013 02:45:31 +0000 (12:45 +1000)]
recoverd: Recovery daemon should use ctdb_get_pnn, which can't fail

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Print tdb flags when logging attached to database message
Amitay Isaacs [Wed, 10 Jul 2013 02:23:30 +0000 (12:23 +1000)]
ctdbd: Print tdb flags when logging attached to database message

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Set process names for child processes
Amitay Isaacs [Tue, 9 Jul 2013 02:32:53 +0000 (12:32 +1000)]
ctdbd: Set process names for child processes

This helps distinguish processes in process list in top, perf, etc.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agocommon/system: Add ctdb_set_process_name() function
Amitay Isaacs [Tue, 9 Jul 2013 02:24:59 +0000 (12:24 +1000)]
common/system: Add ctdb_set_process_name() function

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotraverse: Remove unused start_time field
Amitay Isaacs [Thu, 6 Jun 2013 06:29:04 +0000 (16:29 +1000)]
traverse: Remove unused start_time field

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotraverse: Send records directly from traverse child to srcnode
Amitay Isaacs [Thu, 6 Jun 2013 06:26:25 +0000 (16:26 +1000)]
traverse: Send records directly from traverse child to srcnode

Currently CTDB daemon reads records from a child process and then sends them to
srcnode via TRAVERSE_DATA control.  This ties up main CTDB daemon and also
requires an extra copy of the record in the CTDB daemon.  Instead send records
directly from traverse child process.

The control from child process still goes via local CTDB daemon as there
is no infrastructure currently to open a TCP socket to the srcnode.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotraverse: Pass reqid and srcnode information to local database traverse
Amitay Isaacs [Thu, 6 Jun 2013 06:12:07 +0000 (16:12 +1000)]
traverse: Pass reqid and srcnode information to local database traverse

So that traverse child process can directly send the TRAVERSE_DATA control to
the srcnode without first sending it to local node.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agopackaging: When building with system libraries, add dependency for them
Amitay Isaacs [Mon, 8 Jul 2013 06:14:59 +0000 (16:14 +1000)]
packaging: When building with system libraries, add dependency for them

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: No need for DeadlockTimeout tunable
Amitay Isaacs [Mon, 8 Jul 2013 05:49:58 +0000 (15:49 +1000)]
ctdbd: No need for DeadlockTimeout tunable

The code for deadlock detection and killing smbd process causing deadlock
has been removed and replaced with external debug script.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoinitscript: Export CTDB_DEBUG_LOCKS variable
Amitay Isaacs [Mon, 8 Jul 2013 05:57:22 +0000 (15:57 +1000)]
initscript: Export CTDB_DEBUG_LOCKS variable

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoscripts: Add an example debug_locks.sh script to debug locking issue
Amitay Isaacs [Mon, 8 Jul 2013 05:56:30 +0000 (15:56 +1000)]
scripts: Add an example debug_locks.sh script to debug locking issue

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agolocking: Use external script to debug locking issues
Amitay Isaacs [Mon, 8 Jul 2013 05:46:53 +0000 (15:46 +1000)]
locking: Use external script to debug locking issues

Use an external script to parse /proc/locks and log useful debugging
information about locks rather than doing that in C code.

To use this feature, add configuration variable to /etc/sysconfig/ctdb:

  CTDB_DEBUG_LOCKS=/etc/ctdb/debug_locks.sh

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agolocking: Update locking bucket intervals
Amitay Isaacs [Wed, 3 Jul 2013 01:01:21 +0000 (11:01 +1000)]
locking: Update locking bucket intervals

 0   < 1 ms
 1   < 10 ms
 2   < 100 ms
 3   < 1 s
 4   < 2 s
 5   < 4 s
 6   < 8 s
 7   < 16 s
 8   < 32 s
 9   < 64 s
10   >= 64 s

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agolocking: Update locks latency in CTDB statistics only for RECORD or DB locks
Amitay Isaacs [Wed, 3 Jul 2013 01:46:53 +0000 (11:46 +1000)]
locking: Update locks latency in CTDB statistics only for RECORD or DB locks

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agotools/ctdb: Fix the format of DB statistics output
Amitay Isaacs [Tue, 25 Jun 2013 05:36:13 +0000 (15:36 +1000)]
tools/ctdb: Fix the format of DB statistics output

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Remove incomplete ctdb_db_statistics_wire structure
Amitay Isaacs [Tue, 25 Jun 2013 05:25:16 +0000 (15:25 +1000)]
ctdbd: Remove incomplete ctdb_db_statistics_wire structure

Send the ctdb_db_statistics directly instead of first copying it to
duplicate ctdb_db_statistics_wire structure.  This simplifies the
implementation of the control to get database statistics.

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agoctdbd: Update debug messages for setting readonly property on database
Amitay Isaacs [Wed, 3 Jul 2013 23:04:49 +0000 (09:04 +1000)]
ctdbd: Update debug messages for setting readonly property on database

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
10 years agorecoverd: Fix buffer overflow error in reloadips
Amitay Isaacs [Fri, 5 Jul 2013 04:04:20 +0000 (14:04 +1000)]
recoverd: Fix buffer overflow error in reloadips

Signed-off-by: Amitay Isaacs <amitay@gmail.com>
Pair-Programmed-With: Martin Schwenke <martin@meltin.net>

10 years agotests/eventscripts: Add some rudimentary tests for 60.ganesha
Martin Schwenke [Thu, 4 Jul 2013 10:02:29 +0000 (20:02 +1000)]
tests/eventscripts: Add some rudimentary tests for 60.ganesha

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: New configuration variable $CTDB_SKIP_GANESHA_NFSD_CHECK
Martin Schwenke [Thu, 4 Jul 2013 06:05:01 +0000 (16:05 +1000)]
eventscripts: New configuration variable $CTDB_SKIP_GANESHA_NFSD_CHECK

This allows 60.ganesha to be unit tested, except for the core Ganesha
monitoring code.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscript: Move Ganesha nfsd monitoring to a function
Martin Schwenke [Thu, 4 Jul 2013 06:00:33 +0000 (16:00 +1000)]
eventscript: Move Ganesha nfsd monitoring to a function

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoeventscripts: Drop RPC service version from nfs_check_rpc_service() calls
Martin Schwenke [Thu, 4 Jul 2013 05:11:54 +0000 (15:11 +1000)]
eventscripts: Drop RPC service version from nfs_check_rpc_service() calls

Support for this was removed in commit
77302dbfd85754e02559eccb2dd6c090db0b6b9f and I overlooked its use in
60.ganesha.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Pair-programmed-with: Amitay Isaacs <amitay@gmail.com>

10 years agoctdbd: Log something when releasing all IPs
Martin Schwenke [Tue, 2 Jul 2013 04:43:17 +0000 (14:43 +1000)]
ctdbd: Log something when releasing all IPs

At the moment this is silent and it can be confusing to see IPs just
disappear.

Also, this message:

  Been in recovery mode for too long. Dropping all IPS

can cause anxiety when all IPs should already have been dropped.
Adding a comforting message saying that 0 IPs were dropped relieves
such anxiety.  :-)

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Minor style improvements for ctdb_reload_remote_public_ips()
Martin Schwenke [Sun, 30 Jun 2013 09:00:36 +0000 (19:00 +1000)]
recoverd: Minor style improvements for ctdb_reload_remote_public_ips()

* Add a variable to the loop to make the code more readable and have
  it generally fit into 80 columns.

* Improve comments.

* Improve log messages.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Clean up log messages in remote IP verification
Martin Schwenke [Sun, 30 Jun 2013 08:45:46 +0000 (18:45 +1000)]
recoverd: Clean up log messages in remote IP verification

The log messages in verify_remote_ip_allocation() are confusing
because they don't include the PNN of the problem node, because it is
not known in this function.

Add the PNN of the node being verified as a function argument and then
shuffle the log messages around to make them clearer.

Also fold 3 nested if statements into just one.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Fix an unclear log message - "Restart recovery process"
Martin Schwenke [Sun, 30 Jun 2013 07:57:33 +0000 (17:57 +1000)]
recoverd: Fix an unclear log message - "Restart recovery process"

When the recovery master notices a node in recovery mode it starts the
recovery process, it doesn't restart it.

Update documentation to match.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agorecoverd: Fix an incorrect comment
Martin Schwenke [Sun, 30 Jun 2013 07:53:37 +0000 (17:53 +1000)]
recoverd: Fix an incorrect comment

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Use ctdb_die() on "setup" event failure
Martin Schwenke [Sun, 30 Jun 2013 07:48:01 +0000 (17:48 +1000)]
ctdbd: Use ctdb_die() on "setup" event failure

This is slightly easier to read because it all fits on 1 line.

Signed-off-by: Martin Schwenke <martin@meltin.net>
10 years agoctdbd: Avoid a core dump when "init" event fails
Martin Schwenke [Sun, 30 Jun 2013 07:43:52 +0000 (17:43 +1000)]
ctdbd: Avoid a core dump when "init" event fails

The "init" event only really fails in the scripts, which should log
something useful on failure.  Therefore, a core dump isn't terribly
useful and sometimes attracts unwanted attention.

Signed-off-by: Martin Schwenke <martin@meltin.net>