git.samba.org - ctdb.git/commit

author	Martin Schwenke <martin@meltin.net>
	Fri, 23 May 2014 11:58:55 +0000 (21:58 +1000)
committer	Amitay Isaacs <amitay@gmail.com>
	Thu, 29 May 2014 04:05:14 +0000 (14:05 +1000)
commit	a89f6a073d0d780ac8336c202b7b8f7179aa8d35
tree	90025da8edf6780052c26f690938ff8372a28163	tree
parent	1c3abfc5502cc99a2325152de28a1592a58675af	commit \| diff

tools-ctdb: Make natgwlist and lvsmaster more resilient

Recent changes have caused these commands to attempt to get
capabilities from all nodes before doing further filtering.  This
means that capabilities are unnecessarily fetched from nodes that are
unlikely to be the master.  If such a node does not answer the control
then many nodes can fail to calculate the master node.  In the case of
natgwlist this will cause "monitor" events to fail resulting in
unhealthy nodes.

Restore the behaviour where capabilities are only fetched for a node
that will be the master if it has the desired flags.

Although this masks a problem where a connected node is not replying,
it can help to avoid an outage in some cases.

Add supporting tests and infrastructure.  Infrastructure just lets a
timeout be faked - just for ctdb_ctrl_getcapabilities_stub() so far.
First test checks that this infrastructure works if the first node
times out in natgwlist.  Second test checks the case worked around by
the above fix - that is, no failure when a node with PNN beyond the
NATGW master can time out.

Signed-off-by: Martin Schwenke <martin@meltin.net>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
Autobuild-User(master): Amitay Isaacs <amitay@samba.org>
Autobuild-Date(master): Thu May 29 05:59:37 CEST 2014 on sn-devel-104

(Imported from commit 4dd382296d3e78000713ab0ac1f8e531e25857cc)

tests/src/ctdb_test_stubs.c		diff \| blob \| history
tests/tool/stubby.natgwlist.009.sh	[new file with mode: 0755]	blob
tests/tool/stubby.natgwlist.010.sh	[new file with mode: 0755]	blob
tools/ctdb.c		diff \| blob \| history