Ronnie and I tracked down a bug which seems to be caused by a node
running so slowly that we timed out the request and reused the request
id before it responded.
The result was that we unlocked the wrong record, leading to the
following:
ctdbd: tdb_unlock: count is 0
ctdbd: tdb_chainunlock failed
smbd[
1630912]: [2010/06/08 15:32:28.251716, 0] lib/util_sock.c:1491(get_peer_addr_internal)
ctdbd: Could not find idr:43
ctdbd: server/ctdb_call.c:492 reqid 43 not found
This exact problem is now detected, but in general we want to delay
id reuse as long as possible to make our system more robust.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
}
ctdb->ev = ev;
ctdb->idr = idr_init(ctdb);
+ /* Wrap early to exercise code. */
+ ctdb->lastid = INT_MAX-2;
CTDB_NO_MEMORY_NULL(ctdb, ctdb->idr);
ret = ctdb_set_socketname(ctdb, CTDB_PATH);
uint32_t ctdb_reqid_new(struct ctdb_context *ctdb, void *state)
{
- return idr_get_new(ctdb->idr, state, INT_MAX);
+ int id = idr_get_new_above(ctdb->idr, state, ctdb->lastid+1, INT_MAX);
+ if (id < 0) {
+ DEBUG(DEBUG_NOTICE, ("Reqid wrap!\n"));
+ id = idr_get_new(ctdb->idr, state, INT_MAX);
+ }
+ ctdb->lastid = id;
+ return id;
}
void *_ctdb_reqid_find(struct ctdb_context *ctdb, uint32_t reqid, const char *type, const char *location)
unsigned flags;
uint32_t capabilities;
struct idr_context *idr;
- uint16_t idr_cnt;
+ int lastid;
struct ctdb_node **nodes; /* array of nodes in the cluster - indexed by vnn */
struct ctdb_vnn *vnn; /* list of public ip addresses and interfaces */
struct ctdb_vnn *single_ip_vnn; /* a structure for the single ip */