ctdb/doc/ctdb.7.xml

   1 <?xml version="1.0" encoding="iso-8859-1"?>
   2 <!DOCTYPE refentry
   3         PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
   4         "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
   5 <refentry id="ctdb.7">
   6
   7 <refmeta>
   8         <refentrytitle>ctdb</refentrytitle>
   9         <manvolnum>7</manvolnum>
  10         <refmiscinfo class="source">ctdb</refmiscinfo>
  11         <refmiscinfo class="manual">CTDB - clustered TDB database</refmiscinfo>
  12 </refmeta>
  13
  14
  15 <refnamediv>
  16         <refname>ctdb</refname>
  17         <refpurpose>Clustered TDB</refpurpose>
  18 </refnamediv>
  19
  20 <refsect1>
  21   <title>DESCRIPTION</title>
  22
  23   <para>
  24     CTDB is a clustered database component in clustered Samba that
  25     provides a high-availability load-sharing CIFS server cluster.
  26   </para>
  27
  28   <para>
  29     The main functions of CTDB are:
  30   </para>
  31
  32   <itemizedlist>
  33     <listitem>
  34       <para>
  35         Provide a clustered version of the TDB database with automatic
  36         rebuild/recovery of the databases upon node failures.
  37       </para>
  38     </listitem>
  39
  40     <listitem>
  41       <para>
  42       Monitor nodes in the cluster and services running on each node.
  43       </para>
  44     </listitem>
  45
  46     <listitem>
  47       <para>
  48         Manage a pool of public IP addresses that are used to provide
  49         services to clients.  Alternatively, CTDB can be used with
  50         LVS.
  51       </para>
  52     </listitem>
  53   </itemizedlist>
  54
  55   <para>
  56     Combined with a cluster filesystem CTDB provides a full
  57     high-availablity (HA) environment for services such as clustered
  58     Samba, NFS and other services.
  59   </para>
  60 </refsect1>
  61
  62 <refsect1>
  63   <title>ANATOMY OF A CTDB CLUSTER</title>
  64
  65   <para>
  66     A CTDB cluster is a collection of nodes with 2 or more network
  67     interfaces.  All nodes provide network (usually file/NAS) services
  68     to clients.  Data served by file services is stored on shared
  69     storage (usually a cluster filesystem) that is accessible by all
  70     nodes.
  71   </para>
  72   <para>
  73     CTDB provides an "all active" cluster, where services are load
  74     balanced across all nodes.
  75   </para>
  76 </refsect1>
  77
  78   <refsect1>
  79     <title>Recovery Lock</title>
  80
  81     <para>
  82       CTDB uses a <emphasis>recovery lock</emphasis> to avoid a
  83       <emphasis>split brain</emphasis>, where a cluster becomes
  84       partitioned and each partition attempts to operate
  85       independently.  Issues that can result from a split brain
  86       include file data corruption, because file locking metadata may
  87       not be tracked correctly.
  88     </para>
  89
  90     <para>
  91       CTDB uses a <emphasis>cluster leader and follower</emphasis>
  92       model of cluster management.  All nodes in a cluster elect one
  93       node to be the leader.  The leader node coordinates privileged
  94       operations such as database recovery and IP address failover.
  95       CTDB refers to the leader node as the <emphasis>recovery
  96       master</emphasis>.  This node takes and holds the recovery lock
  97       to assert its privileged role in the cluster.
  98     </para>
  99
 100     <para>
 101       By default, the recovery lock is implemented using a file
 102       (specified by <parameter>recovery lock</parameter> in the
 103       <literal>[cluster]</literal> section of
 104       <citerefentry><refentrytitle>ctdb.conf</refentrytitle>
 105       <manvolnum>5</manvolnum></citerefentry>) residing in shared
 106       storage (usually) on a cluster filesystem.  To support a
 107       recovery lock the cluster filesystem must support lock
 108       coherence.  See
 109       <citerefentry><refentrytitle>ping_pong</refentrytitle>
 110       <manvolnum>1</manvolnum></citerefentry> for more details.
 111     </para>
 112
 113     <para>
 114       The recovery lock can also be implemented using an arbitrary
 115       cluster mutex call-out by using an exclamation point ('!') as
 116       the first character of <parameter>recovery lock</parameter>.
 117       For example, a value of <command>!/usr/local/bin/myhelper
 118       recovery</command> would run the given helper with the specified
 119       arguments.  See the source code relating to cluster mutexes for
 120       clues about writing call-outs.
 121     </para>
 122
 123     <para>
 124       If a cluster becomes partitioned (for example, due to a
 125       communication failure) and a different recovery master is
 126       elected by the nodes in each partition, then only one of these
 127       recovery masters will be able to take the recovery lock.  The
 128       recovery master in the "losing" partition will not be able to
 129       take the recovery lock and will be excluded from the cluster.
 130       The nodes in the "losing" partition will elect each node in turn
 131       as their recovery master so eventually all the nodes in that
 132       partition will be excluded.
 133     </para>
 134
 135     <para>
 136       CTDB does sanity checks to ensure that the recovery lock is held
 137       as expected.
 138     </para>
 139
 140     <para>
 141       CTDB can run without a recovery lock but this is not recommended
 142       as there will be no protection from split brains.
 143     </para>
 144   </refsect1>
 145
 146   <refsect1>
 147     <title>Private vs Public addresses</title>
 148
 149     <para>
 150       Each node in a CTDB cluster has multiple IP addresses assigned
 151       to it:
 152
 153       <itemizedlist>
 154         <listitem>
 155           <para>
 156             A single private IP address that is used for communication
 157             between nodes.
 158           </para>
 159         </listitem>
 160         <listitem>
 161           <para>
 162             One or more public IP addresses that are used to provide
 163             NAS or other services.
 164           </para>
 165         </listitem>
 166       </itemizedlist>
 167     </para>
 168
 169     <refsect2>
 170       <title>Private address</title>
 171
 172       <para>
 173         Each node is configured with a unique, permanently assigned
 174         private address.  This address is configured by the operating
 175         system.  This address uniquely identifies a physical node in
 176         the cluster and is the address that CTDB daemons will use to
 177         communicate with the CTDB daemons on other nodes.
 178       </para>
 179
 180       <para>
 181         Private addresses are listed in the file
 182         <filename>/usr/local/etc/ctdb/nodes</filename>).  This file
 183         contains the list of private addresses for all nodes in the
 184         cluster, one per line. This file must be the same on all nodes
 185         in the cluster.
 186       </para>
 187
 188       <para>
 189         Some users like to put this configuration file in their
 190         cluster filesystem.  A symbolic link should be used in this
 191         case.
 192       </para>
 193
 194       <para>
 195         Private addresses should not be used by clients to connect to
 196         services provided by the cluster.
 197       </para>
 198       <para>
 199         It is strongly recommended that the private addresses are
 200         configured on a private network that is separate from client
 201         networks.  This is because the CTDB protocol is both
 202         unauthenticated and unencrypted.  If clients share the private
 203         network then steps need to be taken to stop injection of
 204         packets to relevant ports on the private addresses.  It is
 205         also likely that CTDB protocol traffic between nodes could
 206         leak sensitive information if it can be intercepted.
 207       </para>
 208
 209       <para>
 210         Example <filename>/usr/local/etc/ctdb/nodes</filename> for a four node
 211         cluster:
 212       </para>
 213       <screen format="linespecific">
 214 192.168.1.1
 215 192.168.1.2
 216 192.168.1.3
 217 192.168.1.4
 218       </screen>
 219     </refsect2>
 220
 221     <refsect2>
 222       <title>Public addresses</title>
 223
 224       <para>
 225         Public addresses are used to provide services to clients.
 226         Public addresses are not configured at the operating system
 227         level and are not permanently associated with a particular
 228         node.  Instead, they are managed by CTDB and are assigned to
 229         interfaces on physical nodes at runtime.
 230       </para>
 231       <para>
 232         The CTDB cluster will assign/reassign these public addresses
 233         across the available healthy nodes in the cluster. When one
 234         node fails, its public addresses will be taken over by one or
 235         more other nodes in the cluster.  This ensures that services
 236         provided by all public addresses are always available to
 237         clients, as long as there are nodes available capable of
 238         hosting this address.
 239       </para>
 240
 241       <para>
 242         The public address configuration is stored in
 243         <filename>/usr/local/etc/ctdb/public_addresses</filename> on
 244         each node.  This file contains a list of the public addresses
 245         that the node is capable of hosting, one per line.  Each entry
 246         also contains the netmask and the interface to which the
 247         address should be assigned.  If this file is missing then no
 248         public addresses are configured.
 249       </para>
 250
 251       <para>
 252         Some users who have the same public addresses on all nodes
 253         like to put this configuration file in their cluster
 254         filesystem.  A symbolic link should be used in this case.
 255       </para>
 256
 257       <para>
 258         Example <filename>/usr/local/etc/ctdb/public_addresses</filename> for a
 259         node that can host 4 public addresses, on 2 different
 260         interfaces:
 261       </para>
 262       <screen format="linespecific">
 263 10.1.1.1/24 eth1
 264 10.1.1.2/24 eth1
 265 10.1.2.1/24 eth2
 266 10.1.2.2/24 eth2
 267       </screen>
 268
 269       <para>
 270         In many cases the public addresses file will be the same on
 271         all nodes.  However, it is possible to use different public
 272         address configurations on different nodes.
 273       </para>
 274
 275       <para>
 276         Example: 4 nodes partitioned into two subgroups:
 277       </para>
 278       <screen format="linespecific">
 279 Node 0:/usr/local/etc/ctdb/public_addresses
 280         10.1.1.1/24 eth1
 281         10.1.1.2/24 eth1
 282
 283 Node 1:/usr/local/etc/ctdb/public_addresses
 284         10.1.1.1/24 eth1
 285         10.1.1.2/24 eth1
 286
 287 Node 2:/usr/local/etc/ctdb/public_addresses
 288         10.1.2.1/24 eth2
 289         10.1.2.2/24 eth2
 290
 291 Node 3:/usr/local/etc/ctdb/public_addresses
 292         10.1.2.1/24 eth2
 293         10.1.2.2/24 eth2
 294       </screen>
 295       <para>
 296         In this example nodes 0 and 1 host two public addresses on the
 297         10.1.1.x network while nodes 2 and 3 host two public addresses
 298         for the 10.1.2.x network.
 299       </para>
 300       <para>
 301         Public address 10.1.1.1 can be hosted by either of nodes 0 or
 302         1 and will be available to clients as long as at least one of
 303         these two nodes are available.
 304       </para>
 305       <para>
 306         If both nodes 0 and 1 become unavailable then public address
 307         10.1.1.1 also becomes unavailable. 10.1.1.1 can not be failed
 308         over to nodes 2 or 3 since these nodes do not have this public
 309         address configured.
 310       </para>
 311       <para>
 312         The <command>ctdb ip</command> command can be used to view the
 313         current assignment of public addresses to physical nodes.
 314       </para>
 315     </refsect2>
 316   </refsect1>
 317
 318
 319   <refsect1>
 320     <title>Node status</title>
 321
 322     <para>
 323       The current status of each node in the cluster can be viewed by the
 324       <command>ctdb status</command> command.
 325     </para>
 326
 327     <para>
 328       A node can be in one of the following states:
 329     </para>
 330
 331     <variablelist>
 332       <varlistentry>
 333         <term>OK</term>
 334         <listitem>
 335           <para>
 336             This node is healthy and fully functional.  It hosts public
 337             addresses to provide services.
 338           </para>
 339         </listitem>
 340       </varlistentry>
 341
 342       <varlistentry>
 343         <term>DISCONNECTED</term>
 344         <listitem>
 345           <para>
 346             This node is not reachable by other nodes via the private
 347             network.  It is not currently participating in the cluster.
 348             It <emphasis>does not</emphasis> host public addresses to
 349             provide services.  It might be shut down.
 350           </para>
 351         </listitem>
 352       </varlistentry>
 353
 354       <varlistentry>
 355         <term>DISABLED</term>
 356         <listitem>
 357           <para>
 358             This node has been administratively disabled. This node is
 359             partially functional and participates in the cluster.
 360             However, it <emphasis>does not</emphasis> host public
 361             addresses to provide services.
 362           </para>
 363         </listitem>
 364       </varlistentry>
 365
 366       <varlistentry>
 367         <term>UNHEALTHY</term>
 368         <listitem>
 369           <para>
 370             A service provided by this node has failed a health check
 371             and should be investigated.  This node is partially
 372             functional and participates in the cluster.  However, it
 373             <emphasis>does not</emphasis> host public addresses to
 374             provide services.  Unhealthy nodes should be investigated
 375             and may require an administrative action to rectify.
 376           </para>
 377         </listitem>
 378       </varlistentry>
 379
 380       <varlistentry>
 381         <term>BANNED</term>
 382         <listitem>
 383           <para>
 384             CTDB is not behaving as designed on this node.  For example,
 385             it may have failed too many recovery attempts.  Such nodes
 386             are banned from participating in the cluster for a
 387             configurable time period before they attempt to rejoin the
 388             cluster.  A banned node <emphasis>does not</emphasis> host
 389             public addresses to provide services.  All banned nodes
 390             should be investigated and may require an administrative
 391             action to rectify.
 392           </para>
 393         </listitem>
 394       </varlistentry>
 395
 396       <varlistentry>
 397         <term>STOPPED</term>
 398         <listitem>
 399           <para>
 400             This node has been administratively exclude from the
 401             cluster.  A stopped node does no participate in the cluster
 402             and <emphasis>does not</emphasis> host public addresses to
 403             provide services.  This state can be used while performing
 404             maintenance on a node.
 405           </para>
 406         </listitem>
 407       </varlistentry>
 408
 409       <varlistentry>
 410         <term>PARTIALLYONLINE</term>
 411         <listitem>
 412           <para>
 413             A node that is partially online participates in a cluster
 414             like a healthy (OK) node.  Some interfaces to serve public
 415             addresses are down, but at least one interface is up.  See
 416             also <command>ctdb ifaces</command>.
 417           </para>
 418         </listitem>
 419       </varlistentry>
 420
 421     </variablelist>
 422   </refsect1>
 423
 424   <refsect1>
 425     <title>CAPABILITIES</title>
 426
 427     <para>
 428       Cluster nodes can have several different capabilities enabled.
 429       These are listed below.
 430     </para>
 431
 432     <variablelist>
 433
 434       <varlistentry>
 435         <term>RECMASTER</term>
 436         <listitem>
 437           <para>
 438             Indicates that a node can become the CTDB cluster recovery
 439             master.  The current recovery master is decided via an
 440             election held by all active nodes with this capability.
 441           </para>
 442           <para>
 443             Default is YES.
 444           </para>
 445         </listitem>
 446       </varlistentry>
 447
 448       <varlistentry>
 449         <term>LMASTER</term>
 450         <listitem>
 451           <para>
 452             Indicates that a node can be the location master (LMASTER)
 453             for database records.  The LMASTER always knows which node
 454             has the latest copy of a record in a volatile database.
 455           </para>
 456           <para>
 457             Default is YES.
 458           </para>
 459         </listitem>
 460       </varlistentry>
 461
 462     </variablelist>
 463
 464     <para>
 465       The RECMASTER and LMASTER capabilities can be disabled when CTDB
 466       is used to create a cluster spanning across WAN links. In this
 467       case CTDB acts as a WAN accelerator.
 468     </para>
 469
 470   </refsect1>
 471
 472   <refsect1>
 473     <title>LVS</title>
 474
 475     <para>
 476       LVS is a mode where CTDB presents one single IP address for the
 477       entire cluster. This is an alternative to using public IP
 478       addresses and round-robin DNS to loadbalance clients across the
 479       cluster.
 480     </para>
 481
 482     <para>
 483       This is similar to using a layer-4 loadbalancing switch but with
 484       some restrictions.
 485     </para>
 486
 487     <para>
 488       One extra LVS public address is assigned on the public network
 489       to each LVS group.  Each LVS group is a set of nodes in the
 490       cluster that presents the same LVS address public address to the
 491       outside world.  Normally there would only be one LVS group
 492       spanning an entire cluster, but in situations where one CTDB
 493       cluster spans multiple physical sites it might be useful to have
 494       one LVS group for each site.  There can be multiple LVS groups
 495       in a cluster but each node can only be member of one LVS group.
 496     </para>
 497
 498     <para>
 499       Client access to the cluster is load-balanced across the HEALTHY
 500       nodes in an LVS group.  If no HEALTHY nodes exists then all
 501       nodes in the group are used, regardless of health status.  CTDB
 502       will, however never load-balance LVS traffic to nodes that are
 503       BANNED, STOPPED, DISABLED or DISCONNECTED.  The <command>ctdb
 504       lvs</command> command is used to show which nodes are currently
 505       load-balanced across.
 506     </para>
 507
 508     <para>
 509       In each LVS group, one of the nodes is selected by CTDB to be
 510       the LVS master.  This node receives all traffic from clients
 511       coming in to the LVS public address and multiplexes it across
 512       the internal network to one of the nodes that LVS is using.
 513       When responding to the client, that node will send the data back
 514       directly to the client, bypassing the LVS master node.  The
 515       command <command>ctdb lvs master</command> will show which node
 516       is the current LVS master.
 517     </para>
 518
 519     <para>
 520       The path used for a client I/O is:
 521       <orderedlist>
 522         <listitem>
 523           <para>
 524             Client sends request packet to LVSMASTER.
 525           </para>
 526         </listitem>
 527         <listitem>
 528           <para>
 529             LVSMASTER passes the request on to one node across the
 530             internal network.
 531           </para>
 532         </listitem>
 533         <listitem>
 534           <para>
 535             Selected node processes the request.
 536           </para>
 537         </listitem>
 538         <listitem>
 539           <para>
 540             Node responds back to client.
 541           </para>
 542         </listitem>
 543       </orderedlist>
 544     </para>
 545
 546     <para>
 547       This means that all incoming traffic to the cluster will pass
 548       through one physical node, which limits scalability. You can
 549       send more data to the LVS address that one physical node can
 550       multiplex. This means that you should not use LVS if your I/O
 551       pattern is write-intensive since you will be limited in the
 552       available network bandwidth that node can handle.  LVS does work
 553       very well for read-intensive workloads where only smallish READ
 554       requests are going through the LVSMASTER bottleneck and the
 555       majority of the traffic volume (the data in the read replies)
 556       goes straight from the processing node back to the clients. For
 557       read-intensive i/o patterns you can achieve very high throughput
 558       rates in this mode.
 559     </para>
 560
 561     <para>
 562       Note: you can use LVS and public addresses at the same time.
 563     </para>
 564
 565     <para>
 566       If you use LVS, you must have a permanent address configured for
 567       the public interface on each node. This address must be routable
 568       and the cluster nodes must be configured so that all traffic
 569       back to client hosts are routed through this interface. This is
 570       also required in order to allow samba/winbind on the node to
 571       talk to the domain controller.  This LVS IP address can not be
 572       used to initiate outgoing traffic.
 573     </para>
 574     <para>
 575       Make sure that the domain controller and the clients are
 576       reachable from a node <emphasis>before</emphasis> you enable
 577       LVS.  Also ensure that outgoing traffic to these hosts is routed
 578       out through the configured public interface.
 579     </para>
 580
 581     <refsect2>
 582       <title>Configuration</title>
 583
 584       <para>
 585         To activate LVS on a CTDB node you must specify the
 586         <varname>CTDB_LVS_PUBLIC_IFACE</varname>,
 587         <varname>CTDB_LVS_PUBLIC_IP</varname> and
 588         <varname>CTDB_LVS_NODES</varname> configuration variables.
 589         <varname>CTDB_LVS_NODES</varname> specifies a file containing
 590         the private address of all nodes in the current node's LVS
 591         group.
 592       </para>
 593
 594       <para>
 595         Example:
 596         <screen format="linespecific">
 597 CTDB_LVS_PUBLIC_IFACE=eth1
 598 CTDB_LVS_PUBLIC_IP=10.1.1.237
 599 CTDB_LVS_NODES=/usr/local/etc/ctdb/lvs_nodes
 600         </screen>
 601       </para>
 602
 603       <para>
 604         Example <filename>/usr/local/etc/ctdb/lvs_nodes</filename>:
 605       </para>
 606       <screen format="linespecific">
 607 192.168.1.2
 608 192.168.1.3
 609 192.168.1.4
 610       </screen>
 611
 612       <para>
 613         Normally any node in an LVS group can act as the LVS master.
 614         Nodes that are highly loaded due to other demands maybe
 615         flagged with the "slave-only" option in the
 616         <varname>CTDB_LVS_NODES</varname> file to limit the LVS
 617         functionality of those nodes.
 618       </para>
 619
 620       <para>
 621         LVS nodes file that excludes 192.168.1.4 from being
 622         the LVS master node:
 623       </para>
 624       <screen format="linespecific">
 625 192.168.1.2
 626 192.168.1.3
 627 192.168.1.4 slave-only
 628       </screen>
 629
 630     </refsect2>
 631   </refsect1>
 632
 633   <refsect1>
 634     <title>TRACKING AND RESETTING TCP CONNECTIONS</title>
 635
 636     <para>
 637       CTDB tracks TCP connections from clients to public IP addresses,
 638       on known ports.  When an IP address moves from one node to
 639       another, all existing TCP connections to that IP address are
 640       reset.  The node taking over this IP address will also send
 641       gratuitous ARPs (for IPv4, or neighbour advertisement, for
 642       IPv6).  This allows clients to reconnect quickly, rather than
 643       waiting for TCP timeouts, which can be very long.
 644     </para>
 645
 646     <para>
 647       It is important that established TCP connections do not survive
 648       a release and take of a public IP address on the same node.
 649       Such connections can get out of sync with sequence and ACK
 650       numbers, potentially causing a disruptive ACK storm.
 651     </para>
 652
 653   </refsect1>
 654
 655   <refsect1>
 656     <title>NAT GATEWAY</title>
 657
 658     <para>
 659       NAT gateway (NATGW) is an optional feature that is used to
 660       configure fallback routing for nodes.  This allows cluster nodes
 661       to connect to external services (e.g. DNS, AD, NIS and LDAP)
 662       when they do not host any public addresses (e.g. when they are
 663       unhealthy).
 664     </para>
 665     <para>
 666       This also applies to node startup because CTDB marks nodes as
 667       UNHEALTHY until they have passed a "monitor" event.  In this
 668       context, NAT gateway helps to avoid a "chicken and egg"
 669       situation where a node needs to access an external service to
 670       become healthy.
 671     </para>
 672     <para>
 673       Another way of solving this type of problem is to assign an
 674       extra static IP address to a public interface on every node.
 675       This is simpler but it uses an extra IP address per node, while
 676       NAT gateway generally uses only one extra IP address.
 677     </para>
 678
 679     <refsect2>
 680       <title>Operation</title>
 681
 682       <para>
 683         One extra NATGW public address is assigned on the public
 684         network to each NATGW group.  Each NATGW group is a set of
 685         nodes in the cluster that shares the same NATGW address to
 686         talk to the outside world.  Normally there would only be one
 687         NATGW group spanning an entire cluster, but in situations
 688         where one CTDB cluster spans multiple physical sites it might
 689         be useful to have one NATGW group for each site.
 690       </para>
 691       <para>
 692         There can be multiple NATGW groups in a cluster but each node
 693         can only be member of one NATGW group.
 694       </para>
 695       <para>
 696         In each NATGW group, one of the nodes is selected by CTDB to
 697         be the NATGW master and the other nodes are consider to be
 698         NATGW slaves.  NATGW slaves establish a fallback default route
 699         to the NATGW master via the private network.  When a NATGW
 700         slave hosts no public IP addresses then it will use this route
 701         for outbound connections.  The NATGW master hosts the NATGW
 702         public IP address and routes outgoing connections from
 703         slave nodes via this IP address.  It also establishes a
 704         fallback default route.
 705       </para>
 706     </refsect2>
 707
 708     <refsect2>
 709       <title>Configuration</title>
 710
 711       <para>
 712         NATGW is usually configured similar to the following example configuration:
 713       </para>
 714       <screen format="linespecific">
 715 CTDB_NATGW_NODES=/usr/local/etc/ctdb/natgw_nodes
 716 CTDB_NATGW_PRIVATE_NETWORK=192.168.1.0/24
 717 CTDB_NATGW_PUBLIC_IP=10.0.0.227/24
 718 CTDB_NATGW_PUBLIC_IFACE=eth0
 719 CTDB_NATGW_DEFAULT_GATEWAY=10.0.0.1
 720       </screen>
 721
 722       <para>
 723         Normally any node in a NATGW group can act as the NATGW
 724         master.  Some configurations may have special nodes that lack
 725         connectivity to a public network.  In such cases, those nodes
 726         can be flagged with the "slave-only" option in the
 727         <varname>CTDB_NATGW_NODES</varname> file to limit the NATGW
 728         functionality of those nodes.
 729       </para>
 730
 731       <para>
 732         See the <citetitle>NAT GATEWAY</citetitle> section in
 733         <citerefentry><refentrytitle>ctdb-script.options</refentrytitle>
 734         <manvolnum>5</manvolnum></citerefentry> for more details of
 735         NATGW configuration.
 736       </para>
 737     </refsect2>
 738
 739
 740     <refsect2>
 741       <title>Implementation details</title>
 742
 743       <para>
 744         When the NATGW functionality is used, one of the nodes is
 745         selected to act as a NAT gateway for all the other nodes in
 746         the group when they need to communicate with the external
 747         services.  The NATGW master is selected to be a node that is
 748         most likely to have usable networks.
 749       </para>
 750
 751       <para>
 752         The NATGW master hosts the NATGW public IP address
 753         <varname>CTDB_NATGW_PUBLIC_IP</varname> on the configured public
 754         interfaces <varname>CTDB_NATGW_PUBLIC_IFACE</varname> and acts as
 755         a router, masquerading outgoing connections from slave nodes
 756         via this IP address.  If
 757         <varname>CTDB_NATGW_DEFAULT_GATEWAY</varname> is set then it
 758         also establishes a fallback default route to the configured
 759         this gateway with a metric of 10.  A metric 10 route is used
 760         so it can co-exist with other default routes that may be
 761         available.
 762       </para>
 763
 764       <para>
 765         A NATGW slave establishes its fallback default route to the
 766         NATGW master via the private network
 767         <varname>CTDB_NATGW_PRIVATE_NETWORK</varname>with a metric of 10.
 768         This route is used for outbound connections when no other
 769         default route is available because the node hosts no public
 770         addresses.  A metric 10 routes is used so that it can co-exist
 771         with other default routes that may be available when the node
 772         is hosting public addresses.
 773       </para>
 774
 775       <para>
 776         <varname>CTDB_NATGW_STATIC_ROUTES</varname> can be used to
 777         have NATGW create more specific routes instead of just default
 778         routes.
 779       </para>
 780
 781       <para>
 782         This is implemented in the <filename>11.natgw</filename>
 783         eventscript.  Please see the eventscript file and the
 784         <citetitle>NAT GATEWAY</citetitle> section in
 785         <citerefentry><refentrytitle>ctdb-script.options</refentrytitle>
 786         <manvolnum>5</manvolnum></citerefentry> for more details.
 787       </para>
 788
 789     </refsect2>
 790   </refsect1>
 791
 792   <refsect1>
 793     <title>POLICY ROUTING</title>
 794
 795     <para>
 796       Policy routing is an optional CTDB feature to support complex
 797       network topologies.  Public addresses may be spread across
 798       several different networks (or VLANs) and it may not be possible
 799       to route packets from these public addresses via the system's
 800       default route.  Therefore, CTDB has support for policy routing
 801       via the <filename>13.per_ip_routing</filename> eventscript.
 802       This allows routing to be specified for packets sourced from
 803       each public address.  The routes are added and removed as CTDB
 804       moves public addresses between nodes.
 805     </para>
 806
 807     <refsect2>
 808       <title>Configuration variables</title>
 809
 810       <para>
 811         There are 4 configuration variables related to policy routing:
 812         <varname>CTDB_PER_IP_ROUTING_CONF</varname>,
 813         <varname>CTDB_PER_IP_ROUTING_RULE_PREF</varname>,
 814         <varname>CTDB_PER_IP_ROUTING_TABLE_ID_LOW</varname>,
 815         <varname>CTDB_PER_IP_ROUTING_TABLE_ID_HIGH</varname>.  See the
 816         <citetitle>POLICY ROUTING</citetitle> section in
 817         <citerefentry><refentrytitle>ctdb-script.options</refentrytitle>
 818         <manvolnum>5</manvolnum></citerefentry> for more details.
 819       </para>
 820     </refsect2>
 821
 822     <refsect2>
 823       <title>Configuration</title>
 824
 825       <para>
 826         The format of each line of
 827         <varname>CTDB_PER_IP_ROUTING_CONF</varname> is:
 828       </para>
 829
 830       <screen>
 831 &lt;public_address&gt; &lt;network&gt; [ &lt;gateway&gt; ]
 832       </screen>
 833
 834       <para>
 835         Leading whitespace is ignored and arbitrary whitespace may be
 836         used as a separator.  Lines that have a "public address" item
 837         that doesn't match an actual public address are ignored.  This
 838         means that comment lines can be added using a leading
 839         character such as '#', since this will never match an IP
 840         address.
 841       </para>
 842
 843       <para>
 844         A line without a gateway indicates a link local route.
 845       </para>
 846
 847       <para>
 848         For example, consider the configuration line:
 849       </para>
 850
 851       <screen>
 852   192.168.1.99  192.168.1.1/24
 853       </screen>
 854
 855       <para>
 856         If the corresponding public_addresses line is:
 857       </para>
 858
 859       <screen>
 860   192.168.1.99/24     eth2,eth3
 861       </screen>
 862
 863       <para>
 864         <varname>CTDB_PER_IP_ROUTING_RULE_PREF</varname> is 100, and
 865         CTDB adds the address to eth2 then the following routing
 866         information is added:
 867       </para>
 868
 869       <screen>
 870   ip rule add from 192.168.1.99 pref 100 table ctdb.192.168.1.99
 871   ip route add 192.168.1.0/24 dev eth2 table ctdb.192.168.1.99
 872       </screen>
 873
 874       <para>
 875         This causes traffic from 192.168.1.1 to 192.168.1.0/24 go via
 876         eth2.
 877       </para>
 878
 879       <para>
 880         The <command>ip rule</command> command will show (something
 881         like - depending on other public addresses and other routes on
 882         the system):
 883       </para>
 884
 885       <screen>
 886   0:            from all lookup local
 887   100:          from 192.168.1.99 lookup ctdb.192.168.1.99
 888   32766:        from all lookup main
 889   32767:        from all lookup default
 890       </screen>
 891
 892       <para>
 893         <command>ip route show table ctdb.192.168.1.99</command> will show:
 894       </para>
 895
 896       <screen>
 897   192.168.1.0/24 dev eth2 scope link
 898       </screen>
 899
 900       <para>
 901         The usual use for a line containing a gateway is to add a
 902         default route corresponding to a particular source address.
 903         Consider this line of configuration:
 904       </para>
 905
 906       <screen>
 907   192.168.1.99  0.0.0.0/0       192.168.1.1
 908       </screen>
 909
 910       <para>
 911         In the situation described above this will cause an extra
 912         routing command to be executed:
 913       </para>
 914
 915       <screen>
 916   ip route add 0.0.0.0/0 via 192.168.1.1 dev eth2 table ctdb.192.168.1.99
 917       </screen>
 918
 919       <para>
 920         With both configuration lines, <command>ip route show table
 921         ctdb.192.168.1.99</command> will show:
 922       </para>
 923
 924       <screen>
 925   192.168.1.0/24 dev eth2 scope link
 926   default via 192.168.1.1 dev eth2
 927       </screen>
 928     </refsect2>
 929
 930     <refsect2>
 931       <title>Sample configuration</title>
 932
 933       <para>
 934         Here is a more complete example configuration.
 935       </para>
 936
 937       <screen>
 938 /usr/local/etc/ctdb/public_addresses:
 939
 940   192.168.1.98  eth2,eth3
 941   192.168.1.99  eth2,eth3
 942
 943 /usr/local/etc/ctdb/policy_routing:
 944
 945   192.168.1.98 192.168.1.0/24
 946   192.168.1.98 192.168.200.0/24 192.168.1.254
 947   192.168.1.98 0.0.0.0/0        192.168.1.1
 948   192.168.1.99 192.168.1.0/24
 949   192.168.1.99 192.168.200.0/24 192.168.1.254
 950   192.168.1.99 0.0.0.0/0        192.168.1.1
 951       </screen>
 952
 953       <para>
 954         The routes local packets as expected, the default route is as
 955         previously discussed, but packets to 192.168.200.0/24 are
 956         routed via the alternate gateway 192.168.1.254.
 957       </para>
 958
 959     </refsect2>
 960   </refsect1>
 961
 962   <refsect1>
 963     <title>NOTIFICATIONS</title>
 964
 965     <para>
 966       When certain state changes occur in CTDB, it can be configured
 967       to perform arbitrary actions via notifications.  For example,
 968       sending SNMP traps or emails when a node becomes unhealthy or
 969       similar.
 970     </para>
 971
 972     <para>
 973       The notification mechanism runs all executable files in
 974       <filename>/usr/local/etc/ctdb/notify.d/</filename>, ignoring any
 975       failures and continuing to run all files.
 976     </para>
 977
 978     <para>
 979       CTDB currently generates notifications after CTDB changes to
 980       these states:
 981     </para>
 982
 983     <simplelist>
 984       <member>init</member>
 985       <member>setup</member>
 986       <member>startup</member>
 987       <member>healthy</member>
 988       <member>unhealthy</member>
 989     </simplelist>
 990
 991   </refsect1>
 992
 993   <refsect1>
 994     <title>LOG LEVELS</title>
 995
 996     <para>
 997       Valid log levels, in increasing order of verbosity, are:
 998     </para>
 999
1000     <simplelist>
1001       <member>ERROR</member>
1002       <member>WARNING</member>
1003       <member>NOTICE</member>
1004       <member>INFO</member>
1005       <member>DEBUG</member>
1006     </simplelist>
1007   </refsect1>
1008
1009
1010   <refsect1>
1011     <title>REMOTE CLUSTER NODES</title>
1012     <para>
1013 It is possible to have a CTDB cluster that spans across a WAN link.
1014 For example where you have a CTDB cluster in your datacentre but you also
1015 want to have one additional CTDB node located at a remote branch site.
1016 This is similar to how a WAN accelerator works but with the difference
1017 that while a WAN-accelerator often acts as a Proxy or a MitM, in
1018 the ctdb remote cluster node configuration the Samba instance at the remote site
1019 IS the genuine server, not a proxy and not a MitM, and thus provides 100%
1020 correct CIFS semantics to clients.
1021     </para>
1022
1023     <para>
1024         See the cluster as one single multihomed samba server where one of
1025         the NICs (the remote node) is very far away.
1026     </para>
1027
1028     <para>
1029         NOTE: This does require that the cluster filesystem you use can cope
1030         with WAN-link latencies. Not all cluster filesystems can handle
1031         WAN-link latencies! Whether this will provide very good WAN-accelerator
1032         performance or it will perform very poorly depends entirely
1033         on how optimized your cluster filesystem is in handling high latency
1034         for data and metadata operations.
1035     </para>
1036
1037     <para>
1038         To activate a node as being a remote cluster node you need to set
1039         the following two parameters in /etc/sysconfig/ctdb  for the remote node:
1040         <screen format="linespecific">
1041 CTDB_CAPABILITY_LMASTER=no
1042 CTDB_CAPABILITY_RECMASTER=no
1043         </screen>
1044     </para>
1045
1046     <para>
1047         Verify with the command "ctdb getcapabilities" that that node no longer
1048         has the recmaster or the lmaster capabilities.
1049     </para>
1050
1051   </refsect1>
1052
1053
1054   <refsect1>
1055     <title>SEE ALSO</title>
1056
1057     <para>
1058       <citerefentry><refentrytitle>ctdb</refentrytitle>
1059       <manvolnum>1</manvolnum></citerefentry>,
1060
1061       <citerefentry><refentrytitle>ctdbd</refentrytitle>
1062       <manvolnum>1</manvolnum></citerefentry>,
1063
1064       <citerefentry><refentrytitle>ctdbd_wrapper</refentrytitle>
1065       <manvolnum>1</manvolnum></citerefentry>,
1066
1067       <citerefentry><refentrytitle>ctdb_diagnostics</refentrytitle>
1068       <manvolnum>1</manvolnum></citerefentry>,
1069
1070       <citerefentry><refentrytitle>ltdbtool</refentrytitle>
1071       <manvolnum>1</manvolnum></citerefentry>,
1072
1073       <citerefentry><refentrytitle>onnode</refentrytitle>
1074       <manvolnum>1</manvolnum></citerefentry>,
1075
1076       <citerefentry><refentrytitle>ping_pong</refentrytitle>
1077       <manvolnum>1</manvolnum></citerefentry>,
1078
1079       <citerefentry><refentrytitle>ctdb.conf</refentrytitle>
1080       <manvolnum>5</manvolnum></citerefentry>,
1081
1082       <citerefentry><refentrytitle>ctdb-script.options</refentrytitle>
1083       <manvolnum>5</manvolnum></citerefentry>,
1084
1085       <citerefentry><refentrytitle>ctdb.sysconfig</refentrytitle>
1086       <manvolnum>5</manvolnum></citerefentry>,
1087
1088       <citerefentry><refentrytitle>ctdb-statistics</refentrytitle>
1089       <manvolnum>7</manvolnum></citerefentry>,
1090
1091       <citerefentry><refentrytitle>ctdb-tunables</refentrytitle>
1092       <manvolnum>7</manvolnum></citerefentry>,
1093
1094       <ulink url="http://ctdb.samba.org/"/>
1095     </para>
1096   </refsect1>
1097
1098   <refentryinfo>
1099     <author>
1100       <contrib>
1101         This documentation was written by
1102         Ronnie Sahlberg,
1103         Amitay Isaacs,
1104         Martin Schwenke
1105       </contrib>
1106     </author>
1107
1108     <copyright>
1109       <year>2007</year>
1110       <holder>Andrew Tridgell</holder>
1111       <holder>Ronnie Sahlberg</holder>
1112     </copyright>
1113     <legalnotice>
1114       <para>
1115         This program is free software; you can redistribute it and/or
1116         modify it under the terms of the GNU General Public License as
1117         published by the Free Software Foundation; either version 3 of
1118         the License, or (at your option) any later version.
1119       </para>
1120       <para>
1121         This program is distributed in the hope that it will be
1122         useful, but WITHOUT ANY WARRANTY; without even the implied
1123         warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
1124         PURPOSE.  See the GNU General Public License for more details.
1125       </para>
1126       <para>
1127         You should have received a copy of the GNU General Public
1128         License along with this program; if not, see
1129         <ulink url="http://www.gnu.org/licenses"/>.
1130       </para>
1131     </legalnotice>
1132   </refentryinfo>
1133
1134 </refentry>