autocluster: Retry setup stage
authorMartin Schwenke <martin@meltin.net>
Thu, 21 May 2020 07:33:23 +0000 (17:33 +1000)
committerMartin Schwenke <martin@meltin.net>
Tue, 26 May 2020 03:31:14 +0000 (13:31 +1000)
Ansible sometimes find that nodes are unreachable even though it is
possible to ssh to them manually.  Perhaps a retry will work.

Signed-off-by: Martin Schwenke <martin@meltin.net>
autocluster.py

index 8cac6e4f377bf8b25178b4f206aefb0cd3e870ef..a7c11ce6264e1fdec2a8641e23f55e6575b41836 100755 (executable)
@@ -28,6 +28,7 @@ import sys
 import re
 import subprocess
 import shutil
+import time
 
 import ipaddress
 
@@ -676,10 +677,21 @@ def cluster_setup(cluster):
             '-e', '@%s' % config_file,
             '-i', inventory,
             playbook]
-    try:
-        subprocess.check_call(args)
-    except subprocess.CalledProcessError as err:
-        sys.exit('ERROR: cluster setup exited with %d' % err.returncode)
+
+    # First attempt sometimes fails, so try a few times
+    for _ in range(5):
+        try:
+            subprocess.check_call(args)
+        except subprocess.CalledProcessError as err:
+            print('warning: cluster setup exited with %d, retrying' %
+                  err.returncode,
+                  file=sys.stderr)
+            saved_err = err
+            time.sleep(1)
+        else:
+            return
+
+    sys.exit('ERROR: cluster setup exited with %d' % saved_err.returncode)
 
 
 def cluster_build(cluster):