Fault-tolerant Cluster database operations #224

alexjpwalker · 2021-05-18T09:41:57Z

What is the goal of this PR?

We improved the fault tolerance of our Cluster database operations, namely Create, Contains and Delete.

What are the changes implemented in this PR?

Make database operations "Create, Contains, Delete" fault tolerant

alexjpwalker · 2021-05-19T11:01:01Z

tests/integration/test_cluster_failover.py

@@ -94,7 +94,7 @@ def test_put_entity_type_to_crashed_primary_replica(self):
                        lsof = subprocess.check_output(["lsof", "-i", ":%s" % port])
                    except subprocess.CalledProcessError:
                        pass
-                    sleep(0.5)
+                    sleep(1)


It seems to fail intermittently without this sleep, perhaps due to getting an unexpected error as the Cluster node is in the process of starting up.

do you still have the error? can you paste them here?

alexjpwalker · 2021-05-19T11:02:26Z

typedb/cluster/database.py

@@ -142,3 +147,99 @@ def __hash__(self):

            def __str__(self):
                return "%s/%s" % (self._address, self._database)
+
+
+# This class has to live here because of circular class creation between ClusterDatabase and FailsafeTask


This is highly unpleasant but I can't find any workaround that is any better. Because an instance method of FailsafeTask creates a ClusterDatabase, and an instance method of ClusterDatabase creates a FailsafeTask, it is physically impossible to decouple the two in Python. (This may also be the case in NodeJS)

alexjpwalker · 2021-05-19T11:06:54Z

typedb/cluster/database_manager.py

            except TypeDBClientException as e:
                errors.append("- %s: %s\n" % (address, e))
        raise TypeDBClientException.of(CLUSTER_ALL_NODES_FAILED, str([str(e) for e in errors]))

    def database_mgrs(self) -> Dict[str, _CoreDatabaseManager]:
        return self._database_mgrs
+
+    def _failsafe_task(self, name: str, task: Callable[[TypeDBClusterStub, _CoreDatabaseManager], T]):


This is where the bulk of the fault-tolerant DB operations logic lies

lolski · 2021-05-19T11:14:05Z

tests/integration/test_cluster_failover.py

@@ -94,7 +94,7 @@ def test_put_entity_type_to_crashed_primary_replica(self):
                        lsof = subprocess.check_output(["lsof", "-i", ":%s" % port])
                    except subprocess.CalledProcessError:
                        pass
-                    sleep(0.5)
+                    sleep(1)


do you still have the error? can you paste them here?

typedb/cluster/session.py

lolski · 2021-05-19T11:20:07Z

typedb/cluster/client.py

@@ -23,9 +23,8 @@
 from typedb.api.client import TypeDBClusterClient
 from typedb.api.options import TypeDBOptions, TypeDBClusterOptions
 from typedb.api.session import SessionType
-from typedb.cluster.database import _ClusterDatabase
+from typedb.cluster.database import _ClusterDatabase, _FailsafeTask


is this needed?

lolski · 2021-05-19T11:37:46Z

tools/cluster_test_rule.bzl

-           ./1/typedb server --data server/data --address 127.0.0.1:11729:11730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 &
-           ./2/typedb server --data server/data --address 127.0.0.1:21729:21730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 &
-           ./3/typedb server --data server/data --address 127.0.0.1:31729:31730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 &
+           ./1/typedb server --data server/data --address 127.0.0.1:11729:11730:11731 --peer 127.0.0.1:11729:11730:11731 --peer 127.0.0.1:21729:21730:21731 --peer 127.0.0.1:31729:31730:31731 &


I've created an issue to extract the TypeDB setup logic into TypeDBRunner and TypeDBClusterRunner, just like in Java: typedb/typedb-dependencies#286. I've also created one for NodeJS: typedb/typedb-dependencies#287.

Fix match:test-core

07bd641

alexjpwalker added type: bug priority: medium labels May 18, 2021

alexjpwalker self-assigned this May 18, 2021

alexjpwalker requested review from flyingsilverfin and vmax as code owners May 18, 2021 09:41

alexjpwalker marked this pull request as draft May 18, 2021 09:42

Alex Walker added 2 commits May 19, 2021 11:38

Fault-tolerant database creation

61b2811

Fix all Cluster tests

880fdbf

alexjpwalker changed the title ~~Fix failing tests~~ Fault-tolerant database creation May 19, 2021

alexjpwalker changed the title ~~Fault-tolerant database creation~~ Fault-tolerant database management May 19, 2021

alexjpwalker changed the title ~~Fault-tolerant database management~~ Fault-tolerant Cluster database operations May 19, 2021

bump

fc93885

alexjpwalker marked this pull request as ready for review May 19, 2021 10:49

Set cluster failover test iteration count to 10

2b3fb37

alexjpwalker commented May 19, 2021

View reviewed changes

alexjpwalker requested a review from lolski May 19, 2021 11:02

alexjpwalker commented May 19, 2021

View reviewed changes

lolski reviewed May 19, 2021

View reviewed changes

Add comment to sleep statement in failover test

5c2e077

lolski approved these changes May 19, 2021

View reviewed changes

lolski merged commit 6af81c2 into typedb:master May 19, 2021

alexjpwalker deleted the fix-tests branch May 19, 2021 13:03

flyingsilverfin added this to the 2.1.0 milestone May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault-tolerant Cluster database operations #224

Fault-tolerant Cluster database operations #224

alexjpwalker commented May 18, 2021 •

edited

Loading

alexjpwalker May 19, 2021

lolski May 19, 2021

alexjpwalker May 19, 2021

alexjpwalker May 19, 2021

lolski May 19, 2021

lolski May 19, 2021

lolski May 19, 2021 •

edited

Loading

Fault-tolerant Cluster database operations #224

Fault-tolerant Cluster database operations #224

Conversation

alexjpwalker commented May 18, 2021 • edited Loading

What is the goal of this PR?

What are the changes implemented in this PR?

alexjpwalker May 19, 2021

Choose a reason for hiding this comment

lolski May 19, 2021

Choose a reason for hiding this comment

alexjpwalker May 19, 2021

Choose a reason for hiding this comment

alexjpwalker May 19, 2021

Choose a reason for hiding this comment

lolski May 19, 2021

Choose a reason for hiding this comment

lolski May 19, 2021

Choose a reason for hiding this comment

lolski May 19, 2021 • edited Loading

Choose a reason for hiding this comment

alexjpwalker commented May 18, 2021 •

edited

Loading

lolski May 19, 2021 •

edited

Loading