-
Notifications
You must be signed in to change notification settings - Fork 24
Fault-tolerant Cluster database operations #224
Conversation
@@ -94,7 +94,7 @@ def test_put_entity_type_to_crashed_primary_replica(self): | |||
lsof = subprocess.check_output(["lsof", "-i", ":%s" % port]) | |||
except subprocess.CalledProcessError: | |||
pass | |||
sleep(0.5) | |||
sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to fail intermittently without this sleep, perhaps due to getting an unexpected error as the Cluster node is in the process of starting up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you still have the error? can you paste them here?
@@ -142,3 +147,99 @@ def __hash__(self): | |||
|
|||
def __str__(self): | |||
return "%s/%s" % (self._address, self._database) | |||
|
|||
|
|||
# This class has to live here because of circular class creation between ClusterDatabase and FailsafeTask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is highly unpleasant but I can't find any workaround that is any better. Because an instance method of FailsafeTask
creates a ClusterDatabase
, and an instance method of ClusterDatabase
creates a FailsafeTask
, it is physically impossible to decouple the two in Python. (This may also be the case in NodeJS)
except TypeDBClientException as e: | ||
errors.append("- %s: %s\n" % (address, e)) | ||
raise TypeDBClientException.of(CLUSTER_ALL_NODES_FAILED, str([str(e) for e in errors])) | ||
|
||
def database_mgrs(self) -> Dict[str, _CoreDatabaseManager]: | ||
return self._database_mgrs | ||
|
||
def _failsafe_task(self, name: str, task: Callable[[TypeDBClusterStub, _CoreDatabaseManager], T]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the bulk of the fault-tolerant DB operations logic lies
@@ -94,7 +94,7 @@ def test_put_entity_type_to_crashed_primary_replica(self): | |||
lsof = subprocess.check_output(["lsof", "-i", ":%s" % port]) | |||
except subprocess.CalledProcessError: | |||
pass | |||
sleep(0.5) | |||
sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you still have the error? can you paste them here?
@@ -23,9 +23,8 @@ | |||
from typedb.api.client import TypeDBClusterClient | |||
from typedb.api.options import TypeDBOptions, TypeDBClusterOptions | |||
from typedb.api.session import SessionType | |||
from typedb.cluster.database import _ClusterDatabase | |||
from typedb.cluster.database import _ClusterDatabase, _FailsafeTask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed?
./1/typedb server --data server/data --address 127.0.0.1:11729:11730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 & | ||
./2/typedb server --data server/data --address 127.0.0.1:21729:21730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 & | ||
./3/typedb server --data server/data --address 127.0.0.1:31729:31730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 & | ||
./1/typedb server --data server/data --address 127.0.0.1:11729:11730:11731 --peer 127.0.0.1:11729:11730:11731 --peer 127.0.0.1:21729:21730:21731 --peer 127.0.0.1:31729:31730:31731 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created an issue to extract the TypeDB setup logic into TypeDBRunner
and TypeDBClusterRunner
, just like in Java: typedb/typedb-dependencies#286. I've also created one for NodeJS: typedb/typedb-dependencies#287.
What is the goal of this PR?
We improved the fault tolerance of our Cluster database operations, namely Create, Contains and Delete.
What are the changes implemented in this PR?
Make database operations "Create, Contains, Delete" fault tolerant