Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Fault-tolerant Cluster database operations #224

Merged
merged 6 commits into from
May 19, 2021

Conversation

alexjpwalker
Copy link
Member

@alexjpwalker alexjpwalker commented May 18, 2021

What is the goal of this PR?

We improved the fault tolerance of our Cluster database operations, namely Create, Contains and Delete.

What are the changes implemented in this PR?

Make database operations "Create, Contains, Delete" fault tolerant

@alexjpwalker alexjpwalker changed the title Fix failing tests Fault-tolerant database creation May 19, 2021
@alexjpwalker alexjpwalker changed the title Fault-tolerant database creation Fault-tolerant database management May 19, 2021
@alexjpwalker alexjpwalker changed the title Fault-tolerant database management Fault-tolerant Cluster database operations May 19, 2021
@alexjpwalker alexjpwalker marked this pull request as ready for review May 19, 2021 10:49
@@ -94,7 +94,7 @@ def test_put_entity_type_to_crashed_primary_replica(self):
lsof = subprocess.check_output(["lsof", "-i", ":%s" % port])
except subprocess.CalledProcessError:
pass
sleep(0.5)
sleep(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to fail intermittently without this sleep, perhaps due to getting an unexpected error as the Cluster node is in the process of starting up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still have the error? can you paste them here?

@@ -142,3 +147,99 @@ def __hash__(self):

def __str__(self):
return "%s/%s" % (self._address, self._database)


# This class has to live here because of circular class creation between ClusterDatabase and FailsafeTask
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is highly unpleasant but I can't find any workaround that is any better. Because an instance method of FailsafeTask creates a ClusterDatabase, and an instance method of ClusterDatabase creates a FailsafeTask, it is physically impossible to decouple the two in Python. (This may also be the case in NodeJS)

@alexjpwalker alexjpwalker requested a review from lolski May 19, 2021 11:02
except TypeDBClientException as e:
errors.append("- %s: %s\n" % (address, e))
raise TypeDBClientException.of(CLUSTER_ALL_NODES_FAILED, str([str(e) for e in errors]))

def database_mgrs(self) -> Dict[str, _CoreDatabaseManager]:
return self._database_mgrs

def _failsafe_task(self, name: str, task: Callable[[TypeDBClusterStub, _CoreDatabaseManager], T]):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the bulk of the fault-tolerant DB operations logic lies

@@ -94,7 +94,7 @@ def test_put_entity_type_to_crashed_primary_replica(self):
lsof = subprocess.check_output(["lsof", "-i", ":%s" % port])
except subprocess.CalledProcessError:
pass
sleep(0.5)
sleep(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still have the error? can you paste them here?

typedb/cluster/session.py Show resolved Hide resolved
@@ -23,9 +23,8 @@
from typedb.api.client import TypeDBClusterClient
from typedb.api.options import TypeDBOptions, TypeDBClusterOptions
from typedb.api.session import SessionType
from typedb.cluster.database import _ClusterDatabase
from typedb.cluster.database import _ClusterDatabase, _FailsafeTask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

./1/typedb server --data server/data --address 127.0.0.1:11729:11730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 &
./2/typedb server --data server/data --address 127.0.0.1:21729:21730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 &
./3/typedb server --data server/data --address 127.0.0.1:31729:31730 --peer 127.0.0.1:11729:11730 --peer 127.0.0.1:21729:21730 --peer 127.0.0.1:31729:31730 &
./1/typedb server --data server/data --address 127.0.0.1:11729:11730:11731 --peer 127.0.0.1:11729:11730:11731 --peer 127.0.0.1:21729:21730:21731 --peer 127.0.0.1:31729:31730:31731 &
Copy link
Member

@lolski lolski May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an issue to extract the TypeDB setup logic into TypeDBRunner and TypeDBClusterRunner, just like in Java: typedb/typedb-dependencies#286. I've also created one for NodeJS: typedb/typedb-dependencies#287.

@lolski lolski merged commit 6af81c2 into typedb:master May 19, 2021
@alexjpwalker alexjpwalker deleted the fix-tests branch May 19, 2021 13:03
@flyingsilverfin flyingsilverfin added this to the 2.1.0 milestone May 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants