Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8066268338 Key segment pair refactor - do not move the segment out of the KeySegmentPair as it is a shared resource #2166

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

poodlewars
Copy link
Collaborator

@poodlewars poodlewars commented Feb 4, 2025

This allows us to re-enable some tests in arcticdb-enterprise https://github.com/man-group/arcticdb-enterprise/pull/new/aseaton/keysegmentpair-refactor-enterprise .

The changes are:

  • Remove the mutable Segment& segment() in the KeySegmentPair - this was often used to move out of the Segment owned by the shared pointer in KeySegmentPair
  • Pass KeySegmentPair by lvalue rather than rvalue reference where possible. The Task layer still needs rvalue references as that is what Folly expects (rightly, the Folly executors need to have ownership).
  • Refactor lookup_match_in_dedup_match to return a variant
  • Unrelated tidy up to DecodeMetadataTask
  • clone() segments when adding them to the in in-memory and mock storages

In particular this lets CopyCompressedInterstore safely copy the same KeySegmentPair to several targets, whereas before the Segment could be moved from during the copy to each target.

@@ -85,7 +85,7 @@ class S3ClientInterface {

virtual S3Result<std::monostate> put_object(
const std::string& s3_object_name,
Segment&& segment,
Segment& segment,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would be a const Segment& (also for other storages). However this might require a much bigger refactor and is likely out of scope for this change.

It would make writing put_object(*key_seg.segment_ptr()) much less worrysome.

I had to spend quite a bit of time verifying the put_object doesn't somehow move some buffers outside of the segment.
The reason it works at all is because the aws-sdk-cpp's PutObjectRequest::SetBody doesn't take an owning copy of the underlying buffer and copies from it to send the request to S3. This is also probably why we weren't seeing the use-after-move issues manifest all the time in our tests before.

Maybe it's just worth a comment saying why it should not move data out of the segment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is where we chose to draw the line and stop refactoring, making all the consumers of the Segment const will be an even bigger change. But isn't it implied that this shouldn't move data out of the segment, given the segment is passed by lvalue reference?

const auto total_size = segment.calculate_size();
/*thread_local*/ std::vector<uint8_t> buffer{};
buffer.resize(total_size);
bsoncxx::types::b_binary data = {};
kv.segment().write_to(buffer.data());
kv.segment_ptr()->write_to(buffer.data());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Just use the reference above and do segment.write_to()

kv.segment().force_own_buffer();
return std::move(kv.segment());
kv.segment_ptr()->force_own_buffer();
return kv.segment().clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the force_own_buffer if we're doing clone. See here, clone will copy underlying buffer even if not owning.

@@ -225,7 +225,7 @@ void do_read_impl(
KeyDecoder&& key_decoder,
ReadKeyOpts opts) {
auto key_seg = do_read_impl(std::move(variant_key), root_folder, bucket_name, s3_client, std::forward<KeyBucketizer>(bucketizer), std::forward<KeyDecoder>(key_decoder), opts);
visitor(key_seg.variant_key(), std::move(key_seg.segment()));
visitor(key_seg.variant_key(), *key_seg.segment_ptr());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO perhaps when reading with a visitor it does make sense to move.

I.e. makes sense to move the ownership to the visitor instead of giving it a reference to something which will soon get destructed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to undo this we should do it for all storages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point thank you, I'll try this

@@ -54,7 +54,7 @@ std::optional<Azure::Core::RequestFailedException> has_failure_trigger(const std

void MockAzureClient::write_blob(
const std::string& blob_name,
arcticdb::Segment&& segment,
arcticdb::Segment& segment,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should clone for all mock storages. Looks like we've forgotten it for azure.

@@ -161,7 +165,7 @@ folly::Future<folly::Unit> write_compressed(storage::KeySegmentPair ks) override
}

void write_compressed_sync(storage::KeySegmentPair ks) override {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change this to take a KeySegmentPair& to be consistent with the others and to make clear that it does not copy the underlying segment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for non-sync variant

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to keep it like this, this is what this bit of the PR description is referring to,

Pass KeySegmentPair by lvalue rather than rvalue reference where possible. The Task layer still needs rvalue references as that is what Folly expects (rightly, the Folly executors need to have ownership).

ARCTICDB_DEBUG(log::storage(), "DecodeTimeseriesDescriptorTask decoding segment with key {}", variant_key_view(key_seg.variant_key()));

auto maybe_desc = decode_timeseries_descriptor(key_seg.segment());
auto maybe_desc = decode_timeseries_descriptor(*key_seg.segment_ptr());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe again a bit out of scope but I think decode_timeseries_descriptor decode_metadata_and_descriptor_fields etc. can be changed to just take a const Segment&?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can't be, they modify the segment in place

@poodlewars poodlewars force-pushed the aseaton/8066268338/ksp-refactor branch from f8744b2 to 56a1983 Compare February 6, 2025 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants