Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test: shm] Add testcase for fast reconnection #1916

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

DownerCase
Copy link
Contributor

@DownerCase DownerCase commented Jan 17, 2025

Description

Will eventually track down and resolve an issue with subscribers connecting but not receiving data.

Current status: Test case refined, problem understood (see issue for details).

Related issues

Cherry-pick to

@DownerCase DownerCase changed the title WIP: Resolve topic resubscribing creating "zombie subscribers" WIP: Resolve SHM topic resubscribing creating "zombie subscribers" Jan 17, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ecal/tests/cpp/pubsub_test/src/pubsub_test_shm.cpp Outdated Show resolved Hide resolved
@DownerCase DownerCase changed the title WIP: Resolve SHM topic resubscribing creating "zombie subscribers" WIP: Resolve SHM topic resubscribing in less than the registration timeout period creating "zombie subscribers" Jan 18, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ecal/core/src/ecal_event.cpp Outdated Show resolved Hide resolved
ecal/core/src/ecal_event.cpp Outdated Show resolved Hide resolved
@DownerCase
Copy link
Contributor Author

Responding to @hannemn's comment from here:

@KerstinKeller Yes, this definitely needs to be solved more cleanly, even if it is the safest solution for a short term release now. I still had an approach where the validity of the event handle is checked, but I like idea with reference counter that @DownerCase proposed even more. :-) However, this requires most likely a larger test loop.

I explored a few different options targeting either the unlinking of the shm file or interrupting the file observer. I was unable to find a solution for the later option that didn't involve adding something to named_event_t, at which point you totally break compatibility with things on eCAL v5.X

So, unless you changed the scheme used for the shm files since v5, trying to use a v6 subscriber with a v5 publisher will have the subscriber try and remove the shm file out from under the publisher as it'll see a 0 refcount. I don't know what your policy is w.r.t compatibility between v5 and v6 (I also haven't tested anything).

Regarding checking the validity of the event handle, that prevents you from doing pthread_*_destroy (not that it was being done previously anyway!) because at destructor time you can't know if the resource is in use elsewhere.

@DownerCase DownerCase changed the title WIP: Resolve SHM topic resubscribing in less than the registration timeout period creating "zombie subscribers" Resolve SHM topic resubscribing in less than the registration timeout period creating "zombie subscribers" Jan 24, 2025
@DownerCase DownerCase marked this pull request as ready for review January 24, 2025 13:57
@DownerCase DownerCase changed the title Resolve SHM topic resubscribing in less than the registration timeout period creating "zombie subscribers" [shm: linux] Resolve SHM topic resubscribing in less than the registration timeout period creating "zombie subscribers" Jan 24, 2025
@KerstinKeller
Copy link
Contributor

Hi @DownerCase, thanks for digging so deep into it.
At the moment, we would like to avoid to break wire compatibility (that's what we call the possibility to communicate with eCAL over different eCAL versions).
And when we do break it, it would be great to put in some additional features, like Shared Locks for reading them SHM files, etc.

@hannemn also tried interrupting the Memfile Observer, this is another area where it might be worth refactoring / restructuring the code a bit.

The current solution, to go back to eCAL 5 behavior, and not to unlink any file until process destruction, also has the benefit that it will work seemlessly with eCAL 5.

However, it would be great, if we could just take your testcase and see if this now works with current master.

@DownerCase DownerCase changed the title [shm: linux] Resolve SHM topic resubscribing in less than the registration timeout period creating "zombie subscribers" [test: shm] Add testcase for fast reconnection Jan 24, 2025
@DownerCase
Copy link
Contributor Author

Hi @DownerCase, thanks for digging so deep into it. At the moment, we would like to avoid to break wire compatibility (that's what we call the possibility to communicate with eCAL over different eCAL versions). And when we do break it, it would be great to put in some additional features, like Shared Locks for reading them SHM files, etc.

However, it would be great, if we could just take your testcase and see if this now works with current master.

@KerstinKeller Totally understandable, I've rebased onto master and removed the ref counting changes. I believe the test will pass now.

@KerstinKeller KerstinKeller added the cherry-pick-to-NONE Don't cherry-pick these changes label Jan 24, 2025
Copy link
Contributor

@KerstinKeller KerstinKeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testcase looks good, I am just wondering about printing to std::cerr. Does this show up in the test logs?

const eCAL::SDataTypeInformation & /*data_type_info_*/,
const eCAL::SReceiveCallbackData & /*data_*/) {
{
std::cerr << "Callback received message\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DownerCase why are you printing to cerr here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force of habit mostly, cerr's unbuffered so don't need to pepper std::endl/flush.

Shows up in the logs all the same: https://github.com/eclipse-ecal/ecal/actions/runs/12952082976/job/36128611552#step:11:805

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am already not so happy with all the tests logging in the unittests (I know we also do have quite a bunch of tests that do it).

I'd prefer if we only see what's the problem on test failures, e.g. you could log to the google test command:
EXPECT_TRUE(data_received) << "Missing data from publisher" ;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I tried to keep it to a minimum but found the extra messages useful when diagnosing a failure (particularly when finding when internal eCAL debug logs are occurring relative to the test).

As an in-between what about logging via the eCAL logging functionality on the debug4 level (or similar)? So that normally no logs will be visible but they're there when you need to dig into it.

I'll add a descriptive error message to the EXPECT_TRUE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-to-NONE Don't cherry-pick these changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants