Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception during data aggregation in parallel BP4 writing #3936

Open
franzpoeschel opened this issue Nov 21, 2023 · 4 comments
Open

Exception during data aggregation in parallel BP4 writing #3936

franzpoeschel opened this issue Nov 21, 2023 · 4 comments

Comments

@franzpoeschel
Copy link
Contributor

Describe the bug
This is an error message that a PIConGPU user saw when using PIConGPU+openPMD+ADIOS2+BP4. Since I never saw this particular error message before, I wanted to ask here if someone has an idea what's going on.

The stderr log as described in ComputationalRadiationPhysics/picongpu#4744 is:

───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: stderr
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ [AbstractIOHandlerImpl] IO Task CLOSE_FILE failed with exception. Removing task from IO queue and passing on the exception.
   2   │ [AbstractIOHandlerImpl] IO Task CLOSE_FILE failed with exception. Removing task from IO queue and passing on the exception.
   3   │ [AbstractIOHandlerImpl] IO Task CLOSE_FILE failed with exception. Removing task from IO queue and passing on the exception.
   4   │ [AbstractIOHandlerImpl] IO Task CLOSE_FILE failed with exception. Removing task from IO queue and passing on the exception.
   5   │ terminate called after throwing an instance of 'std::runtime_error'
   6   │   what():  [Sun Nov 19 12:38:05 2023] [ADIOS2 EXCEPTION] <Toolkit> <aggregator::mpi::MPIChain> <IExchangeAbsolutePosition> : An existing exchange is still act
       │ ive
   7   │
   8   │ terminate called after throwing an instance of 'std::runtime_error'
   9   │   what():  [Sun Nov 19 12:38:05 2023] [ADIOS2 EXCEPTION] <Toolkit> <aggregator::mpi::MPIChain> <IExchangeAbsolutePosition> : An existing exchange is still act
       │ ive
  10   │
  11   │ terminate called after throwing an instance of 'std::runtime_error'
  12   │   what():  [Sun Nov 19 12:38:05 2023] [ADIOS2 EXCEPTION] <Toolkit> <aggregator::mpi::MPIChain> <IExchangeAbsolutePosition> : An existing exchange is still act
       │ ive
  13   │
  14   │ terminate called after throwing an instance of 'std::runtime_error'
  15   │   what():  [Sun Nov 19 12:38:05 2023] [ADIOS2 EXCEPTION] <Toolkit> <aggregator::mpi::MPIChain> <IExchangeAbsolutePosition> : An existing exchange is still act
       │ ive
  16   │
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

To Reproduce
Not my setup, I don't have a reproducer.

Expected behavior
No crash

Desktop (please complete the following information):

  • ADIOS2 v2.8.3 on Karolina
  • further details in the linked issue

Additional context

Following up
Was the issue fixed? Please report back.

@eisenhauer
Copy link
Member

Just responding because nobody else had. This is not a familiar error to me, so no clue...

@berceanu
Copy link

Do we at least know if the message is coming from ADIOS2 or some other library in the stack?

@eisenhauer
Copy link
Member

It looks to be from ADIOS. At least the "An existing exchange is still active" exception is possibly thrown by the ADIOS2 BP4 engine. In the ADIOS GIT history there is also a mention of a similar problem (and a fix for it being applied) 4 years ago. I agree with the suggestion that it's probably time to move to the BP5 engine. Better that than to go trying to diagnose problems in BP4.

@franzpoeschel
Copy link
Contributor Author

Ok, then trying to reproduce this sounds like it's probably more effort than it's worth. Good to know that this error is not only unknown to me.
@berceanu I'd suggest upgrading to ADIOS2 v2.9 and openPMD-api 0.15 and using --openPMD.ext bp5 in order to activate the BP5 backend. Maybe even just using ADIOS2 v2.9 with BP4 might already help, but it might also not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants