-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ocaml5-issue] Segfault in STM Domain.DLS test sequential
on bytecode
#446
Comments
I wonder if it is related to #12889, the only recent change to Domain.DLS that I can think of. (I hope not!) |
This just triggered again on 32-bit
|
Is there more information that we can use to try to investigate this? "There is a segfault somewhere in Domain.DLS on 32bit" is not that much. |
First off, this is a collection of failures we observe.
Come on, there are QCheck seeds that caused the failures, GA workflows listing the steps taken, and links to 2 CI run logs, with full information about versions.
|
No snark intended, I genuinely wonder how you work with these failures. For example I'm not sure if it is reasonably easy to extract a backtrace, and/or to observe the same failure within the debug runtime. (Is this segfault due to a memory corruption, or an assert failure?) If you prefer to work on this without upstream looking over your shoulder for now, I am happy to let you work your magic and wait for easier reproduction instructions. |
OK, fair enough. I've been trying today for this one: https://github.com/ocaml-multicore/multicoretests/actions?query=branch%3Adomain-dls-32-bit-focus
|
I finally managed to reproduce this one - on 5.2.0 - and only once for now. It is indeed a sequential fault! 😮
|
Switching hard-coded seed to 107236932 (the first one) works much better! |
I've made a bit of progress on this.
The assertion in question is this one: |
I've tried testing previous versions (5.0.0, 5.1.0, 5.1.1) too - both with a hard-coded seed 107236932 and with random ones. Result:
|
How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash? |
The call was placed here for thread-safety, to work correctly if another thread (on the same domain) resizes the DLS array under our feet. It is also the only notable change in terms of access to `Caml_state->dls_root` since 5.1, so it might be involved in the mysterious crash being investigated at ocaml-multicore/multicoretests#446
Here is a proposed small patch for example: https://github.com/gasche/ocaml/commits/mysterious-dls-crash-1/ (Another obvious idea would be to revert #12889 and see whether you can still reproduce the crash. I don't see any other relevant change in the runtime, but I also read this change carefully several times without noticing anything that could explain the current observations.) |
I should be able to do that for a feature branch like the proposed one 👍 |
Thanks! This change is really a blind move, so it is unlikely to work. I think the reasonable next step is to revert #12889. Let me know if you need help doing that -- it should revert cleanly from 5.2 in particular, but I haven't actually tried. |
No cigar unfortunately: On that one I also saw this (even rarer) non-crashing misbehaviour:
Despite being generated as a number between 0 and
the Get -constructor's argument somehow ends up being -81244680 ...
That signals some form of heap corruption - like the assertion failure indicates. What makes you suspect #12889? |
My reasoning is that your test exercises the DLS primitives, and started failing in 5.2 and no older realease. #12889 is the only substantial change to the implementation of DLS that happened between 5.1 and 5.2, and it touches an unsafe part of the language (a mix of C runtime code and Obj.magic on the OCaml side). This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed? |
With the debug-runtime strategy to trigger this, in the CI I'm now repeating 200 times a QCheck-test with |
I've now completed a round of
with the latest run available here:
Here's the
|
I accidentally kicked off a run with an even smaller heap (
|
I've tried to run under gdb batch mode in the CI: For the 6 assertion failures this doesn't add much new info:
This is a failure in void caml_verify_heap_from_stw(caml_domain_state *domain) {
struct heap_verify_state* st = caml_verify_begin();
caml_do_roots (&caml_verify_root, verify_scanning_flags, st, domain, 1);
caml_scan_global_roots(&caml_verify_root, st);
while (st->sp) verify_object(st, st->stack[--st->sp]);
caml_addrmap_clear(&st->seen);
caml_stat_free(st->stack);
caml_stat_free(st);
} For the 2 clean segfaults it reveals a little:
with the crash happening in the Instruct(RETURN): {
sp += *pc++;
if (extra_args > 0) {
extra_args--;
pc = Code_val(accu);
env = accu;
Next;
} else {
goto do_return;
}
} Isn't the common theme to both of these "stack corruption"? 🤔
|
I've dug some more into this issue. Experiments reveal that this still can trigger without |
Using tmate I've also managed to log into the GitHub action runner machines, reproduce crashes there, and observe backtraces. A backtrace for an assertion failure run with each thread annotated with its role:
The bisection pointing at ocaml/ocaml#12193 combined with Here's another backtrace, this one from a pure seg fault run:
The common theme is still stack memory corruption (caml_scan_stack and RETURN). I've still not been able to reproduce locally. |
STM Domain.DLS test sequential
on 32-bit trunkSTM Domain.DLS test sequential
on bytecode
I've finally managed to reproduce this locally by using |
Fixed in ocaml/ocaml#13553 |
In the CI-run for #445 on 32-bit trunk the
STM Domain.DLS test sequential
triggered a segfaulthttps://github.com/ocaml-multicore/multicoretests/actions/runs/8436771284/job/23104952265?pr=445
This may be another case of a 32-bit/bytecode issue showing up in a couple of different tests:
Surprisingly this case however triggered in a sequential (single-domain) test! 😮
The text was updated successfully, but these errors were encountered: