Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Segfault in STM Domain.DLS test sequential on bytecode #446

Closed
jmid opened this issue Mar 26, 2024 · 24 comments
Closed

[ocaml5-issue] Segfault in STM Domain.DLS test sequential on bytecode #446

jmid opened this issue Mar 26, 2024 · 24 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Mar 26, 2024

In the CI-run for #445 on 32-bit trunk the STM Domain.DLS test sequential triggered a segfault
https://github.com/ocaml-multicore/multicoretests/actions/runs/8436771284/job/23104952265?pr=445

random seed: 107236932
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

This may be another case of a 32-bit/bytecode issue showing up in a couple of different tests:

Surprisingly this case however triggered in a sequential (single-domain) test! 😮

@gasche
Copy link

gasche commented Mar 30, 2024

I wonder if it is related to #12889, the only recent change to Domain.DLS that I can think of. (I hope not!)

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

This just triggered again on 32-bit 5.3.0+trunk by the merge to main of #460:
https://github.com/ocaml-multicore/multicoretests/actions/runs/9169655398/job/25210472949

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

@gasche
Copy link

gasche commented May 21, 2024

Is there more information that we can use to try to investigate this? "There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

First off, this is a collection of failures we observe.
Once we have fleshed out reproducible steps, these are reported upstream.
Help is very welcome, snarky remarks less so.

"There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

Come on, there are QCheck seeds that caused the failures, GA workflows listing the steps taken, and links to 2 CI run logs, with full information about versions.

Run opam exec -- ocamlc -config
  opam exec -- ocamlc -config
  opam config list
  opam exec -- dune printenv
  opam list --columns=name,installed-version,repository,synopsis-or-target
  opam clean --all-switches --unused-repositories --logs --download-cache --repo-cache
  shell: /usr/bin/bash -e {0}
  env:
    QCHECK_MSG_INTERVAL: 60
    DUNE_PROFILE: dev
    OCAMLRUNPARAM: 
    DUNE_CI_ALIAS: runtest
    COMPILER: ocaml-variants.5.3.0+trunk,ocaml-option-32bit
    OCAML_COMPILER_GIT_REF: refs/heads/trunk
    CUSTOM_COMPILER_VERSION: 
    CUSTOM_COMPILER_SRC: 
    CUSTOM_OCAML_PKG_VERSION: 
    OPAMCLI: 2.0
    OPAMCOLOR: always
    OPAMERRLOGLEN: 0
    OPAMJOBS: 4
    OPAMPRECISETRACKING: 1
    OPAMSOLVERTIMEOUT: 1000
    OPAMYES: 1
    DUNE_CACHE: enabled
    DUNE_CACHE_TRANSPORT: direct
    DUNE_CACHE_STORAGE_MODE: copy
    CLICOLOR_FORCE: 1
version: 5.3.0+dev0-2023-12-22
standard_library_default: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
standard_library: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
ccomp_type: cc
c_compiler: gcc -m32
ocamlc_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlc_cppflags:  -D_FILE_OFFSET_BITS=64 
ocamlopt_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlopt_cppflags:  -D_FILE_OFFSET_BITS=64 
bytecomp_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
native_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
bytecomp_c_libraries: -lzstd  -latomic -lm  -lpthread
native_c_libraries:  -latomic -lm  -lpthread
native_ldflags: 
native_pack_linker: ld -r -o 
native_compiler: false
architecture: i386
model: default
int_size: 31
word_size: 32
system: linux
asm: i386-linux-as
asm_cfi_supported: false
with_frame_pointers: false
ext_exe: 
ext_obj: .o
ext_asm: .s
ext_lib: .a
ext_dll: .so
os_type: Unix
default_executable_name: a.out
systhread_supported: true
host: i386-pc-linux-gnu
target: i386-pc-linux-gnu
flambda: false
safe_string: true
default_safe_string: true
flat_float_array: true
function_sections: false
afl_instrument: false
tsan: false
windows_unicode: false
supports_shared_libraries: true
native_dynlink: false
naked_pointers: false
exec_magic_number: Caml1999X035
cmi_magic_number: Caml1999I035
cmo_magic_number: Caml1999O035
cma_magic_number: Caml1999A035
cmx_magic_number: Caml1999Y035
cmxa_magic_number: Caml1999Z035
ast_impl_magic_number: Caml1999M035
ast_intf_magic_number: Caml1999N035
cmxs_magic_number: Caml1999D035
cmt_magic_number: Caml1999T035
linear_magic_number: Caml1999L035

<><> Global opam variables ><><><><><><><><><><><><><><><><><><><><><><><><><><>
arch              x86_64                                          # Inferred from system
exe                                                               # Suffix needed for executable filenames (Windows)
jobs              4                                               # The number of parallel jobs set up in opam configuration
make              make                                            # The 'make' command to use
opam-version      2.1.6                                           # The currently running opam version
os                linux                                           # Inferred from system
os-distribution   ubuntu                                          # Inferred from system
os-family         debian                                          # Inferred from system
os-version        22.04                                           # Inferred from system
root              /home/runner/.opam                              # The current opam root directory
switch            /home/runner/work/multicoretests/multicoretests # The identifier of the current switch
sys-ocaml-arch                                                    # Target architecture of the OCaml compiler present on your system
sys-ocaml-cc                                                      # Host C Compiler type of the OCaml compiler present on your system
sys-ocaml-libc                                                    # Host C Runtime Library type of the OCaml compiler present on your system
sys-ocaml-version                                                 # OCaml version present on your system independently of opam, if any

<><> Configuration variables from the current switch ><><><><><><><><><><><><><>
prefix   /home/runner/work/multicoretests/multicoretests/_opam
lib      /home/runner/work/multicoretests/multicoretests/_opam/lib
bin      /home/runner/work/multicoretests/multicoretests/_opam/bin
sbin     /home/runner/work/multicoretests/multicoretests/_opam/sbin
share    /home/runner/work/multicoretests/multicoretests/_opam/share
doc      /home/runner/work/multicoretests/multicoretests/_opam/doc
etc      /home/runner/work/multicoretests/multicoretests/_opam/etc
man      /home/runner/work/multicoretests/multicoretests/_opam/man
toplevel /home/runner/work/multicoretests/multicoretests/_opam/lib/toplevel
stublibs /home/runner/work/multicoretests/multicoretests/_opam/lib/stublibs
user     runner
group    docker

<><> Package variables ('opam var --package PKG' to show) <><><><><><><><><><><>
PKG:name       # Name of the package
PKG:version    # Version of the package
PKG:depends    # Resolved direct dependencies of the package
PKG:installed  # Whether the package is installed
PKG:enable     # Takes the value "enable" or "disable" depending on whether the package is installed
PKG:pinned     # Whether the package is pinned
PKG:bin        # Binary directory for this package
PKG:sbin       # System binary directory for this package
PKG:lib        # Library directory for this package
PKG:man        # Man directory for this package
PKG:doc        # Doc directory for this package
PKG:share      # Share directory for this package
PKG:etc        # Etc directory for this package
PKG:build      # Directory where the package was built
PKG:hash       # Hash of the package archive
PKG:dev        # True if this is a development package
PKG:build-id   # A hash identifying the precise package version with all its dependencies
PKG:opamfile   # Path of the curent opam file
(flags
 (-w
  @[email protected]@30..39@[email protected]@[email protected]
  -strict-sequence
  -strict-formats
  -short-paths
  -keep-locs))
(ocamlc_flags (-g))
(ocamlopt_flags (-g))
(melange.compile_flags (-g))
(c_flags
 (-m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -m32
  -D_FILE_OFFSET_BITS=64
  -fdiagnostics-color=always))
(cxx_flags
 (-x
  c++
  -m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -fdiagnostics-color=always))
(link_flags ())
(menhir_flags ())
(menhir_explain ())
(coq_flags (-q))
(coqdoc_flags (--toc))
(js_of_ocaml_flags
 (--pretty --source-map-inline))
(js_of_ocaml_build_runtime_flags
 (--pretty --source-map-inline))
(js_of_ocaml_link_flags (--source-map-inline))
# Packages matching: installed
# Name                     # Installed # Repository # Synopsis
base-bigarray              base        default
base-domains               base        default
base-nnp                   base        default      Naked pointers prohibited in the OCaml heap
base-threads               base        default
base-unix                  base        default
dune                       3.15.2      default      Fast, portable, and opinionated build system
ocaml                      5.3.0       default      The OCaml compiler (virtual package)
ocaml-config               3           default      OCaml Switch Configuration
ocaml-option-32bit         1           default      Set OCaml to be compiled in 32-bit mode for 64-bit Linux and OS X hosts
ocaml-option-bytecode-only 1           default      Compile OCaml without the native-code compiler
ocaml-variants             5.3.0+trunk default      Current trunk
qcheck-core                0.21.3      default      Core qcheck library

@gasche
Copy link

gasche commented May 21, 2024

No snark intended, I genuinely wonder how you work with these failures. For example I'm not sure if it is reasonably easy to extract a backtrace, and/or to observe the same failure within the debug runtime. (Is this segfault due to a memory corruption, or an assert failure?)

If you prefer to work on this without upstream looking over your shoulder for now, I am happy to let you work your magic and wait for easier reproduction instructions.

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

OK, fair enough.
Some of these remaining ones are just hard to reproduce - I suspect because they are timing or signal related.

I've been trying today for this one: https://github.com/ocaml-multicore/multicoretests/actions?query=branch%3Adomain-dls-32-bit-focus

  • for one, we are using focused tests on the CI, that lets us iterate a single test, rather than a full test suite run
  • I failed to reproduce the sequential failure, then started thinking that the crash could be caused by a parallel one, if they just happen to crash quickly before either one of them prints anything
  • Rerunning both tests didn't help reproducing
  • I then tried hard-coding the last known problematic seed
  • Lastly, I've tried adding a stress-test-mode for STM similar to Add Lin_domain.stress_test #443, so far without luck

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

I finally managed to reproduce this one - on 5.2.0 - and only once for now. It is indeed a sequential fault! 😮
https://github.com/ocaml-multicore/multicoretests/actions/runs/9180414541/job/25244781126#step:18:762

Starting 74-th run

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
/usr/bin/bash: line 1: 197189 Segmentation fault      (core dumped) ./focusedtest.exe -v -s 103830913
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

Switching hard-coded seed to 107236932 (the first one) works much better!
Across 500 repetitions this triggered 56 segfaults on 5.2.0
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424522/job/25248130724
and 50 segfaults on 5.3.0+trunk
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424534/job/25248130821

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

I've made a bit of progress on this.

  • The issue still triggers on 5.2.0 and a fresh trunk as of this morning on the CI
  • I'm still unable to reproduce locally
  • I tried running under the debug runtime in the CI to see if an assertion would catch the issue before causing a crash, and indeed it does - 6-8 times out of 200:
    random seed: 107236932
    generated error fail pass / total     time test name
    
    [ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
    [00] file runtime/shared_heap.c; line 787 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
    /usr/bin/bash: line 1: 394730 Aborted                 (core dumped) ./focusedtest.exe -v -s 107236932
    [ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
    

The assertion in question is this one:
https://github.com/ocaml/ocaml/blob/23d896786adb694d39785bd7770c537a6d8c6fe6/runtime/shared_heap.c#L787
used to verify the heap at the end of a STW.

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

I've tried testing previous versions (5.0.0, 5.1.0, 5.1.1) too - both with a hard-coded seed 107236932 and with random ones. Result:

  • 5.0.0 - not able to trigger
  • 5.1.0 - not able to trigger
  • 5.1.1 - not able to trigger
  • 5.2.0 - triggering in 7-12 cases out of 200
  • trunk - triggering in 9-12 cases out of 200

@gasche
Copy link

gasche commented Jun 3, 2024

How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?

gasche added a commit to gasche/ocaml that referenced this issue Jun 3, 2024
The call was placed here for thread-safety, to work correctly if
another thread (on the same domain) resizes the DLS array under our
feet. It is also the only notable change in terms of access to
`Caml_state->dls_root` since 5.1, so it might be involved in the
mysterious crash being investigated at

  ocaml-multicore/multicoretests#446
@gasche
Copy link

gasche commented Jun 3, 2024

Here is a proposed small patch for example: https://github.com/gasche/ocaml/commits/mysterious-dls-crash-1/

(Another obvious idea would be to revert #12889 and see whether you can still reproduce the crash. I don't see any other relevant change in the runtime, but I also read this change carefully several times without noticing anything that could explain the current observations.)

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?

I should be able to do that for a feature branch like the proposed one 👍
Thanks for looking into this - I'll keep you posted.

@gasche
Copy link

gasche commented Jun 3, 2024

Thanks! This change is really a blind move, so it is unlikely to work. I think the reasonable next step is to revert #12889. Let me know if you need help doing that -- it should revert cleanly from 5.2 in particular, but I haven't actually tried.

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

No cigar unfortunately:
https://github.com/ocaml-multicore/multicoretests/tree/domain-dls-32-bit-gabriel
https://github.com/ocaml-multicore/multicoretests/actions/runs/9352909686/job/25742143625

On that one I also saw this (even rarer) non-crashing misbehaviour:

Starting 89-th run

random seed: 135228812
generated error fail pass / total     time test name
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
[✗]  136    1    0  135 / 1000     0.3s STM Domain.DLS test sequential
=== Error ======================================================================
Test STM Domain.DLS test sequential errored on (21 shrink steps):
   Get (-81244680)
exception Invalid_argument("List.nth")
================================================================================
failure (0 tests failed, 1 tests errored, ran 1 tests)

Despite being generated as a number between 0 and length=4 (and not performing any shrinking)

let index = Gen.int_bound (length-1) in

the Get-constructor's argument somehow ends up being -81244680...

That signals some form of heap corruption - like the assertion failure indicates. What makes you suspect #12889?

@gasche
Copy link

gasche commented Jun 3, 2024

My reasoning is that your test exercises the DLS primitives, and started failing in 5.2 and no older realease. #12889 is the only substantial change to the implementation of DLS that happened between 5.1 and 5.2, and it touches an unsafe part of the language (a mix of C runtime code and Obj.magic on the OCaml side). This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?

@jmid
Copy link
Collaborator Author

jmid commented Jun 4, 2024

This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?

With the debug-runtime strategy to trigger this, in the CI I'm now repeating 200 times a QCheck-test with count=1000 - with no hard-coded seeds. That makes for 200.000 arbitrary tests and gives a pretty clear signal.

@jmid
Copy link
Collaborator Author

jmid commented Jun 4, 2024

I've now completed a round of git bisect CI golf, and the finger points at:

  • bdd8d96 - Make the GC compact again (#12193)

with the latest run available here:
https://github.com/ocaml-multicore/multicoretests/actions/runs/9368325592/job/25790072295
Highlights:

  • 4 assertion failures file runtime/shared_heap.c; line 778 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
  • 1 clean segfault
  • 1 memory(?) corruption causing Get 78

Here's the log score-card from the golf round:

  • f87ea0b - first commit on branch 5.2 - BAD
  • cbc5a0c - Scan runtime/*.c for primitives less often (#12753) - BAD
  • df663ca - github-linguist: use M4Sugar for autoconf macros - BAD
  • 6601f1b - Merge pull request #12714 from MisterDA/cc-Wundef - BAD
  • bdd8d96 - Make the GC compact again (#12193) - BAD
  • 44d2127 - Warn on use of undefined macros - GOOD
  • 3d76183 - Guard against use of undefined macros in headers - GOOD
  • c6eec0c - Improve the documentation of {String,Byte}.split_on_char (#12717) - GOOD
  • d77bc97 - minor code cleanup - GOOD
  • 8cc728b - Merge pull request #12707 from jmid/domain-assert-failure - GOOD
  • ebae0d1 - Merge pull request #12461 from panglesd/parsetree-change-in-ci - GOOD
  • 560c444 - Merge pull request #12560 from OlivierNicole/improve_tsan_comments - GOOD
  • 6d0c551 - first commit after branching 5.1 - GOOD

@jmid
Copy link
Collaborator Author

jmid commented Jun 5, 2024

I accidentally kicked off a run with an even smaller heap ("s=2048").
Among a couple of assertion failures, this triggered the following which confirms my suspecion of a memory corruption (bytecode corruption):
https://github.com/ocaml-multicore/multicoretests/actions/runs/9386404318/job/25846922781

Starting 179-th run

random seed: 517273910
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
Fatal error: bad opcode (d701d6d7)
/usr/bin/bash: line 1: 415035 Aborted                 (core dumped) ./focusedtest.exe -v
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

@jmid
Copy link
Collaborator Author

jmid commented Jun 5, 2024

I've tried to run under gdb batch mode in the CI:
https://github.com/ocaml-multicore/multicoretests/actions/runs/9388010576/job/25852264722

For the 6 assertion failures this doesn't add much new info:

Starting 49-th run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

random seed: 246561918
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
[00] file runtime/shared_heap.c; line 778 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
Thread 1 "focusedtest.exe" received signal SIGABRT, Aborted.
0xf7fc4579 in __kernel_vsyscall ()

This is a failure in verify_object called from caml_verify_heap_from_stw:

void caml_verify_heap_from_stw(caml_domain_state *domain) {
  struct heap_verify_state* st = caml_verify_begin();
  caml_do_roots (&caml_verify_root, verify_scanning_flags, st, domain, 1);
  caml_scan_global_roots(&caml_verify_root, st);

  while (st->sp) verify_object(st, st->stack[--st->sp]);

  caml_addrmap_clear(&st->seen);
  caml_stat_free(st->stack);
  caml_stat_free(st);
}

For the 2 clean segfaults it reveals a little:

Starting 39-th run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

random seed: 66516876
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
571	runtime/interp.c: No such file or directory.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
Thread 1 "focusedtest.exe" received signal SIGSEGV, Segmentation fault.
0x565a8d88 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:571

with the crash happening in the pc = Code_val(accu); line:

    Instruct(RETURN): {
      sp += *pc++;
      if (extra_args > 0) {
        extra_args--;
        pc = Code_val(accu);
        env = accu;
        Next;
      } else {
        goto do_return;
      }
    }

Isn't the common theme to both of these "stack corruption"? 🤔

  • assertion failure during stack walking(?) heap verification
  • crash during stack RETURN

@jmid
Copy link
Collaborator Author

jmid commented Aug 8, 2024

I've dug some more into this issue.

Experiments reveal that this still can trigger without split_from_parent and omitting either Get or Set commands entirely. This indicates that there is an issue not directly tied to either of these (there may be another... 🤷 )

@jmid
Copy link
Collaborator Author

jmid commented Aug 8, 2024

Using tmate I've also managed to log into the GitHub action runner machines, reproduce crashes there, and observe backtraces.

A backtrace for an assertion failure run with each thread annotated with its role:

Thread 4 (Thread 0xf03ffac0 (LWP 172957)):  ##### A waiting backup thread for the child domain
#0  0xf1a6c579 in __kernel_vsyscall ()
#1  0xf1683243 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2  0xf168a06a in pthread_mutex_lock () from /lib/i386-linux-gnu/libc.so.6
#3  0x5922ab80 in caml_plat_lock_blocking (m=0x5a9df4c0) at runtime/caml/platform.h:457
#4  backup_thread_func (v=<optimized out>) at runtime/domain.c:1076
#5  0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#6  0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6

Thread 3 (Thread 0xf1880740 (LWP 171712)):  ##### Main thread paused during blocked C_CALL2 to caml_ml_condition_wait
#0  0xf1a6c579 in __kernel_vsyscall ()
#1  0xf1715336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2  0xf1682e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3  0xf1686079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4  0x592572d1 in sync_condvar_wait (m=0x5a9e3920, c=0x5a9e1620) at runtime/sync_posix.h:116
#5  caml_ml_condition_wait (wcond=<optimized out>, wmut=<optimized out>) at runtime/sync.c:172
#6  0x5925dce2 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:1047
#7  0x59261052 in caml_startup_code_exn (pooling=0, argv=0xffdf2e24, section_table_size=3683, section_table=0x59292020 <caml_sections> "\204\225\246\276", data_size=21404, data=0x59292ea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x59298240 <caml_code>) at runtime/startup_byt.c:655
#8  caml_startup_code_exn (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:588
#9  0x59261101 in caml_startup_code (code=0x59298240 <caml_code>, code_size=528104, data=0x59292ea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x59292020 <caml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xffdf2e24) at runtime/startup_byt.c:669
#10 0x592120b4 in main (argc=4, argv=0xffdf2e24) at camlprim.c:25901

Thread 2 (Thread 0xee0f6ac0 (LWP 172956)):  ##### Child domain thread triggering major GC slice on MAKEBLOCK2
#0  0x592512cc in caml_verify_root (state=0xf10ae180, v=-245978988, p=0xf11f7410) at runtime/shared_heap.c:759
#1  0x5923013d in caml_scan_stack (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, stack=0xf117f010, v_gc_regs=0x0) at runtime/fiber.c:396
#2  0x5924f826 in caml_do_local_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, local_roots=0xee0f61ec, current_stack=0xf117f010, v_gc_regs=0x0) at runtime/roots.c:65
#3  0x5924f865 in caml_do_roots (f=0x592512c0 <caml_verify_root>, fflags=0, fdata=0xf10ae180, d=0xf0402620, do_final_val=1) at runtime/roots.c:41
#4  0x5925343e in caml_verify_heap_from_stw (domain=0xf0402620) at runtime/shared_heap.c:804
#5  0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c:1434
#6  0x5922af41 in caml_try_run_on_all_domains_with_spin_work (sync=<optimized out>, handler=<optimized out>, data=<optimized out>, leader_setup=<optimized out>, enter_spin_callback=<optimized out>, enter_spin_data=<optimized out>) at runtime/domain.c:1695
#7  0x5922b10a in caml_try_run_on_all_domains (handler=0x592407c0 <stw_cycle_all_domains>, data=0xee0f5ca8, leader_setup=0x0) at runtime/domain.c:1717
#8  0x5924324e in major_collection_slice (howmuch=<optimized out>, participant_count=participant_count@entry=0, barrier_participants=barrier_participants@entry=0x0, mode=<optimized out>, force_compaction=<optimized out>) at runtime/major_gc.c:1851
#9  0x59243670 in caml_major_collection_slice (howmuch=-1) at runtime/major_gc.c:1869
#10 0x5922a7d8 in caml_poll_gc_work () at runtime/domain.c:1874
#11 0x59254e67 in caml_do_pending_actions_res () at runtime/signals.c:338
#12 0x5924cc9c in caml_alloc_small_dispatch (dom_st=0xf0402620, wosize=2, flags=3, nallocs=1, encoded_alloc_lens=0x0) at runtime/minor_gc.c:896
#13 0x5925f1f9 in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:788
#14 0x59224cbc in caml_callbackN_exn (closure=<optimized out>, narg=<optimized out>, args=<optimized out>) at runtime/callback.c:131
#15 0x59224faa in caml_callback_exn (arg1=<optimized out>, closure=<optimized out>) at runtime/callback.c:144
#16 caml_callback_res (closure=-243007372, arg=1) at runtime/callback.c:320
#17 0x59229e4a in domain_thread_func (v=<optimized out>) at runtime/domain.c:1244
#18 0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#19 0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6

Thread 1 (Thread 0xef3feac0 (LWP 171715)): ##### Main backup thread participating in STW
#0  caml_failed_assert (expr=0x5926bf18 "Has_status_val(v, caml_global_heap_state.UNMARKED)", file_os=0x5926b995 "runtime/shared_heap.c", line=784) at runtime/misc.c:48
#1  0x59253709 in verify_object (v=-298832060, st=0xf1074c70) at runtime/shared_heap.c:784
#2  caml_verify_heap_from_stw (domain=0x5a9e0120) at runtime/shared_heap.c:807
#3  0x59240c39 in stw_cycle_all_domains (domain=<optimized out>, args=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/major_gc.c: 1434
#4  0x5922aa28 in stw_handler (domain=0x5a9e0120) at runtime/domain.c:1486
#5  handle_incoming (s=<optimized out>) at runtime/domain.c:351
#6  0x5922ac9a in caml_handle_incoming_interrupts () at runtime/domain.c:364
#7  backup_thread_func (v=<optimized out>) at runtime/domain.c:1057
#8  0xf1686c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#9  0xf172372c in ?? () from /lib/i386-linux-gnu/libc.so.6

The bisection pointing at ocaml/ocaml#12193 combined with stw_cycle_all_domains in the above made me suspect 32-bit relevant changes in that PR, but I've not been able to find anything so far.

Here's another backtrace, this one from a pure seg fault run:

Thread 2 (Thread 0xf17feac0 (LWP 122715)):
#0  0xf3f47579 in __kernel_vsyscall ()
#1  0xf3b15336 in ?? () from /lib/i386-linux-gnu/libc.so.6
#2  0xf3a82e81 in ?? () from /lib/i386-linux-gnu/libc.so.6
#3  0xf3a86079 in pthread_cond_wait () from /lib/i386-linux-gnu/libc.so.6
#4  0x648daa9d in caml_plat_wait (cond=0x64d693b4, mut=0x64d6939c) at runtime/platform.c:127
#5  0x648b6c1a in backup_thread_func (v=<optimized out>) at runtime/domain.c:1068
#6  0xf3a86c01 in ?? () from /lib/i386-linux-gnu/libc.so.6
#7  0xf3b2372c in ?? () from /lib/i386-linux-gnu/libc.so.6

Thread 1 (Thread 0xf3d5b740 (LWP 122712)):  ##### Segfault during RETURN instruction in pc = Code_val(accu);
#0  0x648ea29f in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:573
#1  0x648ed052 in caml_startup_code_exn (pooling=0, argv=0xff8ca1e4, section_table_size=3683, section_table=0x6491e020 <caml_sections> "\204\225\246\276", data_size=21404, data=
0x6491eea0 <caml_data> "\204\225\246\276", code_size=528104, code=0x64924240 <caml_code>) at runtime/startup_byt.c:655
#2  caml_startup_code_exn (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <caml_section
s> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:588
#3  0x648ed101 in caml_startup_code (code=0x64924240 <caml_code>, code_size=528104, data=0x6491eea0 <caml_data> "\204\225\246\276", data_size=21404, section_table=0x6491e020 <ca
ml_sections> "\204\225\246\276", section_table_size=3683, pooling=0, argv=0xff8ca1e4) at runtime/startup_byt.c:669
#4  0x6489e0b4 in main (argc=4, argv=0xff8ca1e4) at camlprim.c:25901

The common theme is still stack memory corruption (caml_scan_stack and RETURN).
The issue also seems restricted to trunk and 5.2 (incl. compaction and ocaml/ocaml#12889).

I've still not been able to reproduce locally.

@jmid jmid changed the title [ocaml5-issue] Segfault in STM Domain.DLS test sequential on 32-bit trunk [ocaml5-issue] Segfault in STM Domain.DLS test sequential on bytecode Sep 11, 2024
@jmid
Copy link
Collaborator Author

jmid commented Sep 11, 2024

I've finally managed to reproduce this locally by using musl.
It turns out, this is not restricted to 32-bit but reproduces with a musl bytecode compiler.
I've shared repro steps on the upstream issue: ocaml/ocaml#13402

@jmid
Copy link
Collaborator Author

jmid commented Oct 23, 2024

Fixed in ocaml/ocaml#13553

@jmid jmid closed this as completed Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

2 participants