Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled exception Type=Segmentation error #595

Open
Jenson3210 opened this issue Dec 13, 2024 · 22 comments
Open

Unhandled exception Type=Segmentation error #595

Jenson3210 opened this issue Dec 13, 2024 · 22 comments

Comments

@Jenson3210
Copy link

Jenson3210 commented Dec 13, 2024

Hi,

We're using 24.0.0.11-full-java17-openj9-ubi to run. However, I think also 24.0.0.12-full-java17-openj9-ubi would not work.

We're seeing crashes with error log:

Unhandled exception
Type=Segmentation error vmState=0x00040000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
...
----------- Stack Backtrace -----------
decodeStackFrameDataFromStackMapTable+0x1a (0x00007F7808D6024A [libj9vrb29.so+0x1124a])
generateJ9RtvExceptionDetails+0x9a5 (0x00007F7808D5F0F5 [libj9vrb29.so+0x100f5])
j9bcv_createVerifyErrorString+0x332 (0x00007F780A80C932 [libj9vm29.so+0x17c932])
classInitStateMachine+0xab5 (0x00007F780A6B3225 [libj9vm29.so+0x23225])
resolveStaticMethodRefInto+0x2c8 (0x00007F780A707BD8 [libj9vm29.so+0x77bd8])
resolveStaticMethodRef+0x22 (0x00007F780A707DB2 [libj9vm29.so+0x77db2])

We think it might be related to this bug.
According to the details there, it should be resolved by semeru 17.0.14.0
However, it seems that there is no version available other than the one used in the dockerfile
that contains semeru 17.0.14.

Our startup logs contain

Launching defaultServer (Open Liberty 24.0.0.11/wlp-1.0.95.cl241120241021-1102) on Eclipse OpenJ9 VM, version 17.0.13+11 (en_US)

showing it is 17.0.13 only.

Now, i could not find a way to raise this question/issue on ibm ubi image that is used in the dockerfile, so raised it here instead hoping for some insights.

@leochr
Copy link
Member

leochr commented Dec 13, 2024

@Jenson3210 The error message and the method match the referenced bug/APAR, but the stacktrace is slightly different. So I don't know for sure that the fix for the APAR would resolve your error. Semeru team would likely know.

Repository for Semeru is at https://github.com/ibmruntimes/Semeru-Runtimes/issues

As per the release roadmap for Semeru, 17.0.14.0 is planned for February 2025. Semeru containers are typically available 1-2 weeks after the binary (non-container) release. Liberty images can be expected to be updated within 1-2 weeks of Semeru container release.

@Jenson3210
Copy link
Author

@leochr, Thanks a lot for the quick answer. Raised our question over there also.

At this moment testing an e2e pipeline every 30 minutes with 23.0.0.4-full-java17-openj9-ubi, in which we did not get this error so far.
So either the liberty version difference or the semeru runtime internally which might be the root cause here.

@leochr
Copy link
Member

leochr commented Dec 16, 2024

I checked 23.0.0.4-full-java17-openj9-ubi and it included JDK 17.0.6+10. I am not sure whether changes in Semeru or Liberty runtime (or a combination of the two) is the root cause.

@JorisNens
Copy link

@leochr Our pipeline failed already several times now but we are not able to retreive the dump file.
The dump file logged is not there (or visisble) at that location. In the OpenLiberty documentation I found following command:
server dump defaultServer --include=heap,system,thread

I tried it on a healthy and I get a dump file. But on a crashed container the command does not finish (even after one hour), I only get a empty directory.
I also tried without the include tags. And I also tried the javadump version. But the commands never finish.
Can you advice on this?

@leochr
Copy link
Member

leochr commented Dec 19, 2024

@JorisNens Heap and system dumps are heavyweights. The thread dump is lightweight and commonly useful. Can you try with --include=thread to see if that works?

@JorisNens
Copy link

@leochr I tried with --include=thread only but the same issue. The command does not stop. In the OpenShift metrics view I see a small increase in memory but I do not see an increase in cpu (<10mCore usage).

@leochr
Copy link
Member

leochr commented Dec 20, 2024

Kevin (@kgibm), adding you, in case you have any suggestions or insights to debug/resolve the issue with gathering dump. Thank you.

@kgibm
Copy link
Member

kgibm commented Dec 21, 2024

Most commonly with containers, crash core dumps go to the worker node: https://eclipse.dev/openj9/docs/xdump/#piped-system-dumps

Check cat /proc/sys/kernel/core_pattern and review the link above to see where.

As far as the hanging server dump, I've seen this once before in containers but the customer couldn't reproduce. What we would need is, to start, javacores of both the Liberty and server dump processes using kill -3 $PID after the hang.

@leochr
Copy link
Member

leochr commented Jan 6, 2025

Thank you Kevin!

@JorisNens, please try what Kevin suggested and provide the artifacts so we can investigate further. Thank you.

@jorisnenscolruyt
Copy link

Hi sorry for the late response.
The kill does seems to generate a file: javacore.20250106.154207.1.0003.txt
I tried to convert with jpackcore but the fileformat is unknown. I also tried to convert to hprof, but was not able to.

I see some company details in the file, so I first have to check internally if I can share the txt file.

@kgibm
Copy link
Member

kgibm commented Jan 6, 2025

tried to convert with jpackcore but the fileformat is unknown. I also tried to convert to hprof, but was not able to.

A javacore is a thread dump which is just a small text file that shows basic information about a Java process such as stack traces of all threads: https://eclipse.dev/openj9/docs/dump_javadump/

You can either review it in a text editor or in a free tool such as the IBM Thread and Monitor Dump Analyzer.

jpackcore is for operating system coredumps produced by a crash or some other mechanism that produces a J9 System Dump such as server dump --include=system. By default, kill -3 produces javacores, not system dumps. If you want to produce system dumps on kill instead of other techniques such as server dump, then I suggest starting the JVM with this argument:

-Xdump:system:events=user2,request=exclusive+prepwalk

Then execute kill -USR2 $PID to request a system dump.

I see some company details in the file, so I first have to check internally if I can share the txt file.

Yes, please scrub any confidential information. For the issue of server dump being hung, the main things we're interested in are the stacks of the threads and any lock contention, so we'd just want the 0SECTION THREADS section with the thread stacks and the 0SECTION LOCKS section for lock information.

@jorisnenscolruyt
Copy link

jorisnenscolruyt commented Jan 7, 2025

javacore.20250106.154207.1.0003_clean2.txt
In attachment the javacore file

I changed the setting to collect the system dump with the Kill command. I will send a new javacore file when the crash happens again.

@kgibm
Copy link
Member

kgibm commented Jan 7, 2025

@jorisnenscolruyt Was this taken during the server dump hang? If so, then I don't see that the command reached Liberty, so that suggests the server dump itself is hung. If so, then please reproduce with a javacore of the server dump process itself using kill -3 $PID after it's hung.

I changed the setting to collect the system dump with the Kill command. I will send a new javacore file when the crash happens again.

A javacore and system dump will be produced automatically on a crash, so you do not need to execute the kill command in that condition. My understanding is that you were just trying to test system dumps (since that's what the Java team needs to evaluate the crash) using Liberty's server dump but server dump is hanging. The kill -USR2 command is an alternative to test the production of system dumps instead of using server dump (although we would also like to investigate why server dump is hanging).

@jorisnenscolruyt
Copy link

I was able to trigger a new javacore dump:
javacore.20250107.200210.1.0003 copy.txt
This was with performing a kill -3 command.
Then I did a server dump, but it hung. I performed the kill -3 command again and I received this file:

javacore.20250107.200605.1.0004 copy.txt

When the server crashed a dump is automaticly created but the dump creation fails:

JVMDUMP039I Processing dump event "gpf", detail "" at 2025/01/07 19:23:22 - please wait.
JVMDUMP032I JVM requested System dump using '/opt/ol/wlp/output/defaultServer/core.20250107.192322.1.0001.dmp' in response to an event
JVMPORT030W /proc/sys/kernel/core_pattern setting "|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" specifies that the core dump is to be piped to an external program. Attempting to rename either core or core.205. Review the manual for the external program to find where the core dump is written and ensure the program does not truncate it.
JVMPORT049I The core file created by child process with pid = 205 was not found. Review the documentation for the /proc/sys/kernel/core_pattern program "|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" to find where the core file is written and ensure that program does not truncate it.
JVMDUMP012E Error in System dump: /opt/ol/wlp/output/defaultServer/core.20250107.192322.1.0001.dmp

That is why I try to generate a dump with the kill command or server dump command.

The following error is printed in the server logs:

Unhandled exception
Type=Segmentation error vmState=0x00040000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007FBC90ECFD60 Handler2=00007FBC90E28750 InaccessibleAddress=0000000000000000
RDI=00007FBC351F0090 RSI=0000000000000000 RAX=0000000000000000 RBX=00007FBC8C0F4320
RCX=00007FBC351F0120 RDX=000000000000FFFF R8=00007FBC8C0F4320 R9=0000000000000050
R10=00000000FFFFFFFF R11=0000000000000001 R12=00007FBC351F0120 R13=00007FBC351F0070
R14=00007FBC351F0090 R15=0000000000000001
RIP=00007FBC9024524A GS=0000 FS=0000 RSP=00007FBC351F0000
EFlags=0000000000010246 CS=0033 RBP=00007FBC351F0090 ERR=0000000000000004
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=0000000000000000
xmm0=9271f630b53fb101 (f: 3040850176.000000, d: -7.950421e-220)
xmm1=9271f63000000000 (f: 0.000000, d: -7.950417e-220)
xmm2=00000000b7654321 (f: 3076866816.000000, d: 1.520174e-314)
xmm3=0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm4=00000000ff000000 (f: 4278190080.000000, d: 2.113707e-314)
xmm5=0000003000000020 (f: 32.000000, d: 1.018558e-312)
xmm6=00000000ff000000 (f: 4278190080.000000, d: 2.113707e-314)
xmm7=0000000001a16730 (f: 27354928.000000, d: 1.351513e-316)
xmm8=00000000c0112200 (f: 3222348288.000000, d: 1.592052e-314)
xmm9=24245b9facd5b18c (f: 2899685888.000000, d: 1.400442e-134)

@kgibm
Copy link
Member

kgibm commented Jan 7, 2025

I was able to trigger a new javacore dump:
javacore.20250107.200210.1.0003 copy.txt
This was with performing a kill -3 command.
Then I did a server dump, but it hung. I performed the kill -3 command again and I received this file:
javacore.20250107.200605.1.0004 copy.txt

These are from the same PID 1 which is Liberty. The server dump process is a separate Java process. You'll have to look in ps (or under /proc if you don't have ps installed) to find the PID of server dump.

JVMDUMP039I Processing dump event "gpf", detail "" at 2025/01/07 19:23:22 - please wait.
JVMDUMP032I JVM requested System dump using '/opt/ol/wlp/output/defaultServer/core.20250107.192322.1.0001.dmp' in response to an event

This means the system dump went to systemd-coredump on the worker node. You should find it on the OpenShift worker node, most commonly at /var/lib/systemd/coredump/. You'll need a cluster-admin role to access the worker node.

@kgibm
Copy link
Member

kgibm commented Jan 7, 2025

I should add that it won't be named core.20250107.192322.1.0001.dmp despite the JVM writing that in the message. This is the name that the JVM wanted to name the core dump to, but since the Linux kernel took over core dump processing, it will then be renamed to a different format based on systemd-coredump configuration, so the simplest way to find it is to match by the timestamp of the crash.

[A systemd-coredump-named core dump is] period-separated with core, the process or thread name, user ID, /proc/sys/kernel/random/boot_id, PID, and the time when the core was created in microseconds since the Unix Epoch. However, in the case of J9-forked core dumps, the process ID will not match the original process (instead, use jdmpview's info proc).

@Jenson3210
Copy link
Author

Hi, thanks for the patience here.

Finally we where able to get the dump from our worker nodes.
We've extracted a zstd file from our openshift cluster. Then we decompressed it using zstd --rm -d ....

We've volume mounted this dump on a liberty server container to have the same runtime and processed the decompressed file with jpackcore ./mounted/file ./mounted/dump.zip

This generates a dump.zip on my host machine which can be analysed locally after installing 17.0.13-sem locally(sdk install java 17.0.13-sem), using ~/.sdkman/candidates/java/17.0.13-sem/bin/jdmpview ./dump.zip

As this file appears to contain loads of data (some of it sensitive), any advice on how to best continue here?
I can run certain commands and share the results if that would be helpful?

@jorisnenscolruyt
Copy link

analysis dump.zip
I mades some exports with the jdmpview application:
info heap
info lockinfo
info mmap
info mod
info thread

Can you find anything in these files you need?

@kgibm
Copy link
Member

kgibm commented Jan 14, 2025

The core dump analysis for this issue is best handled in ibmruntimes/Semeru-Runtimes#101 or in a support case. Based on Peter's last comment, it looks like jdmpview output won't be enough and gdb analysis of the core dump will be needed and/or traceformat on the Snap file.

By the way, if you want to test if an early release of 17.0.14 fixes the issue, here's a procedure to dynamically use a different JDK in containers: https://publib.boulder.ibm.com/httpserv/cookbook/Troubleshooting_Recipes-Troubleshooting_OpenShift_Recipes-Replace_Container_Directory_in_OpenShift.html

@jorisnenscolruyt
Copy link

I created a image with java 17.0.14. It took longer before the server crash but at the end I received the same error:
Unhandled exception
Type=Segmentation error vmState=0x00040000
...
Module=/opt/java/openjdk/lib/default/libj9vrb29.so
Module_base_address=00007FC814367000
Target=2_90_20250121_916 (Linux 5.14.0-284.57.1.el9_2.x86_64)
CPU=amd64 (10 logical CPUs) (0x1f718ea000 RAM)
----------- Stack Backtrace -----------
decodeStackFrameDataFromStackMapTable+0x1a (0x00007FC81437823A [libj9vrb29.so+0x1123a])
generateJ9RtvExceptionDetails+0x9a5 (0x00007FC8143770E5 [libj9vrb29.so+0x100e5])
j9bcv_createVerifyErrorString+0x332 (0x00007FC815E0BA32 [libj9vm29.so+0x181a32])
classInitStateMachine+0xab5 (0x00007FC815CAD235 [libj9vm29.so+0x23235])
resolveStaticMethodRefInto+0x2c8 (0x00007FC815D01CB8 [libj9vm29.so+0x77cb8])
resolveStaticMethodRef+0x22 (0x00007FC815D01E92 [libj9vm29.so+0x77e92])
_ZN37VM_DebugBytecodeInterpreterCompressed3runEP10J9VMThread+0xc2e3 (0x00007FC815D84B03 [libj9vm29.so+0xfab03])
debugBytecodeLoopCompressed+0xd2 (0x00007FC815D78812 [libj9vm29.so+0xee812])
(0x00007FC815DE4D52 [libj9vm29.so+0x15ad52])

During startup of the pod, I see these additional messages:

JVMSHRC226E Error opening shared class cache file
JVMSHRC336E Port layer error code = -108
JVMSHRC337E Platform error message: No such file or directory
JVMSHRC840E Failed to start up the shared cache.
JVMSHRC686I Failed to startup shared class cache. Continue without using it as -Xshareclasses:nonfatal is specified
WARNING: Unknown module: jdk.management.agent specified to --add-exports
WARNING: Unknown module: jdk.attach specified to --add-exports

Can this cause some errors?

@kgibm
Copy link
Member

kgibm commented Jan 17, 2025

My guess is that the shared class cache would need to be rebuilt for the new JDK but those shared class cache errors aren't fatal; they just impact startup performance as the JVM falls back to no shared class cache, which is okay.

I do find those module warnings a bit strange so I don't know if they're related but if you don't see them on the previous JDK, then I guess they're not the main issue.

I think at this point deeper analysis of the core dump would be required by the Java team. A support case is the best way to share a core dump.

@kgibm
Copy link
Member

kgibm commented Jan 17, 2025

I noticed debugBytecodeLoopCompressed in the crash stack which sounds like it's related to debugging and I noticed this in your javacore:

2CIUSERARG               -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005

This enables the Java debugger. Is this a business requirement? Besides being potentially related to the crash, the debugger may have a significant performance impact even if there's no active debugger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants