LGTM tests are stalling (or are too slow for the timeout, I don't know) #45774

gsmet · 2025-01-22T09:10:53Z

Describe the bug

We are seeing some recent issues in the LGTM tests.

I am not sure if it's a problem with the new startup condition of the container or if starting the container is just too slow and this test needs a higher timeout but it needs to be addressed as it has been failing on CI quite a lot.

You can get the whole output by getting the raw logs for this build: https://github.com/quarkusio/quarkus/actions/runs/12897381888/job/35963408203 .

Then you wait for them to load entirely and you search for 2025-01-22T01:14:31.6075066Z 2025-01-22 01:14:31,520 INFO [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Tempo is up and running.

2025-01-22T01:14:01.3760182Z 2025-01-22 01:14:01,295 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Grafana is up and running.
2025-01-22T01:14:16.4919608Z 2025-01-22 01:14:16,406 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Loki is up and running.
2025-01-22T01:14:16.4920697Z 2025-01-22 01:14:16,412 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Prometheus is up and running.
2025-01-22T01:14:31.6075066Z 2025-01-22 01:14:31,520 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Tempo is up and running.
2025-01-22T01:14:35.3116972Z @QuarkusTest has detected a hang, as there has been no test activity in PT1M
2025-01-22T01:14:35.3119426Z To configure this timeout use the quarkus.test.hang-detection-timeout config property
2025-01-22T01:14:35.3120356Z A stack trace is below to help diagnose the potential hang
2025-01-22T01:14:35.3120939Z === Stack Trace ===
2025-01-22T01:14:35.3121302Z Thread main: WAITING
2025-01-22T01:14:35.3121759Z   Waiting on io.quarkus.builder.Execution@580eef62
2025-01-22T01:14:35.3122291Z   Stack:
2025-01-22T01:14:35.3122778Z     [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
2025-01-22T01:14:35.3123855Z     [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
2025-01-22T01:14:35.3124706Z     io.quarkus.builder.Execution.run(Execution.java:101)
2025-01-22T01:14:35.3125455Z     io.quarkus.builder.BuildExecutionBuilder.execute(BuildExecutionBuilder.java:79)
2025-01-22T01:14:35.3126714Z     io.quarkus.deployment.QuarkusAugmentor.run(QuarkusAugmentor.java:161)
2025-01-22T01:14:35.3127806Z     io.quarkus.runner.bootstrap.AugmentActionImpl.runAugment(AugmentActionImpl.java:350)
2025-01-22T01:14:35.3129279Z     io.quarkus.runner.bootstrap.AugmentActionImpl.createInitialRuntimeApplication(AugmentActionImpl.java:271)
2025-01-22T01:14:35.3131143Z     io.quarkus.runner.bootstrap.AugmentActionImpl.createInitialRuntimeApplication(AugmentActionImpl.java:61)
2025-01-22T01:14:35.3132661Z     app//io.quarkus.test.junit.QuarkusTestExtension.doJavaStart(QuarkusTestExtension.java:195)
2025-01-22T01:14:35.3134125Z     app//io.quarkus.test.junit.QuarkusTestExtension.ensureStarted(QuarkusTestExtension.java:578)
2025-01-22T01:14:35.3135432Z     app//io.quarkus.test.junit.QuarkusTestExtension.beforeAll(QuarkusTestExtension.java:628)
2025-01-22T01:14:35.3137115Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.lambda$invokeBeforeAllCallbacks$12(ClassBasedTestDescriptor.java:396)
2025-01-22T01:14:35.3138773Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor$$Lambda$515/0x00007f440018b978.execute(Unknown Source)
2025-01-22T01:14:35.3140176Z     app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
2025-01-22T01:14:35.3141728Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.invokeBeforeAllCallbacks(ClassBasedTestDescriptor.java:396)
2025-01-22T01:14:35.3143388Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:212)
2025-01-22T01:14:35.3144898Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:85)
2025-01-22T01:14:35.3146281Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:153)
2025-01-22T01:14:35.3147759Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$456/0x00007f440017aad8.execute(Unknown Source)
2025-01-22T01:14:35.3149200Z     app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
2025-01-22T01:14:35.3150701Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:146)
2025-01-22T01:14:35.3152074Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$455/0x00007f440017a8b0.invoke(Unknown Source)
2025-01-22T01:14:35.3153236Z     app//org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
2025-01-22T01:14:35.3154585Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:144)
2025-01-22T01:14:35.3156063Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$454/0x00007f440017a488.execute(Unknown Source)
2025-01-22T01:14:35.3156889Z     app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
2025-01-22T01:14:35.3157670Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:143)
2025-01-22T01:14:35.3158396Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:100)
2025-01-22T01:14:35.3159295Z     app//org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService$$Lambda$460/0x00007f440017ba80.accept(Unknown Source)
2025-01-22T01:14:35.3160072Z     [email protected]/java.util.ArrayList.forEach(ArrayList.java:1511)
2025-01-22T01:14:35.3160900Z     app//org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
2025-01-22T01:14:35.3161921Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:160)
2025-01-22T01:14:35.3162722Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$456/0x00007f440017aad8.execute(Unknown Source)

The text was updated successfully, but these errors were encountered:

gsmet · 2025-01-22T09:11:19Z

/cc @alesj @holly-cummins

holly-cummins · 2025-01-22T09:37:09Z

The Develocity history (https://ge.quarkus.io/scans/tests?search.timeZoneId=Europe%2FLondon&tests.container=io.quarkus.observability.test.LgtmResourcesIT&tests.test=testTracing#) is now showing this:

The first build it fails in is the one which merges #45689.

That strongly suggests my #45689 is to blame ... except that on #45689 I looked up the history of the failing test, and had this:

That screencap shows it starts failing on the 17th. I didn't capture the previous URL to see why the two histories are different.

alesj · 2025-01-22T09:37:39Z

We're actually waiting on .*The OpenTelemetry collector and the Grafana LGTM stack are up and running.* (+ Grafana port)
Which indicates that the LGTM stack is up and fully running ...

Imho, the problem is we've added a few more tests - reload, Prometheus,
and the LGTM is a full stack of things -- "hidden" in the same Docker image:

Loki
Grafana
Tempo
Metrics --> Prometheus
Collector
...

which we always sort of suspected it might be a problem -- startup time wise.

Dunno what we can do?
@brunobat any idea?

I do have a PR somewhere :-), which breaks this things in separate pieces,
but the LGTM is very convenient for us and users, since it's super simple to use -- all-in-one.

alesj · 2025-01-22T09:40:52Z

@holly-cummins I don't see how your PR could be the problem -- it just fixed the missing properties on the re-use.

Imho, the 34eb48e fix -- to have proper container wait strategy -- is to blame. :-)

Before we somehow got lucky for the tests to pass, even though we weren't properly waiting for LGTM to be fully operational.
But I guess the re-try in the metrics / traces lookup made-up for that on most occasions.

holly-cummins · 2025-01-22T09:51:30Z

@holly-cummins I don't see how your PR could be the problem -- it just fixed the missing properties on the re-use.

That's my feeling too. The re-use path isn't even used on the normal test execution (which is why we didn't notice the properties were missing). My change did touch the code on the non-re-use path just as part of the refactoring, so it's possible it could have accidentally changed the behaviour. But the only thing I can think of is if the refactored code was much faster or much slower and that somehow affected how much time there was for the container to start. But it's pretty long shot.

Imho, the 34eb48e fix -- to have proper container wait strategy -- is to blame. :-)

:)

Before we somehow got lucky for the tests to pass, even though we weren't properly waiting for LGTM to be fully operational. But I guess the re-try in the metrics / traces lookup made-up for that on most occasions.

That makes a lot of sense. It might be that just increasing the timeout for container start is enough to get things reliable again.

brunobat · 2025-01-22T10:26:14Z

Let's increment the container timeouts and see what happens, no?

edeandrea · 2025-01-22T12:13:30Z

Imho, the 34eb48e fix -- to have proper container wait strategy -- is to blame. :-)

Damned if I do, damned if I don't :)

Fixes quarkusio#45774 (maybe...)

gsmet added the kind/bug Something isn't working label Jan 22, 2025

quarkus-bot bot added the triage/needs-triage label Jan 22, 2025

geoand added area/housekeeping Issue type for generalized tasks not related to bugs or enhancements and removed triage/needs-triage labels Jan 22, 2025

gsmet added a commit to gsmet/quarkus that referenced this issue Jan 23, 2025

Adjust timeouts for LGTM container and tests

a6c45de

Fixes quarkusio#45774 (maybe...)

gsmet linked a pull request Jan 23, 2025 that will close this issue

Adjust timeouts for LGTM container and tests #45822

Open

gsmet added a commit to gsmet/quarkus that referenced this issue Jan 23, 2025

Adjust timeouts for LGTM container and tests

49d98a2

Fixes quarkusio#45774 (maybe...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LGTM tests are stalling (or are too slow for the timeout, I don't know) #45774

LGTM tests are stalling (or are too slow for the timeout, I don't know) #45774

gsmet commented Jan 22, 2025 •

edited

Loading

gsmet commented Jan 22, 2025

holly-cummins commented Jan 22, 2025 •

edited

Loading

alesj commented Jan 22, 2025

alesj commented Jan 22, 2025 •

edited

Loading

holly-cummins commented Jan 22, 2025

brunobat commented Jan 22, 2025

edeandrea commented Jan 22, 2025

LGTM tests are stalling (or are too slow for the timeout, I don't know) #45774

LGTM tests are stalling (or are too slow for the timeout, I don't know) #45774

Comments

gsmet commented Jan 22, 2025 • edited Loading

Describe the bug

gsmet commented Jan 22, 2025

holly-cummins commented Jan 22, 2025 • edited Loading

alesj commented Jan 22, 2025

alesj commented Jan 22, 2025 • edited Loading

holly-cummins commented Jan 22, 2025

brunobat commented Jan 22, 2025

edeandrea commented Jan 22, 2025

gsmet commented Jan 22, 2025 •

edited

Loading

holly-cummins commented Jan 22, 2025 •

edited

Loading

alesj commented Jan 22, 2025 •

edited

Loading