Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LGTM tests are stalling (or are too slow for the timeout, I don't know) #45774

Open
gsmet opened this issue Jan 22, 2025 · 7 comments · May be fixed by #45822
Open

LGTM tests are stalling (or are too slow for the timeout, I don't know) #45774

gsmet opened this issue Jan 22, 2025 · 7 comments · May be fixed by #45822
Labels
area/housekeeping Issue type for generalized tasks not related to bugs or enhancements kind/bug Something isn't working

Comments

@gsmet
Copy link
Member

gsmet commented Jan 22, 2025

Describe the bug

We are seeing some recent issues in the LGTM tests.

I am not sure if it's a problem with the new startup condition of the container or if starting the container is just too slow and this test needs a higher timeout but it needs to be addressed as it has been failing on CI quite a lot.

You can get the whole output by getting the raw logs for this build: https://github.com/quarkusio/quarkus/actions/runs/12897381888/job/35963408203 .

Then you wait for them to load entirely and you search for 2025-01-22T01:14:31.6075066Z 2025-01-22 01:14:31,520 INFO [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Tempo is up and running.

2025-01-22T01:14:01.3760182Z 2025-01-22 01:14:01,295 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Grafana is up and running.
2025-01-22T01:14:16.4919608Z 2025-01-22 01:14:16,406 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Loki is up and running.
2025-01-22T01:14:16.4920697Z 2025-01-22 01:14:16,412 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Prometheus is up and running.
2025-01-22T01:14:31.6075066Z 2025-01-22 01:14:31,520 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1572734623) [LGTM] STDOUT: Tempo is up and running.
2025-01-22T01:14:35.3116972Z @QuarkusTest has detected a hang, as there has been no test activity in PT1M
2025-01-22T01:14:35.3119426Z To configure this timeout use the quarkus.test.hang-detection-timeout config property
2025-01-22T01:14:35.3120356Z A stack trace is below to help diagnose the potential hang
2025-01-22T01:14:35.3120939Z === Stack Trace ===
2025-01-22T01:14:35.3121302Z Thread main: WAITING
2025-01-22T01:14:35.3121759Z   Waiting on io.quarkus.builder.Execution@580eef62
2025-01-22T01:14:35.3122291Z   Stack:
2025-01-22T01:14:35.3122778Z     [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
2025-01-22T01:14:35.3123855Z     [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
2025-01-22T01:14:35.3124706Z     io.quarkus.builder.Execution.run(Execution.java:101)
2025-01-22T01:14:35.3125455Z     io.quarkus.builder.BuildExecutionBuilder.execute(BuildExecutionBuilder.java:79)
2025-01-22T01:14:35.3126714Z     io.quarkus.deployment.QuarkusAugmentor.run(QuarkusAugmentor.java:161)
2025-01-22T01:14:35.3127806Z     io.quarkus.runner.bootstrap.AugmentActionImpl.runAugment(AugmentActionImpl.java:350)
2025-01-22T01:14:35.3129279Z     io.quarkus.runner.bootstrap.AugmentActionImpl.createInitialRuntimeApplication(AugmentActionImpl.java:271)
2025-01-22T01:14:35.3131143Z     io.quarkus.runner.bootstrap.AugmentActionImpl.createInitialRuntimeApplication(AugmentActionImpl.java:61)
2025-01-22T01:14:35.3132661Z     app//io.quarkus.test.junit.QuarkusTestExtension.doJavaStart(QuarkusTestExtension.java:195)
2025-01-22T01:14:35.3134125Z     app//io.quarkus.test.junit.QuarkusTestExtension.ensureStarted(QuarkusTestExtension.java:578)
2025-01-22T01:14:35.3135432Z     app//io.quarkus.test.junit.QuarkusTestExtension.beforeAll(QuarkusTestExtension.java:628)
2025-01-22T01:14:35.3137115Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.lambda$invokeBeforeAllCallbacks$12(ClassBasedTestDescriptor.java:396)
2025-01-22T01:14:35.3138773Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor$$Lambda$515/0x00007f440018b978.execute(Unknown Source)
2025-01-22T01:14:35.3140176Z     app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
2025-01-22T01:14:35.3141728Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.invokeBeforeAllCallbacks(ClassBasedTestDescriptor.java:396)
2025-01-22T01:14:35.3143388Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:212)
2025-01-22T01:14:35.3144898Z     app//org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:85)
2025-01-22T01:14:35.3146281Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:153)
2025-01-22T01:14:35.3147759Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$456/0x00007f440017aad8.execute(Unknown Source)
2025-01-22T01:14:35.3149200Z     app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
2025-01-22T01:14:35.3150701Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:146)
2025-01-22T01:14:35.3152074Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$455/0x00007f440017a8b0.invoke(Unknown Source)
2025-01-22T01:14:35.3153236Z     app//org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
2025-01-22T01:14:35.3154585Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:144)
2025-01-22T01:14:35.3156063Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$454/0x00007f440017a488.execute(Unknown Source)
2025-01-22T01:14:35.3156889Z     app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
2025-01-22T01:14:35.3157670Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:143)
2025-01-22T01:14:35.3158396Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:100)
2025-01-22T01:14:35.3159295Z     app//org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService$$Lambda$460/0x00007f440017ba80.accept(Unknown Source)
2025-01-22T01:14:35.3160072Z     [email protected]/java.util.ArrayList.forEach(ArrayList.java:1511)
2025-01-22T01:14:35.3160900Z     app//org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
2025-01-22T01:14:35.3161921Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:160)
2025-01-22T01:14:35.3162722Z     app//org.junit.platform.engine.support.hierarchical.NodeTestTask$$Lambda$456/0x00007f440017aad8.execute(Unknown Source)
@gsmet gsmet added the kind/bug Something isn't working label Jan 22, 2025
@gsmet
Copy link
Member Author

gsmet commented Jan 22, 2025

/cc @alesj @holly-cummins

@geoand geoand added area/housekeeping Issue type for generalized tasks not related to bugs or enhancements and removed triage/needs-triage labels Jan 22, 2025
@holly-cummins
Copy link
Contributor

holly-cummins commented Jan 22, 2025

The Develocity history (https://ge.quarkus.io/scans/tests?search.timeZoneId=Europe%2FLondon&tests.container=io.quarkus.observability.test.LgtmResourcesIT&tests.test=testTracing#) is now showing this:

Image

The first build it fails in is the one which merges #45689.

That strongly suggests my #45689 is to blame ... except that on #45689 I looked up the history of the failing test, and had this:

Image

That screencap shows it starts failing on the 17th. I didn't capture the previous URL to see why the two histories are different.

@alesj
Copy link
Contributor

alesj commented Jan 22, 2025

We're actually waiting on .*The OpenTelemetry collector and the Grafana LGTM stack are up and running.* (+ Grafana port)
Which indicates that the LGTM stack is up and fully running ...

Imho, the problem is we've added a few more tests - reload, Prometheus,
and the LGTM is a full stack of things -- "hidden" in the same Docker image:

  • Loki
  • Grafana
  • Tempo
  • Metrics --> Prometheus
  • Collector
  • ...

which we always sort of suspected it might be a problem -- startup time wise.

Dunno what we can do?
@brunobat any idea?

I do have a PR somewhere :-), which breaks this things in separate pieces,
but the LGTM is very convenient for us and users, since it's super simple to use -- all-in-one.

@alesj
Copy link
Contributor

alesj commented Jan 22, 2025

@holly-cummins I don't see how your PR could be the problem -- it just fixed the missing properties on the re-use.

Imho, the 34eb48e fix -- to have proper container wait strategy -- is to blame. :-)

Before we somehow got lucky for the tests to pass, even though we weren't properly waiting for LGTM to be fully operational.
But I guess the re-try in the metrics / traces lookup made-up for that on most occasions.

@holly-cummins
Copy link
Contributor

@holly-cummins I don't see how your PR could be the problem -- it just fixed the missing properties on the re-use.

That's my feeling too. The re-use path isn't even used on the normal test execution (which is why we didn't notice the properties were missing). My change did touch the code on the non-re-use path just as part of the refactoring, so it's possible it could have accidentally changed the behaviour. But the only thing I can think of is if the refactored code was much faster or much slower and that somehow affected how much time there was for the container to start. But it's pretty long shot.

Imho, the 34eb48e fix -- to have proper container wait strategy -- is to blame. :-)

:)

Before we somehow got lucky for the tests to pass, even though we weren't properly waiting for LGTM to be fully operational. But I guess the re-try in the metrics / traces lookup made-up for that on most occasions.

That makes a lot of sense. It might be that just increasing the timeout for container start is enough to get things reliable again.

@brunobat
Copy link
Contributor

Let's increment the container timeouts and see what happens, no?

@edeandrea
Copy link
Contributor

Imho, the 34eb48e fix -- to have proper container wait strategy -- is to blame. :-)

Damned if I do, damned if I don't :)

gsmet added a commit to gsmet/quarkus that referenced this issue Jan 23, 2025
@gsmet gsmet linked a pull request Jan 23, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/housekeeping Issue type for generalized tasks not related to bugs or enhancements kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants