Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Action CI test-executor E2E Test Often Fails - TestStopBehavior #14120

Open
4 tasks done
wesleyscholl opened this issue Jan 23, 2025 · 3 comments
Open
4 tasks done
Labels

Comments

@wesleyscholl
Copy link

wesleyscholl commented Jan 23, 2025

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Over the last week when updating my PR #13895, the GitHub CI Action often fails on the test-executor E2E test - TestSignalsSuite/TestStopBehavior.

=== FAIL: SignalsSuite/TestStopBehavior
FAIL	github.com/argoproj/argo-workflows/v3/test/e2e	623.621s
FAIL
make: *** [Makefile:609: test-executor] Error 1
Error: Process completed with exit code 2.

Failed CI runs for my PR:

Other community members affected:

I found these recurring issues and PRs regarding this failing TestStopBehavior E2E test.

Let me know if you need more information and how I can help resolve this issue, thanks.

Version(s)

GitHub Action - latest

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

Open or update a PR without Go source code changes 
(docs, examples, etc.) The test-executor - 
TestStopBehavior E2E test will fail more than 
50% of the time.

Logs from the workflow controller

Logs from the E2E Tests (test-executor, v1.29.10+k3s1, minimal, false) CI step:


=== RUN   TestSignalsSuite/TestStopBehavior
Submitting workflow  stop-terminate-
Waiting up to 2m0s for workflow with field selector 'metadata.name=stop-terminate-2llr9' and label selector 'workflows.argoproj.io/test'
 ? stop-terminate-2llr9 Workflow 0s      

 ● stop-terminate-2llr9   Workflow 0s      
 └ ● stop-terminate-2llr9 DAG      0s      
 └ ◷ A                    Pod      0s      

 ● stop-terminate-2llr9   Workflow 0s      
 └ ● stop-terminate-2llr9 DAG      0s      
 └ ◷ A                    Pod      0s      

 ● stop-terminate-2llr9   Workflow 0s      
 └ ● stop-terminate-2llr9 DAG      0s      
 └ ◷ A                    Pod      0s      PodInitializing

 ● stop-terminate-2llr9   Workflow 0s      
 └ ● A                    Pod      0s      
 └ ● stop-terminate-2llr9 DAG      0s      

Condition "to have running pod" met after 3s
Waiting up to 2m15s for workflow with field selector 'metadata.name=stop-terminate-2llr9' and label selector 'workflows.argoproj.io/test'
 ● stop-terminate-2llr9   Workflow 0s      
 └ ● stop-terminate-2llr9 DAG      0s      
 └ ● A                    Pod      0s      

 ● stop-terminate-2llr9   Workflow 0s      
 └ ✖ A                    Pod      3s      workflow shutdown with strategy:  Stop
 └ ● stop-terminate-2llr9 DAG      0s      
 └ ◷ A.onExit             Pod      0s      

 ● stop-terminate-2llr9          Workflow 0s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ stop-terminate-2llr9        DAG      3s      
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ◷ stop-terminate-2llr9.onExit Pod      0s      

 ● stop-terminate-2llr9          Workflow 0s      
 └ ✖ stop-terminate-2llr9        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ◷ stop-terminate-2llr9.onExit Pod      0s      PodInitializing

 ● stop-terminate-2llr9          Workflow 0s      
 └ ✖ stop-terminate-2llr9        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ◷ stop-terminate-2llr9.onExit Pod      0s      PodInitializing

 ● stop-terminate-2llr9          Workflow 0s      
 └ ✖ stop-terminate-2llr9        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ● stop-terminate-2llr9.onExit Pod      0s      

 ● stop-terminate-2llr9          Workflow 0s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ stop-terminate-2llr9        DAG      3s      
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ● stop-terminate-2llr9.onExit Pod      0s      

 ● stop-terminate-2llr9          Workflow 0s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ stop-terminate-2llr9        DAG      3s      
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ✔ stop-terminate-2llr9.onExit Pod      5s      

 ✖ stop-terminate-2llr9          Workflow 12s     Stopped with strategy 'Stop'
 └ ✖ stop-terminate-2llr9        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ✔ stop-terminate-2llr9.onExit Pod      5s      

Condition "to be done" met after 9s
Checking expectation stop-terminate-2llr9
stop-terminate-2llr9 : Failed Stopped with strategy 'Stop'
    signals_test.go:44: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:69
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:36
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Failed"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.NodePhase) (len=9) "Succeeded"
        	            	+(v1alpha1.NodePhase) (len=6) "Failed"
        	            	 
        	Test:       	TestSignalsSuite/TestStopBehavior
=== FAIL: SignalsSuite/TestStopBehavior
FAIL	github.com/argoproj/argo-workflows/v3/test/e2e	623.621s
FAIL
make: *** [Makefile:609: test-executor] Error 1
Error: Process completed with exit code 2.

Logs from in your workflow's wait container

Logs from the E2E Tests (test-executor, minimal, false) CI step:


=== RUN   TestSignalsSuite/TestStopBehavior
Submitting workflow  stop-terminate-
Waiting up to 2m0s for workflow with field selector 'metadata.name=stop-terminate-6c2n5' and label selector 'workflows.argoproj.io/test'
 ? stop-terminate-6c2n5 Workflow 0s      

 ● stop-terminate-6c2n5   Workflow 0s      
 └ ● stop-terminate-6c2n5 DAG      0s      
 └ ◷ A                    Pod      0s      

 ● stop-terminate-6c2n5   Workflow 0s      
 └ ● stop-terminate-6c2n5 DAG      0s      
 └ ◷ A                    Pod      0s      PodInitializing

 ● stop-terminate-6c2n5   Workflow 0s      
 └ ● A                    Pod      0s      
 └ ● stop-terminate-6c2n5 DAG      0s      

Condition "to have running pod" met after 3s
Waiting up to 2m15s for workflow with field selector 'metadata.name=stop-terminate-6c2n5' and label selector 'workflows.argoproj.io/test'
 ● stop-terminate-6c2n5   Workflow 0s      
 └ ● stop-terminate-6c2n5 DAG      0s      
 └ ● A                    Pod      0s      

 ● stop-terminate-6c2n5   Workflow 0s      
 └ ● stop-terminate-6c2n5 DAG      0s      
 └ ✖ A                    Pod      3s      workflow shutdown with strategy:  Stop
 └ ◷ A.onExit             Pod      0s      

 ● stop-terminate-6c2n5          Workflow 0s      
 └ ✖ stop-terminate-6c2n5        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✖ A.onExit                    Pod      0s      workflow shutdown with strategy:  Stop
 └ ◷ stop-terminate-6c2n5.onExit Pod      0s      

 ● stop-terminate-6c2n5          Workflow 0s      
 └ ✖ stop-terminate-6c2n5        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ◷ A.onExit                    Pod      0s      PodInitializing
 └ ◷ stop-terminate-6c2n5.onExit Pod      0s      PodInitializing

 ● stop-terminate-6c2n5          Workflow 0s      
 └ ✖ stop-terminate-6c2n5        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ◷ A.onExit                    Pod      0s      PodInitializing
 └ ◷ stop-terminate-6c2n5.onExit Pod      0s      PodInitializing

 ● stop-terminate-6c2n5          Workflow 0s      
 └ ✖ stop-terminate-6c2n5        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ◷ A.onExit                    Pod      0s      PodInitializing
 └ ● stop-terminate-6c2n5.onExit Pod      0s      

 ● stop-terminate-6c2n5          Workflow 0s      
 └ ✖ stop-terminate-6c2n5        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ◷ A.onExit                    Pod      0s      PodInitializing
 └ ● stop-terminate-6c2n5.onExit Pod      0s      

 ● stop-terminate-6c2n5          Workflow 0s      
 └ ✖ stop-terminate-6c2n5        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ✔ stop-terminate-6c2n5.onExit Pod      4s      
 └ ◷ A.onExit                    Pod      0s      PodInitializing

 ✖ stop-terminate-6c2n5          Workflow 12s     Stopped with strategy 'Stop'
 └ ✖ stop-terminate-6c2n5        DAG      3s      
 └ ✖ A                           Pod      3s      workflow shutdown with strategy:  Stop
 └ ◷ A.onExit                    Pod      0s      PodInitializing
 └ ✔ stop-terminate-6c2n5.onExit Pod      4s      

Condition "to be done" met after 8s
Checking expectation stop-terminate-6c2n5
stop-terminate-6c2n5 : Failed Stopped with strategy 'Stop'
    signals_test.go:44: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:69
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:36
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Pending"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.NodePhase) (len=9) "Succeeded"
        	            	+(v1alpha1.NodePhase) (len=7) "Pending"
        	            	 
        	Test:       	TestSignalsSuite/TestStopBehavior
=== FAIL: SignalsSuite/TestStopBehavior
FAIL	github.com/argoproj/argo-workflows/v3/test/e2e	610.760s
FAIL
make: *** [Makefile:609: test-executor] Error 1
Error: Process completed with exit code 2.
@wesleyscholl
Copy link
Author

After investigation, this issue is related to intermittent workflow failures when stopping a workflow.

Normal behavior should be:

  • Submit workflow -> Pod Running -> Stop Workflow -> Pod Failure -> Pod onExit -> Workflow Failure -> Workflow onExit -> E2E test successful

However the intermittent failures are:

  • Submit workflow -> Pod Running -> Stop Workflow -> Pod Failure & Workflow Failure -> Pod onExit & Workflow onExit -> E2E test failure

Successful Test:

Image Image

Test Failure:

Image Image

@wesleyscholl
Copy link
Author

Modifying this line improved the test success rate to >75%:

// Original
WaitForWorkflow(fixtures.ToHaveRunningPod, killDuration).

// Updated
WaitForWorkflow(fixtures.ToHaveRunningPod, 15*time.Second). // Reduced timeout prevents E2E TestStopBehavior failures

Any thoughts on this? Thanks

@wesleyscholl wesleyscholl changed the title GitHub CI Action test-executor E2E Test Often Fails - TestStopBehavior GitHub Action CI test-executor E2E Test Often Fails - TestStopBehavior Jan 25, 2025
@jswxstw
Copy link
Member

jswxstw commented Jan 26, 2025

It seems that there is indeed a bug here; the exit handler in the DAG cannot execute properly when stopping.

# argo get stop-terminate-cmnw9                         
Name:                stop-terminate-cmnw9
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed (Terminated)
Message:             Stopped with strategy 'Stop'
Conditions:          
 PodRunning          False
 Completed           True
Created:             Sun Jan 26 10:46:15 +0800 (1 minute ago)
Started:             Sun Jan 26 10:46:15 +0800 (1 minute ago)
Finished:            Sun Jan 26 10:46:28 +0800 (1 minute ago)
Duration:            13 seconds
Progress:            1/3
ResourcesDuration:   0s*(1 cpu),2s*(100Mi memory)

STEP                            TEMPLATE       PODNAME                                        DURATION  MESSAGE
 ✖ stop-terminate-cmnw9         main                                                                                                            
 ├─✖ A                          echo           stop-terminate-cmnw9-echo-2502829069           4s        workflow shutdown with strategy:  Stop  
 └─● A.onExit                   exit-template  stop-terminate-cmnw9-exit-template-4002442228  0s        workflow shutdown with strategy:  Stop  
                                                                                                                                                         
 ✔ stop-terminate-cmnw9.onExit  exit           stop-terminate-cmnw9-exit-1439502999           7s 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants