Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[E2E] Elasticsearch cluster health should not have been red during mutation process #8267

Open
barkbay opened this issue Nov 27, 2024 · 5 comments
Labels
>test Related to unit/integration/e2e tests

Comments

@barkbay
Copy link
Contributor

barkbay commented Nov 27, 2024

We already had this failure in the past (#5040 and #5795) but it was quiet until yesterday:

2024-11-26 - TestMutationResizeMemoryDown on Kind 1.29

{
    "Time": "2024-11-26T02:10:27.650635375Z",
    "Action": "output",
    "Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/es",
    "Test": "TestMutationResizeMemoryDown/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process",
    "Output": "    steps_mutation.go:159: ContinuousHealthChecks failures count (10) is above the tolerance (0): 10 x [cluster health red]\n"
}

2024-11-27 - TestMutationAndReversal on Kind 1.27.17

{
    "Time": "2024-11-27T02:41:29.251348422Z",
    "Action": "output",
    "Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/es",
    "Test": "TestMutationAndReversal/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process",
    "Output": "    steps_mutation.go:159: ContinuousHealthChecks failures count (11) is above the tolerance (0): 11 x [cluster health red]\n"
}

I had a quick look at the TestMutationAndReversal failure and nothing really stands out for now, the cluster was eventually green.

@barkbay barkbay added the >test Related to unit/integration/e2e tests label Nov 27, 2024
@barkbay
Copy link
Contributor Author

barkbay commented Dec 2, 2024

Trying to make sense of the events for this failure :

Timestamp Source Message
02:29:25.020 E2E TestMutationAndReversal/Starting_to_watch_for_correct_node_shutdown_API_usage
02:29:25.948 E2E TestMutationAndReversal/Applying_the_Elasticsearch_mutation_should_succeed
02:29:26.387 ECK Requesting shutdown node: test-reverted-mutation-25tq-es-masterdata-2 note that masterdata-2 is the elected master at this point
02:29:26.401 ES test-reverted-mutation-25tq-es-masterdata-2 - creating shutdown record {nodeId=[M1SoknqJRn-azlIWOql3jg], type=[RESTART], reason=[123067]}
02:29:26.462 ES test-reverted-mutation-25tq-es-masterdata-2 - marking node as not ready because it's shutting down
02:29:26.464 ES test-reverted-mutation-25tq-es-masterdata-2 - took [1ms] to notify listeners on successful publication of cluster state (version: 80, uuid: NlqSkGGzQO-lLXOCWgtVIg) for [put-node-shutdown-M1SoknqJRn-azlIWOql3jg[PutShutdownNodeTask[request=org.elasticsearch.xpack.shutdown.PutShutdownN
02:29:26.489 ECK Node shutdown complete, can start node deletion
02:29:26.999 ES test-reverted-mutation-25tq-es-masterdata-1 - transport connection to [{test-reverted-mutation-25tq-es-masterdata-2}.... closed by remote
no elected master from 03:29:26.999 to 02:29:27.208?
02:29:27.208 ES test-reverted-mutation-25tq-es-masterdata-1 - master node changed {previous [], current [{test-reverted-mutation-25tq-es-masterdata-0}{5_A7oYdMRaOAkef3P0rXhA}{2l0Fm1KqQcmKD2PmczNuHQ}{test-reverted-mutation-25tq-es-masterdata-0}{10.244.2.68}{10.244.2.68:9300}{cdfhilmrstw}{8.16.0}{
02:29:27.327 ES test-reverted-mutation-25tq-es-masterdata-0 - Cluster health status changed from [YELLOW] to [RED] Cluster health status changed from [YELLOW] to [RED] (reason: [{test-reverted-mutation-25tq-es-masterdata-2}{M1SoknqJRn-azlIWOql3jg}{UN68ohjnSiSbhA_9LcjJyQ}{test-reverted-mutation-25tq-es-masterdata-2}{10.244.3.48}{10.244.3.48:9300}{cdfhilmrstw}{8.16.0}{7000099-8518000} reason: disconnected]).
02:29:57.817 ES test-reverted-mutation-25tq-es-masterdata-2 started {test-reverted-mutation-25tq-es-masterdata-2}
02:29:58.521 ES test-reverted-mutation-25tq-es-masterdata-0 - Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ds-ilm-history-7-2024.12.02-000001][0]]]).

@barkbay
Copy link
Contributor Author

barkbay commented Dec 2, 2024

From the logs above I'm a bit surprised that we had no election while masterdata-2 was still around? It seems that the other nodes started a new election only when masterdata-2 was terminated and the connection was lost.

@pebrc
Copy link
Collaborator

pebrc commented Dec 2, 2024

I removed my previous post. I cut off the routing table in a bad place. There was still a shard assigned.

@barkbay
Copy link
Contributor Author

barkbay commented Jan 3, 2025

Happened again today, found this in the logs:

[.ds-ilm-history-7-2025.01.03-000001][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=NODE_RESTARTING], at[2025-01-03T02:29:28.253Z], delayed=true, last_node[uYfufXxSSbmzwIVCKnepjw], details[node_left [uYfufXxSSbmzwIVCKnepjw]], allocation_status[fetching_shard_data]]
[.ds-ilm-history-7-2025.01.03-000001][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=PRIMARY_FAILED], at[2025-01-03T02:29:28.253Z], delayed=false, last_node[zYLIr4AfRiu0gkJov8wH0w], details[primary failed while replica initializing], allocation_status[no_attempt]]

ilm-history-7 was created a few milliseconds after the shutdown of the first node has been initiated:

Image

So it seems to be a duplicate of #5795 😐 , what I don't understand is why it was so quiet until recently...

@barkbay
Copy link
Contributor Author

barkbay commented Jan 6, 2025

Only ideas I have in mind:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test Related to unit/integration/e2e tests
Projects
None yet
Development

No branches or pull requests

2 participants