Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] topics failed to delete after remove cluster from replicated clusters set and caused OOM #23360

Merged

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Sep 27, 2024

Motivation

Background

  • Users may start 2 clusters with the Geo-Replication feature with Global ZK
  • Users want to hire one cluster, so remove the cluster from namespace-level replicatedClusters
    • Broker will remove the topics of the cluster removed automatically.
  • Issue occurs:
    • The system topic __change_events being deleted first.
    • User topics can not be deleted anymore, because the operation delete topic-level policies can not be executed successfully anymore.
    • The task topic.checkReplication and other operations will retry again and again, as a result, the broker will crash of OOM

Screenshot 2024-09-27 at 17 17 42
Screenshot 2024-09-27 at 17 18 56

Modifications

  • Skip removing topic-level policies if the system topic __change_event has been deleted.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost release/3.2.5 release/3.0.7 labels Sep 27, 2024
@poorbarcode poorbarcode added this to the 4.0.0 milestone Sep 27, 2024
@poorbarcode poorbarcode self-assigned this Sep 27, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 27, 2024
bin/ledgers.md Outdated Show resolved Hide resolved
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the review comments.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @poorbarcode

@lhotari lhotari added the release/blocker Indicate the PR or issue that should block the release until it gets resolved label Oct 9, 2024
@lhotari
Copy link
Member

lhotari commented Oct 10, 2024

Triggering CI with fix #23431

@lhotari lhotari closed this Oct 10, 2024
@lhotari lhotari reopened this Oct 10, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.36%. Comparing base (bbc6224) to head (5c2e2b4).
Report is 663 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23360      +/-   ##
============================================
+ Coverage     73.57%   74.36%   +0.78%     
- Complexity    32624    34983    +2359     
============================================
  Files          1877     1953      +76     
  Lines        139502   147162    +7660     
  Branches      15299    16199     +900     
============================================
+ Hits         102638   109430    +6792     
- Misses        28908    29305     +397     
- Partials       7956     8427     +471     
Flag Coverage Δ
inttests 27.31% <62.50%> (+2.73%) ⬆️
systests 24.32% <62.50%> (-0.01%) ⬇️
unittests 73.72% <100.00%> (+0.88%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../service/SystemTopicBasedTopicPoliciesService.java 74.42% <100.00%> (+0.23%) ⬆️
...er/systopic/NamespaceEventsSystemTopicFactory.java 100.00% <100.00%> (ø)

... and 633 files with indirect coverage changes

@Technoboy- Technoboy- merged commit d9bc7af into apache:master Oct 14, 2024
73 of 76 checks passed
@lhotari lhotari removed this from the 4.1.0 milestone Oct 14, 2024
@lhotari lhotari added this to the 4.0.0 milestone Oct 14, 2024
@lhotari
Copy link
Member

lhotari commented Oct 17, 2024

I wonder if this is causing #23474 which currently blocks Pulsar CI. There's PR #23478 to disable OneWayReplicatorTestBase tests the problem is addressed. @poorbarcode do you have a chance to fix the problem?

@poorbarcode
Copy link
Contributor Author

I wonder if this is causing #23474 which currently blocks Pulsar CI. There's PR #23478 to disable OneWayReplicatorTestBase tests the problem is addressed. @poorbarcode do you have a chance to fix the problem?

Self-assigned #23474, I will fix it next week

@lhotari
Copy link
Member

lhotari commented Oct 21, 2024

@poorbarcode Please also handle cherry-picking to branch-3.0 and branch-3.3

poorbarcode added a commit that referenced this pull request Oct 25, 2024
…icated clusters set and caused OOM (#23360)

(cherry picked from commit d9bc7af)
poorbarcode added a commit that referenced this pull request Oct 25, 2024
…icated clusters set and caused OOM (#23360)

(cherry picked from commit d9bc7af)
poorbarcode added a commit that referenced this pull request Oct 25, 2024
…icated clusters set and caused OOM (#23360)

(cherry picked from commit d9bc7af)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Oct 28, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
(cherry picked from commit 187e6d4)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Oct 28, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
(cherry picked from commit 187e6d4)
@heesung-sn
Copy link
Contributor

heesung-sn commented Oct 28, 2024

It appears that this PR fails the test in branch-3.3.

https://github.com/heesung-sn/pulsar/actions/runs/11535492430/job/32111441443

  Error:  Tests run: 8, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 782.877 s <<< FAILURE! - in org.apache.pulsar.broker.service.OneWayReplicatorUsingGlobalZKTest
  Error:  org.apache.pulsar.broker.service.OneWayReplicatorUsingGlobalZKTest.testRemoveCluster  Time elapsed: 300.528 s  <<< FAILURE!
  org.testng.internal.thread.ThreadTimeoutException: Method org.apache.pulsar.broker.service.OneWayReplicatorUsingGlobalZKTest.testRemoveCluster() didn't finish within the time-out 300000
  	at java.base/java.lang.Thread.sleep0(Native Method)
  	at java.base/java.lang.Thread.sleep(Thread.java:509)
  	at org.apache.pulsar.broker.service.OneWayReplicatorUsingGlobalZKTest.testRemoveCluster(OneWayReplicatorUsingGlobalZKTest.java:211)
  	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)

@poorbarcode
Copy link
Contributor Author

@heesung-sn

It appears that this PR fails the test in branch-3.3.

Sure, it will be fixed by the PR: #23522

@lhotari
Copy link
Member

lhotari commented Nov 1, 2024

@poorbarcode Do you have a chance to take a look at #23543 ? It's a deadlock in SystemTopicBasedTopicPoliciesService which happens also after #23522 has been merged. The deadlock might be completely unrelated to the recent changes, but since you know SystemTopicBasedTopicPoliciesService well, I'd appreciate it if you can take a look at the thread dump and the deadlocks.

nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 6, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
(cherry picked from commit 187e6d4)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 6, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
(cherry picked from commit 187e6d4)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 7, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
(cherry picked from commit 187e6d4)
sagar627 added a commit to cognitree/ds_pulsar that referenced this pull request Nov 8, 2024
dlg99 pushed a commit to datastax/pulsar that referenced this pull request Nov 11, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
dlg99 added a commit to datastax/pulsar that referenced this pull request Nov 11, 2024
…rom replicated clusters set and caused OOM (apache#23360)"

This reverts commit cab51f4.
dlg99 pushed a commit to datastax/pulsar that referenced this pull request Nov 11, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
(cherry picked from commit 187e6d4)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 6, 2024
…icated clusters set and caused OOM (apache#23360)

(cherry picked from commit d9bc7af)
(cherry picked from commit 187e6d4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-3.0 cherry-picked/branch-3.2 cherry-picked/branch-3.3 doc-not-needed Your PR changes do not impact docs release/blocker Indicate the PR or issue that should block the release until it gets resolved release/3.0.8 release/3.2.5 release/3.3.3 release/4.0.0 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants