Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcs: use a controller to manage scheduling jobs #7270

Merged
merged 11 commits into from
Nov 10, 2023

Conversation

rleungx
Copy link
Member

@rleungx rleungx commented Oct 26, 2023

What problem does this PR solve?

Issue Number: Ref #5839.

What is changed and how does it work?

This PR uses a controller to start/stop the scheduling jobs so that we can dynamically control the scheduling jobs according to if there is a scheduling service.

Check List

Tests

  • Unit test

Release note

None.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Oct 26, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • CabinfeverB
  • lhy1024

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Oct 26, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 26, 2023
@ti-chi-bot ti-chi-bot bot requested review from disksing and HunDunDM October 26, 2023 10:20
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 26, 2023
@rleungx rleungx force-pushed the dynamic-enable-scheduling branch 3 times, most recently from 212f0df to 1684a14 Compare November 2, 2023 05:07
@rleungx rleungx changed the title mcs: dynamically enable scheduling service mcs: use a controller to manage scheduling jobs Nov 2, 2023
@rleungx rleungx marked this pull request as ready for review November 2, 2023 05:08
@ti-chi-bot ti-chi-bot bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-linked-issue labels Nov 2, 2023
@rleungx rleungx requested review from JmPotato, nolouch, lhy1024 and CabinfeverB and removed request for disksing and HunDunDM November 3, 2023 03:24
@rleungx rleungx force-pushed the dynamic-enable-scheduling branch from 6cdff51 to 85b941c Compare November 7, 2023 06:43
@rleungx rleungx force-pushed the dynamic-enable-scheduling branch from 85b941c to 36bc79d Compare November 7, 2023 06:47
Signed-off-by: Ryan Leung <[email protected]>
c.enabledServices.Store(mcsutils.SchedulingServiceName, true)
} else if !c.schedulingController.running.Load() {
c.startSchedulingJobs()
c.enabledServices.Delete(mcsutils.SchedulingServiceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enabledServices naming is confusing to me, it seems more like servicesToEnable since we will delete the service name from it after starting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the scheduling server is started, the enabledServices will store it, otherwise, it will be deleted in enabledServices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also confusing to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any better idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about independentServices? And do we need maintain this in PD mode?

@rleungx rleungx force-pushed the dynamic-enable-scheduling branch from 31eee0e to 2d6d331 Compare November 7, 2023 08:35
server/server.go Outdated
@@ -489,7 +489,7 @@ func (s *Server) startServer(ctx context.Context) error {
s.safePointV2Manager = gc.NewSafePointManagerV2(s.ctx, s.storage, s.storage, s.storage)
s.hbStreams = hbstream.NewHeartbeatStreams(ctx, s.clusterID, "", s.cluster)
// initial hot_region_storage in here.
if !s.IsAPIServiceMode() {
if !s.IsServiceEnabled(mcs.SchedulingServiceName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running startServer, is the RaftCluter running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

sc.coordinator.Stop()
sc.cancel()
sc.wg.Wait()
sc.running.Store(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use CompareAndSwap to avoid stop more one time.

checkFn := func() {
if c.isAPIServiceMode {
once.Do(c.initSchedulers)
c.enabledServices.Store(mcsutils.SchedulingServiceName, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in this code, scheduling will not start again in api mode. I think we can delete runServiceCheckJob first and put it in the next PR. Because once the PR merged, the scheduling service must be deployed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

c.enabledServices.Store(mcsutils.SchedulingServiceName, true)
} else if !c.schedulingController.running.Load() {
c.startSchedulingJobs()
c.enabledServices.Delete(mcsutils.SchedulingServiceName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about independentServices? And do we need maintain this in PD mode?

server/server.go Outdated Show resolved Hide resolved
@rleungx rleungx force-pushed the dynamic-enable-scheduling branch from 1f37ab8 to e2aee75 Compare November 9, 2023 04:05
Signed-off-by: Ryan Leung <[email protected]>
@rleungx rleungx force-pushed the dynamic-enable-scheduling branch from e2aee75 to 2ce4ff0 Compare November 9, 2023 04:12
Signed-off-by: Ryan Leung <[email protected]>
Copy link
Member

@CabinfeverB CabinfeverB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

sc.mu.Lock()
defer sc.mu.Unlock()
if sc.running {
log.Warn("scheduling service is already running")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will print frequently

@ti-chi-bot ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label Nov 9, 2023
@rleungx
Copy link
Member Author

rleungx commented Nov 9, 2023

PTAL @JmPotato @lhy1024 @nolouch

func (c *RaftCluster) GetPausedSchedulerDelayUntil(name string) (int64, error) {
return c.coordinator.GetSchedulersController().GetPausedSchedulerDelayUntil(name)
// IsServiceIndependent returns whether the service is independent.
func (c *RaftCluster) IsServiceIndependent(name string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a comment to indicate what service to be supported? only scheduling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe later, right now, we just have a scheduling service.

server/cluster/scheduling_controller.go Show resolved Hide resolved
@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Nov 9, 2023
Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nolouch
Copy link
Contributor

nolouch commented Nov 10, 2023

/merge

Copy link
Contributor

ti-chi-bot bot commented Nov 10, 2023

@nolouch: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Copy link
Contributor

ti-chi-bot bot commented Nov 10, 2023

This pull request has been accepted and is ready to merge.

Commit hash: e807c69

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Nov 10, 2023
Copy link
Contributor

ti-chi-bot bot commented Nov 10, 2023

@rleungx: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot bot merged commit 1a0233b into tikv:master Nov 10, 2023
@rleungx rleungx deleted the dynamic-enable-scheduling branch November 13, 2023 02:08
rleungx added a commit to rleungx/pd that referenced this pull request Dec 1, 2023
ref tikv#5839

Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants