Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor the process of controlling a training job #39

Merged
merged 5 commits into from
Sep 18, 2018

Conversation

m3ngyang
Copy link
Collaborator

fix #26

@@ -106,9 +126,11 @@ func New(
// is closed, at which point it will shutdown the workqueue and wait for
// workers to finish processing their current work items.
func (c *TrainingJobController) Run(threadiness int, maxLoadDesired float64, stopCh <-chan struct{}) error {
// TODO add a lock to ensure there is only one controller in the cluster
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource lock has been implemented, so the TODO is needless.

@@ -125,9 +147,12 @@ func (c *TrainingJobController) Run(threadiness int, maxLoadDesired float64, sto
go wait.Until(c.runWorker, time.Second, stopCh)
}

// gc := NewGarbageCollector(c.KubeCli, c.trainingjobLister)
// go gc.CleanOrphans(10 * time.Minute)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete commits and add a TODO for gc

@typhoonzero typhoonzero requested a review from gongweibao July 12, 2018 12:25
@@ -34,6 +34,7 @@ var (
func main() {
masterURL := flag.String("master", "", "Address of a kube master.")
kubeConfig := flag.String("kubeconfig", "", "Path to a kube config. Only required if out-of-cluster.")
autoClean := flag.Bool("autoclean", false, "Auto clean pods after terminating job, default false")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So make this default false means user may need to get the logs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if flag autoclean is false, controller will maintain pods after success or failure. Otherwise, all pods will be deleted automatically. It's useful to debug and get logs.

@tizhou86 tizhou86 self-requested a review September 17, 2018 10:52
Copy link
Member

@tizhou86 tizhou86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already did the code review in Baidu repository. Merge it now.

@tizhou86 tizhou86 merged commit d141736 into elasticdeeplearning:develop_CRD Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants