refactor the process of controlling a training job #39

m3ngyang · 2018-07-12T03:00:13Z

m3ngyang · 2018-07-12T03:09:18Z

pkg/controller/trainingjob_controller.go

@@ -106,9 +126,11 @@ func New(
 // is closed, at which point it will shutdown the workqueue and wait for
 // workers to finish processing their current work items.
 func (c *TrainingJobController) Run(threadiness int, maxLoadDesired float64, stopCh <-chan struct{}) error {
+	// TODO add a lock to ensure there is only one controller in the cluster


resource lock has been implemented, so the TODO is needless.

m3ngyang · 2018-07-12T03:10:32Z

pkg/controller/trainingjob_controller.go

@@ -125,9 +147,12 @@ func (c *TrainingJobController) Run(threadiness int, maxLoadDesired float64, sto
 		go wait.Until(c.runWorker, time.Second, stopCh)
 	}

+	// gc := NewGarbageCollector(c.KubeCli, c.trainingjobLister)
+	// go gc.CleanOrphans(10 * time.Minute)


delete commits and add a TODO for gc

…ctory

typhoonzero · 2018-07-20T03:32:29Z

cmd/edl/edl.go

@@ -34,6 +34,7 @@ var (
 func main() {
 	masterURL := flag.String("master", "", "Address of a kube master.")
 	kubeConfig := flag.String("kubeconfig", "", "Path to a kube config. Only required if out-of-cluster.")
+	autoClean := flag.Bool("autoclean", false, "Auto clean pods after terminating job, default false")


So make this default false means user may need to get the logs?

Yes, if flag autoclean is false, controller will maintain pods after success or failure. Otherwise, all pods will be deleted automatically. It's useful to debug and get logs.

tizhou86

Already did the code review in Baidu repository. Merge it now.

refactor the process of controlling a training job

165d2a1

m3ngyang requested review from typhoonzero and Yancey1989 July 12, 2018 03:01

m3ngyang commented Jul 12, 2018

View reviewed changes

typhoonzero requested a review from gongweibao July 12, 2018 12:25

m3ngyang added 4 commits July 16, 2018 10:53

modify comments to pass golint check

6617af0

Merge branch 'develop_CRD' into refactory

804cdda

fix travis problem

189444b

Merge branch 'refactory' of https://github.com/m3ngyang/edl into refa…

6a438c4

…ctory

typhoonzero reviewed Jul 20, 2018

View reviewed changes

tizhou86 self-requested a review September 17, 2018 10:52

tizhou86 approved these changes Sep 18, 2018

View reviewed changes

tizhou86 merged commit d141736 into elasticdeeplearning:develop_CRD Sep 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor the process of controlling a training job #39

refactor the process of controlling a training job #39

m3ngyang commented Jul 12, 2018

m3ngyang Jul 12, 2018

m3ngyang Jul 12, 2018

typhoonzero Jul 20, 2018

m3ngyang Jul 20, 2018

tizhou86 left a comment

refactor the process of controlling a training job #39

refactor the process of controlling a training job #39

Conversation

m3ngyang commented Jul 12, 2018

m3ngyang Jul 12, 2018

Choose a reason for hiding this comment

m3ngyang Jul 12, 2018

Choose a reason for hiding this comment

typhoonzero Jul 20, 2018

Choose a reason for hiding this comment

m3ngyang Jul 20, 2018

Choose a reason for hiding this comment

tizhou86 left a comment

Choose a reason for hiding this comment