Make sure to read :doc:`prerequisites` before installing mlbench.
All guides assume you have checked out the mlbench-helm github repository and have a terminal open in the checked-out mlbench-helm
directory.
Since every Kubernetes is different, there are no reasonable defaults for some values, so the following properties have to be set. You can save them in a yaml file of your chosing. This guide will assume you saved them in myvalues.yaml. For a reference file for all configurable values, you can copy the values.yaml file to myvalues.yaml.
limits:
workers:
cpu:
bandwidth:
gpu:
gcePersistentDisk:
enabled:
pdName:
limits.workers
is the maximum number of worker nodes available to mlbench. This sets the maximum number of nodes that can be chosen for an experiment in the UI. By default mlbench starts 2 workers on startup.limits.cpu
is the maximum number of CPUs (Cores) available on each worker node. Uses Kubernetes notation (8 or 8000m for 8 cpus/cores). This is also the maximum number of Cores that can be selected for an experiment in the UIlimits.bandwidth
is the maximum network bandwidth available between workers, in mbit per second. This is the default bandwidth used and the maximum number selectable in the UI.limits.gpu
is the number of gpus requested by each worker pod.gcePersistentDisk.enabled
create resources related to NFS persistentVolume and persistentVolumeClaim.gcePersistentDisk.pdName
is the name of persistent disk existed in GKE.
Caution!
If you set workers
, cpu
or gpu
higher than available in your cluster, Kubernetes will not be able to allocate nodes to mlbench and the deployment will hang indefinitely, without throwing an exception.
Kubernetes will just wait until nodes that fit the requirements become available. So make sure your cluster actually has the requirements avilable that you requested.
Note
To use gpu
in the cluster, the nvidia device plugin should be installed. See :ref:`plugins` for details
Note
Use commands like gcloud compute disks create --size=10G --zone=europe-west1-b my-pd-name
to create persistent disk.
Note
The GCE persistent disk will be mounted to /datasets/ directory on each worker.
Set the :ref:`helm-charts`
Use helm to install the mlbench chart (Replace ${RELEASE_NAME}
with a name of your choice):
$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install ${RELEASE_NAME} .
Follow the instructions at the end of the helm install to get the dashboard URL. E.g.:
$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install rel .
[...]
NOTES:
1. Get the application URL by running these commands:
export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services rel-mlbench-master)
export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT
This outputs the URL the Dashboard is accessible at.
In values.yaml
, one can optionally install Kubernetes plugins by turning on/off the following flags:
weave.enabled
: If true, install the weave network plugin.nvidiaDevicePlugin.enabled
: If true, install the nvidia device plugin.
Set the :ref:`helm-charts`
Important
Make sure to read the prerequisites for :ref:`google-cloud`
Please make sure that kubectl
is configured correctly.
Caution!
Google installs several pods on each node by default, limiting the available CPU. This can take up to 0.5 CPU cores per node. So make sure to provision VM's that have at least 1 more core than the amount of cores you want to use for you mlbench experiment. See here for further details on node limits.
Install mlbench (Replace ${RELEASE_NAME}
with a name of your choice):
$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install ${RELEASE_NAME} .
To access mlbench, run these commands and open the URL that is returned (Note: The default instructions returned by helm on the commandline return the internal cluster ip only):
$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master)
$ export NODE_IP=$(gcloud compute instances list|grep $(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}") |awk '{print $5}')
$ gcloud compute firewall-rules create --quiet mlbench --allow tcp:$NODE_PORT,tcp:$NODE_PORT
$ echo http://$NODE_IP:$NODE_PORT
!DANGER!
The last command opens up a firewall rule to the google cloud. Make sure to delete the rule once it's not needed anymore:
$ gcloud compute firewall-rules delete --quiet mlbench
Minikube allows running a single-node Kubernetes cluster inside a VM on your laptop, for users looking to try out Kubernetes or to develop with it.
Installing mlbench to minikube.
Set the :ref:`helm-charts`
Start minikube cluster
$ minikube start
Next install or upgrade a helm chart with desired configurations with name ${RELEASE_NAME}
$ helm init --kube-context minikube --wait
$ helm upgrade --wait --recreate-pods -f myvalues.yaml --timeout 900 --install ${RELEASE_NAME} .
Note
The minikube runs a single-node Kubernetes cluster inside a VM. So we need to fix the replicaCount=1
in values.yaml.
Once the installation is finished, one can obtain the url
$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master)
$ export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
$ echo http://$NODE_IP:$NODE_PORT
Now the mlbench dashboard should be available at http://${NODE_IP}:${NODE_PORT}
.
Note
To access http://$NODE_IP:$NODE_PORT
outside minikube, run the following command on the host:
$ ssh -i ${MINIKUBE_HOME}/.minikube/machines/minikube/id_rsa -N -f -L localhost:${NODE_PORT}:${NODE_IP}:${NODE_PORT} docker@$(minikube ip)
where $MINIKUBE_HOME
is by default $HOME
. One can view mlbench dashboard at http://localhost:${NODE_PORT}
Docker-in-Docker allows simulating multiple nodes locally on a single machine. This is useful for development.
Hint
For development purposes, it makes sense to use a local docker registry as well with DIND.
Describing how to set up a local registry would be too long for this guide, so here are some pointers:
Download the kubeadm-dind-cluster script.
$ wget https://cdn.rawgit.com/kubernetes-sigs/kubeadm-dind-cluster/master/fixed/dind-cluster-v1.11.sh
$ chmod +x dind-cluster-v1.11.sh
For networking to work in DIND, we need to set a CNI Plugin. In our experience, weave
works well with DIND.
$ export CNI_PLUGIN=weave
Now we can start the local cluster with
$ ./dind-cluster-v1.11.sh up
This might take a couple of minutes.
Hint
If you're using a local docker registry, run dind-proxy.sh
after the previous step.
Install helm
(See :doc:`prerequisites`) and set the :ref:`helm-charts`.
Hint
For a local registry, make sure you have an imagePullSecret
added to the kubernetes serviceaccount and set the repository and secret in the values.yaml
file (regcred
in this example):
master:
imagePullSecret: regcred
image:
repository: localhost:5000/mlbench_master
tag: latest
pullPolicy: Always
worker:
imagePullSecret: regcred
image:
repository: localhost:5000/mlbench_worker
tag: latest
pullPolicy: Always
Install mlbench (Replace ${RELEASE_NAME}
with a name of your choice):
$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install rel .
[...]
NOTES:
1. Get the application URL by running these commands:
export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services rel-mlbench-master)
export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
echo http://$NODE_IP:$NODE_PORT
Run the 3 commands printed by the last command. This outputs the URL the Dashboard is accessible at.