ElasticDL Client: Submit ElasticDL Job to Kubernetes

Prepare Model Definition

A model definition directory is needed to be created, the files in the directory are as follows:

(mandatory) A Python source file which defines the keras model and use the directory base name as the filename.
(mandatory) The file __init__.py is necessary.
(optional) Source files of other Python modules.
(optional) A requirements.txt file that lists dependencies required by the above source files.

There are several Keras examples provided in model_zoo directory.

Submit ElasticDL Job In Development Mode

Download ElasticDL Source Code

git clone https://github.com/sql-machine-learning/elasticdl.git
cd elasticdl

Use ElasticDL client to launch ElasticDL system on a Kubernetes cluster and submit a model, e.g. model_zoo/mnist_subclass/mnist_subclass.py to it.

Submit to local Kubernetes on Your Machine

For demonstration purposes, we use the data stored on elasticdl:ci Docker image. First we build all development Docker images, which include elasticdl:ci image:

export TRAVIS_BUILD_DIR=$PWD
bash scripts/travis/build_images.sh

By default, the above script builds images with TensorFlow CPU image as the base image. If you want to switch to other images, for example, Python, Ubuntu, or TensorFlow GPU image, please edit elasticdl/docker/Dockerfile.

Submit training job (make sure you have packages kubernetes and docker installed in your running environment):

python -m elasticdl.python.elasticdl.client train \
    --model_zoo=model_zoo \
    --model_def=mnist_subclass.mnist_subclass.CustomModel \
    --image_base=elasticdl:ci \
    --training_data=/data/mnist/train \
    --validation_data=/data/mnist/test \
    --num_epochs=1 \
    --master_resource_request="cpu=1,memory=512Mi" \
    --master_resource_limit="cpu=1,memory=512Mi" \
    --worker_resource_request="cpu=1,memory=1024Mi" \
    --worker_resource_limit="cpu=1,memory=1024Mi" \
    --minibatch_size=10 \
    --num_minibatches_per_task=10 \
    --num_workers=1 \
    --checkpoint_steps=2 \
    --grads_to_wait=2 \
    --job_name=test \
    --image_pull_policy=Never \
    --log_level=INFO \
    --envs=e1=v1,e2=v2

Submit to a GKE cluster

Please checkout this tutorial for instructions on submitting jobs to a GKE cluster.

Submit to an on-premise Kubernetes cluster

On-premise Kubernetes cluster may add some additional configurations for pods to be launched, ElasticDL provides an easy way for users to specify their pods requirements.

python -m elasticdl.python.elasticdl.client train \
    --job_name=test \
    --image_name=gcr.io/elasticdl/mnist:dev \
    --model_zoo=model_zoo \
    --model_def=mnist_subclass.mnist_subclass.CustomModel \
    --cluster_spec=<path_to_cluster_specification_file> \
    --training_data=/data/mnist_nfs/mnist/train \
    --validation_data=/data/mnist_nfs/mnist/test \
    --num_epochs=1 \
    --minibatch_size=10 \
    --num_minibatches_per_task=10 \
    --num_workers=1 \
    --checkpoint_steps=2 \
    --master_pod_priority=high-priority \
    --worker_pod_priority=high-priority \
    --master_resource_request="cpu=1,memory=2048Mi" \
    --master_resource_limit="cpu=1,memory=2048Mi" \
    --worker_resource_request="cpu=2,memory=4096Mi" \
    --worker_resource_limit="cpu=2,memory=4096Mi" \
    --grads_to_wait=2 \
    --volume="mount_path=/data,claim_name=fileserver-claim" \
    --image_pull_policy=Always \
    --log_level=INFO \
    --docker_image_repository=gcr.io/elasticdl \
    --envs=e1=v1,e2=v2

The difference is that we add a new argument cluster_spec which points to a cluster specification file. The cluster specification module includes a cluster component, and ElasticDL will invoke function cluster.with_cluster(pod) to add extra specifications to the pod and invoke function cluster.with_service(service) to add extra specifications to the service. Here is an example that assigns labels "app": "elasticdl" to the pod and service. Users can implement more customized configurations inside these two functions.

class KubernetesCluster:
    def with_pod(self, pod):
        pod.metadata.labels["app"] = "elasticdl"
        return pod

    def with_service(self, service):
        service.metadata.labels["app"] = "elasticdl"
        return service

# TODO: need to change this after we make same change to model definition
cluster = KubernetesCluster()

Submit ElasticDL Job In Command Line Mode

Download ElasticDL Source Code And Build Wheel Package

git clone https://github.com/sql-machine-learning/elasticdl.git
cd elasticdl

Build And Install Wheel Package From Source Code

python3 setup.py install

Submit Jobs

Same as in the development mode, just replace python -m elasticdl.python.elasticdl.client part with elasticdl.

Check the pod status

kubectl get pods
kubectl logs $pod_name