Link Search Menu Expand Document

ElasticDL Command-line Client Tool


ElasticDL is a Kubernetes-Native deep learning framework. As it runs distributed training/prediction/evaluation jobs in a cluster, we need a client to submit the jobs to the cluster. The main functionality of the client is building image for ElasticDL job and submitting ElasticDL job.

Currently we have a client but it’s tightly coupled with the main package. It’s too heavy that users need pip install the whole elasticdl package and lots of dependencies such as TensorFlow, grpcio, etc.

To improve the user experience, the client should be light-weight. It only has depedency on docker and Kubernetes Api. In this doc, we are discussing about this command-line client tool.

User Story

  1. Prerequisite

    • Install Docker CE >= 18.x for building the Docker images of the distributed ElasticDL jobs.
    • Install Python >= 3.6.
    • Install ElasticDL command-line tool by pip install elasticdl_client.
  2. Users develop model and the directory structure of model definition files is as follows:

  3. Generate a Dockerfile.

    Input the command:

     cd model_zoo_root
     elasticdl zoo init

    The options inside [] are optional. The default value of base_image is python:3.6. The generated Dockerfile example is:

     FROM python:3.6
     RUN pip install elasticdl_preprocessing
     RUN pip install elasticdl
     COPY . /model_zoo
     RUN pip install -r /model_zoo/requirements.txt

    Users can make additional updates on the Dockerfile if necessary.

  4. Build the Docker image for an ElasticDL job.

     elasticdl zoo build --image=a_docker_registry/bright/elasticdl-wnd:1.0 .
  5. Push the Docker image to a remote registry.

     elasticdl zoo push a_docker_registry/bright/elasticdl-wnd:1.0

    If you want to execute the job locally in Minikube, the push step is not necessary.

  6. Submit a model training/prediction/evaluation job.

     elasticdl train \
         --image_name=a_docker_registry/bright/elasticdl-wnd:1.0 \
         --model_def=a_directory.wide_and_deep.custom_model \
         --training_data=/data/mnist/train \
         --validation_data=/data/mnist/test \
         --num_epochs=2 \
         --minibatch_size=64 \
         --num_ps_pods=1 \
         --num_workers=1 \
         --evaluation_steps=50 \
         --job_name=test-mnist \
         --distribution_strategy=ParameterServerStrategy \
         --master_resource_request="cpu=0.2,memory=1024Mi" \
         --master_resource_limit="cpu=1,memory=2048Mi" \
         --worker_resource_request="cpu=0.4,memory=1024Mi" \
         --worker_resource_limit="cpu=1,memory=2048Mi" \
         --ps_resource_request="cpu=0.2,memory=1024Mi" \