Link

ElasticDL: A Kubernetes-native Deep Learning Framework

Development Docker Image

Development Docker image contains dependencies for ElasticDL development and processed demo data in RecordIO format. In repo’s root directory, run the following command:

docker build \
    -t elasticdl:dev \
    -f elasticdl/docker/Dockerfile.dev .

To build the Docker image with GPU support, run the following command:

docker build \
    -t elasticdl:dev-gpu \
    -f elasticdl/docker/Dockerfile \
    --build-arg BASE_IMAGE=tensorflow/tensorflow:2.0.0-gpu-py3 .

Note that since ElasticDL depends on TensorFlow, the base image must have TensorFlow installed.

When having difficulties downloading from the main PyPI site, you could pass an extra PyPI index url to docker build, such as:

docker build \
    --build-arg EXTRA_PYPI_INDEX=https://mirrors.aliyun.com/pypi/simple \
    -t elasticdl:dev \
    -f elasticdl/docker/Dockerfile .

To develop in the Docker container, run the following command to mount your cloned elasticdl git repo directory (e.g. EDL_REPO below) to /elasticdl directory in the container and start container:

EDL_REPO=<your_elasticdl_git_repo>
docker run --rm -u $(id -u):$(id -g) -it \
    -v $EDL_REPO:/edl_dir \
    -w /edl_dir \
    elasticdl:dev

Continuous Integration Docker Image

Continuous integration docker image contains everything from the development docker image and the ElasticDL source code. It is used to run continuous integration with the latest version of the source code. In repo’s root directory, run the following command:

docker build \
    -t elasticdl:ci \
    -f elasticdl/docker/Dockerfile.ci .

Test and Debug

Pre-commit Check

We have set up pre-commit checks in the Github repo for pull requests, which can catch some Python style problems. However, to avoid waiting in the Travis CI queue, you can run the pre-commit checks locally:

docker run --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
    elasticdl:dev \
    bash -c \
    "pre-commit run --files $(find elasticdl/python model_zoo -name '*.py' -print0 | tr '\0' ' ')"

Unit Tests

In dev Docker container’s elasticdl repo’s root directory, do the following:

make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests

Could also start Docker container and run unit tests in a single command:

docker run --rm -u $(id -u):$(id -g) -it \
    -v $EDL_REPO:/edl_dir \
    -w /edl_dir \
    elasticdl:dev \
    bash -c "make -f elasticdl/Makefile && K8S_TESTS=False pytest elasticdl/python/tests"

Note that, some unit tests may require a running Kubernetes cluster available. To include those unit tests, run the following:

make -f elasticdl/Makefile && pytest elasticdl/python/tests

ODPS-related tests require additional environment variables. To run those tests, execute the following:

docker run --rm -it -v $PWD:/edl_dir -w /edl_dir \
    -e ODPS_PROJECT_NAME=xxx \
    -e ODPS_ACCESS_ID=xxx \
    -e ODPS_ACCESS_KEY=xxx \
    -e ODPS_ENDPOINT=xxx \
    elasticdl:dev bash -c "make -f elasticdl/Makefile && K8S_TESTS=False ODPS_TESTS=True pytest elasticdl/python/tests/odps_* elasticdl/python/tests/data_reader_test.py"

Test in Docker

In a terminal, start master to distribute mnist training tasks.

docker run --net=host --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
    elasticdl:dev \
    bash -c "python -m elasticdl.python.master.main \
          --model_zoo=model_zoo \
          --model_def=mnist_functional_api.mnist_functional_api.custom_model \
          --job_name=test \
          --training_data=/data/mnist/train \
          --validation_data=/data/mnist/test \
          --evaluation_steps=15 \
          --num_epochs=2 \
          --checkpoint_steps=2 \
          --grads_to_wait=2 \
          --minibatch_size=10 \
          --num_minibatches_per_task=10 \
          --log_level=INFO"

In another terminal, start a worker

docker run --net=host --rm -it -v $EDL_REPO:/edl_dir -w /edl_dir \
    elasticdl:dev \
    bash -c "python -m elasticdl.python.worker.main \
          --worker_id=1 \
          --model_zoo=model_zoo \
          --model_def=mnist_functional_api.mnist_functional_api.custom_model \
          --minibatch_size=10 \
          --job_type=training_with_evaluation \
          --master_addr=localhost:50001 \
          --log_level=INFO"

This will train MNIST data with a model defined in model_zoo/mnist_functional_api/mnist_functional_api.py for 2 epoches. Note that, the master will save model checkpoints in a local directory checkpoint_dir.

If you get some issues related to proto definitions, please run the following command to build latest proto components.

make -f elasticdl/Makefile

Test with Kubernetes

We can also test ElasticDL job in a Kubernetes cluster using the previously built image.

First make sure the built image has been pushed to a docker registry, and then run the following command to launch the job.

kubectl apply -f manifests/examples/elasticdl-demo-k8s.yaml

For running demo job in Minikube, please make sure run eval $(minikube docker-env) first, and then build images.

kubectl apply -f manifests/examples/elasticdl-demo-minikube.yaml

If you find permission error in the main pod log, e.g., "pods is forbidden: User \"system:serviceaccount:default:default\" cannot create resource \"pods\"", you need to grant pod-related permissions for the default user.

kubectl apply -f manifests/examples/elasticdl-rbac.yaml

Test on Travis CI

All tests will be executed on Travis CI, which includes:

  • Pre-commit checks
  • Unit tests
  • Integration tests

The unit tests and integration tests also contain tests running on a local Kubernetes cluster via Minikube and tests that require data sources from ODPS. Please refer to Travis configuration file for more details.

Note that tests related to ODPS will not be executed on pull requests created from forks since the ODPS access information has been secured on Travis and only those who have write access can retrieve it. Developers who have write access to this repo are encouraged to submit pull requests from branches instead of forks if any code related to ODPS has been modified.

Also note that two test cases of integration tests involve loading checkpoint. It is not easy to automatically generate checkpoints when doing integration tests. Currently we save a checkpoint file in the test data folder of the ElasticDL Github repository and use this checkpoint file for integration tests. Thus you need to re-generate a new checkpoint file if your PR modifies the definition of Model protocol buffer.

If you want to trigger Travis builds without submitting a pull request, you can do so by developing on a branch and add this branch name to the list in branches section in Travis configuration file. Note that you can also trigger Travis builds from forks but it requires additional work such as activating Travis for the forked repo and ODPS related tests will be skipped as mentioned earlier.