Design for ElasticDL Operator

Motivation

ElasticDL uses master-worker architecture. Each ElasticDL job has a unique master pod. The master pod manages the lifecycle of worker pods and controls the training process.

ElasticDL provides a command-line client tool to submit a job to a Kubernetes cluster. At first, a master pod will be launched. Then, the master pod launches worker pods and PS pods if necessary. The training process begins once a worker pod becomes ready.

When making ElasticDL as a product of cloud computing, we find that we have to address the following two points:

Job monitoring and management. ElasticDL client tool only launches a job. We have to write extra scripts to monitor the pods’ status and clean pods when a job completes.
Product compatibility. Current products have deployed some operators of Kubeflow, such as TF operator and PyTorch operator. Many development works have been done when integrating the operators, including dashboard, command-line tool, and controllers with rich monitoring functions. It’s better to reuse the work.

So, we decide to apply the operator pattern to ElasticDL as well. We introduce a CRD to define the workload of ElasticDL jobs. Then, we describe each ElasticDL job with a YAML file according to the CRD. The custom controller handles the request from the YAML file.

Please note the fact that the controller only launches the master pod, and monitors the job. The worker pods are still mananged by the master pod. The controller does not take part in fault-torance and elastic scheduling features.

Case study: Describing a MNIST Training Job

Let’s use a real case to drive the design of ElasticDL CRD. Following is the master pod YAML file of a MNIST job dumped from the ElasticDL client tool. It contains all the needed information.

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: elasticdl
    elasticdl-job-name: test-mnist
    elasticdl-replica-index: '0'
    elasticdl-replica-type: master
  name: elasticdl-test-mnist-master
  namespace: default
spec:
  containers:
  - args:
    - -c
    - set -o pipefail; python -m elasticdl.python.master.main --worker_image 'elasticdl:test'
      --model_zoo 'model_zoo' --cluster_spec '' --minibatch_size '64' --log_level
      'INFO' --dataset_fn 'dataset_fn' --loss 'loss' --optimizer 'optimizer' --callbacks
      'callbacks' --eval_metrics_fn 'eval_metrics_fn' --custom_data_reader 'custom_data_reader'
      --model_def 'mnist.mnist_functional_api.custom_model' --model_params
      '' --get_model_steps '1' --data_reader_params '' --distribution_strategy 'ParameterServerStrategy'
      --checkpoint_steps '0' --checkpoint_dir '' --keep_checkpoint_max '0' --output
      '' --image_name 'elasticdl:test' --job_name 'test-mnist' --master_resource_request
      'cpu=1,memory=1024Mi' --master_resource_limit 'cpu=1,memory=2048Mi' --num_workers
      '8' --worker_resource_request 'cpu=2,gpu=1,memory=2048Mi' --worker_resource_limit
      'cpu=2,gpu=1,memory=2048Mi' --master_pod_priority '' --worker_pod_priority 'high=0.5' --num_ps_pods
      '1' --ps_resource_request 'cpu=2,memory=1024Mi' --ps_resource_limit 'cpu=2,memory=2048Mi'
      --ps_pod_priority 'high' --volume 'host_path=/data,mount_path=/data' --image_pull_policy
      'Never' --restart_policy 'Never' --envs '' --extra_pypi_index 'https://pypi.org/simple'
      --namespace 'default' --num_minibatches_per_task '2' --use_go_ps 'True' --aux_params '' --log_file_path '' --tensorboard_log_dir
      '' --num_epochs '2' --grads_to_wait '1' --training_data '/data/mnist/train'
      --validation_data '' --evaluation_steps '0' --evaluation_start_delay_secs '100'
      --evaluation_throttle_secs '0' --checkpoint_dir_for_init '' --sync_version_tolerance
      '0' --log_loss_steps '100' --use_async 'False' --lr_staleness_modulation 'False'
    command:
    - /bin/bash
    env:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
    image: elasticdl:test
    imagePullPolicy: Never
    name: elasticdl-test-mnist-master
    resources:
      limits:
        cpu: '1'
        memory: 2048Mi
      requests:
        cpu: '1'
        memory: 1024Mi
    volumeMounts:
    - mountPath: /data
      name: elasticdl-test-mnist-master-volume-0
  priorityClassName: ''
  restartPolicy: Never
  volumes:
  - hostPath:
      path: /host_data
    name: elasticdl-test-mnist-master-volume-0

We could rewrite it as a custom ElasticDLJob object after the ElasticDL CRD is created. The following is a sample:

apiVersion: "elasticdl.org/v1"
kind: "ElasticAIJob"
metadata:
  name: "test-mnist"
spec:
  jobArgs:
  - "--model_zoo /model_zoo"
  - "--model_def mnist.mnist_functional_api.custom_model"
  - "--training_data /data/mnist/train"
  - "--valiation_data /data/mnist/val"
  - "--output /data/output"
  - "--minibatch_size 64"
  - "--num_minibatches_per_task 2"
  - "--evaluation_step 1000"
  master:
    image: elasticdl-mnist
    priority: high
    resource_request: "cpu=1,memory=1024Mi"
    volume: "host_path=/host_data,mount_path=/data"
  ps:
    count: 2
    priority: high
    image: elasticdl-ps
    resource_request: "cpu=1,memory=1024Mi"
  worker:
    count: 10
    priority: high=0.5
    image: elasticdl-worker
    resource_request: "cpu=4,gpu=1,memory=2048Mi"
    volume: "host_path=/host_data,mount_path=/data"

ElasticDL CRD

This is a reference CRD design. Each ElasticDL operator implementation may have its own design.

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: eaijobs.elasticdl.org
spec:
  group: elasticdl.org
  scope: Namespaced
  versions:
  - name: v1alpha1
    served: true
    storaged: true
  names:
    kind: ElasticAIJob
    listKind: ElasticAIJobList
    singular: elasticaijob
    plural: elasticaijobs
    shortNames:
    - eaijob
  subresources:
    status: {}
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            jobArgs:
              type: array
              items:
                type: string
                pattern: '^--([a-z0-9_]+)\s([a-z0-9_]+)$'
            Master:
              properties:
                image: string
                priority: string
                resource_request: string
                volume: string
            PS:
              properties:
                count: integer
                  minium: 0
                image: string
                priority: string
                resource_request: string
                volume: string
            Worker:
               properties:
                count: integer
                  minium: 1
                image: string
                priority: string
                resource_request: string
                volume: string

ElasticDL Controller

Currently, we do not have a plan to give an ElasticDL controller implementation in the community.