Data Preparation Tutorial


Currently, ElasticDL requires the input data in RecordIO format. This tutorial is to help users convert raw training data to the required RecordIO format. The RecordIO API is written in Golang and you can see how to use that here. Because we process our data via PySpark job, what we use is the python wrapper outside its Golang implementation.

This tutorial provides three approaches to process the data: local Python script, local PySpark job and PySpark job running on Google Cloud.

Python Script

If your data amount is small that can fit in your local disk, the following section is the approach you want to use. Here is a sample program to convert the MNIST dataset into RecordIO format. It should be straightforward to write your own script by mimicking the sample program.

Local PySpark Job

If your data amount is large but can fit in your local disk, and you want to process your data in parallel, the following section is the approach you want to use. You can set up your own Spark environment locally. But we recommend to run the PySpark job in the Docker container. Here are the steps to run our sample PySpark job, which processes the MNIST data, in the Docker container:

  1. Download MNIST sampled training data in jpg format here, and unzip it to your training data directory, and make the tar file: ```bash TRAINING_DATA_DIR= TAR_FILE=mnist_sampled_training_data.tar

tar -cvf $TRAINING_DATA_DIR/$TAR_FILE $TRAINING_DATA_DIR/mnist_sampled_training_data

1. Build Docker image:
    docker build -t elasticdl:data_process \
        -f elasticdl/docker/Dockerfile.data_process .

2. Run PySpark job in Docker Container:

    docker run --rm -v $OUTPUT_DIR:$MOUNTED_OUTPUT_DIR \
        elasticdl:data_process \
        /elasticdl/python/data/recordio_gen/sample_pyspark_recordio_gen/ \
        --training_data_tar_file=$MOUNTED_TRAINING_DATA_DIR/$TAR_FILE \
        --output_dir=$MOUNTED_OUTPUT_DIR  \
        --model_file=$MODEL_FILE \
    After the job is finished, you should see your data named in the format of `data-<worker_num_that_generate_this_data>-<chunk_num>` located in `OUTPUT_DIR`.
    1. If your PySpark job needs other dependencies in the image, you can create your own image derived from the sample Dockerfile.
    2. You need to provide your own [model file](/elasticdl/docs/tutorials/model_building/), from which we need the [feature](/elasticdl/docs/tutorials/model_building/#feature_columns) and [label](/elasticdl/docs/tutorials/model_building/#label_columns) columns, as well as the [user-defined data processing logic](prepare_data_for_a_single_file).
    3. There are some other arguments you can pass to our backbone PySpark file as you can see in the main() function of [](

## PySpark Job on Google Cloud

If your amount of data is so huge that it can't fit into your local disk, the following section is the approach you want to use. In this tutorial we use [Google Filestore]( as our training data storage. We also tried [Google Cloud Storage](, which is not a good fit for our use case(see [here]( Here are the steps to run our PySpark job on Google Cloud:
1. Set up the Google Cloud SDK and project following [here]( based on your OS.

2. [Create a Google Cloud Storage bucket](, and upload the [initialization script](../../elasticdl/python/data/recordio_gen/sample_pyspark_recordio_gen/, which will install all dependencies we need to Spark cluster, to bucket:

gsutil mb gs://$GS_BUCKET_NAME/
  1. Create a Dataproc cluster with the initialization actions by the script $GS_INIT_SCRIPT: ```bash CLUSTER_NAME=test-cluster

gcloud beta dataproc clusters create $CLUSTER_NAME \ –image-version=preview \ –optional-components=ANACONDA \ –initialization-actions $GS_INIT_SCRIPT

4. [Create a Google Filestore instance](, which is used to store our training data:

gcloud filestore instances create $FILESTORE_NAME \
    --project=$PROJECT_NAME \
    --location=us-west1-a \
    --file-share=name="elasticdl",capacity=1TB \
  1. Mount the Filestore to every node of your Spark cluster per here. In this tutorial, we mount it to /filestore_mnt.

  2. Copy the training data from local to Filestore:
    gcloud compute scp $TRAINING_DATA_DIR --recurse \
     $CLUSTER_NAME-m:/filestore_mnt/$TRAINING_DATA_DIR \
     --project elasticdl --zone us-west1-a
  3. Zip the elasticdl folder as the dependency for the PySpark job, which will be submitted together with PySpark job in the next step:
    zip -r elasticdl
  4. Submit the PySpark job:
    gcloud dataproc jobs submit pyspark \
     elasticdl/python/data/recordio_gen/sample_pyspark_recordio_gen/ \
     --cluster=$CLUSTER_NAME --region=global \
     --files=/model_zoo/mnist_functional_api/ \
     -- --training_data_tar_file=/filestore_mnt/$TAR_FILE \
     --output_dir=/filestore_mnt --model_file=$MODEL_FILE --records_per_file=200

After the job is finished, you can see four generated RecordIO files data-0-0000, data-0-0001, data-1-0000 and data-1-0001 located in the mounted directory /filestore_mnt.