Overview
ElasticDL is a framework implements the swamp optimization meta-algorithm. It is like Apache Hadoop is a framework that implements the MapReduce parallel programming paradigm.
To program the ElasticDL framework, programmers need to provide at least one
nn.Module
-derived class that describes the specification of a model. It is
like programmers of Hadoop need to provide a class that implements the methods
of Map and Reduce.
To train a model, ElasticDL needs (1) hyperparameter values, and (2) the data. Each ElasticDL job uses the same data to train one or more models, where each model needs a set of hyperparameter values. A model could have more than one sets of hyperparameter values. In such a case, they are considered multiple models.
- A job is associated with a dataset.
- A job is associated with one or more model specifications, each model specification is a Python classed derived from torch.nn.Module.
- A model specification is associated with one or more sets of hyperparameter values.
- The pair of a model specification and a set of its hyperparameter values is a model.
- A job includes a coordinator process and one or more bee processes.
- A bee process trains one or more models.
- The coordinator dispatches models to bees.
The following example command line starts an ElasticDL job.
elasticdl start \
-model='MyModuleClass,a_param=[0.1:0.001:10:logscale],another_one="a string value"' \
-model='AnotherModuleClass,yet_a_param=[1:10:5]'
This example train the following models six models:
- MyModuleClass(a_param=0.1, another_one=”a string value”)
- MyModuleClass(a_param=0.01, another_one=”a string value”)
- MyModuleClass(a_param=0.001, another_one=”a string value”)
- AnotherModuleClass(yet_a_param=1)
- AnotherModuleClass(yet_a_param=5)
- AnotherModuleClass(yet_a_param=10)