Training¶
This page serves a guide on the different options to train a model in the platform.
Training options¶
There are currently three main options to train a model in the platform:
standard mode: you are given access to a persistent deployment that you can interact with via an IDE (ie. VScode).
batch mode: you deploy a temporary job that runs your training and then is killed when the training is completed
federated mode: you deploy a federated learning server that orchestrates the training. Then you can have several clients joining forces to distribute the training load among all of them.
All these options have the respective pros and cons.
Option |
✅ Pros |
❌ Cons |
|---|---|---|
Standard mode (persistent deployment) |
|
|
Batch mode (temporary jobs) |
|
|
|
|
Given the above specifications, we recommend the following typical workflows:
Use standard mode for you preliminary trainings, when you still might need to have direct access to the code/data to debug things.
Use batch mode when your training script is stable, and you are basically tweaking hyperparameters.
Use federated mode if you have sensitive data and/or need to distribute you training across many machines.
Note
Bear in mind that all the modules of the platform are fully open-source, as they are packaged as Docker containers. So if you have access to your own HPC resources, you can always take the container and train it there using udocker.