DLRover makes the distributed training of large AI models easy, stable, fast and green. It can automatically train the Deep Learning model on the distributed cluster. It helps model developers to focus on model arichtecture, without taking care of any engineering stuff, say, hardware acceleration, distributed running, etc. Now, it provides automated operation and maintenance for deep learning training jobs on K8s/Ray. Major features as
Furthermore, DLRover offers extension libraries for PyTorch and TensorFlow to expedite training. These are also open-source projects available in our GitHub repositories.
DLRover can restore the training when the process fails without stopping the training job. The actions to restore training in DLRover are:
For detail, we can see the blog of fault-tolerance and elasticity. With fault tolerance, the goodput of GLM-65B training on thousands of GPUs increased from 69% to 95%. The goodput is the time spent computing useful new steps over the elapsed time of the training job. The downtime details are shown:
In addition to fault tolerance, DLRover provides the flash checkpoint to save/load checkpoint in seconds. With flash checkpoint, the training can frequently save checkpoints and reduce the roll-back step to resume training from the latest checkpoint when a failure happens. The features of flash checkpoint are:
The Performance of DLRover Flash Checkpoint to Save/Load GPT2-1.5B.
The figure illustrates that the I/O time of different DL frameworks to read checkpoint files when resuming training processes. With DLRover Flash Checkpoint, recovery could be completed in the order of seconds by loading checkpoints directly from shared memory, which is much faster compared to loading checkpoints from SSD and NAS.
DLRover can recover failed parameter servers and workers to resume training.
In AntGroup, DLRover manages hundreds of DL training jobs every day on the customized Kubernetes cluster in AntGroup. Except for the failed job resulting from code errors, the rate of completed jobs increase from 89% with tf-operator in KubeFlow to 95%. Other unrecoverable failure reasons of a job are data error, NaN loss of the model, network breakdown, and so on.
DLRover automatically scales up/down resources (for parameter servers or workers) at the runtime of a training job. By monitoring the workload of nodes and throughput, DLRover can diagnose the bottleneck of the resource configuration. The common bottleneck contains node straggler, the unbalanced workload of PS, insufficient CPU cores of nodes, and the insufficient number of nodes. DLRover can improve the training performance by dynamic resource adjustment.
In order to improve the training througphput, users prefer to configure their jobs with over-provision resources to avoid any potential risk from insufficient resources. This usually ends up in huge resource waste. DLRover Auto-Scaling can allocate resources by the demand of model training to reduce the waste of resources.
Dynamic data sharding splits the dataset into many small shards and each shard only contains a few batches of training samples. The worker will get a shard only when it using up samples of the last one. With the dynaic sharding, DLRover can
With the data source transparency provided by dynamic data sharding, DLRover can be integrated with offline training which consumes batch data, and also supports online learning with real-time streaming data. (fed with a message queue like RocketMQ/Kafka/Pulsar/..., or executed as a training sink node inside Flink/Spark/Ray/...)
By practice, DLRover is an ideal component to build an end-to-end industrial online learning system, estimator.md provides a detailed example implemented with tf.estimator.Estimator
.
We can use dlrover-run
to run the training script which torchrun
or torch.distributed.run
can run.
pip install dlrover[torch]
dlrover-run --nnodes=1 --nproc_per_node=$NUM_TRAINERS train_scripts.py
The more detail tutorials are:
We can use DLRover to train a TensorFlow by the following steps:
tf.dataset
in a training configuration of DLRover.We can refer to the estimator.md to train a model with DLRover.
Please refer to the DEVELOPMENT
An Example of Flash Checkpoint.
Train a PyTorch Model on Kubernetes.
Train a GPT Model on Kubernetes.
Train a TensorFlow Estimator on Kubernetes.
Welcome to scan the DingTalk QR or search "AI Infra" in WeChat(微信) to join DLRover group. The DingTalk QR is: