Proto commits in volcengine/veScale

These 3 commits are when the Protocol Buffers files have changed:

2024-05-31

Commit:	c4afc72
Author:	MingjiHan	2024-05-31 00:12:56 -0700
Committer:	GitHub	2024-05-31 00:12:56 -0700

[checkpoint] feat: open source fast checkpoint system (#38) ## Summary We improved `vescale.checkpoint` with the following new features for fast checkpointing (where front three features are built-in techniques without necessitating manual activation): - **Saving Plan Caching**: During training, the program may save model and optimizer checkpoints every n steps. Once a saving plan is created, it remains unchanged as long as the model does. We implemented plan caching to avoid regenerating the plan when checkpointing a model or optimizer multiple times, reducing unnecessary compute and communication costs. As of 05/30/2024, PyTorch DCP does not support plan caching. - **Saving Plan Load-Balancing**: In data parallel training, models are replicated across GPUs with different data parallel ranks but the same pipeline and tensor parallel ranks. Existing PyTorch DCP (as of 05/30/2024) deduplicates replicated tensors using a simple algorithm, causing GPUs with data parallel rank 0 to save the entire model, leading to load imbalance. We implemented a load-balancing algorithm to address this issue when deduplicating model tensors. - **D2H Tensor Copying via Pinned Memory**: When copying tensors from GPU to host memory, `vescale.checkpoint` uses pinned host memory, reducing memory allocation costs each time a checkpoint is saved. As of 05/30/2024, PyTorch DCP does not support pinned memory. - **Checkpoint Broadcasting**: In data parallel training, models are replicated across GPUs with different data parallel ranks but the same pipeline and tensor parallel ranks. If `broadcast_checkpoint` is enabled, `vescale.checkpoint.load` lets GPUs with data parallel rank 0 to load the model and broadcast it to other GPUs with higher data parallel ranks. If GPUs are connected with NCCL and I/O bandwidth is fully utilized, broadcasting model tensors speeds up checkpoint loading compared to all GPUs loading models from persistent storage. E.g.: ```python # prepare checkpoint state for the model and optimizer checkpoint_state = { "model": distributed_model, "optimizer": distributed_optimizer } # load the checkpoint vescale.checkpoint.load("/user/vescale/gpt/", checkpoint_state, broadcast_checkpoint=True) ``` - **Asynchronous Checkpointing**: When `vescale.checkpoint.save` is called, it first generates a saving plan and then synchronously copies tensors from GPU to host memory. If `async_checkpoint` is enabled, the training program can continue after the D2H copying, while `vescale.checkpoint.save` continues to serialize tensors and dump the checkpoint to persistent storage asynchronously without blocking training. As of 05/30/2024, PyTorch DCP does not support asynchronous checkpointing. E.g.: ```python # prepare checkpoint state for the model and optimizer checkpoint_state = { "model": distributed_model, "optimizer": distributed_optimizer } # save the checkpoint asynchronuously vescale.checkpoint.save("/user/vescale/gpt/", checkpoint_state, async_checkpoint=True) ``` ## Acknowledgement We sincerely appreciate all contributors including but not limited to @shanesyy-1992 @raywan-110 @lazychao @AHEADer @MingjiHan99

The documentation is generated from this commit.

2024-04-23

Commit:	97735b1
Author:	MackZackA	2024-04-22 23:44:36 -0700
Committer:	GitHub	2024-04-22 23:44:36 -0700

added veDeviceMesh (#32) This PR introduces veDeviceMesh, the device mesh API that integrates handling of submeshes and process groups in performing training with DDP, TP/SP, distributed optimizer and checkpointing. It also updates fixes and patches related to veDeviceMesh API to the repository since last PR.

2024-04-10

Commit:	bbf2860
Author:	MingjiHan	2024-04-10 12:38:40 -0700
Committer:	GitHub	2024-04-10 12:38:40 -0700

[checkpoint] Open Source (#27) ### In this PR, we open source our `vescale.checkpoint`, Yo. ~ `vescale.checkpoint` is a distributed LLM checkpointing system. `vescale.checkpoint` offers simple and straightforward APIs, enabling users to load and save distributed model (`DModule`) and optimizer (`DistributedOptimizer`) seamlessly, abstracting away the complexities of underlying details such as process rank and device mesh. `vescale.checkpoint`supports load-time checkpoint resharding when varying the degrees of data, tensor, or pipeline (TODO) parallelism for both veScale distributed model (`DModule`) and optimizer (`DistributedOptimizer`). `vescale.checkpoint` incorporates [fast checkpointing](https://arxiv.org/abs/2402.15627) and various I/O optimization techinques, enhancing I/O efficiency during large language model training. `vescale.checkpoint` will be a part of OmniStore project, a new open source project coming soon. ### Credit to veScale Checkpoint Team This endeavor would not have been possible without the contribution of veScale Checkpoint team which includes but not limited to: @shanesyy-1992 @MingjiHan99 @AHEADer @raywan-110 @michael4RD @lazychao @leochen-ai Also thanks to the great guidance and leadership of: [@pengyanghua](https://github.com/pengyanghua) [@eric-haibin-lin](https://github.com/eric-haibin-lin) [@liwenchangbdbz](https://github.com/liwenchangbdbz) [@Meteorix](https://github.com/Meteorix) ### Credit to veScale Team We would like to sincerely acknowledge the assistance of and collaboration with the veScale team which inlcudes but not limited to: [@leonardo0lyj](https://github.com/leonardo0lyj) @JsBlueCat @MackZackA @Vremold @jc-bytedance @lichen225 ### Credit to PyTorch Distributed Checkpoint (DCP) Team We would like to sincerely acknowledge the assistance of and collaboration with the [PyTorch Distributed Checkpoint (DCP) team](https://github.com/pytorch/pytorch/tree/main/torch/distributed/checkpoint) which includes but not limited to: @wz337 @kumpera @fegin @LucasLLC