[July.12, 2025] Our paper is accepted at SOSP'25! 🎉 Please check our updated preprint at [Paper]. We are still refactoring the code and will soon update the repo. Stay tuned :)
[Nov.6, 2024] PhOS is open sourced 🎉 [Repo] [Documentations]
👉 PhOS is currently fully supporting single-GPU checkpoint and restore
👉 We will soon release codes for cross-node live migration and multi-GPU support :)
[May 20, 2024] PhOS paper is now released on arXiv [Paper]
PhoenixOS (PhOS) is an OS-level GPU checkpoint/restore (C/R) system. It can transparently C/R processes that use the GPU, without requiring any cooperation from the application (though with cooperation it would be faster :)). Most importantly, PhOS is the first (and only) OS-level GPU C/R system that can concurrently execute C/R without stopping the execution of applications.
Concurrent execution brings huge performance gains, e.g., please check below when compared with NVIDIA's CUDA-Checkpoint nvidia/cuda-checkpoint:
| Checkpointing Llama2-13b-chat |
|---|
| Restoring Llama2-13b-chat |
|---|
Note that PhOS is aiming to be a generic design towards various hardware platforms from different vendors, by providing a set of interfaces which should be implemented by specific hardware platforms. We currently provide the C/R implementation on CUDA platform, and we are planning to support ROCm and Ascend. Yet we hope we could get help from the community because getting the whole C/R done is really non-trivial.
| PhOS is currently under heavy development. If you're interested in contributing to this project, please join our slack workspace for more upcoming cool features on PhOS. |
[Clone Repository] First of all, clone this repository recursively:
git clone --recursive https://github.com/SJTU-IPADS/PhoenixOS.git
[Start Container] PhOS can be built and installed on official vendor image.
NOTE: PhOS require libc6 >= 2.29 for compiling CRIU from source.
For example, for running PhOS for CUDA 11.3, one can build on official CUDA images (e.g., nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04):
# enter repository
cd PhoenixOS
# start container
sudo docker run -dit --gpus all \
-v.:/root \
--privileged --network=host --ipc=host \
--name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
# enter container
sudo docker exec -it phos /bin/bash
Note that it's important to execute docker container with root privilege, as CRIU needs the permission to C/R kernel-space memory pages.
[Downloading Necesssary Assets] PhOS relies on some assets to build and test, please download these assets by simply running following commands:
# inside container
# install basic dependencies from OS pkg manager
apt-get update
apt-get install git wget
# download assets
cd /root/scripts/build_scripts
bash download_assets.sh
[Build] Building PhOS is simple!
PhOS provides a convinient build system, which covers compiling, linking and installing all PhOS components:
| Component | Description |
|---|---|
phos-autogen |
Autogen Engine for generating most of Parser and Worker code for specific hardware platform, based on lightwight notation. |
phosd |
PhOS Daemon, which continuously run at the background, taking over the control of all GPU devices on the node. |
libphos.so |
PhOS Hijacker, which hijacks all GPU API calls on the client-side and forward to PhOS Daemon. |
libpccl.so |
PhOS Checkpoint Communication Library (PCCL), which provide highly-optimized device-to-device state migration. Note that this library is not included in current release. |
unit-testing |
Unit Tests for PhOS, which is based on GoogleTest. |
phos-cli |
Command Line Interface (CLI) for interacting with PhOS. |
phos-remoting |
Remoting Framework, which provide highly-optimized GPU API remoting performance. See more details at SJTU-IPADS/PhoenixOS-Remoting. |
To build and install all above components and other dependencies, simply run the build script in the container would works:
# inside container
cd /root/scripts/build_scripts
# clear old build cache
# -c: clear previous build
# -3: the clean process involves all third-parties
bash build.sh -c -3
# start building
# -3: the build process involves all third-parties
# -i: install after successful building
bash build.sh -3 -i
For customizing build options, please refers to and modify avaiable options under scripts/build_scripts/build_config.yaml.
If you encounter any build issues, you're able to see building logs under build_log. Please open a new issue if things are stuck :-|
Will soon be updated, stay tuned :)
Once successfully installed PhOS, you can now try run your program with PhOS support!
For more details, you can refer to examples for step-by-step tutorials to run PhOS. |
phosd and your programStart the PhOS daemon (phosd), which takes over all GPU reousces on the node:
pos_cli --start --target daemon
To run your program with PhOS support, one need to put a yaml configure file under the directory which your program would regard as $PWD. This file contains all necessary informations for PhOS to hijack your program. An example file looks like:
# [Field] name of the job
# [Note] job with same name would share some resources in posd, e.g., CUModule, etc.
job_name: "llama2-13b-chat-hf"
# [Field] remote address of posd, default is local
daemon_addr: "127.0.0.1"
You are going for launch now! Try run your program with env $phos prefix, for example:
env $phos python3 train.py
To pre-dump your program, which save the CPU & GPU state without stopping your execution, simple run:
# create directory to store checkpoing files
mkdir /root/ckpt
# pre-dump command
pos_cli --pre-dump --dir /root/ckpt --pid [your program's pid]
To dump your program, which save the CPU & GPU state and stop your execution, simple run:
# create directory to store checkpoing files
mkdir /root/ckpt
# pre-dump command
pos_cli --dump --dir /root/ckpt --pid [your program's pid]
To restore your program, simply run:
# restore command
pos_cli --restore --dir /root/ckpt
For more details, please check our paper.
If you use PhOS in your research, please cite our paper:
@inproceedings{phoenixos,
title={PhoenixOS: Concurrent {OS}-level GPU Checkpoint and Restore with Validated Speculation},
author={Xingda, Wei and Zhuobin, Huang and Tianle, Sun and Yingyi, Hao and Rong, Chen and Mingcong, Han and Jinyu, Gu and Haibo, Chen},
booktitle={Proceedings of the ACM SIGOPS 31th Symposium on Operating Systems Principles},
year={2025}
}
Please check mailmap for all contributors.