| Documentation | Paper | Discord | WeChat |
ServerlessLLM (sllm
, pronounced "slim") is an open-source serverless framework designed to make custom and elastic LLM deployment easy, fast, and affordable. As LLMs grow in size and complexity, deploying them on AI hardware has become increasingly costly and technically challenging, limiting custom LLM deployment to only a select few. ServerlessLLM solves these challenges with a full-stack, LLM-centric serverless system design, optimizing everything from checkpoint formats and inference runtimes to the storage layer and cluster scheduler.
Curious about how it works under the hood? Check out our System Walkthrough for a deep dive into the technical design—perfect if you're exploring your own research or building with ServerlessLLM.
ServerlessLLM is designed to support multiple LLMs in efficiently sharing limited AI hardware and dynamically switching between them on demand, which can increase hardware utilization and reduce the cost of LLM services. This multi-LLM scenario, commonly referred to as Serverless, is highly sought after by AI practitioners, as seen in solutions like Serverless Inference, Inference Endpoints, and Model Endpoints. However, these existing offerings often face performance overhead and scalability challenges, which ServerlessLLM effectively addresses through three key capabilities:
ServerlessLLM is Fast:
ServerlessLLM is Cost-Efficient:
ServerlessLLM is Easy-to-Use:
# On the head node
conda create -n sllm python=3.10 -y
conda activate sllm
pip install serverless-llm
# On a worker node
conda create -n sllm-worker python=3.10 -y
conda activate sllm-worker
pip install serverless-llm[worker]
Start a local ServerlessLLM cluster using the Quick Start Guide.
Want to try fast checkpoint loading in your own code? Check out the ServerlessLLM Store Guide.
To install ServerlessLLM, please follow the steps outlined in our documentation. ServerlessLLM also offers Python APIs for loading and unloading checkpoints, as well as CLI tools to launch an LLM cluster. Both the CLI tools and APIs are demonstrated in the documentation.
Benchmark results for ServerlessLLM can be found here.
ServerlessLLM is maintained by a global team of over 10 developers, and this number is growing. If you're interested in learning more or getting involved, we invite you to join our community on Discord and WeChat. Share your ideas, ask questions, and contribute to the development of ServerlessLLM. For becoming a contributor, please refer to our Contributor Guide.
If you use ServerlessLLM for your research, please cite our paper:
@inproceedings{fu2024serverlessllm,
title={ServerlessLLM: Low-Latency Serverless Inference for Large Language Models},
author={Fu, Yao and Xue, Leyang and Huang, Yeqi and Brabete, Andrei-Octavian and Ustiugov, Dmitrii and Patel, Yuvraj and Mai, Luo},
booktitle={18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},
pages={135--153},
year={2024}
}