PyPI Documentation Status

English | 中文

News

Introduction

Written in C++ runtime, DashInfer aims to deliver production-level implementations highly optimized for various hardware architectures, including CUDA, x86 and ARMv9.

Main Features

DashInfer is a highly optimized LLM inference engine with the following core features:

Supported Hardware and Data Types

Hardware

Data Types

Quantization

DashInfer provides various many quantization technology for LLM weight, such as, int{8,4} weight only quantization, int8 activate quantization, and many customized fused kernel to provide best performance on specified device.

To put it simply, models fine-tuned with GPTQ will provide better accuracy, but our InstantQuant (IQ) technique, which does not require fine-tuning, can offer a faster deployment experience. Detailed explanations of IQ quantization can be found at the end of this article.

In terms of supported quantization algorithms, DashInfer supports models fine-tuned with GPTQ and dynamic quantization using the IQ quantization technique in two ways:

The quantization strategies introduced here can be broadly divided into two categories:

In terms of quantization granularity, there are two types:

Documentation and Example Code

Documentation

For the detailed user manual, please refer to the documentation: Documentation Link.

Quick Start:

  1. Using API Python Quick Start
  2. LLM OpenAI Server Quick Start Guide for OpenAI API Server
  3. VLM OpenAI Server VLM Support

Feature Introduction:

  1. Prefix Cache
  2. Guided Decoding
  3. Engine Config

Development:

  1. Development Guide
  2. Build From Source
  3. OP Profiling
  4. Environment Variable

Code Examples

In <path_to_dashinfer>/examples there are examples for C++ and Python interfaces, and please refer to the documentation in <path_to_dashinfer>/documents/EN to run the examples.

Multi-Modal Model(VLMs) Support

The VLM Support in multimodal folder, it's a toolkit to support Vision Language Models (VLMs) inference based on the DashInfer engine. It's compatible with the OpenAI Chat Completion API, supporting text and image/video inputs.

Performance

We have conducted several benchmarks to compare the performance of mainstream LLM inference engines.

Multi-Modal Model (VLMs)

We compared the performance of Qwen-VL with vllm across various model sizes:

img_1.png

Benchmarks were conducted using an A100-80Gx1 for 2B and 7B sizes, and an A100-80Gx4 for the 72B model. For more details, please refer to the benchmark documentation.

Prefix Cache

We evaluated the performance of the prefix cache at different cache hit rates:

dahsinfer-benchmark-prefix-cache.png

The chart above shows the reduction in TTFT (Time to First Token) with varying PrefixCache hit rates in DashInfer.

dashinfer-prefix-effect.png

Test Setup:

Guided Decoding (JSON Mode)

We compared the guided output (in JSON format) between different engines using the same request with a customized JSON schema (Context Length: 45, Generated Length: 63):

dashinfer-benchmark-json-mode.png

Subprojects

  1. HIE-DNN: an operator library for high-performance inference of deep neural network (DNN).
  2. SpanAttention: a high-performance decode-phase attention with paged KV cache for LLM inference on CUDA-enabled devices.

Citation

The high-performance implementation of DashInfer MoE operator is introduced in this paper, and DashInfer employs the efficient top-k operator RadiK. If you find them useful, please feel free to cite these papers:

@misc{dashinfermoe2025,
  title = {Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference}, 
  author = {Yinghan Li and Yifei Li and Jiejing Zhang and Bujiao Chen and Xiaotong Chen and Lian Duan and Yejun Jin and Zheng Li and Xuanyu Liu and Haoyu Wang and Wente Wang and Yajie Wang and Jiacheng Yang and Peiyang Zhang and Laiwen Zheng and Wenyuan Yu},
  year = {2025},
  eprint = {2501.16103},
  archivePrefix = {arXiv},
  primaryClass = {cs.DC},
  url = {https://arxiv.org/abs/2501.16103}
}

@inproceedings{radik2024,
  title = {RadiK: Scalable and Optimized GPU-Parallel Radix Top-K Selection},
  author = {Li, Yifei and Zhou, Bole and Zhang, Jiejing and Wei, Xuechao and Li, Yinghan and Chen, Yingda},
  booktitle = {Proceedings of the 38th ACM International Conference on Supercomputing},
  year = {2024}
}

Future Plans

License

The DashInfer source code is licensed under the Apache 2.0 license, and you can find the full text of the license in the root of the repository.