BEVFUSION

Updated 23 days ago

ID: 50280063/25

hanlab.mit.edu

CLICK HERE TO SEE DETAILS OF COMPANY CHANGES

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight,..

Also known as: MIT HAN Lab

SEARCH FOR SIMILAR COMPANIES

Interest Score

HIT Score

0.90

Domain

bevfusion.mit.edu

Actual

hanlab.mit.edu

18.25.16.171

Status

Category

Company

0 comments Add a comment