vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs
vLLM hits GitHub Trending with state-of-the-art LLM inference performance: continuous batching, speculative decoding, FlashAttention integration, and support for multiple quantization schemes (AWQ, GPTQ, INT4, INT8, FP8) to squeeze throughput and cut memory costs. Originally from UC Berkeley, now community-driven with contributions spanning academia and industry.
GitHub Trending · GitHub repo
Repos