← LLM Runners tool · /VLLM
vLLM
High-throughput LLM serving for production GPU workloads.
// github
★ 81.8k
last commit · today
heavy GPU required Apache-2.0
// readme · what it is
vLLM is the production-grade serving engine behind a large share of public LLM endpoints. PagedAttention and continuous batching let one GPU sustain dramatically higher throughput than naive HuggingFace pipelines. Pick this when you're past the prototype stage and need to actually serve traffic.
// deploy notes
Requires CUDA. OpenAI-compatible endpoint. Tune `--gpu-memory-utilization` and `--max-model-len` for your card.
[ ALTERNATIVE TO ]
// links
- website https://docs.vllm.ai
- github github.com/vllm-project/vllm