own/metal
← LLM Runners tool · /VLLM

vLLM

High-throughput LLM serving for production GPU workloads.

// github

★ 81.8k

last commit · today

heavy GPU required Apache-2.0

// readme · what it is

vLLM is the production-grade serving engine behind a large share of public LLM endpoints. PagedAttention and continuous batching let one GPU sustain dramatically higher throughput than naive HuggingFace pipelines. Pick this when you're past the prototype stage and need to actually serve traffic.

// deploy notes

Requires CUDA. OpenAI-compatible endpoint. Tune `--gpu-memory-utilization` and `--max-model-len` for your card.