This guide covers multiple installation methods for vLLM, including pip, Docker, and building from source.
| Component | Minimum | Recommended |
|---|---|---|
| OS | Linux, macOS, Windows (WSL2) | Linux (Ubuntu 20.04+) |
| Python | 3.9+ | 3.10+ |
| GPU VRAM | 8GB+ | 24GB+ per GPU |
| RAM | 16 GB | 64+ GB |
| Storage | 50 GB SSD | 500GB+ NVMe |
| CUDA | 11.8+ | Latest stable |
vLLM supports multiple hardware backends:
# Create virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate # Linux/macOS
# or
vllm-env\Scripts\activate # Windows
# Install vLLM
pip install vllm
# For CUDA 12.1
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
# For CUDA 11.8
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118
pip install vllm --extra-index-url https://download.pytorch.org/whl/rocm6.0
# Pull the official image
docker pull vllm/vllm-openai:latest
# Run with NVIDIA GPU support
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-2-7b-chat-hf
See Docker Setup for detailed instructions.
# Clone the repository
git clone https://github.com/vllm-project/vllm.git
cd vllm
# Install in development mode
pip install -e .
# Or build with specific features
python setup.py install
# Install build dependencies
pip install cmake ninja packaging wheel
# Install PyTorch first
pip install torch torchvision torchaudio
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000
# Completions endpoint
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "Once upon a time",
"max_tokens": 100
}'
# Chat endpoint (for chat models)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=100,
top_p=0.9
)
# Generate completions
prompts = ["Hello, my name is", "The future of AI is"]
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
print(output.outputs[0].text)
vLLM supports most popular open-source models from HuggingFace:
See the official documentation for the complete list.
| Argument | Description | Example |
|---|---|---|
--model |
Model name or path | meta-llama/Llama-2-7b-chat-hf |
--host |
Server host | 0.0.0.0 |
--port |
Server port | 8000 |
--tensor-parallel-size |
Number of GPUs | 2 |
--max-model-len |
Max sequence length | 4096 |
--gpu-memory-utilization |
GPU memory fraction | 0.9 |
| Argument | Description | Default |
|---|---|---|
--quantization |
Quantization method | None |
--enforce-eager |
Enforce eager execution | False |
--enable-prefix-caching |
Enable prefix caching | False |
--enable-chunked-prefill |
Enable chunked prefill | False |
See Configuration for detailed options.
# Reduce GPU memory utilization
--gpu-memory-utilization 0.8
# Reduce max model length
--max-model-len 2048
# Clear HuggingFace cache
rm -rf ~/.cache/huggingface
# Re-download the model
nvidia-smi)Choose your deployment method:
Setting up LLM inference can be complex. We offer consulting services for:
Contact us at office@linux-server-admin.com or visit our contact page.