Guide to configuring vLLM for optimal performance, memory efficiency, and use-case specific tuning.
| Argument |
Type |
Default |
Description |
--model |
str |
(required) |
Model name or path on HuggingFace |
--tokenizer |
str |
Same as model |
Custom tokenizer name or path |
--tokenizer-mode |
str |
auto |
auto, slow, or tokenizer class |
--trust-remote-code |
bool |
False |
Trust remote code from HF |
--download-dir |
str |
HF cache |
Directory for model downloads |
--load-format |
str |
auto |
auto, pt, safetensors, npcache |
--dtype |
str |
auto |
auto, float16, bfloat16, float32 |
--seed |
int |
0 |
Random seed for reproducibility |
| Argument |
Type |
Default |
Description |
--gpu-memory-utilization |
float |
0.9 |
GPU memory fraction (0.0-1.0) |
--max-model-len |
int |
Model default |
Maximum sequence length |
--kv-cache-dtype |
str |
auto |
KV cache dtype: auto, fp8, fp8_e5m2, fp8_e4m3 |
--quantization-param-path |
str |
None |
Path to quantization params |
--num-gpu-blocks-override |
int |
Auto-detected |
Override GPU block count |
--swap-space |
int |
4 |
CPU swap space (GB) |
--cpu-offload-gb |
float |
0 |
CPU offload space (GB) |
| Argument |
Type |
Default |
Description |
--tensor-parallel-size |
int |
1 |
Number of GPUs for tensor parallelism |
--pipeline-parallel-size |
int |
1 |
Number of GPUs for pipeline parallelism |
--data-parallel-size |
int |
1 |
Data parallelism degree |
--max-num-seqs |
int |
256 |
Maximum concurrent sequences |
--max-num-batched-tokens |
int |
Model dependent |
Max tokens per batch |
--enable-chunked-prefill |
bool |
False |
Enable chunked prefill |
--prefill-schedule-type |
str |
default |
Prefill scheduling type |
| Argument |
Type |
Default |
Description |
--quantization |
str |
None |
Quantization: awq, gptq, fp8, squeezellm |
--enforce-eager |
bool |
False |
Enforce eager execution |
--enable-prefix-caching |
bool |
False |
Enable prefix caching (APC) |
--enable-chunked-prefill |
bool |
False |
Enable chunked prefill |
--disable-sliding-window |
bool |
False |
Disable sliding window attention |
--use-v2-block-manager |
bool |
True |
Use V2 block manager |
| Argument |
Type |
Default |
Description |
--host |
str |
localhost |
Server hostname |
--port |
int |
8000 |
Server port |
--uvicorn-log-level |
str |
info |
Uvicorn log level |
--allow-credentials |
bool |
False |
Allow CORS credentials |
--allowed-origins |
list |
["*"] |
Allowed CORS origins |
--allowed-methods |
list |
["*"] |
Allowed CORS methods |
--allowed-headers |
list |
["*"] |
Allowed CORS headers |
| Argument |
Type |
Default |
Description |
--api-key |
str |
None |
API key for authentication |
--served-model-name |
str |
Model name |
Custom model name in API |
--chat-template |
str |
Model default |
Custom chat template path |
--response-role |
str |
assistant |
Default response role |
--enable-auto-tool-choice |
bool |
False |
Enable auto tool choice |
--tool-call-parser |
str |
None |
Tool call parser name |
| Argument |
Type |
Default |
Description |
--log-level |
str |
info |
Log level: debug, info, warning, error |
--log-stats |
bool |
False |
Log periodic stats |
--log-requests |
bool |
False |
Log all requests |
--logging-config |
str |
None |
Path to logging config file |
| Variable |
Description |
Example |
HUGGING_FACE_HUB_TOKEN |
Authentication token for HF Hub |
hf_xxx |
HF_HOME |
HuggingFace cache directory |
/data/huggingface |
HF_DATASETS_CACHE |
Datasets cache location |
/data/datasets |
HF_TOKEN |
Alternative token variable |
hf_xxx |
| Variable |
Description |
Example |
CUDA_VISIBLE_DEVICES |
Select visible GPUs |
0,1,2 |
CUDA_LAUNCH_BLOCKING |
Debug CUDA launches |
1 |
NCCL_DEBUG |
NCCL debug level |
INFO, WARN |
NCCL_IB_DISABLE |
Disable InfiniBand |
1 |
NCCL_SOCKET_IFNAME |
Network interface for NCCL |
eth0 |
| Variable |
Description |
Example |
VLLM_ALLOW_LONG_MAX_MODEL_LEN |
Allow longer model lengths |
1 |
VLLM_TEST_FORCE_FP8 |
Force FP8 quantization |
1 |
VLLM_USE_RAY_SPMD_WORKER |
Use Ray SPMD worker |
1 |
VLLM_NO_USAGE_STATS |
Disable usage stats |
1 |
VLLM_CONFIGURE_LOGGING |
Configure logging |
1 |
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.9 \
--max-model-len 4096
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-chat-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 8192 \
--enable-prefix-caching
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-7b-Chat-AWQ \
--quantization awq \
--gpu-memory-utilization 0.9 \
--max-model-len 4096
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--max-num-seqs 512 \
--max-num-batched-tokens 100000 \
--enable-chunked-prefill \
--enable-prefix-caching \
--gpu-memory-utilization 0.95
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--max-num-seqs 128 \
--max-model-len 2048 \
--gpu-memory-utilization 0.8 \
--enforce-eager False
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--api-key sk-your-secret-api-key \
--served-model-name llama-2-7b-prod \
--log-level info \
--log-stats \
--enable-prefix-caching
These parameters are passed via the API request, not server startup:
| Parameter |
Type |
Default |
Description |
temperature |
float |
1.0 |
Sampling temperature (0 = greedy) |
top_p |
float |
1.0 |
Nucleus sampling threshold |
top_k |
int |
-1 |
Top-k sampling |
max_tokens |
int |
16 |
Maximum output tokens |
min_tokens |
int |
0 |
Minimum output tokens |
stop |
list |
[] |
Stop sequences |
presence_penalty |
float |
0.0 |
Presence penalty (-2 to 2) |
frequency_penalty |
float |
0.0 |
Frequency penalty (-2 to 2) |
repetition_penalty |
float |
1.0 |
Repetition penalty |
seed |
int |
None |
Random seed |
logprobs |
int |
None |
Number of logprobs |
best_of |
int |
1 |
Number of candidates |
{
"model": "llama-2-7b",
"prompt": "Explain quantum computing",
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 500,
"presence_penalty": 0.5,
"frequency_penalty": 0.5,
"stop": ["\n\n", "##"]
}
| Strategy |
Configuration |
Impact |
| Reduce GPU memory |
--gpu-memory-utilization 0.8 |
Lower memory, more headroom |
| Shorter sequences |
--max-model-len 2048 |
Less KV cache memory |
| KV cache dtype |
--kv-cache-dtype fp8 |
~50% KV cache reduction |
| CPU offload |
--cpu-offload-gb 8 |
Offload to CPU RAM |
| Strategy |
Configuration |
Impact |
| Increase batch size |
--max-num-seqs 512 |
Higher throughput |
| Chunked prefill |
--enable-chunked-prefill |
Better memory utilization |
| Prefix caching |
--enable-prefix-caching |
Faster repeated prompts |
| Higher memory util |
--gpu-memory-utilization 0.95 |
More concurrent requests |
| Strategy |
Configuration |
Impact |
| Limit batch size |
--max-num-seqs 64 |
Lower latency |
| Shorter sequences |
--max-model-len 1024 |
Faster processing |
| Eager execution |
--enforce-eager |
More predictable latency |
| Lower memory util |
--gpu-memory-utilization 0.7 |
Less memory pressure |
Split model layers across multiple GPUs:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-chat-hf \
--tensor-parallel-size 4 \
--pipeline-parallel-size 1
Split model layers in pipeline fashion:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-chat-hf \
--tensor-parallel-size 2 \
--pipeline-parallel-size 2
Multiple replicas for higher throughput:
# Using Ray for data parallelism
export VLLM_USE_RAY=1
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--data-parallel-size 4
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-metrics \
--metrics-port 9090
# logging.yaml
version: 1
disable_existing_loggers: false
formatters:
default:
format: '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
handlers:
console:
class: logging.StreamHandler
formatter: default
level: INFO
root:
handlers: [console]
level: INFO
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--logging-config logging.yaml
# Reduce memory utilization
--gpu-memory-utilization 0.8
# Reduce sequence length
--max-model-len 2048
# Use quantization
--quantization fp8
# Enable optimizations
--enable-prefix-caching
--enable-chunked-prefill
# Increase batch size
--max-num-seqs 512
# Trust remote code (if needed)
--trust-remote-code
# Specify dtype explicitly
--dtype float16
Squeezing every bit of performance from your vLLM installation? Our experts help with:
- Memory and resource tuning
- Connection pool optimization
- Caching strategies
- Load balancing and clustering
Optimize your setup: office@linux-server-admin.com | Contact Us