The story of vLLM from academic research to becoming the industry standard for LLM inference and serving.
vLLM was developed in the Sky Computing Lab at the University of California, Berkeley. The lab focuses on cloud computing, distributed systems, and efficient resource management for modern computing workloads.
The vLLM project was created by a team of researchers:
| Name |
Role |
| Woosuk Kwon |
Lead author, PhD student |
| Zhuohan Li |
Researcher |
| Siyuan Zhuang |
Researcher |
| Ying Sheng |
Researcher |
| Lianmin Zheng |
Researcher |
| Cody Hao Yu |
Researcher |
| Joseph E. Gonzalez |
Professor, advisor |
| Hao Zhang |
Professor, advisor |
| Ion Stoica |
Professor, advisor |
- Joseph E. Gonzalez - Professor at UC Berkeley, expert in distributed systems and machine learning
- Hao Zhang - Professor at UC Berkeley, focused on systems and ML
- Ion Stoica - Professor at UC Berkeley, co-founder of Anyscale and Ray
Large Language Models (LLMs) require significant memory for inference, particularly for the KV cache (key-value cache) used during attention computation. Traditional approaches suffered from:
- Memory fragmentation - 60-80% memory waste
- Inefficient batching - Limited concurrent requests
- Static allocation - Pre-allocated memory regardless of actual usage
- Redundant duplication - Same prefixes stored multiple times
The team drew inspiration from operating system virtual memory techniques that have been used for decades in computer systems:
| OS Concept |
vLLM Equivalent |
| Virtual memory pages |
KV cache blocks |
| Page table |
Block table |
| Demand paging |
On-demand block allocation |
| Copy-on-write |
Shared prefix caching |
PagedAttention treats the KV cache like OS virtual memory:
- Fixed-size blocks - KV cache split into fixed-size “pages” (blocks)
- On-demand allocation - Blocks allocated only when needed
- Block table - Non-contiguous blocks mapped via block table
- Copy-on-write - Shared sequences use same blocks until modified
| Metric |
Traditional |
vLLM with PagedAttention |
| Memory waste |
60-80% |
<4% |
| Concurrent requests |
Baseline |
2-4× more |
| Throughput |
Baseline |
2-4× higher |
| Memory planning |
Complex |
Predictable |
| Date |
Event |
| Early 2023 |
vLLM project begins at UC Berkeley Sky Computing Lab |
| June 2023 |
PagedAttention algorithm developed |
| July 2023 |
vLLM system built on PagedAttention |
| September 12, 2023 |
Paper submitted to arXiv |
| October 2023 |
Paper accepted to SOSP 2023 |
| October 23, 2023 |
SOSP 2023 conference presentation |
| Late 2023 |
vLLM gains traction in ML community |
| Date |
Event |
| Q1 2024 |
Major cloud providers begin adopting vLLM |
| Q2 2024 |
vLLM 0.4.0 released with multi-GPU support |
| Q3 2024 |
vLLM 0.5.0 with FP8 quantization |
| Q4 2024 |
10M+ Docker pulls, industry standard status |
| Date |
Event |
| Q1 2025 |
vLLM 0.6.0 with chunked prefill |
| Q2 2025 |
AMD ROCm support added |
| Q3 2025 |
vLLM 0.7.0 with speculative decoding |
| Q4 2025 |
TPU and AWS Trainium/Inferentia support |
| Date |
Event |
| March 20, 2026 |
Latest release: v0.18.0 |
| March 2026 |
73.9k+ GitHub stars, 14.1k+ forks |
| March 2026 |
10M+ Docker pulls |
| March 2026 |
2,244+ contributors |
| Property |
Value |
| Title |
Efficient Memory Management for Large Language Model Serving with PagedAttention |
| Venue |
SOSP 2023 (ACM Symposium on Operating Systems Principles) |
| Date |
October 23, 2023 |
| arXiv |
arxiv.org/abs/2309.06180 |
| DOI |
10.1145/3600006.3613165 |
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca.
- PagedAttention algorithm - OS-inspired memory management for LLMs
- vLLM system - Production-ready LLM serving engine
- Near-zero memory waste - <4% waste vs 60-80% traditional
- 2-4× throughput improvement - Over FasterTransformer and Orca
- Flexible KV cache sharing - Within and across requests
| Model |
System |
Throughput (tokens/s) |
Improvement |
| LLaMA-7B |
FasterTransformer |
2,737 |
Baseline |
| LLaMA-7B |
vLLM |
6,859 |
2.5× |
| LLaMA-13B |
FasterTransformer |
1,423 |
Baseline |
| LLaMA-13B |
vLLM |
3,876 |
2.7× |
| LLaMA-70B |
Orca |
287 |
Baseline |
| LLaMA-70B |
vLLM (8 GPUs) |
1,151 |
4.0× |
¶ Growth and Adoption
| Metric |
2023 |
2024 |
2025 |
2026 |
| GitHub Stars |
5,000 |
25,000 |
55,000 |
71,500+ |
| Docker Pulls |
100K |
2M |
7M |
10M+ |
| Contributors |
50 |
500 |
1,500 |
2,244+ |
| Forks |
1,000 |
5,000 |
10,000 |
13,800+ |
| Sector |
Use Case |
| Cloud Providers |
AWS, GCP, Azure managed LLM serving |
| AI Startups |
Production LLM APIs |
| Enterprises |
Internal LLM infrastructure |
| Research |
Fast LLM experimentation |
| Open Source |
Powering OpenAI-compatible endpoints |
vLLM powers:
- Many OpenAI-compatible API endpoints
- Cloud provider managed services
- Enterprise LLM infrastructure
- Research institutions worldwide
| Version |
Release |
Key Features |
| 0.1.x |
2023 |
Initial release, PagedAttention |
| 0.2.x |
2023 |
Multi-GPU tensor parallelism |
| 0.3.x |
2024 |
Pipeline parallelism |
| 0.4.x |
2024 |
OpenAI-compatible API server |
| 0.5.x |
2024 |
FP8 quantization, prefix caching |
| 0.6.x |
2025 |
Chunked prefill, multi-LoRA |
| 0.7.x |
2025 |
Speculative decoding |
| 0.8.x+ |
2025 |
AMD/TPU support |
| 0.16.x |
2026 |
Latest stable |
| Feature |
Version |
Impact |
| PagedAttention |
0.1 |
Memory efficiency breakthrough |
| Tensor Parallelism |
0.2 |
Large model support |
| API Server |
0.4 |
Easy integration |
| Prefix Caching |
0.5 |
Multi-turn conversation boost |
| Chunked Prefill |
0.6 |
Better memory utilization |
| Speculative Decoding |
0.7 |
2-3× output throughput |
| Multi-LoRA |
0.6 |
Efficient multi-model serving |
- High memory waste (60-80%)
- Limited concurrent requests
- Complex deployment
- Expensive LLM serving
- Near-zero memory waste (<4%)
- 2-4× more concurrent requests
- Simple Docker deployment
- Cost-effective LLM serving
¶ Industry Standard
vLLM has become the de facto standard for:
- OpenAI-compatible API endpoints
- Self-hosted LLM serving
- Production LLM infrastructure
- Cloud provider managed services
- SOSP 2023 - Top systems conference
- Highly cited - Widely referenced in LLM systems research
- Influence - Inspired similar approaches in other systems
- 10M+ Docker pulls - Most popular LLM serving image
- 73.9k+ GitHub stars - One of the most starred ML projects
- Cloud adoption - Integrated into major cloud platforms
| Metric |
Status |
| Latest Release |
v0.18.0 (March 20, 2026) |
| Release Cadence |
Regular monthly releases |
| Active Contributors |
100+ monthly |
| Issues |
Active triage and resolution |
| PRs |
Quick review and merge |
- GitHub Discussions - Active community support
- Slack - vLLM Dev Slack workspace
- Forum - Community discussions and announcements
- Continued performance optimization
- Broader hardware support
- Enhanced multi-model serving
- Improved observability
- Enterprise features
- PagedAttention Paper - arxiv.org/abs/2309.06180
- SOSP 2023 Proceedings - dl.acm.org/doi/10.1145/3600006.3613165
Any questions?
Feel free to contact us. Find all contact information on our contact page.
Beyond this playbook, we offer:
- Custom role development
- Multi-environment setups (dev/staging/prod)
- Integration with CI/CD pipelines
- Compliance and audit reporting
Contact our automation team: office@linux-server-admin.com