vLLM KV Offloading: Key Findings from the Official Announcement
Tags: ai, llm, vllm, kv-cache, performance, optimization, inference
vLLM recently published a detailed blog post about their KV offloading connector feature, introduced in v0.9.0 with major improvements in v0.12.0 and v0.14.0. This feature addresses a critical bottleneck in high-throughput LLM inference: what happens when GPU memory fills up and requests get preempted.
In my LMCache + Redis article, I covered distributed cache sharing across instances. vLLM's native offloading takes a different approach: extending GPU memory with CPU DRAM for a single instance. Here are the key findings from their announcement.
The Core Problem: Preemption Without Recovery
When vLLM runs out of GPU memory while serving multiple concurrent requests, it must preempt (pause) lower-priority requests to make room. Before KV offloading, this meant:
- Discard the preempted request's KV cache completely
- When resuming later: recompute everything from scratch
- Long prompts (8K+ tokens) incur massive prefill penalty
The cost: On an H100, recomputing an 8K-token prompt takes ~3.2 seconds of wasted GPU cycles.
Key Innovation: Async Offloading to CPU
The KV offloading connector introduces an asynchronous API that:
- Before preemption: Offloads KV cache to CPU DRAM (via DMA)
- On resume: Imports KV cache back to GPU
- Result: Avoid recomputation entirely
Critical design choice: Asynchronous transfers don't block inference. While KV data moves between GPU and CPU, the model continues processing other requests.
The v0.12.0 Game-Changer: Memory Layout Reorganization
Early versions had a fatal flaw: KV cache was fragmented across transformer layers, creating tiny transfer blocks (8-72 KB). This killed transfer efficiency.
v0.12.0 breakthrough: Consolidated KV data into one contiguous physical block per request across all layers.
Block Size Impact
| Model | Old Block Size | New Block Size | Multiplier |
|---|---|---|---|
| Llama-3.1-8B | 32 KB | 2 MB | 62x larger |
| DeepSeek-R1-Distill-32B | 16 KB | 2 MB | 125x larger |
| Llama-3.2-1B | 16 KB | 0.5 MB | 31x larger |
Why it matters: Larger contiguous blocks amortize DMA setup overhead and enable full memory bandwidth utilization.
Real-world impact from their benchmarks:
- 4x reduction in TTFT (time-to-first-token)
- 5x increase in throughput after the memory layout change alone
Transfer Method Showdown: DMA vs. Custom CUDA Kernel
The team compared two approaches for GPU↔CPU transfers:
DMA (Direct Memory Access via cudaMemcpyAsync)
- Bandwidth: 83.4 GB/s bidirectional with 2MB blocks
- Pros: No GPU core interference, consistent performance
- Cons: Less efficient for tiny blocks (<0.5 MB)
Custom CUDA Kernel
- Bandwidth: 68.5 GB/s with higher variance
- Pros: Better for small fragmented blocks
- Cons: Competes with inference for GPU cores
Winner: DMA by a landslide after v0.12.0's contiguous memory layout. The blog reports 32% more throughput using DMA versus the custom kernel, while matching TTFT.
Key insight: The memory layout optimization made DMA the clear winner. With the old fragmented layout, custom kernels were necessary evil.
Benchmark Results: The Numbers That Matter
Testing setup: H100 80GB, Llama-3.1-8B-Instruct, 500GB DRAM, Ubuntu 24.04.1
Single Request Latency: 2-22x Faster TTFT
When a request's KV cache is already in CPU memory (from previous preemption), reloading it dramatically reduces time-to-first-token:
| Prompt Length | Recompute (ms) | CPU Load (ms) | Speedup |
|---|---|---|---|
| 512 tokens | ~200 | ~100 | 2x |
| 2K tokens | ~800 | ~80 | 10x |
| 8K tokens | ~3200 | ~145 | 22x |
Critical finding: Longer prompts benefit more because DMA transfer time is constant (~50ms) regardless of prompt length, while recomputation scales linearly.
Why 22x for 8K tokens? Transfer takes ~145ms. Recomputation takes ~3200ms. The ratio gets better as prompts grow.
Concurrent Throughput: Up to 9x Improvement
Benchmark scenario: 10,000 unique 512-token requests hitting the server.
| CPU DRAM Allocated | Baseline | With Offloading | Improvement |
|---|---|---|---|
| 0 GB (disabled) | 1850 tok/s | 1850 tok/s | 1x |
| 16 GB | 1850 tok/s | 3200 tok/s | 1.7x |
| 64 GB | 1850 tok/s | 8500 tok/s | 4.6x |
| 128 GB | 1850 tok/s | 16,650 tok/s | 9x |
Key finding: The more CPU DRAM you allocate, the higher the cache hit rate, the better throughput scales.
Why such massive gains?
- Without offloading: preempted requests must recompute → wasted GPU cycles
- With offloading: GPU spends time generating tokens, not re-prefilling
- Effective batch size increases because GPU isn't blocked on recomputation
Practical implication: Adding cheap CPU DRAM (128GB DDR5 ≈ $400) can nearly 10x your throughput on expensive GPUs (H100 ≈ $30,000).
Configuration Evolution: CLI Simplicity
The configuration story shows vLLM's maturity over versions:
Legacy (pre-0.14.0): Complex JSON config
--kv-transfer-config '{
"kv_connector": "OffloadingConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {"num_cpu_blocks": 8192}
}'
Modern (v0.14.0+): Two simple flags
--kv-offloading-backend native \
--kv-offloading-size 128 # GB of CPU DRAM
Finding: The API surface simplification indicates the feature has moved from experimental to production-ready.
What Makes This Async Design Work
The blog emphasizes the non-blocking nature of the connector API:
- Before handling requests: Query connector to import cached KV (async)
- During inference: Model computes while DMA transfers happen in background
- After generation: Store new KV values externally (async)
Critical insight: vLLM doesn't wait for transfers to complete. It overlaps computation with data movement. This is why the latency overhead is described as "not user-facing."
Architecture Deep Dive: Request Lifecycle
The blog describes how offloading integrates into vLLM's scheduler:
Key takeaway: Offloading is transparent to the application layer. The scheduler makes all decisions about when to preempt and offload.
Upcoming Improvements (v0.14.0+)
The blog mentions work-in-progress features:
-
Preempted request reloading: Currently, if a request gets preempted, it can't automatically resume from CPU cache. This is being fixed.
-
Race condition fixes: Between offloading operations and model computation. The async nature creates timing challenges they're addressing.
Finding: The feature is mature but still evolving. Production users should track releases for stability improvements.
When This Actually Matters
The blog doesn't explicitly state this, but the numbers reveal the sweet spot:
✅ High Impact Scenarios
Long contexts + high concurrency:
- 50+ concurrent requests on single GPU
- 8K+ token prompts (22x benefit)
- Frequent preemption due to memory pressure
Bursty traffic:
- Traffic spikes cause aggressive preemption
- CPU cache smooths out GPU memory bottleneck
- Cost-effective scaling (CPU DRAM is cheap)
⚠️ Minimal Impact Scenarios
Short contexts (<2K tokens):
- Recomputation is already fast (<800ms)
- DMA overhead comparable to just recomputing
- Benefit drops to 2x (barely worth complexity)
Low concurrency:
- GPU memory not under pressure
- No preemption happening
- Feature adds overhead without benefit
The Elephant in the Room: No Cross-Instance Sharing
What the blog doesn't emphasize: This is purely local to one vLLM instance.
Unlike LMCache with Redis:
- ❌ No cache sharing across multiple vLLM workers
- ❌ No persistence (lost on restart)
- ❌ No chunk-level position-independent matching
It's purely a memory extension mechanism, not a distributed cache.
The complement: Run LMCache on top for cross-instance sharing + native offloading for local memory extension.
Production Lessons from the Benchmarks
Reading between the lines of their benchmark setup reveals production considerations:
Memory planning:
- They tested with 500GB DRAM on an H100 system
- Allocated up to 128GB for KV cache (25% of total)
- Left headroom for OS, other processes
CPU core limitation:
- Limited to 8 cores despite having more available
- Suggests CPU cycles aren't the bottleneck (DMA is)
- Don't need high core count, just fast memory bandwidth
Block size matters:
- Tests use 16-token blocks (standard vLLM default)
- With contiguous layout, these aggregate into MB-sized transfers
- Configuration choice affects transfer efficiency
Comparison: Where This Fits in the KV Cache Landscape
The blog exists in a broader ecosystem:
| Approach | Scope | Latency | Use Case |
|---|---|---|---|
| Native vLLM prefix cache | Single instance, prefix-only | 0 (in GPU) | Same prompt beginnings |
| Native KV offloading | Single instance, CPU DRAM | Sub-ms (DMA) | High concurrency, preemption |
| LMCache + Redis | Multi-instance cluster | 1-5ms (network) | Distributed fleet, chunk-level sharing |
| LMCache + S3 | Multi-instance, persistent | 50-200ms | Cold storage, cost optimization |
Finding: These are complementary layers in a storage hierarchy, not competitors.
Key Architectural Decision: Why DMA Won
The blog spends significant time justifying DMA over custom CUDA kernels. Here's why this matters:
Before v0.12.0:
- Fragmented blocks (8-72 KB)
- Custom kernels needed to batch small transfers
- Lower throughput but necessary
After v0.12.0:
- Contiguous blocks (0.5-2.5 MB)
- DMA shines at this size (83 GB/s)
- No GPU core interference
Lesson: Architecture changes (memory layout) unlocked a simpler, faster solution (DMA). Sometimes the right abstraction makes the obvious approach work.
What They Didn't Benchmark: Multi-GPU Scenarios
Notably absent: How does this work with tensor parallelism across multiple GPUs?
Open question: When a model is split across 4 GPUs, does offloading transfer from all 4 in parallel? Or serialize? The blog doesn't say.
Implication for production: Users running Llama-70B on 4x A100s need to test this themselves.
The Prometheus Metrics Gap
The blog mentions monitoring but doesn't provide metric names. Based on vLLM patterns, expect:
vllm_kv_offload_total # Count of offload operations
vllm_kv_offload_bytes # Bytes moved to CPU
vllm_kv_reload_total # Count of reload operations
vllm_kv_cache_hit_rate # % requests with CPU cache hit
Production gap: No guidance on what "good" values look like. Cache hit rate >70% likely ideal based on throughput curves.
Practical Takeaways: What to Actually Do
Distilling the blog's findings into actionable advice:
Start Simple
vllm serve <model> \
--kv-offloading-backend native \
--kv-offloading-size 64 # Start with 64GB
Monitor cache hit rate. If low (<50%), you need more CPU DRAM or less concurrency.
Size the CPU Cache
Rule of thumb from benchmarks:
- 1 GB per 1000 tokens of active working set
- For 100 concurrent 8K-token requests: ~800GB
- Obviously impractical → tune based on hit rate
Practical sizing:
- 16GB: Minimal (handles ~20 concurrent 8K requests)
- 64GB: Good (handles ~80 concurrent 8K requests)
- 128GB: Excellent (handles ~160 concurrent 8K requests)
Know Your Break-Even Point
From the TTFT numbers:
- 8K tokens: 22x benefit → use offloading
- 2K tokens: 10x benefit → probably use offloading
- 512 tokens: 2x benefit → maybe skip (low ROI)
If your median prompt is <1K tokens, this feature might not be worth the complexity.
Combine with LMCache for Maximum Effect
vllm serve <model> \
--kv-offloading-backend native \
--kv-offloading-size 64 \
--enable-lmcache \
--lmcache-config redis.yaml
Stack the benefits:
- LMCache handles cross-instance sharing (chunk-level)
- Native offloading handles local preemption (DMA-fast)
- Best of both worlds
Conclusion: A Narrow but Powerful Tool
The vLLM blog post describes a feature that solves one specific problem exceptionally well: avoiding recomputation after GPU memory preemption.
What it is:
- Memory extension for single vLLM instance
- DMA-based GPU↔CPU transfers
- Massive TTFT reduction (2-22x)
- Up to 9x throughput with high cache hit rates
What it isn't:
- Distributed cache (use LMCache for that)
- Persistent storage (lost on restart)
- Position-independent matching (prefix-based only)
The real finding: Adding $400 of CPU DRAM can 10x throughput on a $30,000 GPU. The ROI is absurd for high-concurrency, long-context workloads.
For production LLM deployments running vLLM with memory pressure and long contexts, this isn't optional — it's table stakes.