LMCache-Aware Routing: When Prefill Workers Stop Being Stateless
Published on: 2026/03/06
Tags: vllm, rust, lmcache, inference, kv-cache, optimization, performance
In my previous article I covered the production features in my vLLM router fork: YAML config, response caching, semantic cluster routing, admin API. This article covers the latest addition — and arguably the most consequential one for PD disaggregation: LMCache-aware routing.
The core idea is simple: instead of guessing which worker has KV cache (the cache_aware approach with a radix tree), ask the system that actually manages the cache. The LMCache controller knows exactly which worker holds how many cached KV chunks. The router polls it and routes accordingly.
But the implications are bigger than a routing improvement. With LMCache-aware routing, prefill workers are no longer stateless. The router knows they hold cache state. It routes to preserve it. This changes the fundamental assumption of PD disaggregation.
The Problem With Guessing
The existing cache_aware policy in the vLLM router maintains an approximate radix tree. Every time a request is routed to a worker, the router inserts the prompt's token prefix into a per-worker tree. On the next request, it walks the tree to find which worker has the longest matching prefix and routes there.
This works — until it doesn't:
Evictions are invisible. When a vLLM worker evicts KV cache entries due to memory pressure, the router's radix tree doesn't know. It still thinks the prefix is there. It routes the request, the worker recomputes from scratch, and you get a cache miss that looks like a cache hit from the router's perspective.
Multi-instance state diverges. If you run multiple router instances (which you should, for HA), each builds its own radix tree independently. Their views of cache state drift apart. One router sends multi-turn session requests to worker A, another sends to worker B — both think they're doing cache-aware routing, both are wrong.
Cold starts are blind. After a router restart, the radix tree is empty. Every request is a guess until enough history accumulates to rebuild an accurate tree. During this window, cache hit rates drop to near zero.
The cache_aware policy also has a complex dual-mode behavior — it switches between cache-based and load-based routing depending on load imbalance. This works but makes the routing behavior harder to predict and debug.
Data-Driven Routing With LMCache
The lmcache_aware policy replaces all of this with a single source of truth: the LMCache controller.
Redis / Valkey)] R -->|polls every 10s| LC[LMCache Controller] R -->|routes| W1[Worker 1 + LMCache] R -->|routes| W2[Worker 2 + LMCache] LC -->|ZMQ| W1 LC -->|ZMQ| W2 W1 --> RD[(KV Cache
Redis / Valkey)] W2 --> RD
The architecture has three components:
LMCache on each vLLM worker — configured with enable_controller: true, each worker reports its cache state to the controller via ZMQ. This includes key_count (number of cached KV chunks), heartbeat timestamps, and instance identity.
LMCache controller — a lightweight FastAPI service (no GPU needed) that aggregates cache state from all workers and exposes it via HTTP. It runs as a single pod in Kubernetes.
The router's lmcache_aware policy — spawns an async Tokio task that polls GET /controller/workers at a configurable interval. It maintains an in-memory map of instance_id → key_count and uses this to score routing decisions.
The Scoring Formula
For each healthy worker, the policy computes:
score = cache_weight * normalized_key_count + (1 - cache_weight) * normalized_inverse_load
Where:
normalized_key_count= worker'skey_count/ maxkey_countacross all workersnormalized_inverse_load= (max_load - worker_load) / max_loadcache_weightis a tunable parameter (default 0.7, range 0.0–1.0)
The worker with the highest score wins. At cache_weight=1.0, routing is purely cache-affinity. At cache_weight=0.0, it's purely load-based. The default of 0.7 strongly prefers workers with more cached data but still accounts for load distribution.
fn compute_score(
&self,
worker: &dyn Worker,
max_key_count: usize,
max_load: usize,
instance_id: Option<&str>,
cache_state: &HashMap<String, WorkerCacheInfo>,
) -> f64 {
let cache_weight = self.config.cache_weight as f64;
let load_weight = 1.0 - cache_weight;
let normalized_cache = if let Some(id) = instance_id {
if let Some(info) = cache_state.get(id) {
if max_key_count > 0 {
info.key_count as f64 / max_key_count as f64
} else { 0.0 }
} else { 0.0 }
} else { 0.0 };
let normalized_inverse_load = if max_load > 0 {
(max_load - worker.load().min(max_load)) as f64 / max_load as f64
} else { 1.0 };
cache_weight * normalized_cache + load_weight * normalized_inverse_load
}
Graceful Fallback
When the controller is unreachable — timeout, error, not yet deployed — the policy delegates to a configurable fallback (default: power_of_two). No error reaches the client. When the controller comes back, routing automatically switches to cache-aware. This makes deployment incremental: you can add the controller later without changing the router config.
Regular Mode: The Simplest Win
The most straightforward use of lmcache_aware is in regular (non-PD) mode — multiple vLLM workers behind the router, all serving the same model. No prefill/decode split, just a pool of workers.
mode:
type: regular
worker_urls:
- "http://vllm-worker-001:8000"
- "http://vllm-worker-002:8000"
policy:
type: lmcache_aware
controller_url: "http://lmcache-controller:9000"
poll_interval_secs: 10
cache_weight: 0.7
lookup_mode: occupancy
fallback_policy: "power_of_two"
controller_timeout_ms: 2000
lmcache_worker_map:
"vllm-001": "http://vllm-worker-001:8000"
"vllm-002": "http://vllm-worker-002:8000"
This is the setup that gives you the biggest improvement with the least architectural change. You're already running multiple workers — you already have cache affinity problems. Every multi-turn conversation that lands on a different worker than its previous turn wastes GPU cycles recomputing KV.
With lmcache_aware in regular mode:
- Turn 1 goes to Worker 1 (load balanced, both workers are empty)
- LMCache on Worker 1 reports 100 cached chunks to the controller
- Turn 2 arrives — the router polls the controller, sees Worker 1 has 100 chunks, Worker 2 has 10
- Turn 2 routes to Worker 1 — prefix cache hit, skip recomputation
Without it, Turn 2 goes to Worker 2 via round robin or power-of-two — full recomputation.
This is the right starting point for most teams. If you're running multi-turn workloads behind multiple vLLM instances, add the LMCache controller and switch to lmcache_aware before even considering PD disaggregation.
Why This Changes PD Disaggregation
In the standard PD disaggregation model, the mental model is clean:
- Prefill workers are stateless. Any prefill worker can handle any request. Use load balancing.
- Decode workers are stateful. They hold the KV cache across turns. Use session affinity.
This mental model is wrong as soon as you introduce LMCache.
With LMCache enabled, prefill workers also hold KV cache. When Worker 1 prefills a request, LMCache stores the computed KV chunks in local CPU memory (and optionally reports them to the controller). If Turn 2 of the same conversation arrives at Worker 1, it gets a prefix cache hit and skips recomputation. If it arrives at Worker 2, it recomputes everything.
Prefill workers are now stateful. The question is whether the router knows it.
Without lmcache_aware, the answer is no — the router treats prefill workers as interchangeable. With lmcache_aware, the router has real visibility into which prefill worker holds cached data for which conversations.
This is the configuration that makes PD disaggregation cache-aware at both layers:
mode:
type: pd_disaggregation
prefill_urls:
- "http://prefill-1:8081"
- "http://prefill-2:8081"
decode_urls:
- "http://decode-1:8083"
- "http://decode-2:8083"
prefill_policy:
type: lmcache_aware
controller_url: "http://lmcache-controller:9000"
poll_interval_secs: 10
cache_weight: 0.8
fallback_policy: "power_of_two"
controller_timeout_ms: 2000
lmcache_worker_map:
"prefill-instance-1": "http://prefill-1:8081"
"prefill-instance-2": "http://prefill-2:8081"
decode_policy:
type: consistent_hash
virtual_nodes: 160
Prefill side: lmcache_aware routes to the worker with the most relevant cached KV chunks. In multi-turn conversations, this means Turn 2 goes to the same prefill worker that processed Turn 1 — because the controller reports that worker has 100 cached chunks while others have 10.
Decode side: consistent_hash with sticky sessions pins each conversation to the same decode worker. This hasn't changed — decode workers accumulate state across turns by design.
Without lmcache_aware, Turn 2 might go to Prefill Worker 2 via round robin or power-of-two — full recomputation, no cache benefit, wasted GPU cycles.
cache_aware vs lmcache_aware
Both policies aim for the same goal — route to the worker with the most relevant cache. They differ fundamentally in how they know where the cache is:
| Aspect | cache_aware | lmcache_aware |
|---|---|---|
| Cache state source | Approximate radix tree built from request history | Real state from LMCache controller |
| Eviction visibility | No — tree doesn't know when vLLM evicts | Yes — controller reflects actual cache occupancy |
| Multi-router consistency | No — each router builds its own tree | Yes — all routers poll the same controller |
| Cold start behavior | Empty tree, blind routing until history accumulates | Polls controller immediately, gets current state |
| Infrastructure required | None — runs standalone | LMCache controller + LMCache on each worker |
| Load balancing | Dual-mode: cache-based when balanced, load-based when imbalanced | Single formula: weighted score with tunable cache_weight |
| Per-request overhead | Radix tree lookup (microseconds) | HashMap read (microseconds) — polling is background |
| Regular mode | Works (but guesses) | Works — simplest and most impactful improvement |
| PD disaggregation | Works as prefill policy (guesses) | Works as prefill policy with real cache visibility |
| Best for | Deployments without LMCache, simpler setups | Any deployment with LMCache: regular or PD, multi-turn workloads |
The cache_aware policy isn't obsolete — it's the right choice when you don't have LMCache deployed. It works well enough for many workloads and has zero infrastructure dependencies. But when you have LMCache (which you should if you're running multi-turn at scale), lmcache_aware is strictly better because it uses real data instead of approximations.
The Moving Parts
The integration requires three components: an LMCache controller (lightweight FastAPI pod, no GPU), LMCache enabled on each vLLM worker with enable_controller: true, and the router configured with lmcache_aware policy. The one detail worth highlighting here is the lmcache_worker_map — the controller identifies workers by instance_id while the router identifies them by URL. The map bridges these two identities. Without it, the router can't correlate cache state to its worker pool and routing degrades to pure load balancing.
Full configuration examples are in the repository: configs/lmcache-aware.yaml for regular mode and configs/pd-lmcache-aware.yaml for PD disaggregation. The LMCache integration docs cover prerequisites, Kubernetes manifests, and troubleshooting.
Redis at Two Layers
In my deployment I use Redis at two distinct layers, and it's worth being explicit about why they're different things.
LMCache + Redis: Distributed KV Cache
LMCache supports a multi-tier storage hierarchy: GPU memory, CPU DRAM, local disk, and a remote backend. When configured with remote_url: "redis://host:6379", KV cache chunks are serialized and stored in Redis as the outermost tier. This means any worker can pull cached KV chunks from Redis — even if it didn't compute them originally.
This changes the cache locality picture. Without Redis, KV cache is trapped on the worker that computed it. If the router sends Turn 2 to the wrong worker, that worker starts from scratch. With Redis as remote backend, the wrong worker can still pull the KV from Redis — it's slower than a local CPU cache hit, but faster than full recomputation.
Does this make lmcache_aware routing unnecessary? No. The storage hierarchy matters:
| Cache Location | Latency | What Triggers It |
|---|---|---|
| Local GPU memory | <1us | Same worker, same session, still in VRAM |
| Local CPU DRAM | ~10us | Same worker, evicted from GPU, still in RAM |
| Redis (remote) | ~1ms | Any worker, fetched over network |
| Recomputation | ~100ms+ | Cache miss everywhere |
Routing to the worker that has the KV in local CPU (~10us) is still two orders of magnitude faster than pulling from Redis (~1ms). And Redis pulls add network traffic that scales with context length. lmcache_aware routing minimizes Redis fallback by preferring workers with local cache. Redis is the safety net, not the primary path.
I'm currently running Redis for both layers, but my intention is to migrate to Valkey after more extensive testing. Valkey is the open-source fork of Redis (post-license change), wire-compatible, and backed by the Linux Foundation. The LMCache and router Redis clients use standard Redis protocol — the migration should be a URL swap, but I want to validate performance under KV cache workloads before committing.
For my LMCache + Redis setup, see my previous article on LMCache + Redis.
Router + Redis: Response Caching
Separately, the router fork supports Redis as a response cache backend. This is an entirely different use case: caching complete LLM responses (not KV chunks) so that identical requests return instantly without touching any vLLM worker. When running multiple router instances, a shared Redis cache ensures deduplication across all instances.
These two Redis uses are complementary:
- LMCache Redis prevents KV recomputation across workers (inference-level optimization)
- Router Redis prevents duplicate inference entirely for repeated prompts (request-level optimization)
Two Phases
Phase 1: Occupancy Routing (implemented)
lookup_mode: occupancy
The router polls GET /controller/workers at the configured interval. Each worker's total key_count (number of cached KV chunks across all sessions) is used for scoring. Workers with more cached data are preferred.
This is a coarse signal — it tells you "Worker 1 has more cache than Worker 2" but not "Worker 1 has cache for this specific conversation". In practice, it's still far better than guessing because it reflects real state including evictions, and in multi-turn workloads the worker with the most cache is usually the one that processed your previous turns.
Phase 2: Prefix Lookup (not yet implemented)
lookup_mode: prefix_lookup
The config accepts this value and the plumbing exists (needs_request_text() returns true in this mode), but the actual POST /lookup call to the controller is not implemented yet. Today, setting lookup_mode: prefix_lookup will not give you per-request prefix matching — routing still uses the occupancy-based scoring.
The design for Phase 2: per-request POST /lookup to the controller with the tokenized prompt. The controller already exposes this endpoint in LMCache (lmcache.v1.api_server):
POST /lookup
{"tokens": [1, 2, 3, 4, 5, ...]}
Response:
{"layout_info": {"vllm-001": ["LocalCPUBackend", 768]}}
The layout_info maps instance_id to (location, matched_token_count). This gives exact prefix-match routing — "Worker vllm-001 has 768 tokens cached for this exact prefix."
What's needed to implement it:
- The router needs to tokenize the incoming prompt before the lookup call. The tokenizer must produce the same token IDs as vLLM/LMCache (same HuggingFace model). The fork already supports HuggingFace, TikToken, and SentencePiece tokenizers.
- The
select_worker_with_headersmethod needs to callPOST /lookupwith the token IDs and use thematched_token_countas the cache score instead ofkey_count. - Latency budget: the lookup adds a per-request HTTP call, mitigated by
controller_timeout_ms.
Phase 2 is where lmcache_aware will become qualitatively different from any other routing policy — routing based on actual cached content for the specific incoming request, not aggregate occupancy.
Practical Takeaways
- Start with regular mode. If you have multiple vLLM workers behind a router,
lmcache_awarein regular mode is the simplest and highest-impact change. You don't need PD disaggregation to benefit. - Start with Phase 1 (occupancy). It requires zero changes to the tokenizer setup and gives you the biggest improvement over
cache_awareorpower_of_two— real cache visibility instead of guessing. - Use
cache_weight: 0.7as default. This strongly prefers cached workers but doesn't create hot spots. Tune down to 0.5 if you see load imbalance. - Set
fallback_policy: "power_of_two". If the controller goes down, this is the best general-purpose fallback. The transition is transparent to clients. - Always configure
lmcache_worker_map. Without it, the routing degrades to pure load balancing because the router can't correlate controller data to its workers. - For PD mode:
lmcache_awarefor prefill,consistent_hashfor decode. This is the recommended production configuration when you do need PD disaggregation with multi-turn workloads. - Monitor cache hit rates. The router exports per-policy Prometheus metrics. Compare cache hit rates between
cache_awareandlmcache_awarebefore and after deployment. - The controller is lightweight. It needs 500m CPU and 512Mi memory. Don't skip deploying it because you think it's heavy — it's cheaper than the GPU cycles you waste on cache misses.
What's Next
Phase 2 (prefix lookup) is the priority. The LMCache controller already exposes the /lookup endpoint — the missing piece is the router-side implementation: tokenize the prompt, call the endpoint, use the matched token count for scoring. The plumbing exists; the logic doesn't yet.
Once implemented, combined with P2P cache sharing between LMCache instances, the end state looks like:
- The router tokenizes the incoming prompt and asks the controller which worker has the longest cached prefix
- Routes to that worker — exact prefix cache hit
- If no single worker has the full prefix, LMCache can pull missing chunks from another worker via P2P (NIXL)
- The decode worker receives the KV cache and continues generation
That's the end state for cache-aware routing: real state, exact matching, cross-worker sharing.
Sources
- Fork repository — full source code and configs
- LMCache project — the KV cache management system
- Previous article: vLLM Router fork features
- vLLM Router and PD Disaggregation — background on cache-aware routing
- LMCache + Redis: Distributed KV Cache — LMCache deep dive