vLLM Router: the story of a fork and the features upstream doesn't have
Published on: 2026/03/05
Tags: vllm, rust, performance, inference, tutorial
In my previous article I covered why prefix-cache-aware routing matters for PD disaggregation and looked at the vLLM router as one of the production-grade solutions. Since then I've been running it in real workloads and experimenting with features I'd like to see in the routing layer: response caching, semantic-aware routing, graceful operations, config-file-driven deployments.
The upstream router is solid for what it does. But there are features I wanted to try — some experimental, some production-hardened — that don't exist upstream yet. The vLLM project has its own semantic-router, but it's a more complex system with a broader scope. I wanted something lightweight that I could use for my own experiments and deployments while keeping it enterprise and production-grade. So I maintain a fork where I can iterate on these ideas.
This article walks through what the fork adds, why each feature exists, and how to configure them.
What the upstream router gives you
The vllm-project/router is a Rust-based request router for vLLM. It handles the basics well:
- Five load balancing policies: round robin, random, consistent hash, power of two, cache-aware
- PD disaggregation with separate prefill/decode pools
- Circuit breakers and retries with exponential backoff
- Kubernetes service discovery
- Prometheus metrics
- Bearer token authentication
All configuration is done via CLI flags, which works well for straightforward setups. The fork builds on top of this foundation with experimental and production features that I wanted to have available for my own use cases.
What the fork adds
Here's the full list of additions. These are features I've been experimenting with or needed for specific deployments — some are battle-tested in production, others are still evolving:
| Feature | Upstream | Fork |
|---|---|---|
YAML config file (--config-file) | - | Full YAML for all settings |
| Exact-match response cache | - | FNV-1a hash, DashMap, TTL + LRU |
| Semantic similarity cache | - | Cosine similarity via embeddings |
| Semantic cluster routing | - | Route by prompt content to worker groups |
| Anthropic Messages API | - | POST /v1/messages with streaming |
| Graceful worker drain | - | POST /admin/drain |
| Hot config reload | - | POST /admin/reload |
| Per-worker API keys | - | Each backend gets its own Bearer token |
| Redis cache backend | - | Shared cache across router instances |
| Inbound API key auth | - | Static Bearer token for all /v1/* |
| Sticky sessions with failover | - | DashMap TTL + ring walk on failure |
/v1/completions, /v1/embeddings, /v1/rerank | - | Full proxy + streaming |
| SentencePiece tokenizer | - | Via system libsentencepiece |
| Per-routing Prometheus metrics | - | Worker, cluster, fallback counters |
| INFO-level routing logs | - | Model, worker, method, status, duration |
The rest of this article covers the most impactful features in detail.
YAML configuration
The upstream router requires all settings as CLI flags. A typical production deployment ends up looking like this:
vllm-router \
--policy cache_aware \
--vllm-pd-disaggregation \
--prefill http://prefill-1:8000 http://prefill-2:8000 \
--decode http://decode-1:8000 http://decode-2:8000 \
--bearer-token-file /etc/secrets/token \
--metrics-port 29000 \
--health-check-interval 60 \
--circuit-breaker-failure-threshold 5 \
--retry-count 3
In the fork, the same deployment is a single YAML file:
host: "0.0.0.0"
port: 8090
log_level: info
mode:
type: pd_disaggregation
prefill_urls:
- "http://prefill-1:8000"
- "http://prefill-2:8000"
decode_urls:
- "http://decode-1:8000"
- "http://decode-2:8000"
prefill_policy:
type: power_of_two
load_check_interval_secs: 10
decode_policy:
type: consistent_hash
virtual_nodes: 160
metrics:
host: "0.0.0.0"
port: 29000
health_check:
check_interval_secs: 60
timeout_secs: 5
failure_threshold: 3
success_threshold: 2
endpoint: /health
vllm-router --config-file configs/pd-disagg.yaml
This is better for version control, easier to review in PRs, and you can template it for Kubernetes ConfigMaps. The config also enables features that would be impractical to express as CLI flags, like semantic cluster definitions or per-worker API key maps.
Two-level response caching
This is the feature I built first because the use case is so common: the same prompt (or a nearly identical one) gets sent to your inference cluster hundreds of times. Without caching, every single request triggers a full inference pass. The fork adds a two-level cache pipeline:
FNV-1a hash} B -->|Hit| C[Return cached response] B -->|Miss| D{Semantic Cache
cosine similarity} D -->|Match above threshold| C D -->|Miss| E[Route to vLLM worker] E --> F[Store response in both caches] F --> G[Return response to client]
Level 1: exact-match cache
The first layer hashes the request body (after stripping non-deterministic fields like stream, user, and request_id) using FNV-1a. If an identical request was seen before and the cached response hasn't expired, it returns immediately without touching any backend worker.
cache:
backend: memory
max_entries: 2048
ttl_secs: 120
That's it. This alone can save significant compute if your workload has any repetition — think shared system prompts, common user questions, or automated pipelines that retry the same call.
Level 2: semantic cache
The second layer handles the case where prompts aren't identical but are semantically equivalent. "Explain what a Transformer is" and "What is a Transformer model?" should probably return the same cached response.
The semantic cache embeds each request using an OpenAI-compatible embeddings endpoint (like Infinity or a vLLM instance serving an embedding model) and compares it against stored embeddings using cosine similarity:
semantic_cache:
embeddings_url: "http://localhost:8030"
embeddings_model: "BAAI/bge-small-en-v1.5"
threshold: 0.95
max_entries: 1024
ttl_secs: 300
The threshold parameter controls how similar two prompts need to be. At 0.95, only near-paraphrases match. At 0.80, you'll get broader matches but risk returning irrelevant cached responses. Start high and tune down.
Redis backend
For multi-instance deployments, in-memory caching means each router instance builds its own cache independently. The fork supports Redis as a shared backend:
cache:
backend: redis
max_entries: 2048
ttl_secs: 120
redis:
url: "redis://127.0.0.1:6379/0"
pool_size: 8
key_prefix: "vllm-router:"
connection_timeout_ms: 3000
command_timeout_ms: 500
This requires building with --features redis-cache. The cache degrades gracefully — if Redis is unreachable, the router treats it as a cache miss and forwards to the backend normally.
Important: streaming responses are never cached. Only non-streaming requests go through the cache pipeline.
Semantic cluster routing
This is the most interesting routing feature in the fork. Instead of routing purely by load or session affinity, semantic cluster routing routes requests by what the user is asking about.
The idea: you define clusters of workers, each specialized (or simply allocated) for a domain. You provide example prompts for each cluster. At startup, the router embeds these examples and computes a centroid vector per cluster. When a request arrives, the router embeds it and routes to the cluster whose centroid is closest — if the similarity exceeds a threshold:
Here's the configuration:
semantic_cluster:
embeddings_url: "http://localhost:8030"
embeddings_model: "BAAI/bge-small-en-v1.5"
threshold: 0.70
embedding_timeout_ms: 2000
clusters:
- name: coding
workers:
- "http://worker-code-1:8000"
- "http://worker-code-2:8000"
examples:
- "Write a Python function to sort a list"
- "How do I implement a binary search tree in Rust?"
- "Debug this JavaScript code that throws a TypeError"
- "Implement a REST API endpoint in FastAPI"
- name: science
workers:
- "http://worker-sci-1:8000"
examples:
- "Explain the process of photosynthesis"
- "What is the difference between mitosis and meiosis?"
- "Describe Newton's laws of motion"
- "How does quantum entanglement work?"
If no cluster matches above the threshold, the request falls through to the default load balancing policy (round robin, consistent hash, etc.). The router also sets x-semantic-cluster-id and other x-semantic-* headers on matched requests, which get propagated to vLLM workers.
When this is useful: multi-tenant deployments where different teams share a cluster but want their traffic isolated; workloads where different prompt types benefit from different LoRA adapters or model configurations; or simply organizing traffic by domain for better monitoring.
Anthropic Messages API support
If your clients use the Anthropic SDK, you no longer need a separate translation layer. The fork natively accepts Anthropic's Messages API format and translates it to OpenAI format before forwarding to vLLM:
curl http://router:3000/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: your-key" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "What is KV caching in LLMs?"}
]
}'
This includes full streaming support with Anthropic's SSE format. The response comes back in Anthropic format — the client never knows it's talking to vLLM.
Admin API: graceful drain and hot reload
Graceful worker drain
When you need to take a worker offline for maintenance or scaling down, you don't want to kill in-flight requests. The drain endpoint stops sending new traffic to a worker, waits for active requests to finish, and then removes it:
# Start draining worker
curl -X POST http://router:3000/admin/drain \
-H "Authorization: Bearer admin-secret" \
-H "Content-Type: application/json" \
-d '{"url": "http://worker-1:8000", "timeout_secs": 300}'
# Check drain status
curl http://router:3000/admin/drain/status?url=http://worker-1:8000 \
-H "Authorization: Bearer admin-secret"
If the in-flight requests don't complete within the timeout, the worker is force-removed. The GET /workers endpoint now includes a draining field per worker so you can see the full fleet status.
Hot config reload
Change API keys, add or remove workers, adjust settings — all without restarting the router:
# Edit the YAML config, then:
curl -X POST http://router:3000/admin/reload \
-H "Authorization: Bearer admin-secret"
The router re-reads the YAML file, diffs the worker lists, gracefully drains any removed workers, and adds new ones. API keys are swapped atomically behind Arc<RwLock<>>. No downtime, no dropped connections.
Per-worker API keys
In multi-provider setups, different backend workers may require different authentication tokens. The upstream router only supports a single global API key. The fork adds per-worker key mapping:
api_key: "default-key-for-most-workers"
worker_api_keys:
"http://worker-provider-a:8000": "sk-provider-a-secret"
"http://worker-provider-b:8000": "sk-provider-b-secret"
The priority chain is: per-worker key (highest) > global api_key > OPENAI_API_KEY env var (PD mode only) > no Authorization header. This applies to all routing modes: regular, PD disaggregation, and OpenAI proxy.
Cache-aware routing with tunable parameters
The upstream router has a cache-aware policy, but the fork exposes every knob as a configuration parameter with sensible defaults:
policy:
type: cache_aware
# Minimum cached prefix ratio to prefer a worker
cache_threshold: 0.5
# Absolute request count difference to force rebalancing
balance_abs_threshold: 32
# Relative load ratio to force rebalancing
balance_rel_threshold: 1.1
# How often to prune the prefix tree (seconds)
eviction_interval_secs: 30
# Maximum nodes in the prefix tree per worker
max_tree_size: 10000
The key insight is the dual-mode behavior:
- When load is balanced: the policy maximizes cache hits. Requests go to whichever worker has the longest matching prefix.
- When load is imbalanced (one worker has
balance_abs_thresholdmore requests than another, orbalance_rel_thresholdtimes the load): the policy switches to load-based selection regardless of cache state.
This prevents the common failure mode where cache-aware routing creates hot spots by always sending similar prompts to the same overloaded worker.
PD disaggregation with independent policies
A significant improvement over upstream: you can set different load balancing policies for prefill and decode pools:
mode:
type: pd_disaggregation
prefill_urls:
- "http://prefill-1:8081"
- "http://prefill-2:8081"
decode_urls:
- "http://decode-1:8083"
- "http://decode-2:8083"
prefill_policy:
type: power_of_two
load_check_interval_secs: 10
decode_policy:
type: consistent_hash
virtual_nodes: 160
This matters because prefill and decode have fundamentally different routing needs. Prefill workers are stateless between turns — any load-balancing policy works, and power_of_two avoids hot spots under variable prompt lengths. Decode workers hold the KV cache across turns — consistent_hash pins each session to the same worker, preserving the accumulated context.
The router encodes the selected prefill and decode addresses directly in the vLLM request ID:
___prefill_addr_<host:port>___decode_addr_<host:port>_<uuid>
This tells vLLM where to transfer the KV cache via the NIXL connector (UCX/GDS), without any out-of-band coordination.
Authentication layers
The fork adds three authentication layers that the upstream router doesn't have:
Inbound (client to router):
# Static API key for all /v1/* endpoints
inbound_api_key: "sk-my-inference-key"
# Admin endpoints get their own key
admin_api_key: "sk-admin-secret"
Health endpoints (/health, /liveness, /readiness) are exempt from authentication — Kubernetes probes work without tokens.
Outbound (router to workers): per-worker keys as described above.
Embeddings endpoint: separate key for the embedding service used by semantic cache and cluster routing:
semantic_cache:
embeddings_api_key: "sk-embed-secret"
Getting the router
Pre-built releases
Every tagged version triggers a GitHub Actions workflow that builds a Linux amd64 binary, packages it with configs, docs, and scripts, and publishes it as a GitHub Release:
# Download the latest release
curl -LO https://github.com/bet0x/vllm-router/releases/latest/download/vllm-router-v0.6.5-linux-amd64.tar.gz
tar xzf vllm-router-v0.6.5-linux-amd64.tar.gz
./vllm-router-v0.6.5-linux-amd64/vllm-router --config-file configs/round-robin.yaml
Docker images
Docker images are built automatically on every release and pushed to both Docker Hub and GitHub Container Registry:
# Docker Hub
docker pull barrahome/vllm-router:latest
docker pull barrahome/vllm-router:v0.6.5
# GitHub Container Registry
docker pull ghcr.io/bet0x/vllm-router:latest
docker pull ghcr.io/bet0x/vllm-router:v0.6.5
Run it with your config file mounted:
docker run -p 3000:3000 -p 29000:29000 \
-v /path/to/config.yaml:/config.yaml \
barrahome/vllm-router:latest --config-file /config.yaml
Build from source
If you need the Redis cache feature or want to modify the code:
# Prerequisites (Ubuntu/Debian)
sudo apt-get install -y protobuf-compiler libprotobuf-dev libsentencepiece-dev
# Standard build
cargo build --release
# With Redis cache support
cargo build --release --features redis-cache
Quick test
# Start the router with round robin
vllm-router --config-file configs/round-robin.yaml
# Send a request
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Practical takeaways
- Start with the YAML config. It's easier to manage, version-control, and template for Kubernetes. Every
configs/*.yamlfile is a working example. - Enable the exact-match cache immediately. Even a small cache with short TTL saves significant compute if your workload has any repetition. It costs almost nothing to turn on.
- Use semantic caching carefully. It adds latency (embedding call per request on cache miss). Only worth it if you have high prompt similarity and the embedding service is fast and local.
- Semantic cluster routing is for organization, not performance. It adds ~2ms per request for the embedding lookup. The value is in traffic isolation and specialized worker allocation, not raw speed.
- Use separate PD policies.
power_of_twofor prefill,consistent_hashfor decode. This is the recommended production configuration for multi-turn workloads. - Set up the admin API key. The drain and reload endpoints are powerful — protect them.
- Monitor with Prometheus. The fork exports per-routing-decision metrics. Use them to understand cache hit rates, cluster routing decisions, and worker load distribution.
Feedback
If you're using this fork — or considering it — I'd genuinely like to hear from you. Bug reports, feature requests, questions, and general feedback are all welcome on the GitHub issues page. I'm especially interested in hearing about real-world deployments: what worked, what didn't, and what features would make it more useful for your setup.
Sources
- Fork repository — full source code, documentation, and example configs
- Upstream vLLM router — the original project this fork extends
- vLLM project — the inference engine
- Previous article: vLLM router and PD disaggregation — background on why cache-aware routing matters