alberto@barrahome.org:~$

LMCache + Redis: Distributed KV Cache for Enterprise LLM Inference

Tags: ai, llm, vllm, redis, kv-cache, optimization, inference


In my previous article about vLLM router and PD disaggregation, I discussed how prefix-cache-aware routing solves cache hit rate problems when distributing prefill and decode across different nodes. That article focused on where to route requests to maximize cache hits.

This article tackles a different but complementary problem: where to store the KV cache itself. Enter LMCache with Redis — a distributed KV cache layer that transforms how we manage, share, and reuse cached key-value tensors across LLM inference workloads.

The Problem: KV Cache is Stuck in GPU Memory

Traditional LLM inference keeps the entire KV cache in GPU memory during generation. This works fine for single requests, but creates several problems at scale:

Memory Pressure:

  • Long context (32K, 128K+ tokens) consumes massive GPU memory
  • Limits batch size and concurrent requests
  • Forces expensive GPU upgrades

No Cache Sharing:

  • Each vLLM instance has its own isolated cache
  • Repeated content (RAG passages, system prompts) gets recomputed
  • Multi-turn conversations can't reuse previous context

Wasted Computation:

  • Same document chunks processed repeatedly across sessions
  • Customer support bots re-encode identical policy text
  • RAG systems recompute embeddings for common passages

Limited Prefix Caching:

  • vLLM's native prefix cache only works within a single instance
  • Only matches tokens at the beginning of prompts
  • Doesn't help when repeated content appears mid-prompt or at the end

What is LMCache?

LMCache is an open-source KV cache management layer that extends LLM inference engines (vLLM, SGLang) with a multi-tier storage hierarchy for KV caches. Instead of keeping everything in GPU memory, LMCache can offload and share caches across:

  • GPU memory (active working set)
  • CPU DRAM (hot cache with pinned memory)
  • Local storage (NVMe, SSD)
  • Remote backends (Redis, Mooncake, S3, InfiniStore)

Key Innovation: Chunk-Level Caching

Unlike traditional prefix caching, LMCache operates at the chunk level (default: 256 tokens). This enables:

Position-independent matching — reuses chunks regardless of where they appear in the prompt
Cross-instance sharing — multiple vLLM workers share the same cache
Flexible granularity — configurable chunk sizes and overlap strategies

Example:

Traditional prefix cache:

Prompt A: "Return policy: You can return items within 30 days. How do I..."
Prompt B: "How do I process returns? Return policy: You can return items within 30 days."
          ❌ No cache hit — repeated text is not at the beginning

LMCache chunk-level cache:

Prompt A: "Return policy: You can return items within 30 days. How do I..."
          └─ Chunk hash: abc123 (cached)

Prompt B: "How do I process returns? Return policy: You can return items within 30 days."
          └─ Chunk hash: abc123 ✅ Cache hit!

Why Redis as the Storage Backend?

LMCache supports multiple storage backends (local CPU, disk, Redis, Mooncake, S3), but Redis stands out for production deployments:

Low-Latency Retrieval:

  • Sub-millisecond lookups for cached chunks
  • Significantly faster than disk or S3
  • Production-proven at scale

Structured Storage:

  • Stores KV cache data + metadata separately
  • Supports filtering by model, temperature, format
  • Hash-based key structure for efficient lookups

Production-Ready Features:

  • Redis Sentinel for high availability
  • TTL management for cache freshness
  • Horizontal scaling with Redis Cluster
  • Monitoring and observability built-in

Ecosystem Integration:

  • Works with Redis Cloud, AWS ElastiCache, Azure Cache
  • Compatible with existing Redis infrastructure
  • Familiar operational tooling (redis-cli, monitoring)

Architecture Overview

LMCache Multi-Tier Storage Architecture
graph LR A[Request] --> B[Token Chunking] B --> C{Cache Lookup} C -->|Hit| D[Inject Cached KV] C -->|Miss| E[Compute KV] E --> F[GPU Memory] F --> G[CPU DRAM] G -->|Async| H[Local Disk] H -->|LRU| I[Redis] I --> J[Metadata] I --> K[KV Bytes] L[Future Request] --> M{Check Redis} M -->|Hit| N[Prefetch] N --> O[Restore GPU] O --> D style I fill:#c8a060,stroke:#8a6520,color:#fff style J fill:#5f9ea0,stroke:#3d6e70,color:#fff style K fill:#5f9ea0,stroke:#3d6e70,color:#fff

Data Flow Breakdown

1. Chunk Hashing:

  • Input tokens split into chunks (e.g., 256 tokens)
  • SHA-256 hash computed for each chunk
  • Hash used as lookup key across storage tiers

2. Storage Hierarchy:

TierLatencyCapacityUse Case
GPU Memory<1us24-80GBActive generation
CPU DRAM~10us128GB-2TBHot cache (pinned memory)
Local NVMe~100us1-10TBRecent sessions
Redis~1msUnlimitedShared across cluster

3. Cache Operations:

  • Offload: GPU → CPU (frees GPU memory for new requests)
  • Evict: CPU → Redis (LRU policy, async writes)
  • Prefetch: Redis → CPU (on cache hit)
  • Restore: CPU → GPU (zero-copy DMA transfer)

Redis Storage Structure

LMCache stores each cached chunk as two separate Redis entries:

Key Format

{model_name}@{world_size}@{worker_id}@{chunk_hash}

Example base key:

meta-llama/Llama-3.1-70B@1@0@a7f3b2c9d1e4f5a6b8c9d0e1f2a3b4c5

Each chunk has two entries:

  1. Metadata entry (base key, no suffix)
  2. KV bytes entry (base key + _bytes suffix)

Metadata Entry (Redis Hash)

# Metadata stored at base key
HGETALL "meta-llama/Llama-3.1-70B@1@0@abc123"
1) "model_name"
2) "meta-llama/Llama-3.1-70B"
3) "format"
4) "naive"
5) "world_size"
6) "1"
7) "worker_id"
8) "0"

KV Bytes Entry (Binary Blob)

# KV cache data stored with _bytes suffix
GET "meta-llama/Llama-3.1-70B@1@0@abc123_bytes"
# Returns serialized tensor data (pickle by default, or CacheGen compressed)

This separation allows:

  • Fast metadata filtering without deserializing tensors
  • Efficient storage (only download KV bytes when needed)
  • Flexible serialization formats (naive vs. cachegen)

Configuration: vLLM + LMCache + Redis

Step 1: Install Dependencies

# Install LMCache and vLLM
uv pip install lmcache vllm

# Or install from source
git clone https://github.com/LMCache/LMCache.git
cd LMCache
uv pip install -e . --no-build-isolation

Step 2: Start Redis Server

# Local Redis (development)
redis-server --port 6379

# Redis Sentinel (high availability)
redis-sentinel /etc/redis/sentinel.conf

Step 3: Configure LMCache

Option A: Environment Variables

export LMCACHE_CHUNK_SIZE=256
export LMCACHE_REMOTE_URL="redis://localhost:6379"
export LMCACHE_REMOTE_SERDE="naive"

Option B: Configuration File (lmcache_config.yaml)

chunk_size: 256
remote_url: "redis://localhost:6379"
remote_serde: "naive"
local_cpu: true
max_local_cpu_size: 5.0  # GB

Step 4: Launch vLLM with LMCache

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768

Configuration Parameters:

  • kv_connector: "LMCacheConnectorV1" for vLLM v1 (latest)
  • kv_role: "kv_both" (read and write), "kv_producer" (write only), "kv_consumer" (read only)

Python API Example

import os
from vllm import LLM, SamplingParams

# Configure LMCache
os.environ["LMCACHE_CHUNK_SIZE"] = "256"
os.environ["LMCACHE_REMOTE_URL"] = "redis://localhost:6379"
os.environ["LMCACHE_REMOTE_SERDE"] = "naive"

# Initialize vLLM with LMCache
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    kv_transfer_config={
        "kv_connector": "LMCacheConnectorV1",
        "kv_role": "kv_both"
    },
    gpu_memory_utilization=0.85,
    max_model_len=32768
)

# First request (cache miss - computes KV cache)
prompts = [
    "You can return your item within 30 days of purchase. How do I start a return?"
]
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=256))

# Second request (cache hit - reuses cached chunk)
prompts = [
    "What's your policy? You can return your item within 30 days of purchase."
]
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=256))
# ✅ The repeated chunk is found in Redis and reused

Performance Benchmarks

LMCache + Redis Performance Impact
graph LR subgraph Baseline[Baseline vLLM] A1[TTFT: 1.24s] A2[Throughput: 230 req/s] A3[Cost: X per 1K tokens] end subgraph WithLMCache[vLLM + LMCache + Redis] B1[TTFT: 0.18s] B2[Throughput: 420 req/s] B3[Cost: 0.36X per 1K tokens] end A1 -->|6.9x faster| B1 A2 -->|82% increase| B2 A3 -->|64% reduction| B3 style B1 fill:#5f9ea0,stroke:#3d6e70,color:#fff style B2 fill:#5f9ea0,stroke:#3d6e70,color:#fff style B3 fill:#5f9ea0,stroke:#3d6e70,color:#fff

Real-World Results

Multi-Turn QA (Customer Support):

  • TTFT improvement: 6.9× faster (1.24s → 0.18s)
  • Throughput: +82% (230 → 420 req/s)
  • Cost reduction: 64% per 1K tokens

RAG Applications (Document Analysis):

  • Cache hit rate: 60-80% on repeated passages
  • GPU memory savings: 40% (more concurrent requests)
  • Latency reduction: 2× lower end-to-end latency

Long-Context Processing (128K tokens):

  • Prefill speedup: 3-10× for repeated document chunks
  • Memory efficiency: 50% reduction in GPU memory usage
  • Throughput: 2.3-14× higher at same TTFT (low QPS scenarios)

Use Cases: When to Use LMCache + Redis

✅ Ideal Scenarios

1. Multi-Turn Conversations

User: "What's the return policy?"
Agent: [Response with policy text]
User: "How long do I have?"
Agent: [Reuses cached policy chunk ✅]

2. RAG Applications

  • Same document chunks retrieved across different queries
  • Common passages in knowledge base
  • Repeated context in prompt templates

3. Customer Support Agents

  • Repeated policy statements
  • Standard procedure explanations
  • Common troubleshooting steps

4. Document Summarization

  • Overlapping content across documents
  • Repeated headers/footers
  • Common boilerplate text

5. Code Generation

  • Standard library imports
  • Common code patterns
  • Repeated documentation snippets

⚠️ Less Effective Scenarios

Unique, One-Off Queries:

  • Every prompt is completely different
  • No repeated content across sessions
  • Low cache hit rate → minimal benefit

Streaming with Very Short Outputs:

  • Redis latency (~1ms) may exceed compute time for tiny generations
  • Better to keep cache in GPU/CPU only

Latency-Critical Applications (<10ms SLA):

  • Redis network round-trip adds 1-3ms overhead
  • Consider local CPU DRAM only for ultra-low latency

Comparison: LMCache vs. Alternatives

Distributed KV Cache Solutions Comparison
FeatureLMCache + RedisMooncakeNVIDIA NXIL
Cache GranularityChunk-level (256 tokens)Chunk-levelPage-level
Storage BackendRedis, S3, NFS, localDistributed memory poolGPU RDMA fabric
Latency~1ms (Redis)~500us (RDMA)<5us (RDMA)
Multi-Node SupportYesYesYes (same datacenter)
Cross-DatacenterYes (geo-replication)LimitedNo
vLLM IntegrationNative connectorNative connectorNative support
SGLang SupportYesYesLimited
Storage CapacityUnlimited (Redis/S3)Limited by cluster RAMLimited by GPU memory
Operational ComplexityLow (standard Redis)Medium (custom cluster)High (RDMA fabric)
CostLow (commodity Redis)Medium (RAM expensive)High (RDMA NICs)

Key Differences

LMCache + Redis:

  • Best for: Cross-datacenter deployments, long-term cache persistence, multi-cloud
  • Trade-off: Higher latency (~1ms) vs. RDMA solutions

Mooncake:

  • Best for: Single-datacenter deployments with massive memory pools
  • Trade-off: More complex setup, requires dedicated memory nodes
  • Note: LMCache and Mooncake have partnered — you can use Mooncake as an LMCache backend

NVIDIA NXIL:

  • Best for: Single-node multi-GPU with ultra-low latency (<5μs)
  • Trade-off: Limited to RDMA-connected GPUs, doesn't scale across datacenters

Hybrid Approach

You can combine multiple backends:

# lmcache_config.yaml
chunk_size: 256

# Tier 1: CPU DRAM (hot cache)
local_cpu: true
max_local_cpu_size: 10.0  # GB

# Tier 2: Local NVMe (recent sessions)
local_storage: "/nvme/lmcache"
max_local_storage_size: 100.0  # GB

# Tier 3: Redis (shared across cluster)
remote_url: "redis://redis-cluster:6379"
remote_serde: "naive"

Production Considerations

High Availability

Redis Sentinel Configuration:

# sentinel.conf
sentinel monitor lmcache-redis redis-master 6379 2
sentinel down-after-milliseconds lmcache-redis 5000
sentinel failover-timeout lmcache-redis 10000
sentinel parallel-syncs lmcache-redis 1

LMCache Connection String:

export LMCACHE_REMOTE_URL="redis-sentinel://sentinel1:26379,sentinel2:26379,sentinel3:26379"

Memory Management

Redis Memory Limits:

# redis.conf
maxmemory 50gb
maxmemory-policy allkeys-lru  # LRU eviction for cache entries

LMCache TTL (Optional):

LMCache doesn't set Redis TTL by default (cache persists forever). For auto-expiration:

# Custom wrapper to set TTL on cached entries
import redis

r = redis.Redis(host='localhost', port=6379)
for key in r.scan_iter("*@kv_bytes"):
    r.expire(key, 86400)  # 24-hour TTL

Monitoring & Observability

Key Metrics to Track:

# Cache hit rate
redis-cli INFO stats | grep keyspace_hits
redis-cli INFO stats | grep keyspace_misses

# Memory usage
redis-cli INFO memory | grep used_memory_human

# LMCache-specific keys
redis-cli KEYS "*@metadata" | wc -l
redis-cli KEYS "*@kv_bytes" | wc -l

Prometheus Metrics:

  • lmcache_cache_hit_rate — percentage of cache hits
  • lmcache_gpu_to_cpu_transfer_bytes — offload volume
  • lmcache_redis_latency_ms — Redis round-trip time
  • vllm_queue_depth — request backlog (indicates cache benefit)

Security

Redis Authentication:

# redis.conf
requirepass your_secure_password

# LMCache config
export LMCACHE_REMOTE_URL="redis://:your_secure_password@localhost:6379"

Network Segmentation:

  • Keep Redis on private network (no public exposure)
  • Use TLS for Redis connections: rediss:// (note the extra 's')
  • Firewall rules: only allow vLLM nodes to access Redis

Debugging: Inspecting Redis Cache

Connect to Redis

redis-cli -h localhost -p 6379

List LMCache Entries

# Count KV cache entries (with _bytes suffix)
KEYS "*_bytes" | wc -l

# Count metadata entries (base keys without _bytes)
# Note: This requires filtering, as base keys don't have a specific suffix
KEYS "*@*@*@*" | grep -v "_bytes" | wc -l

# Sample KV cache keys
KEYS "*_bytes" | head -5

# Sample metadata keys (showing pattern)
KEYS "meta-llama*" | grep -v "_bytes" | head -5

Examine Metadata

# Get metadata for a specific chunk (base key, no suffix)
HGETALL "meta-llama/Llama-3.1-70B@1@0@abc123"

Check Memory Usage

# Memory used by KV cache entry
MEMORY USAGE "meta-llama/Llama-3.1-70B@1@0@abc123_bytes"

# Memory used by metadata entry
MEMORY USAGE "meta-llama/Llama-3.1-70B@1@0@abc123"

# Total Redis memory
INFO memory

Delete Stale Entries

# Delete all LMCache KV cache entries (reset cache)
redis-cli KEYS "*_bytes" | xargs redis-cli DEL

# Delete all metadata entries (use pattern matching carefully)
redis-cli KEYS "*@*@*@*" | grep -v "_bytes" | xargs redis-cli DEL

How LMCache + Redis Relates to vLLM Router

In my previous article about vLLM router PD disaggregation, I explained how prefix-cache-aware routing solves the problem of low cache hit rates when distributing prefill and decode across different nodes.

Here's how they work together:

The Complete Picture

┌─────────────────────────────────────────────────────┐
│ vLLM Router (NVIDIA Dynamo / vllm-project/router)  │ ← Routing Layer
│ - Decides which node handles prefill/decode         │
│ - Uses consistent hashing for cache locality        │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ LMCache + Redis                                      │ ← Storage Layer
│ - Stores KV cache chunks in Redis                   │
│ - Shares cache across all vLLM instances            │
│ - Chunk-level reuse (position-independent)          │
└─────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────┐
│ vLLM Instances (Prefill + Decode Workers)           │ ← Inference Layer
│ - Fetch cached chunks from Redis when available     │
│ - Offload new chunks to Redis for future reuse      │
└─────────────────────────────────────────────────────┘

Complementary Benefits

vLLM Router solves:

  • Routing decision: "Which node should handle this request?"
  • Cache locality: "Send prefill to nodes that already have the prefix cached"

LMCache + Redis solves:

  • Storage scalability: "Where do we store KV cache beyond GPU memory?"
  • Cross-instance sharing: "How do we share cache across all vLLM workers?"
  • Chunk-level reuse: "How do we match repeated content anywhere in the prompt?"

Combined Architecture

You can use both together for maximum efficiency:

  1. LMCache stores all KV chunks in Redis (shared storage layer)
  2. vLLM Router routes requests using consistent hashing (cache-aware routing)
  3. Cache hits increase because:
    • Router sends similar requests to the same node (locality)
    • LMCache shares chunks across nodes via Redis (global cache)
    • Chunks match position-independently (flexible reuse)

Example:

Request A: "Return policy: You can return items within 30 days."
  ├─ Router → sends to Node 1 (based on prefix hash)
  ├─ Node 1 computes KV cache
  └─ LMCache stores chunk in Redis

Request B: "How do I process returns? Return policy: You can return items within 30 days."
  ├─ Router → might send to Node 2 (different prefix)
  ├─ Node 2 checks LMCache
  └─ ✅ Cache hit from Redis (chunk stored by Node 1)

Practical Takeaways

Use LMCache + Redis when:

✅ You have repeated content across prompts (RAG, multi-turn chat, support bots)
✅ You run multiple vLLM instances and want cache sharing
✅ You need long-term cache persistence beyond single sessions
✅ GPU memory is a bottleneck (long context, high concurrency)
✅ You can tolerate ~1-3ms additional latency for cache lookups

Skip LMCache + Redis when:

❌ Every prompt is unique (no repeated content)
❌ Ultra-low latency is critical (<10ms SLA)
❌ You have a single vLLM instance with plenty of GPU memory
❌ Content changes frequently (low cache reuse rate)

Optimization Tips

1. Tune Chunk Size:

  • Smaller chunks (128 tokens) → higher reuse rate, more storage overhead
  • Larger chunks (512 tokens) → lower reuse rate, less storage overhead
  • Default 256 tokens is a good starting point

2. Use Local CPU DRAM First:

  • Configure max_local_cpu_size to use pinned memory before Redis
  • Significantly reduces Redis load for hot cache

3. Monitor Cache Hit Rate:

  • Track keyspace_hits / (keyspace_hits + keyspace_misses)
  • If hit rate < 30%, LMCache may not be beneficial
  • Investigate prompt patterns to increase reuse

4. Combine with Prompt Engineering:

  • Place reusable content (system prompts, guidelines) in consistent positions
  • Use templates to standardize repeated sections
  • Increases chunk match probability

Conclusion

LMCache with Redis transforms KV cache from a single-instance, ephemeral resource into a distributed, shared, persistent layer that scales across your entire LLM infrastructure.

By combining chunk-level caching with Redis as a storage backend, you get:

  • 3-10× faster TTFT on repeated content
  • 60-80% cache hit rates in real-world RAG/chat applications
  • 40-50% GPU memory savings (more concurrent requests)
  • Cross-instance cache sharing (no more isolated caches)

When paired with prefix-cache-aware routing (from my previous article), you achieve both optimal routing decisions and efficient cache storage — the complete solution for distributed LLM inference at scale.

Resources


Sources:

📂 Git SRC 🐙 GitHub 💼 LinkedIn © 2026 barrahome.org // powered by nginx + markdown