AI workloads are expensive: every uncached LLM call costs real money and adds hundreds of milliseconds. A good caching strategy can cut bills 40-70% and make your product feel instant. This guide covers the layered cache patterns we use in production AI backends.
What Makes AI Caching Different
- Inputs are often near-duplicates, not exact matches (paraphrases, typos)
- Outputs are large and stochastic — temperature > 0 means non-determinism
- Costs scale with input and output tokens, not request count
- Freshness needs vary wildly: a translation can cache forever; a stock summary, seconds
Layer 1: Exact-Match Cache
The simplest and most reliable. Hash (model, version, prompt, params) and store the response in Redis or Memcached. Hit rates of 20-40% are realistic on chat workloads thanks to repeated questions and bot traffic.
import hashlib, json, redis
r = redis.Redis()
def cache_key(model: str, prompt: str, params: dict) -> str:
payload = json.dumps({"m": model, "p": prompt, **params}, sort_keys=True)
return "llm:" + hashlib.sha256(payload.encode()).hexdigest()
def cached_complete(model, prompt, **params):
key = cache_key(model, prompt, params)
if hit := r.get(key):
return json.loads(hit)
res = llm_call(model, prompt, **params)
r.set(key, json.dumps(res), ex=86400)
return res
Layer 2: Semantic Cache
Catches paraphrases the exact-match cache misses. Embed the query, search a vector store, and serve the cached answer if similarity exceeds a threshold (typically 0.92+).
def semantic_lookup(query: str, threshold: float = 0.92):
emb = embed(query)
hits = vector_index.query(vector=emb, top_k=1, include_metadata=True)
if hits and hits[0].score >= threshold:
return hits[0].metadata["answer"]
return None
Tune the threshold carefully. Too low and you serve wrong answers; too high and the cache barely fires.
Layer 3: KV-Cache Reuse
Inside the model, the attention KV cache is the most expensive thing to recompute. Tools like vLLM, SGLang, and TGI support prefix sharing — every request that starts with the same system prompt skips that prefix's compute. Standardize a small set of system prompts to maximize hit rate.
Layer 4: Embedding Cache
Embedding the same text twice is pure waste. Cache embeddings keyed by (model, text). For document corpora, store embeddings alongside the source so you never recompute them across deploys.
Layer 5: Edge / CDN
For public, deterministic endpoints (default avatars, static summaries, FAQ answers), cache at the CDN. Microseconds round-trip, near-zero compute cost. Add appropriate Cache-Control and Vary headers.
Cache Invalidation
The hard problem. Strategies:
- TTL: simplest; pick based on freshness need
- Version key: bump
model_versionin the cache key on every model swap - Tag-based: tag entries by tenant or document ID; bulk-invalidate on update
- Stale-while-revalidate: serve cached, refresh in background
Stochastic Outputs and Caching
If temperature > 0, two calls with the same prompt return different answers. You have a choice:
- Force
temperature = 0for cache-eligible endpoints - Cache the first response and accept that follow-up calls return the cached version
- Skip caching for endpoints where variation is the feature (creative writing)
Privacy and Multi-Tenancy
Caches that mix tenants leak data. Always include the tenant ID in the cache key. For PII-heavy workloads, prefer per-user caches with short TTLs and strict eviction.
Observability
- Hit rate per layer (exact, semantic, edge)
- Latency saved per layer
- Cost saved per layer (estimated tokens × price)
- Cache size and eviction rate
Conclusion
Caching is the single highest-leverage optimization in an AI backend. Start with exact-match, layer in semantic, embrace KV-cache reuse, and never recompute embeddings. Treat hit rate and cost saved as first-class metrics. The savings compound — and your users get a faster product as a bonus.