Implementing Semantic Caching Using Spring AI
Optimize AI responses, reduce latency, and cut API costs using semantic similarity-based caching with vector databases
1What is Semantic Caching?
Semantic caching is an intelligent caching strategy that goes beyond exact string matching. Instead of requiring identical queries, it finds cached responses for queries that are semantically similar — meaning they have the same meaning or intent, even if worded differently.
Traditional Caching
- • Exact string match required
- • "What is AI?" ≠ "What's artificial intelligence?"
- • High cache miss rate
- • Limited effectiveness for natural language
Semantic Caching
- • Meaning-based similarity match
- • "What is AI?" ≈ "What's artificial intelligence?"
- • Much higher cache hit rate
- • Perfect for AI/LLM applications
How It Works
Semantic caching uses embedding vectors to represent the meaning of queries. When a new query arrives, its embedding is compared against cached embeddings using cosine similarity. If a match exceeds the similarity threshold, the cached response is returned.
2Benefits of Semantic Caching
Cost Reduction
Reduce API calls to expensive LLM providers by 40-70% with semantic matching
Faster Responses
Cache hits return in milliseconds vs seconds for LLM calls
Consistency
Similar questions always get consistent answers from cache
Real-World Impact
Customer Support Bot: Many users ask the same questions differently. "How do I reset my password?", "Password reset help", "I forgot my password" - all can use one cached response.
FAQ Systems: Product questions like "What's the battery life?" and "How long does the battery last?" are semantically identical and benefit from caching.
3Architecture Overview
Semantic Caching Flow
User Query
"What is Spring?"
Embed Query
[0.2, 0.8, ...]
Vector Search
Find similar
Similarity Check
> threshold?
Return/Generate
Cache or LLM
Key Components
1. Embedding Model
Converts text queries into numerical vectors that capture semantic meaning. Spring AI supports OpenAI, Ollama, and other embedding providers.
2. Vector Store
Stores embeddings and performs similarity search. Options include Redis, PostgreSQL with pgvector, Pinecone, Milvus, etc.
3. Cache Advisor
Intercepts requests, checks the cache, and decides whether to return cached response or call the LLM.
4Implementation with Spring AI
Step 1: Add Dependencies
<dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-openai-spring-boot-starter</artifactId></dependency><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-redis-store-spring-boot-starter</artifactId></dependency>Step 2: Configure Vector Store
spring:ai:openai:api-key: ${OPENAI_API_KEY}vectorstore:redis:uri: redis://localhost:6379index: semantic-cache
prefix: cache:Step 3: Create Semantic Cache Service
@ServicepublicclassSemanticCacheService{privatefinalVectorStore vectorStore;privatefinalEmbeddingModel embeddingModel;privatefinalChatClient chatClient;privatestaticfinaldoubleSIMILARITY_THRESHOLD=0.92;publicSemanticCacheService(VectorStore vectorStore,EmbeddingModel embeddingModel,ChatClient.Builder chatClientBuilder){this.vectorStore = vectorStore;this.embeddingModel = embeddingModel;this.chatClient = chatClientBuilder.build();}publicStringgetResponse(String userQuery){// 1. Search for semantically similar cached queriesList<Document> similarDocs = vectorStore.similaritySearch(SearchRequest.query(userQuery).withTopK(1).withSimilarityThreshold(SIMILARITY_THRESHOLD));// 2. If cache hit, return cached responseif(!similarDocs.isEmpty()){return similarDocs.get(0).getMetadata().get("response").toString();}// 3. Otherwise, call LLM and cache the resultString response = chatClient.prompt().user(userQuery).call().content();// 4. Store in cache with query embeddingDocument doc =newDocument(userQuery,Map.of("response", response));
vectorStore.add(List.of(doc));return response;}}Cache Hit Scenario
When a user asks "Explain Spring Framework", and there's a cached response for "What is Spring Framework?", the semantic similarity will be high enough to return the cached response instantly.
5Configuration & Tuning
Similarity Threshold
The similarity threshold determines how "close" a query must be to return a cached response. This is crucial for balancing cache hits vs. accuracy.
| Threshold | Cache Hits | Accuracy | Use Case |
|---|---|---|---|
| 0.95+ | Low | Very High | Critical/legal queries |
| 0.90-0.95 | Medium | High | General Q&A (recommended) |
| 0.85-0.90 | High | Medium | FAQs, casual chatbots |
| <0.85 | Very High | Low | Not recommended |
Cache Invalidation
Consider implementing TTL (Time-To-Live) for cached responses to ensure freshness. For time-sensitive information, cache entries should expire and be regenerated.
7Best Practices
Do: Normalize Queries
Convert to lowercase, remove extra whitespace, and trim queries before embedding for better matching.
Do: Monitor Cache Metrics
Track hit rate, miss rate, and average similarity scores to optimize threshold settings.
Do: Use Domain-Specific Embeddings
For specialized domains, consider fine-tuned embedding models for better semantic understanding.
Don't: Cache Personalized Content
User-specific responses (account info, preferences) should bypass the semantic cache.
Production Considerations
- Use Redis Cluster or managed vector databases for scalability
- Implement cache warm-up with common queries during deployment
- Add fallback mechanisms if the cache service is unavailable
- Consider multi-tenant caching with namespace isolation
What You've Learned
Semantic vs Traditional
Why meaning-based caching outperforms exact-match
Architecture
Embeddings, vector stores, and cache advisors
Implementation
Spring AI code for semantic caching
Configuration
Similarity thresholds and tuning
Best Practices
Production-ready patterns
Cost Optimization
Reduce LLM API costs significantly