Spring AI Tutorials
    Tutorial 08

    Implementing Semantic Caching Using Spring AI

    Optimize AI responses, reduce latency, and cut API costs using semantic similarity-based caching with vector databases

    1
    What is Semantic Caching?

    Semantic caching is an intelligent caching strategy that goes beyond exact string matching. Instead of requiring identical queries, it finds cached responses for queries that are semantically similar — meaning they have the same meaning or intent, even if worded differently.

    Traditional Caching

    • • Exact string match required
    • • "What is AI?" ≠ "What's artificial intelligence?"
    • • High cache miss rate
    • • Limited effectiveness for natural language

    Semantic Caching

    • • Meaning-based similarity match
    • • "What is AI?" ≈ "What's artificial intelligence?"
    • • Much higher cache hit rate
    • • Perfect for AI/LLM applications

    How It Works

    Semantic caching uses embedding vectors to represent the meaning of queries. When a new query arrives, its embedding is compared against cached embeddings using cosine similarity. If a match exceeds the similarity threshold, the cached response is returned.

    2
    Benefits of Semantic Caching

    Cost Reduction

    Reduce API calls to expensive LLM providers by 40-70% with semantic matching

    Faster Responses

    Cache hits return in milliseconds vs seconds for LLM calls

    Consistency

    Similar questions always get consistent answers from cache

    Real-World Impact

    Customer Support Bot: Many users ask the same questions differently. "How do I reset my password?", "Password reset help", "I forgot my password" - all can use one cached response.

    FAQ Systems: Product questions like "What's the battery life?" and "How long does the battery last?" are semantically identical and benefit from caching.

    3
    Architecture Overview

    Semantic Caching Flow

    User Query

    "What is Spring?"

    Embed Query

    [0.2, 0.8, ...]

    Vector Search

    Find similar

    Similarity Check

    > threshold?

    Return/Generate

    Cache or LLM

    Key Components

    1. Embedding Model

    Converts text queries into numerical vectors that capture semantic meaning. Spring AI supports OpenAI, Ollama, and other embedding providers.

    2. Vector Store

    Stores embeddings and performs similarity search. Options include Redis, PostgreSQL with pgvector, Pinecone, Milvus, etc.

    3. Cache Advisor

    Intercepts requests, checks the cache, and decides whether to return cached response or call the LLM.

    4
    Implementation with Spring AI

    Step 1: Add Dependencies

    Xml Example
    <dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-openai-spring-boot-starter</artifactId></dependency><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-redis-store-spring-boot-starter</artifactId></dependency>

    Step 2: Configure Vector Store

    Yaml Example
    spring:ai:openai:api-key: ${OPENAI_API_KEY}vectorstore:redis:uri: redis://localhost:6379index: semantic-cache
    prefix: cache:

    Step 3: Create Semantic Cache Service

    Java Example
    @ServicepublicclassSemanticCacheService{privatefinalVectorStore vectorStore;privatefinalEmbeddingModel embeddingModel;privatefinalChatClient chatClient;privatestaticfinaldoubleSIMILARITY_THRESHOLD=0.92;publicSemanticCacheService(VectorStore vectorStore,EmbeddingModel embeddingModel,ChatClient.Builder chatClientBuilder){this.vectorStore = vectorStore;this.embeddingModel = embeddingModel;this.chatClient = chatClientBuilder.build();}publicStringgetResponse(String userQuery){// 1. Search for semantically similar cached queriesList<Document> similarDocs = vectorStore.similaritySearch(SearchRequest.query(userQuery).withTopK(1).withSimilarityThreshold(SIMILARITY_THRESHOLD));// 2. If cache hit, return cached responseif(!similarDocs.isEmpty()){return similarDocs.get(0).getMetadata().get("response").toString();}// 3. Otherwise, call LLM and cache the resultString response = chatClient.prompt().user(userQuery).call().content();// 4. Store in cache with query embeddingDocument doc =newDocument(userQuery,Map.of("response", response));
    vectorStore.add(List.of(doc));return response;}}

    Cache Hit Scenario

    When a user asks "Explain Spring Framework", and there's a cached response for "What is Spring Framework?", the semantic similarity will be high enough to return the cached response instantly.

    5
    Configuration & Tuning

    Similarity Threshold

    The similarity threshold determines how "close" a query must be to return a cached response. This is crucial for balancing cache hits vs. accuracy.

    ThresholdCache HitsAccuracyUse Case
    0.95+LowVery HighCritical/legal queries
    0.90-0.95MediumHighGeneral Q&A (recommended)
    0.85-0.90HighMediumFAQs, casual chatbots
    <0.85Very HighLowNot recommended

    Cache Invalidation

    Consider implementing TTL (Time-To-Live) for cached responses to ensure freshness. For time-sensitive information, cache entries should expire and be regenerated.

    6
    Hands-On Tutorial

    Ready to build a complete semantic search application? Check out our step-by-step guide on implementing Vector Similarity Search with Spring Boot and Redis.

    7
    Best Practices

    Do: Normalize Queries

    Convert to lowercase, remove extra whitespace, and trim queries before embedding for better matching.

    Do: Monitor Cache Metrics

    Track hit rate, miss rate, and average similarity scores to optimize threshold settings.

    Do: Use Domain-Specific Embeddings

    For specialized domains, consider fine-tuned embedding models for better semantic understanding.

    Don't: Cache Personalized Content

    User-specific responses (account info, preferences) should bypass the semantic cache.

    Production Considerations

    • Use Redis Cluster or managed vector databases for scalability
    • Implement cache warm-up with common queries during deployment
    • Add fallback mechanisms if the cache service is unavailable
    • Consider multi-tenant caching with namespace isolation

    What You've Learned

    Semantic vs Traditional

    Why meaning-based caching outperforms exact-match

    Architecture

    Embeddings, vector stores, and cache advisors

    Implementation

    Spring AI code for semantic caching

    Configuration

    Similarity thresholds and tuning

    Best Practices

    Production-ready patterns

    Cost Optimization

    Reduce LLM API costs significantly

    💬 Comments & Discussion