RAG (Retrieval Augmented Generation)

    Ground your AI responses in facts from your own documents. RAG eliminates hallucinations by giving the LLM access to your proprietary knowledge base, turning it from a generic chatbot into a domain expert.

    Retrieval Augmented Generation is the technique of fetching relevant documents from a knowledge base and injecting them into the LLM's context window before generating a response. Instead of asking "What is our refund policy?" and hoping the model guessed correctly from internet training data, RAG retrieves youractual refund policy document and provides it as context—guaranteeing accuracy.

    Spring AI provides first-class support for RAG with built-in document loaders, text splitters, embedding clients, and vector store integrations. You can build a production-ready knowledge base in hours, not weeks.

    Why RAG Changes Everything

    Without RAG (Pure LLM)

    • Limited to training data (often 1-2 years old)
    • No knowledge of your internal documents
    • Confidently hallucinates when it doesn't know
    • Cannot cite sources for its claims

    With RAG

    • Access to your latest documents in real-time
    • Answers grounded in your proprietary data
    • Says "I don't know" when context is missing
    • Can cite exact documents and page numbers

    RAG Pipeline Architecture

    Indexing Phase (Offline)

    1

    Load Documents

    PDFs, Word docs, Markdown, HTML, JSON, databases

    2

    Chunk Text

    Split into 500-1000 token segments with overlap

    3

    Generate Embeddings

    Convert each chunk to a 1536-dimensional vector

    4

    Store in Vector DB

    Index vectors for fast similarity search

    Query Phase (Online)

    5

    Embed User Query

    Convert question to same vector space

    6

    Similarity Search

    Find top-K most similar chunks

    7

    Augment Prompt

    Inject retrieved context into system/user message

    8

    Generate Response

    LLM answers based on provided context

    Vector Store Options

    PGVector

    PostgreSQL Extension

    Use your existing Postgres database. No new infrastructure. Great for starting out.

    spring-ai-pgvector-store-spring-boot-starter

    Chroma

    Open Source

    Lightweight, embeddable vector DB. Perfect for local development and small deployments.

    spring-ai-chroma-store-spring-boot-starter

    Pinecone

    Managed Cloud

    Fully managed, scales to billions of vectors. Best for production at scale.

    spring-ai-pinecone-store-spring-boot-starter

    Spring AI Implementation

    Configuration

    application.properties
    # Vector Store (PGVector example)spring.ai.vectorstore.pgvector.index-type=HNSWspring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCEspring.ai.vectorstore.pgvector.dimensions=1536# Embedding Modelspring.ai.openai.embedding.options.model=text-embedding-3-small# Or use local embeddings with Ollama# spring.ai.ollama.embedding.options.model=nomic-embed-text

    Document Ingestion Service

    Load, split, and index your documents

    DocumentIngestionService.java
    @ServicepublicclassDocumentIngestionService{privatefinalVectorStore vectorStore;publicDocumentIngestionService(VectorStore vectorStore){this.vectorStore = vectorStore;}publicvoidingestPdf(Resource pdfResource){// 1. Load PDFPagePdfDocumentReader reader =newPagePdfDocumentReader(pdfResource);List<Document> documents = reader.read();// 2. Split into chunks (500 tokens, 100 token overlap)TokenTextSplitter splitter =newTokenTextSplitter(500,100);List<Document> chunks = splitter.split(documents);// 3. Add metadata for filtering/citation
    chunks.forEach(doc ->{
    doc.getMetadata().put("source", pdfResource.getFilename());
    doc.getMetadata().put("ingested_at",Instant.now().toString());});// 4. Store (embeddings generated automatically)
    vectorStore.add(chunks);}publicvoidingestMarkdown(Resource mdResource){TextReader reader =newTextReader(mdResource);List<Document> docs =newTokenTextSplitter().split(reader.read());
    vectorStore.add(docs);}}

    RAG Query Service

    Retrieve context and generate answers

    RAGService.java
    @ServicepublicclassRAGService{privatefinalChatClient chatClient;privatefinalVectorStore vectorStore;publicRAGService(ChatClient.Builder builder,VectorStore vectorStore){this.chatClient = builder
    .defaultSystem("""
    You are a helpful assistant that answers questions based on the provided context.
    If the answer is not in the context, say "I don't have information about that."
    Always cite which document your answer came from.
    """).build();this.vectorStore = vectorStore;}publicStringquery(String question){// 1. Semantic search for relevant chunksList<Document> relevantDocs = vectorStore.similaritySearch(SearchRequest.query(question).withTopK(5).withSimilarityThreshold(0.7));// 2. Build context string with sourcesString context = relevantDocs.stream().map(doc ->"Source: "+ doc.getMetadata().get("source")+"\nContent: "+ doc.getContent()).collect(Collectors.joining("\n\n---\n\n"));// 3. Generate response with contextreturn chatClient.prompt().user(u -> u.text("""
    Context:
    {context}
    Question: {question}
    """).param("context", context).param("question", question)).call().content();}}

    Chunking Strategies

    Chunking dramatically affects retrieval quality. Too small = lost context. Too large = irrelevant noise.

    Fixed Size

    Split every N tokens with overlap

    Simple, Fast

    Semantic

    Split at paragraph/section boundaries

    Better Context

    Recursive

    Try multiple separators hierarchically

    Most Flexible

    Rule of Thumb: Start with 500 tokens per chunk, 100 token overlap. Adjust based on your content type—code needs smaller chunks, narratives need larger.

    Best Practices

    ✓ Do

    • • Add rich metadata (source, date, author) for filtering
    • • Use similarity threshold to avoid irrelevant matches
    • • Implement hybrid search (semantic + keyword)
    • • Cache embeddings—they're expensive to regenerate
    • • Test with real user questions, not synthetic ones

    ✗ Avoid

    • • Retrieving too many chunks (context overflow)
    • • Ignoring document freshness (stale data)
    • • Mixing unrelated content in same vector store
    • • Skipping chunk overlap (context loss at boundaries)
    • • Using embeddings/LLM from different providers

    Enterprise Use Cases

    Internal Knowledge Base

    Index company policies, procedures, and documentation. Employees ask questions in natural language and get instant, accurate answers with citations.

    Customer Support Bot

    Train on FAQs, support tickets, and product docs. Resolve common issues instantly and escalate only when context is truly missing.

    Codebase Q&A

    Index your entire codebase including comments and docs. Ask "How does auth work?" and get answers pointing to actual implementation files.

    Legal/Compliance Research

    Search through contracts, regulations, and legal documents. Find relevant clauses in seconds instead of hours of manual review.

    Advanced RAG Patterns

    Basic RAG works well for simple use cases, but production systems often need more sophisticated retrieval strategies. These advanced patterns can dramatically improve answer quality and relevance.

    Multi-Query Retrieval

    Users often phrase questions poorly or ambiguously. Multi-query RAG uses the LLM itself to generate 3-5 alternative phrasings of the user's question, retrieves documents for each variant, and merges the results. This captures semantically related content that a single query might miss.

    Multi-Query Pattern
    // Generate query variantsList<String> queryVariants = chatClient.prompt().user("Generate 3 alternative phrasings for: "+ originalQuery).call().entity(newParameterizedTypeReference<List<String>>(){});// Retrieve for each variant and mergeSet<Document> allDocs =newHashSet<>();for(String variant : queryVariants){
    allDocs.addAll(vectorStore.similaritySearch(variant));}

    HyDE (Hypothetical Document Embeddings)

    Instead of embedding the question directly, HyDE asks the LLM to generate a hypothetical answer (even if it's hallucinated), then uses that answer's embedding to search. This works because a hypothetical answer is semantically closer to actual answers than a question is.

    HyDE Pattern
    // Generate hypothetical answerString hypotheticalAnswer = chatClient.prompt().user("Answer this question (guess if needed): "+ question).call().content();// Search using the hypothetical answer's embeddingList<Document> docs = vectorStore.similaritySearch(hypotheticalAnswer);

    Re-Ranking

    Vector search is fast but imprecise. Re-ranking uses a more expensive model (like a cross-encoder) to re-score the top-K results based on actual relevance to the query. Retrieve 20 documents, re-rank, and use only the top 5. This can boost accuracy by 15-30%.

    Re-Ranking Pattern
    // Over-retrieve candidatesList<Document> candidates = vectorStore.similaritySearch(SearchRequest.query(question).withTopK(20));// Re-rank with a cross-encoder or LLM
    candidates.sort((a, b)->scoreRelevance(b, question)-scoreRelevance(a, question));// Use top 5 after re-rankingList<Document> topDocs = candidates.subList(0,5);

    Hybrid Search (Semantic + Keyword)

    Pure vector search can miss exact keyword matches (like product codes or technical terms). Hybrid search combines semantic similarity with traditional BM25 keyword matching, giving you the best of both worlds. Most vector databases now support this natively.

    Pinecone
    Weaviate
    Elasticsearch

    Embedding Model Comparison

    The embedding model you choose significantly impacts retrieval quality. Here's how the popular options compare. Remember: you should use the same embedding model for both indexing and querying—mixing models will produce garbage results.

    ModelDimensionsBest ForCost
    text-embedding-3-small1536General purpose, cost-effective
    $0.02/1M tokens
    text-embedding-3-large3072Maximum accuracy
    $0.13/1M tokens
    nomic-embed-text (Ollama)768Local/private, good quality
    Free (local)
    Cohere embed-v31024Multilingual, long documents
    $0.10/1M tokens

    Pro Tip: Start with text-embedding-3-small for cloud or nomic-embed-text for local. Only upgrade to larger models if you see retrieval quality issues in production.

    Troubleshooting Common Issues

    "The AI doesn't find relevant documents"

    Causes: Poor chunking (chunks too big or too small), wrong embedding model, similarity threshold too high, or query phrasing doesn't match document vocabulary.

    Fixes: Experiment with chunk sizes (try 300-800 tokens). Lower similarity threshold to 0.5. Use multi-query retrieval. Check if documents were actually indexed (query count in vector store).

    "Answers are correct but cite the wrong source"

    Cause: Metadata not preserved during chunking, or prompt doesn't instruct model to cite sources.

    Fix: Add source/filename to each chunk's metadata before indexing. Include instructions in system prompt: "Always cite the source document for each claim."

    "Context window exceeded" errors

    Cause: Retrieving too many chunks or chunks are too large, exceeding the model's context limit.

    Fix: Reduce topK from 10 to 3-5. Use smaller chunk sizes. Consider summarizing retrieved content before injecting into prompt. Use a model with larger context (GPT-4 Turbo: 128K, Claude: 200K).

    "Embeddings are slow to generate"

    Cause: Processing documents synchronously, one at a time.

    Fix: Batch embedding requests (OpenAI supports up to 2048 inputs per call). Use async processing with CompletableFuture. For local models, increase Ollama's parallelism.

    Build Your Knowledge Base

    RAG transforms AI from a generic assistant into your domain expert. Start with your most valuable documents and iterate from there.