RAG (Retrieval Augmented Generation)
Ground your AI responses in facts from your own documents. RAG eliminates hallucinations by giving the LLM access to your proprietary knowledge base, turning it from a generic chatbot into a domain expert.
Retrieval Augmented Generation is the technique of fetching relevant documents from a knowledge base and injecting them into the LLM's context window before generating a response. Instead of asking "What is our refund policy?" and hoping the model guessed correctly from internet training data, RAG retrieves youractual refund policy document and provides it as context—guaranteeing accuracy.
Spring AI provides first-class support for RAG with built-in document loaders, text splitters, embedding clients, and vector store integrations. You can build a production-ready knowledge base in hours, not weeks.
Why RAG Changes Everything
Without RAG (Pure LLM)
- ✗Limited to training data (often 1-2 years old)
- ✗No knowledge of your internal documents
- ✗Confidently hallucinates when it doesn't know
- ✗Cannot cite sources for its claims
With RAG
- ✓Access to your latest documents in real-time
- ✓Answers grounded in your proprietary data
- ✓Says "I don't know" when context is missing
- ✓Can cite exact documents and page numbers
RAG Pipeline Architecture
Indexing Phase (Offline)
Load Documents
PDFs, Word docs, Markdown, HTML, JSON, databases
Chunk Text
Split into 500-1000 token segments with overlap
Generate Embeddings
Convert each chunk to a 1536-dimensional vector
Store in Vector DB
Index vectors for fast similarity search
Query Phase (Online)
Embed User Query
Convert question to same vector space
Similarity Search
Find top-K most similar chunks
Augment Prompt
Inject retrieved context into system/user message
Generate Response
LLM answers based on provided context
Vector Store Options
PGVector
Use your existing Postgres database. No new infrastructure. Great for starting out.
spring-ai-pgvector-store-spring-boot-starterChroma
Lightweight, embeddable vector DB. Perfect for local development and small deployments.
spring-ai-chroma-store-spring-boot-starterPinecone
Fully managed, scales to billions of vectors. Best for production at scale.
spring-ai-pinecone-store-spring-boot-starterSpring AI Implementation
Configuration
# Vector Store (PGVector example)spring.ai.vectorstore.pgvector.index-type=HNSWspring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCEspring.ai.vectorstore.pgvector.dimensions=1536# Embedding Modelspring.ai.openai.embedding.options.model=text-embedding-3-small# Or use local embeddings with Ollama# spring.ai.ollama.embedding.options.model=nomic-embed-textDocument Ingestion Service
Load, split, and index your documents
@ServicepublicclassDocumentIngestionService{privatefinalVectorStore vectorStore;publicDocumentIngestionService(VectorStore vectorStore){this.vectorStore = vectorStore;}publicvoidingestPdf(Resource pdfResource){// 1. Load PDFPagePdfDocumentReader reader =newPagePdfDocumentReader(pdfResource);List<Document> documents = reader.read();// 2. Split into chunks (500 tokens, 100 token overlap)TokenTextSplitter splitter =newTokenTextSplitter(500,100);List<Document> chunks = splitter.split(documents);// 3. Add metadata for filtering/citation
chunks.forEach(doc ->{
doc.getMetadata().put("source", pdfResource.getFilename());
doc.getMetadata().put("ingested_at",Instant.now().toString());});// 4. Store (embeddings generated automatically)
vectorStore.add(chunks);}publicvoidingestMarkdown(Resource mdResource){TextReader reader =newTextReader(mdResource);List<Document> docs =newTokenTextSplitter().split(reader.read());
vectorStore.add(docs);}}RAG Query Service
Retrieve context and generate answers
@ServicepublicclassRAGService{privatefinalChatClient chatClient;privatefinalVectorStore vectorStore;publicRAGService(ChatClient.Builder builder,VectorStore vectorStore){this.chatClient = builder
.defaultSystem("""
You are a helpful assistant that answers questions based on the provided context.
If the answer is not in the context, say "I don't have information about that."
Always cite which document your answer came from.
""").build();this.vectorStore = vectorStore;}publicStringquery(String question){// 1. Semantic search for relevant chunksList<Document> relevantDocs = vectorStore.similaritySearch(SearchRequest.query(question).withTopK(5).withSimilarityThreshold(0.7));// 2. Build context string with sourcesString context = relevantDocs.stream().map(doc ->"Source: "+ doc.getMetadata().get("source")+"\nContent: "+ doc.getContent()).collect(Collectors.joining("\n\n---\n\n"));// 3. Generate response with contextreturn chatClient.prompt().user(u -> u.text("""
Context:
{context}
Question: {question}
""").param("context", context).param("question", question)).call().content();}}Chunking Strategies
Chunking dramatically affects retrieval quality. Too small = lost context. Too large = irrelevant noise.
Fixed Size
Split every N tokens with overlap
Semantic
Split at paragraph/section boundaries
Recursive
Try multiple separators hierarchically
Rule of Thumb: Start with 500 tokens per chunk, 100 token overlap. Adjust based on your content type—code needs smaller chunks, narratives need larger.
Best Practices
✓ Do
- • Add rich metadata (source, date, author) for filtering
- • Use similarity threshold to avoid irrelevant matches
- • Implement hybrid search (semantic + keyword)
- • Cache embeddings—they're expensive to regenerate
- • Test with real user questions, not synthetic ones
✗ Avoid
- • Retrieving too many chunks (context overflow)
- • Ignoring document freshness (stale data)
- • Mixing unrelated content in same vector store
- • Skipping chunk overlap (context loss at boundaries)
- • Using embeddings/LLM from different providers
Enterprise Use Cases
Internal Knowledge Base
Index company policies, procedures, and documentation. Employees ask questions in natural language and get instant, accurate answers with citations.
Customer Support Bot
Train on FAQs, support tickets, and product docs. Resolve common issues instantly and escalate only when context is truly missing.
Codebase Q&A
Index your entire codebase including comments and docs. Ask "How does auth work?" and get answers pointing to actual implementation files.
Legal/Compliance Research
Search through contracts, regulations, and legal documents. Find relevant clauses in seconds instead of hours of manual review.
Advanced RAG Patterns
Basic RAG works well for simple use cases, but production systems often need more sophisticated retrieval strategies. These advanced patterns can dramatically improve answer quality and relevance.
Multi-Query Retrieval
Users often phrase questions poorly or ambiguously. Multi-query RAG uses the LLM itself to generate 3-5 alternative phrasings of the user's question, retrieves documents for each variant, and merges the results. This captures semantically related content that a single query might miss.
// Generate query variantsList<String> queryVariants = chatClient.prompt().user("Generate 3 alternative phrasings for: "+ originalQuery).call().entity(newParameterizedTypeReference<List<String>>(){});// Retrieve for each variant and mergeSet<Document> allDocs =newHashSet<>();for(String variant : queryVariants){
allDocs.addAll(vectorStore.similaritySearch(variant));}HyDE (Hypothetical Document Embeddings)
Instead of embedding the question directly, HyDE asks the LLM to generate a hypothetical answer (even if it's hallucinated), then uses that answer's embedding to search. This works because a hypothetical answer is semantically closer to actual answers than a question is.
// Generate hypothetical answerString hypotheticalAnswer = chatClient.prompt().user("Answer this question (guess if needed): "+ question).call().content();// Search using the hypothetical answer's embeddingList<Document> docs = vectorStore.similaritySearch(hypotheticalAnswer);Re-Ranking
Vector search is fast but imprecise. Re-ranking uses a more expensive model (like a cross-encoder) to re-score the top-K results based on actual relevance to the query. Retrieve 20 documents, re-rank, and use only the top 5. This can boost accuracy by 15-30%.
// Over-retrieve candidatesList<Document> candidates = vectorStore.similaritySearch(SearchRequest.query(question).withTopK(20));// Re-rank with a cross-encoder or LLM
candidates.sort((a, b)->scoreRelevance(b, question)-scoreRelevance(a, question));// Use top 5 after re-rankingList<Document> topDocs = candidates.subList(0,5);Hybrid Search (Semantic + Keyword)
Pure vector search can miss exact keyword matches (like product codes or technical terms). Hybrid search combines semantic similarity with traditional BM25 keyword matching, giving you the best of both worlds. Most vector databases now support this natively.
Embedding Model Comparison
The embedding model you choose significantly impacts retrieval quality. Here's how the popular options compare. Remember: you should use the same embedding model for both indexing and querying—mixing models will produce garbage results.
| Model | Dimensions | Best For | Cost |
|---|---|---|---|
| text-embedding-3-small | 1536 | General purpose, cost-effective | $0.02/1M tokens |
| text-embedding-3-large | 3072 | Maximum accuracy | $0.13/1M tokens |
| nomic-embed-text (Ollama) | 768 | Local/private, good quality | Free (local) |
| Cohere embed-v3 | 1024 | Multilingual, long documents | $0.10/1M tokens |
Pro Tip: Start with text-embedding-3-small for cloud or nomic-embed-text for local. Only upgrade to larger models if you see retrieval quality issues in production.
Troubleshooting Common Issues
"The AI doesn't find relevant documents"
Causes: Poor chunking (chunks too big or too small), wrong embedding model, similarity threshold too high, or query phrasing doesn't match document vocabulary.
Fixes: Experiment with chunk sizes (try 300-800 tokens). Lower similarity threshold to 0.5. Use multi-query retrieval. Check if documents were actually indexed (query count in vector store).
"Answers are correct but cite the wrong source"
Cause: Metadata not preserved during chunking, or prompt doesn't instruct model to cite sources.
Fix: Add source/filename to each chunk's metadata before indexing. Include instructions in system prompt: "Always cite the source document for each claim."
"Context window exceeded" errors
Cause: Retrieving too many chunks or chunks are too large, exceeding the model's context limit.
Fix: Reduce topK from 10 to 3-5. Use smaller chunk sizes. Consider summarizing retrieved content before injecting into prompt. Use a model with larger context (GPT-4 Turbo: 128K, Claude: 200K).
"Embeddings are slow to generate"
Cause: Processing documents synchronously, one at a time.
Fix: Batch embedding requests (OpenAI supports up to 2048 inputs per call). Use async processing with CompletableFuture. For local models, increase Ollama's parallelism.