Back to Intelligent Chatbots
    Advanced
    ~30 minutes

    RAG-Enhanced Chatbots

    Ground your chatbot responses in real data. Retrieval-Augmented Generation combines the power of LLMs with your organization's knowledge base for accurate, sourced answers.

    Vector Databases
    Semantic Search
    Document Processing

    Large Language Models are incredibly powerful, but they have a fundamental limitation: they only know what they were trained on. Ask GPT-4 about your company's internal policies, last quarter's sales figures, or the contents of your proprietary documentation, and it can only hallucinate an answer—or honestly admit it doesn't know.

    Retrieval-Augmented Generation (RAG) solves this by giving the LLM access to your data at query time. When a user asks a question, we first search our knowledge base for relevant information, then provide that context to the LLM along with the question. The model generates a response grounded in actual documents, dramatically reducing hallucinations and enabling citation of sources.

    This is the architecture behind modern enterprise chatbots, customer support systems, and internal knowledge assistants. With Spring AI, implementing production-grade RAG in Java is remarkably straightforward.

    How RAG Works

    1

    Document Ingestion

    Load documents (PDFs, web pages, databases) and split into chunks

    2

    Embedding Generation

    Convert text chunks into vector embeddings using an embedding model

    3

    Vector Storage

    Store embeddings in a vector database for efficient similarity search

    4

    Query & Retrieval

    When user asks a question, find the most relevant chunks

    5

    Augmented Generation

    Send retrieved context + question to LLM for grounded response

    Choose Your Vector Database

    Vector databases store embeddings and enable fast similarity search. Spring AI supports multiple options.

    PgVector

    PostgreSQL extension, familiar SQL, great for existing Postgres users

    Best for: Teams already using PostgreSQL

    Pinecone

    Managed service, serverless scaling, excellent performance

    Best for: Production at scale

    Chroma

    Lightweight, easy setup, great for development

    Best for: Local development & prototyping

    Weaviate

    GraphQL API, hybrid search, built-in ML models

    Best for: Complex search requirements
    1

    Project Setup & Dependencies

    Add Spring AI dependencies for your chosen vector store and document processing. We'll use PgVector with PostgreSQL—a great choice if you're already using Postgres, as it adds vector capabilities via an extension.

    pom.xml - RAG Dependencies
    <!-- pom.xml - RAG Dependencies --><dependencies><!-- Spring AI Core --><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-openai-spring-boot-starter</artifactId></dependency><!-- Vector Store - Choose one --><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId></dependency><!-- Document Readers --><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-pdf-document-reader</artifactId></dependency><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-tika-document-reader</artifactId></dependency></dependencies>

    Vector Store Configuration

    VectorStoreConfig.java
    @ConfigurationpublicclassVectorStoreConfig{@BeanpublicVectorStorevectorStore(JdbcTemplate jdbcTemplate,EmbeddingModel embeddingModel){returnnewPgVectorStore(jdbcTemplate, embeddingModel,PgVectorStore.PgVectorStoreConfig.builder().withDimensions(1536)// OpenAI ada-002 dimensions.withDistanceType(PgVectorStore.PgDistanceType.COSINE_DISTANCE).withIndexType(PgVectorStore.PgIndexType.HNSW).build());}}// application.properties
    spring.datasource.url=jdbc:postgresql://localhost:5432/vectordb
    spring.datasource.username=postgres
    spring.datasource.password=secret
    spring.ai.openai.api-key=${OPENAI_API_KEY}

    The PgVectorStore configuration specifies important parameters. dimensions must match your embedding model (OpenAI's text-embedding-ada-002 uses 1536). COSINE_DISTANCE is the most common similarity metric—it measures the angle between vectors rather than magnitude, which works better for semantic comparison.HNSW indexing provides fast approximate nearest neighbor search, essential for production performance.

    2

    Document Ingestion Pipeline

    Before your chatbot can answer questions, you need to load your documents, split them into searchable chunks, and store their vector embeddings. This is the "indexing" phase of RAG.

    DocumentIngestionService.java
    @ServicepublicclassDocumentIngestionService{privatefinalVectorStore vectorStore;privatefinalTokenTextSplitter textSplitter;publicDocumentIngestionService(VectorStore vectorStore){this.vectorStore = vectorStore;this.textSplitter =newTokenTextSplitter(800,// Default chunk size350,// Minimum chunk size  200,// Overlap between chunks10000,// Max chunks per documenttrue// Keep separator);}publicvoidingestPdfDocument(Resource pdfResource,Map<String,Object> metadata){// Read PDFvar pdfReader =newPagePdfDocumentReader(pdfResource);List<Document> documents = pdfReader.get();// Add metadata to each document
    documents.forEach(doc -> doc.getMetadata().putAll(metadata));// Split into chunksList<Document> chunks = textSplitter.apply(documents);// Store in vector database (embeddings generated automatically)
    vectorStore.add(chunks);
    log.info("Ingested {} chunks from PDF: {}", 
    chunks.size(), pdfResource.getFilename());}publicvoidingestTextContent(String content,String source,Map<String,Object> metadata){Document document =newDocument(content, metadata);
    document.getMetadata().put("source", source);List<Document> chunks = textSplitter.apply(List.of(document));
    vectorStore.add(chunks);}}

    Chunking is perhaps the most critical decision in RAG. Chunks that are too small lose context; chunks that are too large dilute the specific information you're looking for. The TokenTextSplitter configuration above uses 800 tokens per chunk with 200-token overlap. The overlap ensures that information at chunk boundaries isn't lost—if a relevant sentence spans two chunks, it will appear in both.

    Metadata is equally important. Each chunk carries metadata like source file, document title, and ingestion timestamp. This enables filtering (e.g., "only search the 2024 Q3 reports"), citation (tell users exactly where an answer came from), and debugging (trace unexpected answers back to their source documents).

    Pro tip: Run ingestion as a batch job, not during user requests. Consider a scheduled job that detects new/updated documents and incrementally updates your vector store.

    3

    The RAG Chat Controller

    Now we bring it together: a controller that retrieves relevant documents and uses them to generate accurate, grounded responses. Spring AI's QuestionAnswerAdvisor handles the retrieval automatically.

    RagChatController.java
    @RestController@RequestMapping("/api/rag-chat")publicclassRagChatController{privatefinalChatClient chatClient;privatefinalVectorStore vectorStore;publicRagChatController(ChatClient.Builder builder,VectorStore vectorStore){this.vectorStore = vectorStore;this.chatClient = builder
    .defaultSystem("""
    You are a knowledgeable assistant for TechCorp. Answer questions based 
    on the provided context from our documentation. If the context doesn't
    contain relevant information, say so honestly rather than making up answers.
    Always cite your sources when possible.
    """).defaultAdvisors(newQuestionAnswerAdvisor(vectorStore,SearchRequest.defaults().withTopK(5).withSimilarityThreshold(0.7))).build();}@PostMappingpublicChatResponsechat(@RequestBodyChatRequest request){String response = chatClient.prompt().user(request.message()).call().content();// Optionally include retrieved documentsList<Document> sources = vectorStore.similaritySearch(SearchRequest.query(request.message()).withTopK(3));returnnewChatResponse(response, sources.stream().map(doc ->newSource(
    doc.getMetadata().get("source").toString(),
    doc.getContent().substring(0,Math.min(200, doc.getContent().length())))).toList());}}publicrecordChatRequest(String message){}publicrecordChatResponse(String answer,List<Source> sources){}publicrecordSource(String name,String snippet){}

    The QuestionAnswerAdvisor is Spring AI's turnkey RAG solution. Behind the scenes, it converts the user's question to a vector embedding, searches the vector store for similar documents, injects those documents into the prompt as context, and lets the LLM generate a response grounded in that context—all in a single .call().

    The topK(5) parameter retrieves the 5 most similar chunks. More chunks provide more context but increase token usage and cost. similarityThreshold(0.7) filters out weakly relevant results—if no chunks score above 0.7 similarity, the model knows there's no relevant context and can respond accordingly.

    Notice we also return the source documents in the response. This enables your frontend to display citations, letting users verify answers and explore source materials.

    Advanced

    Chunking Strategies

    Different document types benefit from different chunking approaches. Token-based splitting is fast but can break mid-sentence. Semantic chunking preserves logical boundaries but is more complex.

    ChunkingConfig.java
    @ConfigurationpublicclassChunkingConfig{// Strategy 1: Token-based (default)@BeanpublicTokenTextSplittertokenSplitter(){returnnewTokenTextSplitter(800,350,200,10000,true);}// Strategy 2: Semantic chunking (keeps paragraphs together)@BeanpublicTextSplittersemanticSplitter(){returnnewTextSplitter(){@OverrideprotectedList<String>splitText(String text){// Split by paragraphs, then regroup to target sizeString[] paragraphs = text.split("\n\n+");List<String> chunks =newArrayList<>();StringBuilder current =newStringBuilder();for(String para : paragraphs){if(current.length()+ para.length()>1500){if(current.length()>0){
    chunks.add(current.toString().trim());
    current =newStringBuilder();}}
    current.append(para).append("\n\n");}if(current.length()>0){
    chunks.add(current.toString().trim());}return chunks;}};}}// Metadata enrichment during ingestionpublicDocumentenrichChunk(Document chunk,int chunkIndex,String documentTitle){
    chunk.getMetadata().put("chunk_index", chunkIndex);
    chunk.getMetadata().put("document_title", documentTitle);
    chunk.getMetadata().put("ingested_at",Instant.now().toString());// Add section headers for better contextString content = chunk.getContent();String[] lines = content.split("\n");if(lines.length >0&& lines[0].matches("^#+\\s+.*")){
    chunk.getMetadata().put("section_header", lines[0].replaceAll("^#+\\s+",""));}return chunk;}

    Semantic chunking keeps paragraphs and sections together, preserving the logical flow of documents. This is especially valuable for technical documentation, legal contracts, or any content where meaning depends heavily on context. The tradeoff is variable chunk sizes—some may be quite short if a document has many small paragraphs.

    Metadata enrichment adds valuable information to each chunk. Storing section headers enables the LLM to reference "As described in the 'Security Best Practices' section..." The chunk index allows reconstruction of document order if needed. Ingestion timestamps support freshness-based filtering.

    Hybrid Retrieval & Re-ranking

    Pure semantic search sometimes misses exact keyword matches. Hybrid approaches combine semantic and keyword search for better recall, then use LLM-based re-ranking for precision.

    HybridRetrievalService.java
    @ServicepublicclassHybridRetrievalService{privatefinalVectorStore vectorStore;privatefinalChatClient chatClient;// Hybrid search: keyword + semanticpublicList<Document>hybridSearch(String query,String category){// Semantic searchSearchRequest semanticRequest =SearchRequest.query(query).withTopK(10).withSimilarityThreshold(0.6).withFilterExpression("category == '"+ category +"'");List<Document> semanticResults = vectorStore.similaritySearch(semanticRequest);// Re-rank using LLM for better relevancereturnrerank(query, semanticResults);}privateList<Document>rerank(String query,List<Document> documents){if(documents.isEmpty())return documents;String rerankPrompt ="""
    Given the query: "%s"
    Rank these documents by relevance (1-10 scale):
    %s
    Return JSON array: [{"index": 0, "score": 8.5}, ...]
    """.formatted(query,formatDocuments(documents));String rankingJson = chatClient.prompt().user(rerankPrompt).call().content();// Parse and reorder based on scoresreturnreorderByScores(documents,parseRankings(rankingJson));}// Query expansion for better recallpublicStringexpandQuery(String originalQuery){return chatClient.prompt().system("Generate 3 alternative phrasings of this query for search. Return JSON array.").user(originalQuery).call().content();}}

    Re-ranking is a powerful technique: retrieve a broader set of candidates (topK=10), then use the LLM to score each document against the original query. This catches relevant documents that might have lower embedding similarity but are actually more useful. The tradeoff is additional LLM calls and latency.

    Query expansion generates alternative phrasings of the user's question. If someone asks "How do I reset my password?", the system also searches for "forgot password", "change credentials", "account recovery", increasing the chance of finding relevant documents regardless of how they're worded.

    Conversational RAG with Memory

    Combine RAG with conversation memory for multi-turn question answering. Handle follow-up questions like "What about the pricing?" where "the" refers to something from earlier in the conversation.

    ConversationalRagService.java
    @ServicepublicclassConversationalRagService{privatefinalChatClient chatClient;privatefinalVectorStore vectorStore;privatefinalChatMemory chatMemory;publicStringchat(String sessionId,String userMessage){// 1. Get conversation history for contextList<Message> history = chatMemory.get(sessionId,5);// 2. Reformulate query based on conversation contextString standaloneQuery =reformulateQuery(userMessage, history);// 3. Retrieve relevant documentsList<Document> relevantDocs = vectorStore.similaritySearch(SearchRequest.query(standaloneQuery).withTopK(5));// 4. Build context-aware promptString context = relevantDocs.stream().map(Document::getContent).collect(Collectors.joining("\n\n---\n\n"));// 5. Generate response with memoryString response = chatClient.prompt().system("""
    Answer based on the provided context and conversation history.
    If the context doesn't contain the answer, say so.
    Context:
    %s
    """.formatted(context)).user(userMessage).advisors(a -> a.param(CHAT_MEMORY_CONVERSATION_ID_KEY, sessionId)).call().content();return response;}privateStringreformulateQuery(String query,List<Message> history){if(history.isEmpty())return query;String historyContext = history.stream().map(m -> m.getRole()+": "+ m.getContent()).collect(Collectors.joining("\n"));return chatClient.prompt().system("Given the conversation history, reformulate this query as a standalone question.").user("History:\n"+ historyContext +"\n\nNew query: "+ query).call().content();}}

    The key insight is query reformulation. When a user asks "What about the pricing?", we use the conversation history to reformulate this into "What is the pricing for TechCorp Enterprise Edition?" before searching. This standalone query can be effectively matched against documents without relying on conversation context.

    This architecture—memory + reformulation + retrieval—is how production-grade conversational RAG systems work. Each component can be tuned independently: adjust memory window, improve reformulation prompts, tweak retrieval parameters.

    Remember: Query reformulation adds latency (an extra LLM call). For simple, standalone questions, skip reformulation. Use heuristics like "does the query contain pronouns?" to decide.

    RAG Best Practices

    Start with good data

    RAG quality depends on your documents. Clean, well-structured content retrieves better than messy data.

    Experiment with chunk sizes

    There's no universal right answer. Test 500, 800, and 1000 token chunks on your specific use case.

    Use metadata filtering

    Don't search everything. Filter by document type, date, category, or user permissions.

    Monitor retrieval quality

    Log retrieved chunks and user feedback. Poor retrievals cause poor answers.

    Keep embeddings fresh

    Documents change. Build a pipeline to detect updates and re-embed affected chunks.

    Handle 'no results' gracefully

    When nothing relevant is found, have the LLM honestly say so rather than guess.

    Your Chatbot Now Knows Your Data!

    With RAG, your AI assistant can answer questions about your documents, products, policies, and more—with accurate, sourced responses instead of hallucinations.