What is the best way to learn Java 8?

Start with Lambda Expressions and Functional Interfaces, then progress to Stream API and Optional. Practice with real coding examples and take quizzes to test your understanding.

How do I prepare for Spring Boot interviews?

Focus on core concepts like dependency injection, REST APIs, Spring Data JPA, and Spring Security. Practice with our 100+ Spring Boot quiz questions covering real interview scenarios.

What topics are covered in System Design?

We cover scalability patterns, database design, microservices architecture, distributed systems, caching strategies, API design, and security architecture.

RAG (Retrieval Augmented Generation)

Ground your AI responses in facts from your own documents. RAG eliminates hallucinations by giving the LLM access to your proprietary knowledge base, turning it from a generic chatbot into a domain expert.

Retrieval Augmented Generation is the technique of fetching relevant documents from a knowledge base and injecting them into the LLM's context window before generating a response. Instead of asking "What is our refund policy?" and hoping the model guessed correctly from internet training data, RAG retrieves youractual refund policy document and provides it as context—guaranteeing accuracy.

Spring AI provides first-class support for RAG with built-in document loaders, text splitters, embedding clients, and vector store integrations. You can build a production-ready knowledge base in hours, not weeks.

Why RAG Changes Everything

Without RAG (Pure LLM)

✗Limited to training data (often 1-2 years old)
✗No knowledge of your internal documents
✗Confidently hallucinates when it doesn't know
✗Cannot cite sources for its claims

With RAG

✓Access to your latest documents in real-time
✓Answers grounded in your proprietary data
✓Says "I don't know" when context is missing
✓Can cite exact documents and page numbers

RAG Pipeline Architecture

Indexing Phase (Offline)

Load Documents

PDFs, Word docs, Markdown, HTML, JSON, databases

Chunk Text

Split into 500-1000 token segments with overlap

Generate Embeddings

Convert each chunk to a 1536-dimensional vector

Store in Vector DB

Index vectors for fast similarity search

Query Phase (Online)

Embed User Query

Convert question to same vector space

Similarity Search

Find top-K most similar chunks

Augment Prompt

Inject retrieved context into system/user message

Generate Response

LLM answers based on provided context

Vector Store Options

PGVector

PostgreSQL Extension

Use your existing Postgres database. No new infrastructure. Great for starting out.

spring-ai-pgvector-store-spring-boot-starter

Chroma

Open Source

Lightweight, embeddable vector DB. Perfect for local development and small deployments.

spring-ai-chroma-store-spring-boot-starter

Pinecone

Managed Cloud

Fully managed, scales to billions of vectors. Best for production at scale.

spring-ai-pinecone-store-spring-boot-starter

Spring AI Implementation

Configuration

application.properties

# Vector Store (PGVector example)spring.ai.vectorstore.pgvector.index-type=HNSWspring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCEspring.ai.vectorstore.pgvector.dimensions=1536# Embedding Modelspring.ai.openai.embedding.options.model=text-embedding-3-small# Or use local embeddings with Ollama# spring.ai.ollama.embedding.options.model=nomic-embed-text

Document Ingestion Service

Load, split, and index your documents

DocumentIngestionService.java

@ServicepublicclassDocumentIngestionService{privatefinalVectorStore vectorStore;publicDocumentIngestionService(VectorStore vectorStore){this.vectorStore = vectorStore;}publicvoidingestPdf(Resource pdfResource){// 1. Load PDFPagePdfDocumentReader reader =newPagePdfDocumentReader(pdfResource);List<Document> documents = reader.read();// 2. Split into chunks (500 tokens, 100 token overlap)TokenTextSplitter splitter =newTokenTextSplitter(500,100);List<Document> chunks = splitter.split(documents);// 3. Add metadata for filtering/citation
chunks.forEach(doc ->{
doc.getMetadata().put("source", pdfResource.getFilename());
doc.getMetadata().put("ingested_at",Instant.now().toString());});// 4. Store (embeddings generated automatically)
vectorStore.add(chunks);}publicvoidingestMarkdown(Resource mdResource){TextReader reader =newTextReader(mdResource);List<Document> docs =newTokenTextSplitter().split(reader.read());
vectorStore.add(docs);}}

RAG Query Service

Retrieve context and generate answers

RAGService.java

@ServicepublicclassRAGService{privatefinalChatClient chatClient;privatefinalVectorStore vectorStore;publicRAGService(ChatClient.Builder builder,VectorStore vectorStore){this.chatClient = builder
.defaultSystem("""
You are a helpful assistant that answers questions based on the provided context.
If the answer is not in the context, say "I don't have information about that."
Always cite which document your answer came from.
""").build();this.vectorStore = vectorStore;}publicStringquery(String question){// 1. Semantic search for relevant chunksList<Document> relevantDocs = vectorStore.similaritySearch(SearchRequest.query(question).withTopK(5).withSimilarityThreshold(0.7));// 2. Build context string with sourcesString context = relevantDocs.stream().map(doc ->"Source: "+ doc.getMetadata().get("source")+"\nContent: "+ doc.getContent()).collect(Collectors.joining("\n\n---\n\n"));// 3. Generate response with contextreturn chatClient.prompt().user(u -> u.text("""
Context:
{context}
Question: {question}
""").param("context", context).param("question", question)).call().content();}}

Chunking Strategies

Chunking dramatically affects retrieval quality. Too small = lost context. Too large = irrelevant noise.

Fixed Size

Split every N tokens with overlap

Simple, Fast

Semantic

Split at paragraph/section boundaries

Better Context

Recursive

Try multiple separators hierarchically

Most Flexible

Rule of Thumb: Start with 500 tokens per chunk, 100 token overlap. Adjust based on your content type—code needs smaller chunks, narratives need larger.

Best Practices

✓ Do

• Add rich metadata (source, date, author) for filtering
• Use similarity threshold to avoid irrelevant matches
• Implement hybrid search (semantic + keyword)
• Cache embeddings—they're expensive to regenerate
• Test with real user questions, not synthetic ones

✗ Avoid

• Retrieving too many chunks (context overflow)
• Ignoring document freshness (stale data)
• Mixing unrelated content in same vector store
• Skipping chunk overlap (context loss at boundaries)
• Using embeddings/LLM from different providers

Enterprise Use Cases

Internal Knowledge Base

Index company policies, procedures, and documentation. Employees ask questions in natural language and get instant, accurate answers with citations.

Customer Support Bot

Train on FAQs, support tickets, and product docs. Resolve common issues instantly and escalate only when context is truly missing.

Codebase Q&A

Index your entire codebase including comments and docs. Ask "How does auth work?" and get answers pointing to actual implementation files.

Legal/Compliance Research

Search through contracts, regulations, and legal documents. Find relevant clauses in seconds instead of hours of manual review.

Advanced RAG Patterns

Basic RAG works well for simple use cases, but production systems often need more sophisticated retrieval strategies. These advanced patterns can dramatically improve answer quality and relevance.

Multi-Query Retrieval

Users often phrase questions poorly or ambiguously. Multi-query RAG uses the LLM itself to generate 3-5 alternative phrasings of the user's question, retrieves documents for each variant, and merges the results. This captures semantically related content that a single query might miss.

Multi-Query Pattern

// Generate query variantsList<String> queryVariants = chatClient.prompt().user("Generate 3 alternative phrasings for: "+ originalQuery).call().entity(newParameterizedTypeReference<List<String>>(){});// Retrieve for each variant and mergeSet<Document> allDocs =newHashSet<>();for(String variant : queryVariants){
allDocs.addAll(vectorStore.similaritySearch(variant));}

HyDE (Hypothetical Document Embeddings)

Instead of embedding the question directly, HyDE asks the LLM to generate a hypothetical answer (even if it's hallucinated), then uses that answer's embedding to search. This works because a hypothetical answer is semantically closer to actual answers than a question is.

HyDE Pattern

// Generate hypothetical answerString hypotheticalAnswer = chatClient.prompt().user("Answer this question (guess if needed): "+ question).call().content();// Search using the hypothetical answer's embeddingList<Document> docs = vectorStore.similaritySearch(hypotheticalAnswer);

Re-Ranking

Vector search is fast but imprecise. Re-ranking uses a more expensive model (like a cross-encoder) to re-score the top-K results based on actual relevance to the query. Retrieve 20 documents, re-rank, and use only the top 5. This can boost accuracy by 15-30%.

Re-Ranking Pattern

// Over-retrieve candidatesList<Document> candidates = vectorStore.similaritySearch(SearchRequest.query(question).withTopK(20));// Re-rank with a cross-encoder or LLM
candidates.sort((a, b)->scoreRelevance(b, question)-scoreRelevance(a, question));// Use top 5 after re-rankingList<Document> topDocs = candidates.subList(0,5);

Hybrid Search (Semantic + Keyword)

Pure vector search can miss exact keyword matches (like product codes or technical terms). Hybrid search combines semantic similarity with traditional BM25 keyword matching, giving you the best of both worlds. Most vector databases now support this natively.

Pinecone

Weaviate

Elasticsearch

Embedding Model Comparison

The embedding model you choose significantly impacts retrieval quality. Here's how the popular options compare. Remember: you should use the same embedding model for both indexing and querying—mixing models will produce garbage results.

Model	Dimensions	Best For	Cost
text-embedding-3-small	1536	General purpose, cost-effective	$0.02/1M tokens
text-embedding-3-large	3072	Maximum accuracy	$0.13/1M tokens
nomic-embed-text (Ollama)	768	Local/private, good quality	Free (local)
Cohere embed-v3	1024	Multilingual, long documents	$0.10/1M tokens

Pro Tip: Start with text-embedding-3-small for cloud or nomic-embed-text for local. Only upgrade to larger models if you see retrieval quality issues in production.

Troubleshooting Common Issues

"The AI doesn't find relevant documents"

Causes: Poor chunking (chunks too big or too small), wrong embedding model, similarity threshold too high, or query phrasing doesn't match document vocabulary.

Fixes: Experiment with chunk sizes (try 300-800 tokens). Lower similarity threshold to 0.5. Use multi-query retrieval. Check if documents were actually indexed (query count in vector store).

"Answers are correct but cite the wrong source"

Cause: Metadata not preserved during chunking, or prompt doesn't instruct model to cite sources.

Fix: Add source/filename to each chunk's metadata before indexing. Include instructions in system prompt: "Always cite the source document for each claim."

"Context window exceeded" errors

Cause: Retrieving too many chunks or chunks are too large, exceeding the model's context limit.

Fix: Reduce topK from 10 to 3-5. Use smaller chunk sizes. Consider summarizing retrieved content before injecting into prompt. Use a model with larger context (GPT-4 Turbo: 128K, Claude: 200K).

"Embeddings are slow to generate"

Cause: Processing documents synchronously, one at a time.

Fix: Batch embedding requests (OpenAI supports up to 2048 inputs per call). Use async processing with CompletableFuture. For local models, increase Ollama's parallelism.

Build Your Knowledge Base

RAG transforms AI from a generic assistant into your domain expert. Start with your most valuable documents and iterate from there.