What is RAG?
Retrieval-Augmented Generation (RAG) is the technique of giving LLMs access to external knowledge by retrieving relevant documents and including them in the prompt. Instead of relying solely on training data, the LLM can reference your company documents, product manuals, knowledge bases, or any other text corpus to generate accurate, up-to-date answers.
Consider a customer support bot for your SaaS product. Without RAG, it can only answer based on general knowledge from its training. With RAG, it can pull information from your actual documentation, release notes, and FAQ to give precise, product-specific answers. A user asking "How do I reset my password in version 3.2?" gets the exact steps from your docs, not generic advice.
RAG is essential because LLMs have knowledge cutoffs (they don't know about events after their training date) and can't access private/proprietary information. RAG bridges this gap by dynamically injecting relevant context into each conversation, making AI assistants genuinely useful for enterprise applications.
Without RAG
"Tell me about our Q3 pricing changes" → "I don't have information about your specific company's pricing."
With RAG
"Tell me about our Q3 pricing changes" → Retrieves pricing doc → "In Q3, we introduced a 15% discount for annual plans..."
Understanding Embeddings
Embeddings are numerical representations of text that capture semantic meaning. Instead of comparing exact words, embeddings let you find text with similar meaning. The sentence "How do I cancel my subscription?" is semantically close to "I want to stop my membership" even though they share few words. Embeddings make this similarity discoverable.
An embedding transforms text into a vector of numbers (typically 1536 dimensions for OpenAI models). Similar texts produce similar vectors, meaning you can use mathematical distance to find relevant documents. This is the foundation of semantic search — finding content by meaning rather than keywords.
Creating Embeddings
importdev.langchain4j.model.embedding.EmbeddingModel;importdev.langchain4j.model.openai.OpenAiEmbeddingModel;importdev.langchain4j.data.embedding.Embedding;publicclassEmbeddingExample{publicstaticvoidmain(String[] args){EmbeddingModel embeddingModel =OpenAiEmbeddingModel.builder().apiKey(System.getenv("OPENAI_API_KEY")).modelName("text-embedding-3-small").build();// Convert text to embedding vectorEmbedding embedding = embeddingModel.embed("How do I reset my password?").content();// embedding.vector() returns float[] with ~1536 dimensionsSystem.out.println("Vector dimensions: "+ embedding.vector().length);System.out.println("First 5 values: "+Arrays.toString(Arrays.copyOf(embedding.vector(),5)));}}Embedding Model Choice Matters
Different embedding models have different strengths. OpenAI's "text-embedding-3-small" is cost-effective for most use cases. For better accuracy at higher cost, use "text-embedding-3-large". Always use the same model for storing and querying — mixing models produces incompatible vectors.
Vector Stores
Embeddings need to be stored somewhere for efficient retrieval. Vector stores (also called vector databases) are specialized databases optimized for storing and searching embedding vectors. They use algorithms like HNSW or IVF to find similar vectors in milliseconds, even across millions of documents.
LangChain4j integrates with many vector stores: in-memory for development, Chroma and Milvus for open-source solutions, Pinecone and Weaviate for managed services, or Elasticsearch and PostgreSQL with pgvector for leveraging existing infrastructure.
In-Memory Store (Development)
importdev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;importdev.langchain4j.data.segment.TextSegment;// Create an in-memory store (perfect for development/testing)InMemoryEmbeddingStore<TextSegment> embeddingStore =newInMemoryEmbeddingStore<>();// Store a documentTextSegment segment =TextSegment.from("Our refund policy allows returns within 30 days.");Embedding embedding = embeddingModel.embed(segment.text()).content();
embeddingStore.add(embedding, segment);// Search for relevant contentEmbedding queryEmbedding = embeddingModel.embed("How do I get my money back?").content();List<EmbeddingMatch<TextSegment>> matches = embeddingStore.findRelevant(
queryEmbedding,3// Return top 3 matches);// First match will be the refund policy!
matches.forEach(match ->{System.out.println("Score: "+ match.score());System.out.println("Text: "+ match.embedded().text());});Production: PostgreSQL with pgvector
importdev.langchain4j.store.embedding.pgvector.PgVectorEmbeddingStore;// PostgreSQL with pgvector extension (production-ready)PgVectorEmbeddingStore embeddingStore =PgVectorEmbeddingStore.builder().host("localhost").port(5432).database("knowledge_base").user("postgres").password(System.getenv("DB_PASSWORD")).table("document_embeddings").dimension(1536)// Must match your embedding model.build();// Usage is identical to in-memory store!
embeddingStore.add(embedding, segment);List<EmbeddingMatch<TextSegment>> results = embeddingStore.findRelevant(query,5);Loading & Chunking Documents
Real documents are often too long to embed as a single piece. LLMs have context limits, and retrieving a 100-page PDF as one chunk isn't useful. Document loaders read various file formats, and text splitters break them into manageable chunks that can be embedded and retrieved independently.
Chunking strategy significantly impacts RAG quality. Chunks too small lose context; chunks too large dilute relevance. A good starting point is 500-1000 characters with 100-200 character overlap to maintain context across chunk boundaries.
importdev.langchain4j.data.document.Document;importdev.langchain4j.data.document.loader.FileSystemDocumentLoader;importdev.langchain4j.data.document.splitter.DocumentSplitters;importdev.langchain4j.data.segment.TextSegment;// Load documents from a directoryList<Document> documents =FileSystemDocumentLoader.loadDocuments(Path.of("./knowledge-base"),newTextDocumentParser());// Split into chunksDocumentSplitter splitter =DocumentSplitters.recursive(500,// Max characters per chunk100// Overlap between chunks);List<TextSegment> segments =newArrayList<>();for(Document doc : documents){
segments.addAll(splitter.split(doc));}// Embed and store all segmentsfor(TextSegment segment : segments){Embedding embedding = embeddingModel.embed(segment.text()).content();
embeddingStore.add(embedding, segment);}System.out.println("Indexed "+ segments.size()+" chunks");Supported File Formats
LangChain4j supports PDF, TXT, HTML, Markdown, and more through different parsers. For PDFs, add the appropriate dependency (e.g., Apache PDFBox). Each format may require specific parser configuration.
Complete RAG Pipeline
Now let's put it all together. LangChain4j provides ContentRetriever and RetrievalAugmentor to seamlessly integrate retrieval into your AI Services. The retriever finds relevant documents, and the augmentor injects them into the prompt automatically.
importdev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever;importdev.langchain4j.rag.DefaultRetrievalAugmentor;importdev.langchain4j.service.AiServices;publicclassRagPipeline{interfaceKnowledgeAssistant{Stringanswer(String question);}publicstaticvoidmain(String[] args){// 1. Set up embedding model and store (populated with your docs)EmbeddingModel embeddingModel =OpenAiEmbeddingModel.builder().apiKey(System.getenv("OPENAI_API_KEY")).modelName("text-embedding-3-small").build();InMemoryEmbeddingStore<TextSegment> store =newInMemoryEmbeddingStore<>();// ... populate store with your documents ...// 2. Create content retrieverContentRetriever retriever =EmbeddingStoreContentRetriever.builder().embeddingStore(store).embeddingModel(embeddingModel).maxResults(5)// Retrieve top 5 relevant chunks.minScore(0.7)// Only include if similarity > 70%.build();// 3. Create chat modelChatLanguageModel chatModel =OpenAiChatModel.builder().apiKey(System.getenv("OPENAI_API_KEY")).modelName("gpt-4o-mini").build();// 4. Build AI Service with RAGKnowledgeAssistant assistant =AiServices.builder(KnowledgeAssistant.class).chatLanguageModel(chatModel).contentRetriever(retriever).build();// 5. Ask questions - retrieval happens automatically!String answer = assistant.answer("What is our refund policy?");System.out.println(answer);// Response will be based on your actual documents!}}User Asks
"What's the refund policy?"
Embed Query
Convert to vector
Retrieve
Find similar docs
Generate
LLM answers with context
RAG Best Practices
Tune Your Chunk Size
Experiment with different chunk sizes. For factual Q&A, smaller chunks (300-500 chars) often work better. For complex reasoning, larger chunks (800-1200 chars) provide more context.
Add Metadata
Include source file, date, author, and category as metadata. This enables filtering (e.g., "only search in HR policies") and helps users verify sources.
Use Hybrid Search
Combine semantic search (embeddings) with keyword search (BM25). Some queries work better with exact keyword matching while others need semantic understanding.
Set Minimum Similarity Threshold
Don't include low-relevance results. A minScore of 0.7-0.8 usually filters out noise. If nothing meets the threshold, acknowledge uncertainty rather than using irrelevant content.
🎉 Level 5 Complete!
You've mastered embeddings and RAG! Your AI can now answer questions using your own documents, making it genuinely useful for domain-specific applications. In the next level, we'll explore AI Services — the declarative API that simplifies everything you've learned into clean, type-safe interfaces.