OpenAI Integration
Integrate OpenAI's GPT models into your Spring Boot applications with a unified API, streaming support, and rigorous type safety.
OpenAI has set the standard for modern AI capabilities with models like GPT-4,GPT-4 Turbo, and GPT-4o. Spring AI provides first-class support for OpenAI, giving Java developers access to chat completions, embeddings, image generation with DALL-E, speech synthesis, and audio transcription—all through a consistent, type-safe API.
The OpenAI integration handles connection management, automatic retries, rate limiting backoff, and streaming. You write business logic; Spring AI handles the infrastructure. And when you need to switch to Azure OpenAI for enterprise deployment, your code stays the same—only configuration changes.
Available OpenAI Models
Chat Models
- GPT-4oLatest
- GPT-4 Turbo128k context
- GPT-48k/32k context
- GPT-3.5 TurboBudget
Other Capabilities
- DALL-E 3 — Image generation
- Whisper — Speech to text
- TTS — Text to speech
- Vision — Image understanding
Core Concepts
Tokens & Context
LLMs process text in chunks called tokens (≈0.75 words). Each model has a strict context window—GPT-4 Turbo supports 128k tokens, while GPT-4 supports 8k/32k. This limit includes both input and output.
Chat Roles
System: Sets behavioral guidelines and persona.
User: The actual query or input from humans.
Assistant: The model's response (use for conversation history).
Temperature
Controls randomness (0.0 to 2.0). Use 0.0-0.3 for factual/code tasks, 0.7-1.0 for creative writing. Higher values increase variety but may reduce coherence.
Understanding these concepts is essential for cost control and output quality. Token limits determine how much context you can provide—for RAG applications, you might use 80% of the context for retrieved documents and 20% for the actual conversation. Temperature dramatically affects output: customer support bots should use low temperature for consistent answers, while brainstorming assistants benefit from higher values.
Getting Started
Configuration
Add Spring AI OpenAI dependency and configure your API key
<!-- Maven Dependency --><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-openai-spring-boot-starter</artifactId></dependency><!-- Add Spring AI BOM for version management --><dependencyManagement><dependencies><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-bom</artifactId><version>1.0.0-M4</version><type>pom</type><scope>import</scope></dependency></dependencies></dependencyManagement>Application Properties
Configure your OpenAI connection settings
# Required: Your OpenAI API key (use environment variable in production)spring.ai.openai.api-key=${OPENAI_API_KEY}# Model selection (default: gpt-4o)spring.ai.openai.chat.options.model=gpt-4o# Generation parametersspring.ai.openai.chat.options.temperature=0.7spring.ai.openai.chat.options.max-tokens=2000spring.ai.openai.chat.options.top-p=1.0# Optional: Organization ID (for teams)spring.ai.openai.organization-id=org-xxxx# Optional: Override base URL (for proxies or Azure)# spring.ai.openai.base-url=https://your-proxy.com/v1Never commit API keys! Use environment variables: export OPENAI_API_KEY=sk-...or Spring's @Value("${OPENAI_API_KEY}") injection.
Basic Usage
Chat Service Implementation
@ServicepublicclassOpenAIChatService{privatefinalChatClient chatClient;publicOpenAIChatService(ChatClient.Builder builder){this.chatClient = builder
.defaultSystem("""
You are a helpful customer service assistant for TechStore.
Be friendly, concise, and helpful. If you don't know something,
say so honestly rather than making up information.
""").build();}// Simple chat - returns complete responsepublicStringchat(String userMessage){return chatClient.prompt().user(userMessage).call().content();}// Streaming - returns tokens as they're generatedpublicFlux<String>streamChat(String userMessage){return chatClient.prompt().user(userMessage).stream().content();}// With conversation historypublicStringchatWithHistory(String userMessage,List<Message> history){return chatClient.prompt().messages(history).user(userMessage).call().content();}}The ChatClient.Builder is injected automatically by Spring AI. Use .defaultSystem() to set a persona that applies to all requests—this is where you define your assistant's behavior, personality, and constraints. The system prompt is crucial for consistent, on-brand responses.
Advanced Features
Streaming with SSE
Stream responses to your frontend in real-time using Server-Sent Events.
@GetMapping(value ="/chat/stream",
produces =MediaType.TEXT_EVENT_STREAM_VALUE)publicFlux<String>streamChat(@RequestParamString message){return chatClient.prompt().user(message).stream().content().map(chunk ->"data: "+ chunk +"\n\n");}Structured Output (JSON Mode)
Parse AI responses directly into Java objects—no regex needed.
publicrecordProductInfo(String name,String category,BigDecimal price,List<String> features
){}publicProductInfoextractProduct(String description){return chatClient.prompt().user("Extract product info: "+ description).call().entity(ProductInfo.class);}Custom Model Parameters
Override default settings per-request for fine-grained control over model behavior.
importorg.springframework.ai.openai.OpenAiChatOptions;publicStringgenerateCreativeContent(String prompt){return chatClient.prompt().user(prompt).options(OpenAiChatOptions.builder().model("gpt-4o").temperature(0.9)// More creative.maxTokens(2000)// Longer responses.topP(0.95)// Nucleus sampling.presencePenalty(0.6)// Encourage new topics.frequencyPenalty(0.3)// Reduce repetition.build()).call().content();}publicStringgenerateCode(String spec){return chatClient.prompt().user(spec).options(OpenAiChatOptions.builder().model("gpt-4-turbo").temperature(0.0)// Deterministic.maxTokens(4000)// Room for code.build()).call().content();}Vision: Image Understanding
GPT-4o and GPT-4 Vision can analyze images. Pass images as URLs or base64-encoded data.
publicStringanalyzeImage(String imageUrl,String question){var userMessage =newUserMessage(question,List.of(newMedia(MimeTypeUtils.IMAGE_PNG, imageUrl)));return chatClient.prompt().messages(userMessage).call().content();}// Example: Analyze a product imageString analysis =analyzeImage("https://example.com/product.jpg","Describe this product. What are its key features?");Pricing & Cost Optimization
GPT-3.5 Turbo
Budget-friendly
$0.50
per 1M input tokens
Best for: Simple Q&A, classification, summarization
GPT-4o
Balanced
$5.00
per 1M input tokens
Best for: Complex tasks, code, multimodal
GPT-4 Turbo
Maximum context
$10.00
per 1M input tokens
Best for: Long documents, complex reasoning
- Use GPT-3.5 Turbo for simple tasks—it's 10-20x cheaper
- Set
maxTokensto prevent unexpectedly long responses - Cache responses for repeated queries with Spring Cache
- Use smaller models for initial filtering, GPT-4 for final answers
Error Handling & Rate Limits
@ServicepublicclassResilientOpenAIService{privatefinalChatClient chatClient;@Retryable(
value ={OpenAiApiException.class},
maxAttempts =3,
backoff =@Backoff(delay =1000, multiplier =2))publicStringchat(String message){try{return chatClient.prompt().user(message).call().content();}catch(OpenAiApiException e){if(e.getStatusCode()==429){
log.warn("Rate limit hit, retry scheduled...");throw e;// Will be retried}if(e.getStatusCode()==503){
log.error("OpenAI service unavailable");thrownewServiceUnavailableException("AI service temporarily down");}throw e;}}@RecoverpublicStringfallback(OpenAiApiException e,String message){
log.error("All retries exhausted for: {}", message);return"I'm sorry, our AI service is currently unavailable. Please try again later.";}}OpenAI enforces rate limits based on your tier—typically requests per minute (RPM) and tokens per minute (TPM). When you hit these limits, the API returns a 429 Too Many Requests error.Spring Retry with exponential backoff handles this gracefully, automatically waiting before retrying.
Best Practices
💰 Cost Management
- Set
maxTokensto prevent runaway costs - Use GPT-3.5 Turbo for simple tasks (10x cheaper)
- Cache common queries with Spring Cache
- Monitor usage via OpenAI dashboard
🛡️ Security
- Never expose API keys in frontend code
- Sanitize user input to prevent prompt injection
- Implement rate limiting per user
- Use content moderation for user inputs
⚡ Performance
- Use streaming for better perceived latency
- Batch similar requests when possible
- Set appropriate timeouts (30-60s typical)
- Monitor response times and token usage
🎯 Quality
- Write detailed, specific system prompts
- Include examples in prompts for consistency
- Use temperature 0 for factual tasks
- Test prompts with varied inputs