Multimodal AI
Beyond text—process images, generate visuals, transcribe audio, and synthesize speech. Multimodal AI unlocks capabilities that text-only models simply cannot achieve.
Multimodal AI refers to models that understand and generate multiple types of data: text, images, audio, and video. Instead of describing a product in words, you can show the model a photo and ask "What is this and how much should I charge?" Instead of typing questions, users can speak naturally. Instead of describing an image, the AI can generate it.
Spring AI provides unified abstractions for vision models (GPT-4V, Claude 3, Gemini), audio transcription (Whisper), speech synthesis (TTS), and image generation (DALL-E 3). You get the same clean API across all modalities.
Multimodal Capabilities
Vision
Analyze images, extract text (OCR), describe content, compare visuals
Transcription
Convert speech to text with timestamps and speaker diarization
Text-to-Speech
Generate natural-sounding speech from text in multiple voices
Image Generation
Create images from text prompts, edit existing images, generate variations
Vision: Image Understanding
Analyze Images with GPT-4 Vision
Send images via URL or base64-encoded data
@ServicepublicclassVisionService{privatefinalChatClient chatClient;publicVisionService(ChatClient.Builder builder){this.chatClient = builder
.defaultOptions(OpenAiChatOptions.builder().model("gpt-4o")// GPT-4o has best vision.build()).build();}// Analyze image from URLpublicStringanalyzeImageUrl(String imageUrl,String question){UserMessage message =newUserMessage(
question,List.of(newMedia(MimeTypeUtils.IMAGE_PNG, imageUrl)));return chatClient.prompt().messages(message).call().content();}// Analyze local image (base64)publicStringanalyzeLocalImage(byte[] imageData,String question){String base64 =Base64.getEncoder().encodeToString(imageData);String dataUrl ="data:image/png;base64,"+ base64;UserMessage message =newUserMessage(
question,List.of(newMedia(MimeTypeUtils.IMAGE_PNG, dataUrl)));return chatClient.prompt().messages(message).call().content();}// Compare multiple imagespublicStringcompareImages(List<String> imageUrls){List<Media> media = imageUrls.stream().map(url ->newMedia(MimeTypeUtils.IMAGE_PNG, url)).toList();UserMessage message =newUserMessage("Compare these images and describe the key differences",
media
);return chatClient.prompt().messages(message).call().content();}}Structured Data Extraction from Images
// Extract structured data from receipts, invoices, etc.publicrecordInvoiceData(String invoiceNumber,String vendorName,LocalDate date,List<LineItem> items,double total
){}publicInvoiceDataextractInvoiceData(String imageUrl){UserMessage message =newUserMessage("Extract all invoice data from this image. Return as JSON.",List.of(newMedia(MimeTypeUtils.IMAGE_PNG, imageUrl)));return chatClient.prompt().messages(message).call().entity(InvoiceData.class);}Audio Transcription (Whisper)
Speech-to-Text
Transcribe audio files with OpenAI's Whisper model
@ServicepublicclassTranscriptionService{privatefinalOpenAiAudioTranscriptionModel transcriptionModel;publicTranscriptionService(OpenAiAudioTranscriptionModel transcriptionModel){this.transcriptionModel = transcriptionModel;}// Basic transcriptionpublicStringtranscribe(Resource audioFile){AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile);AudioTranscriptionResponse response = transcriptionModel.call(prompt);return response.getResult().getOutput();}// With timestamps (for subtitles)publicStringtranscribeWithTimestamps(Resource audioFile){OpenAiAudioTranscriptionOptions options =OpenAiAudioTranscriptionOptions.builder().responseFormat(OpenAiAudioApi.TranscriptResponseFormat.VTT).build();AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile, options);return transcriptionModel.call(prompt).getResult().getOutput();}// Translate non-English to EnglishpublicStringtranslateToEnglish(Resource audioFile){OpenAiAudioTranscriptionOptions options =OpenAiAudioTranscriptionOptions.builder().language("en")// Force English output.build();AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile, options);return transcriptionModel.call(prompt).getResult().getOutput();}}Supported Formats: MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM. Maximum file size is 25MB. For longer audio, split into chunks.
Text-to-Speech
Generate Natural Speech
Convert text to lifelike audio with multiple voice options
@ServicepublicclassSpeechService{privatefinalOpenAiAudioSpeechModel speechModel;publicSpeechService(OpenAiAudioSpeechModel speechModel){this.speechModel = speechModel;}publicbyte[]generateSpeech(String text,String voice){OpenAiAudioSpeechOptions options =OpenAiAudioSpeechOptions.builder().voice(OpenAiAudioApi.SpeechRequest.Voice.valueOf(voice.toUpperCase())).speed(1.0f).responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3).build();SpeechPrompt prompt =newSpeechPrompt(text, options);SpeechResponse response = speechModel.call(prompt);return response.getResult().getOutput();}// Stream audio for real-time playbackpublicFlux<byte[]>streamSpeech(String text){SpeechPrompt prompt =newSpeechPrompt(text);return speechModel.stream(prompt).map(response -> response.getResult().getOutput());}}// Available voices: alloy, echo, fable, onyx, nova, shimmerImage Generation (DALL-E 3)
Generate Images from Text
Create unique images using natural language descriptions
@ServicepublicclassImageGenerationService{privatefinalOpenAiImageModel imageModel;publicImageGenerationService(OpenAiImageModel imageModel){this.imageModel = imageModel;}publicStringgenerateImage(String prompt){ImagePrompt imagePrompt =newImagePrompt(
prompt,OpenAiImageOptions.builder().model("dall-e-3").quality("hd").n(1).height(1024).width(1024).responseFormat("url")// or "b64_json".style("vivid")// or "natural".build());ImageResponse response = imageModel.call(imagePrompt);return response.getResult().getOutput().getUrl();}// Generate product mockupspublicStringgenerateProductImage(String productDescription){String enhancedPrompt ="""
Professional product photography of %s,
white background, studio lighting,
high resolution, commercial quality
""".formatted(productDescription);returngenerateImage(enhancedPrompt);}}Multimodal Provider Comparison
| Provider | Vision | Transcription | TTS | Image Gen |
|---|---|---|---|---|
| OpenAI | GPT-4o | Whisper | TTS-1-HD | DALL-E 3 |
| Anthropic | Claude 3 | No | No | No |
Gemini Pro | Cloud Speech | Cloud TTS | Imagen 3 | |
| Ollama (Local) | LLaVA | No | No | No |
Real-World Applications
E-Commerce
- • Auto-generate product descriptions from photos
- • Extract data from supplier invoices
- • Quality control with visual inspection
- • Generate product mockups for A/B testing
Customer Service
- • Transcribe support calls for analysis
- • Voice-enabled chatbots
- • Generate audio responses for accessibility
- • Analyze customer-submitted photos
Document Processing
- • OCR with context understanding
- • Extract data from handwritten forms
- • Analyze charts and diagrams
- • ID verification from photos
Content Creation
- • Generate blog post illustrations
- • Create social media visuals
- • Produce podcast narrations
- • Design marketing materials
Best Practices
✓ Do
- • Compress images before sending (reduces cost/latency)
- • Use appropriate resolution (1024px usually enough)
- • Cache generated content when possible
- • Handle rate limits gracefully
- • Validate file types and sizes before API calls
✗ Avoid
- • Sending uncompressed 4K images
- • Regenerating the same images repeatedly
- • Ignoring content moderation filters
- • Storing base64 in databases (use URLs)
- • Synchronous TTS for long text (stream instead)