What is the best way to learn Java 8?

Start with Lambda Expressions and Functional Interfaces, then progress to Stream API and Optional. Practice with real coding examples and take quizzes to test your understanding.

How do I prepare for Spring Boot interviews?

Focus on core concepts like dependency injection, REST APIs, Spring Data JPA, and Spring Security. Practice with our 100+ Spring Boot quiz questions covering real interview scenarios.

What topics are covered in System Design?

We cover scalability patterns, database design, microservices architecture, distributed systems, caching strategies, API design, and security architecture.

Multimodal AI

Beyond text—process images, generate visuals, transcribe audio, and synthesize speech. Multimodal AI unlocks capabilities that text-only models simply cannot achieve.

Multimodal AI refers to models that understand and generate multiple types of data: text, images, audio, and video. Instead of describing a product in words, you can show the model a photo and ask "What is this and how much should I charge?" Instead of typing questions, users can speak naturally. Instead of describing an image, the AI can generate it.

Spring AI provides unified abstractions for vision models (GPT-4V, Claude 3, Gemini), audio transcription (Whisper), speech synthesis (TTS), and image generation (DALL-E 3). You get the same clean API across all modalities.

Multimodal Capabilities

Vision

Analyze images, extract text (OCR), describe content, compare visuals

Transcription

Convert speech to text with timestamps and speaker diarization

Text-to-Speech

Generate natural-sounding speech from text in multiple voices

Image Generation

Create images from text prompts, edit existing images, generate variations

Vision: Image Understanding

Analyze Images with GPT-4 Vision

Send images via URL or base64-encoded data

VisionService.java

@ServicepublicclassVisionService{privatefinalChatClient chatClient;publicVisionService(ChatClient.Builder builder){this.chatClient = builder
.defaultOptions(OpenAiChatOptions.builder().model("gpt-4o")// GPT-4o has best vision.build()).build();}// Analyze image from URLpublicStringanalyzeImageUrl(String imageUrl,String question){UserMessage message =newUserMessage(
question,List.of(newMedia(MimeTypeUtils.IMAGE_PNG, imageUrl)));return chatClient.prompt().messages(message).call().content();}// Analyze local image (base64)publicStringanalyzeLocalImage(byte[] imageData,String question){String base64 =Base64.getEncoder().encodeToString(imageData);String dataUrl ="data:image/png;base64,"+ base64;UserMessage message =newUserMessage(
question,List.of(newMedia(MimeTypeUtils.IMAGE_PNG, dataUrl)));return chatClient.prompt().messages(message).call().content();}// Compare multiple imagespublicStringcompareImages(List<String> imageUrls){List<Media> media = imageUrls.stream().map(url ->newMedia(MimeTypeUtils.IMAGE_PNG, url)).toList();UserMessage message =newUserMessage("Compare these images and describe the key differences",
media
);return chatClient.prompt().messages(message).call().content();}}

Structured Data Extraction from Images

Structured Extraction

// Extract structured data from receipts, invoices, etc.publicrecordInvoiceData(String invoiceNumber,String vendorName,LocalDate date,List<LineItem> items,double total
){}publicInvoiceDataextractInvoiceData(String imageUrl){UserMessage message =newUserMessage("Extract all invoice data from this image. Return as JSON.",List.of(newMedia(MimeTypeUtils.IMAGE_PNG, imageUrl)));return chatClient.prompt().messages(message).call().entity(InvoiceData.class);}

Audio Transcription (Whisper)

Speech-to-Text

Transcribe audio files with OpenAI's Whisper model

TranscriptionService.java

@ServicepublicclassTranscriptionService{privatefinalOpenAiAudioTranscriptionModel transcriptionModel;publicTranscriptionService(OpenAiAudioTranscriptionModel transcriptionModel){this.transcriptionModel = transcriptionModel;}// Basic transcriptionpublicStringtranscribe(Resource audioFile){AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile);AudioTranscriptionResponse response = transcriptionModel.call(prompt);return response.getResult().getOutput();}// With timestamps (for subtitles)publicStringtranscribeWithTimestamps(Resource audioFile){OpenAiAudioTranscriptionOptions options =OpenAiAudioTranscriptionOptions.builder().responseFormat(OpenAiAudioApi.TranscriptResponseFormat.VTT).build();AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile, options);return transcriptionModel.call(prompt).getResult().getOutput();}// Translate non-English to EnglishpublicStringtranslateToEnglish(Resource audioFile){OpenAiAudioTranscriptionOptions options =OpenAiAudioTranscriptionOptions.builder().language("en")// Force English output.build();AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile, options);return transcriptionModel.call(prompt).getResult().getOutput();}}

Supported Formats: MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM. Maximum file size is 25MB. For longer audio, split into chunks.

Text-to-Speech

Generate Natural Speech

Convert text to lifelike audio with multiple voice options

SpeechService.java

@ServicepublicclassSpeechService{privatefinalOpenAiAudioSpeechModel speechModel;publicSpeechService(OpenAiAudioSpeechModel speechModel){this.speechModel = speechModel;}publicbyte[]generateSpeech(String text,String voice){OpenAiAudioSpeechOptions options =OpenAiAudioSpeechOptions.builder().voice(OpenAiAudioApi.SpeechRequest.Voice.valueOf(voice.toUpperCase())).speed(1.0f).responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3).build();SpeechPrompt prompt =newSpeechPrompt(text, options);SpeechResponse response = speechModel.call(prompt);return response.getResult().getOutput();}// Stream audio for real-time playbackpublicFlux<byte[]>streamSpeech(String text){SpeechPrompt prompt =newSpeechPrompt(text);return speechModel.stream(prompt).map(response -> response.getResult().getOutput());}}// Available voices: alloy, echo, fable, onyx, nova, shimmer

Image Generation (DALL-E 3)

Generate Images from Text

Create unique images using natural language descriptions

ImageGenerationService.java

@ServicepublicclassImageGenerationService{privatefinalOpenAiImageModel imageModel;publicImageGenerationService(OpenAiImageModel imageModel){this.imageModel = imageModel;}publicStringgenerateImage(String prompt){ImagePrompt imagePrompt =newImagePrompt(
prompt,OpenAiImageOptions.builder().model("dall-e-3").quality("hd").n(1).height(1024).width(1024).responseFormat("url")// or "b64_json".style("vivid")// or "natural".build());ImageResponse response = imageModel.call(imagePrompt);return response.getResult().getOutput().getUrl();}// Generate product mockupspublicStringgenerateProductImage(String productDescription){String enhancedPrompt ="""
Professional product photography of %s,
white background, studio lighting,
high resolution, commercial quality
""".formatted(productDescription);returngenerateImage(enhancedPrompt);}}

Multimodal Provider Comparison

Provider	Vision	Transcription	TTS	Image Gen
OpenAI	GPT-4o	Whisper	TTS-1-HD	DALL-E 3
Anthropic	Claude 3	No	No	No
Google	Gemini Pro	Cloud Speech	Cloud TTS	Imagen 3
Ollama (Local)	LLaVA	No	No	No

Real-World Applications

E-Commerce

• Auto-generate product descriptions from photos
• Extract data from supplier invoices
• Quality control with visual inspection
• Generate product mockups for A/B testing

Customer Service

• Transcribe support calls for analysis
• Voice-enabled chatbots
• Generate audio responses for accessibility
• Analyze customer-submitted photos

Document Processing

• OCR with context understanding
• Extract data from handwritten forms
• Analyze charts and diagrams
• ID verification from photos

Content Creation

• Generate blog post illustrations
• Create social media visuals
• Produce podcast narrations
• Design marketing materials

Best Practices

✓ Do

• Compress images before sending (reduces cost/latency)
• Use appropriate resolution (1024px usually enough)
• Cache generated content when possible
• Handle rate limits gracefully
• Validate file types and sizes before API calls

✗ Avoid

• Sending uncompressed 4K images
• Regenerating the same images repeatedly
• Ignoring content moderation filters
• Storing base64 in databases (use URLs)
• Synchronous TTS for long text (stream instead)

See Beyond Text

Multimodal AI opens entirely new application categories. Start with vision for document processing, then add speech for accessibility.