Multimodal AI

    Beyond text—process images, generate visuals, transcribe audio, and synthesize speech. Multimodal AI unlocks capabilities that text-only models simply cannot achieve.

    Multimodal AI refers to models that understand and generate multiple types of data: text, images, audio, and video. Instead of describing a product in words, you can show the model a photo and ask "What is this and how much should I charge?" Instead of typing questions, users can speak naturally. Instead of describing an image, the AI can generate it.

    Spring AI provides unified abstractions for vision models (GPT-4V, Claude 3, Gemini), audio transcription (Whisper), speech synthesis (TTS), and image generation (DALL-E 3). You get the same clean API across all modalities.

    Multimodal Capabilities

    Vision

    Analyze images, extract text (OCR), describe content, compare visuals

    Transcription

    Convert speech to text with timestamps and speaker diarization

    Text-to-Speech

    Generate natural-sounding speech from text in multiple voices

    Image Generation

    Create images from text prompts, edit existing images, generate variations

    Vision: Image Understanding

    Analyze Images with GPT-4 Vision

    Send images via URL or base64-encoded data

    VisionService.java
    @ServicepublicclassVisionService{privatefinalChatClient chatClient;publicVisionService(ChatClient.Builder builder){this.chatClient = builder
    .defaultOptions(OpenAiChatOptions.builder().model("gpt-4o")// GPT-4o has best vision.build()).build();}// Analyze image from URLpublicStringanalyzeImageUrl(String imageUrl,String question){UserMessage message =newUserMessage(
    question,List.of(newMedia(MimeTypeUtils.IMAGE_PNG, imageUrl)));return chatClient.prompt().messages(message).call().content();}// Analyze local image (base64)publicStringanalyzeLocalImage(byte[] imageData,String question){String base64 =Base64.getEncoder().encodeToString(imageData);String dataUrl ="data:image/png;base64,"+ base64;UserMessage message =newUserMessage(
    question,List.of(newMedia(MimeTypeUtils.IMAGE_PNG, dataUrl)));return chatClient.prompt().messages(message).call().content();}// Compare multiple imagespublicStringcompareImages(List<String> imageUrls){List<Media> media = imageUrls.stream().map(url ->newMedia(MimeTypeUtils.IMAGE_PNG, url)).toList();UserMessage message =newUserMessage("Compare these images and describe the key differences",
    media
    );return chatClient.prompt().messages(message).call().content();}}

    Structured Data Extraction from Images

    Structured Extraction
    // Extract structured data from receipts, invoices, etc.publicrecordInvoiceData(String invoiceNumber,String vendorName,LocalDate date,List<LineItem> items,double total
    ){}publicInvoiceDataextractInvoiceData(String imageUrl){UserMessage message =newUserMessage("Extract all invoice data from this image. Return as JSON.",List.of(newMedia(MimeTypeUtils.IMAGE_PNG, imageUrl)));return chatClient.prompt().messages(message).call().entity(InvoiceData.class);}

    Audio Transcription (Whisper)

    Speech-to-Text

    Transcribe audio files with OpenAI's Whisper model

    TranscriptionService.java
    @ServicepublicclassTranscriptionService{privatefinalOpenAiAudioTranscriptionModel transcriptionModel;publicTranscriptionService(OpenAiAudioTranscriptionModel transcriptionModel){this.transcriptionModel = transcriptionModel;}// Basic transcriptionpublicStringtranscribe(Resource audioFile){AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile);AudioTranscriptionResponse response = transcriptionModel.call(prompt);return response.getResult().getOutput();}// With timestamps (for subtitles)publicStringtranscribeWithTimestamps(Resource audioFile){OpenAiAudioTranscriptionOptions options =OpenAiAudioTranscriptionOptions.builder().responseFormat(OpenAiAudioApi.TranscriptResponseFormat.VTT).build();AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile, options);return transcriptionModel.call(prompt).getResult().getOutput();}// Translate non-English to EnglishpublicStringtranslateToEnglish(Resource audioFile){OpenAiAudioTranscriptionOptions options =OpenAiAudioTranscriptionOptions.builder().language("en")// Force English output.build();AudioTranscriptionPrompt prompt =newAudioTranscriptionPrompt(audioFile, options);return transcriptionModel.call(prompt).getResult().getOutput();}}

    Supported Formats: MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM. Maximum file size is 25MB. For longer audio, split into chunks.

    Text-to-Speech

    Generate Natural Speech

    Convert text to lifelike audio with multiple voice options

    SpeechService.java
    @ServicepublicclassSpeechService{privatefinalOpenAiAudioSpeechModel speechModel;publicSpeechService(OpenAiAudioSpeechModel speechModel){this.speechModel = speechModel;}publicbyte[]generateSpeech(String text,String voice){OpenAiAudioSpeechOptions options =OpenAiAudioSpeechOptions.builder().voice(OpenAiAudioApi.SpeechRequest.Voice.valueOf(voice.toUpperCase())).speed(1.0f).responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3).build();SpeechPrompt prompt =newSpeechPrompt(text, options);SpeechResponse response = speechModel.call(prompt);return response.getResult().getOutput();}// Stream audio for real-time playbackpublicFlux<byte[]>streamSpeech(String text){SpeechPrompt prompt =newSpeechPrompt(text);return speechModel.stream(prompt).map(response -> response.getResult().getOutput());}}// Available voices: alloy, echo, fable, onyx, nova, shimmer

    Image Generation (DALL-E 3)

    Generate Images from Text

    Create unique images using natural language descriptions

    ImageGenerationService.java
    @ServicepublicclassImageGenerationService{privatefinalOpenAiImageModel imageModel;publicImageGenerationService(OpenAiImageModel imageModel){this.imageModel = imageModel;}publicStringgenerateImage(String prompt){ImagePrompt imagePrompt =newImagePrompt(
    prompt,OpenAiImageOptions.builder().model("dall-e-3").quality("hd").n(1).height(1024).width(1024).responseFormat("url")// or "b64_json".style("vivid")// or "natural".build());ImageResponse response = imageModel.call(imagePrompt);return response.getResult().getOutput().getUrl();}// Generate product mockupspublicStringgenerateProductImage(String productDescription){String enhancedPrompt ="""
    Professional product photography of %s,
    white background, studio lighting,
    high resolution, commercial quality
    """.formatted(productDescription);returngenerateImage(enhancedPrompt);}}

    Multimodal Provider Comparison

    ProviderVisionTranscriptionTTSImage Gen
    OpenAI
    GPT-4o
    Whisper
    TTS-1-HD
    DALL-E 3
    Anthropic
    Claude 3
    No
    No
    No
    Google
    Gemini Pro
    Cloud Speech
    Cloud TTS
    Imagen 3
    Ollama (Local)
    LLaVA
    No
    No
    No

    Real-World Applications

    E-Commerce

    • • Auto-generate product descriptions from photos
    • • Extract data from supplier invoices
    • • Quality control with visual inspection
    • • Generate product mockups for A/B testing

    Customer Service

    • • Transcribe support calls for analysis
    • • Voice-enabled chatbots
    • • Generate audio responses for accessibility
    • • Analyze customer-submitted photos

    Document Processing

    • • OCR with context understanding
    • • Extract data from handwritten forms
    • • Analyze charts and diagrams
    • • ID verification from photos

    Content Creation

    • • Generate blog post illustrations
    • • Create social media visuals
    • • Produce podcast narrations
    • • Design marketing materials

    Best Practices

    ✓ Do

    • • Compress images before sending (reduces cost/latency)
    • • Use appropriate resolution (1024px usually enough)
    • • Cache generated content when possible
    • • Handle rate limits gracefully
    • • Validate file types and sizes before API calls

    ✗ Avoid

    • • Sending uncompressed 4K images
    • • Regenerating the same images repeatedly
    • • Ignoring content moderation filters
    • • Storing base64 in databases (use URLs)
    • • Synchronous TTS for long text (stream instead)

    See Beyond Text

    Multimodal AI opens entirely new application categories. Start with vision for document processing, then add speech for accessibility.