Ollama & Local AI Models

    Run powerful large language models directly on your machine. Complete privacy, zero API costs, and lightning-fast inference—all without internet connectivity.

    Ollama is an open-source tool that makes running LLMs locally as simple as running Docker containers. It handles model downloads, quantization, memory management, and provides an OpenAI-compatible API—all with a single command. Spring AI integrates seamlessly with Ollama, allowing you to use the same ChatClient interface whether you're calling GPT-4 in the cloud or Llama 3 on your laptop.

    Why Run Models Locally?

    Complete Privacy

    Your data never leaves your infrastructure. Process sensitive documents, proprietary code, and customer information without any third-party exposure. Essential for healthcare, legal, and financial applications where data sovereignty matters.

    Zero Latency

    No network round-trips mean sub-100ms response times. Local inference is 3-5x faster than cloud APIs for most use cases. Perfect for real-time applications, IDE integrations, and interactive experiences where speed matters.

    Offline Capable

    Works without internet connectivity. Deploy to air-gapped environments, edge devices, or remote locations. Build applications that function reliably regardless of network conditions.

    Hardware Requirements

    Local AI requires RAM, not necessarily a GPU. Modern quantization techniques compress models to fit in system memory while preserving quality. Here's what you need for different model sizes:

    7B Models

    8GB RAM

    Llama 3.1 8B, Mistral 7B, Gemma 7B

    Most Laptops

    13-14B Models

    16GB RAM

    Llama 2 13B, Phi-3 Medium

    Pro Laptops

    70B Models

    48GB RAM

    Llama 3.3 70B, Mixtral 8x22B

    Workstations

    Apple Silicon Advantage: M1/M2/M3 Macs use unified memory, allowing the GPU to access all system RAM. A 64GB MacBook Pro can run 70B models that would require an expensive NVIDIA GPU on Windows/Linux.

    Installing Ollama

    Quick Start

    Get up and running in under 2 minutes

    Step 1: Install Ollama

    Installation
    # macOS / Linux (one command)curl-fsSL https://ollama.com/install.sh |sh# Windows: Download installer from https://ollama.com/download# Or use winget:
    winget install Ollama.Ollama

    Step 2: Download a Model

    Pull Models
    # Best all-around model (4.7GB download)
    ollama pull llama3.1:8b
    # Code-specialized model
    ollama pull codellama:13b
    # Lightweight and fast
    ollama pull phi3:mini

    Step 3: Test It

    Test Run
    # Interactive chat mode
    ollama run llama3.1:8b
    # Or via API (OpenAI-compatible!)curl http://localhost:11434/v1/chat/completions \-H"Content-Type: application/json"\-d'{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello!"}]}'

    Spring AI Integration

    Maven Configuration

    pom.xml
    <dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-ollama-spring-boot-starter</artifactId></dependency><!-- Spring AI BOM --><dependencyManagement><dependencies><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-bom</artifactId><version>1.0.0-M4</version><type>pom</type><scope>import</scope></dependency></dependencies></dependencyManagement>

    Application Properties

    application.properties
    # Ollama connection (default port)spring.ai.ollama.base-url=http://localhost:11434# Default model for chatspring.ai.ollama.chat.options.model=llama3.1:8bspring.ai.ollama.chat.options.temperature=0.7# Embedding model for RAGspring.ai.ollama.embedding.options.model=nomic-embed-text

    Service Implementation

    Identical API to OpenAI—switch models with zero code changes

    LocalAIService.java
    @ServicepublicclassLocalAIService{privatefinalChatClient chatClient;publicLocalAIService(ChatClient.Builder builder){this.chatClient = builder
    .defaultSystem("You are a helpful coding assistant.").build();}publicStringchat(String message){return chatClient.prompt().user(message).call().content();}// Switch models per requestpublicStringgenerateCode(String specification){return chatClient.prompt().user("Write Java code: "+ specification).options(OllamaChatOptions.builder().model("codellama:13b").temperature(0.2f).build()).call().content();}}

    Advanced Configuration

    Custom Models (Modelfile)

    Create specialized models with baked-in system prompts and parameters:

    Modelfile
    FROM llama3.1:8b
    PARAMETER temperature 0.3
    PARAMETER top_p 0.9
    SYSTEM """
    You are a Spring Boot expert. Always use:
    - Constructor injection over @Autowired
    - Java 21 features when applicable
    - Proper exception handling
    """

    Create: ollama create spring-expert -f Modelfile

    Network & Performance

    Docker Access

    Allow containers to reach Ollama:

    OLLAMA_HOST=0.0.0.0 ollama serve
    Prevent Cold Starts

    Keep models loaded in memory:

    OLLAMA_KEEP_ALIVE=24h
    GPU Layers

    Control GPU offloading:

    OLLAMA_NUM_GPU=35

    Recommended Models

    🦙 General Purpose

    • llama3.1:8b
      Best Balance
    • mistral:7b
      Fast
    • gemma2:9b
      Google

    💻 Code & Embeddings

    • codellama:13b
      Code Gen
    • nomic-embed-text
      RAG
    • llava:13b
      Vision

    Start Building Locally

    Local AI gives you complete control over your data and costs. Download Ollama, pull a model, and start building in minutes.