Ollama & Local AI Models
Run powerful large language models directly on your machine. Complete privacy, zero API costs, and lightning-fast inference—all without internet connectivity.
Ollama is an open-source tool that makes running LLMs locally as simple as running Docker containers. It handles model downloads, quantization, memory management, and provides an OpenAI-compatible API—all with a single command. Spring AI integrates seamlessly with Ollama, allowing you to use the same ChatClient interface whether you're calling GPT-4 in the cloud or Llama 3 on your laptop.
Why Run Models Locally?
Complete Privacy
Your data never leaves your infrastructure. Process sensitive documents, proprietary code, and customer information without any third-party exposure. Essential for healthcare, legal, and financial applications where data sovereignty matters.
Zero Latency
No network round-trips mean sub-100ms response times. Local inference is 3-5x faster than cloud APIs for most use cases. Perfect for real-time applications, IDE integrations, and interactive experiences where speed matters.
Offline Capable
Works without internet connectivity. Deploy to air-gapped environments, edge devices, or remote locations. Build applications that function reliably regardless of network conditions.
Hardware Requirements
Local AI requires RAM, not necessarily a GPU. Modern quantization techniques compress models to fit in system memory while preserving quality. Here's what you need for different model sizes:
7B Models
8GB RAM
Llama 3.1 8B, Mistral 7B, Gemma 7B
13-14B Models
16GB RAM
Llama 2 13B, Phi-3 Medium
70B Models
48GB RAM
Llama 3.3 70B, Mixtral 8x22B
Apple Silicon Advantage: M1/M2/M3 Macs use unified memory, allowing the GPU to access all system RAM. A 64GB MacBook Pro can run 70B models that would require an expensive NVIDIA GPU on Windows/Linux.
Installing Ollama
Quick Start
Get up and running in under 2 minutes
Step 1: Install Ollama
# macOS / Linux (one command)curl-fsSL https://ollama.com/install.sh |sh# Windows: Download installer from https://ollama.com/download# Or use winget:
winget install Ollama.OllamaStep 2: Download a Model
# Best all-around model (4.7GB download)
ollama pull llama3.1:8b
# Code-specialized model
ollama pull codellama:13b
# Lightweight and fast
ollama pull phi3:miniStep 3: Test It
# Interactive chat mode
ollama run llama3.1:8b
# Or via API (OpenAI-compatible!)curl http://localhost:11434/v1/chat/completions \-H"Content-Type: application/json"\-d'{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hello!"}]}'Spring AI Integration
Maven Configuration
<dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-ollama-spring-boot-starter</artifactId></dependency><!-- Spring AI BOM --><dependencyManagement><dependencies><dependency><groupId>org.springframework.ai</groupId><artifactId>spring-ai-bom</artifactId><version>1.0.0-M4</version><type>pom</type><scope>import</scope></dependency></dependencies></dependencyManagement>Application Properties
# Ollama connection (default port)spring.ai.ollama.base-url=http://localhost:11434# Default model for chatspring.ai.ollama.chat.options.model=llama3.1:8bspring.ai.ollama.chat.options.temperature=0.7# Embedding model for RAGspring.ai.ollama.embedding.options.model=nomic-embed-textService Implementation
Identical API to OpenAI—switch models with zero code changes
@ServicepublicclassLocalAIService{privatefinalChatClient chatClient;publicLocalAIService(ChatClient.Builder builder){this.chatClient = builder
.defaultSystem("You are a helpful coding assistant.").build();}publicStringchat(String message){return chatClient.prompt().user(message).call().content();}// Switch models per requestpublicStringgenerateCode(String specification){return chatClient.prompt().user("Write Java code: "+ specification).options(OllamaChatOptions.builder().model("codellama:13b").temperature(0.2f).build()).call().content();}}Advanced Configuration
Custom Models (Modelfile)
Create specialized models with baked-in system prompts and parameters:
FROM llama3.1:8b
PARAMETER temperature 0.3
PARAMETER top_p 0.9
SYSTEM """
You are a Spring Boot expert. Always use:
- Constructor injection over @Autowired
- Java 21 features when applicable
- Proper exception handling
"""Create: ollama create spring-expert -f Modelfile
Network & Performance
Docker Access
Allow containers to reach Ollama:
OLLAMA_HOST=0.0.0.0 ollama servePrevent Cold Starts
Keep models loaded in memory:
OLLAMA_KEEP_ALIVE=24hGPU Layers
Control GPU offloading:
OLLAMA_NUM_GPU=35Recommended Models
🦙 General Purpose
llama3.1:8bBest Balancemistral:7bFastgemma2:9bGoogle
💻 Code & Embeddings
codellama:13bCode Gennomic-embed-textRAGllava:13bVision