Ollama Tutorial: Run Local LLMs on Windows, Linux, and macOS
Most people experience large language models (LLMs) through a cloud API — ChatGPT, Gemini, Claude, Grok, DeepSeek, and friends. That is convenient, but you are paying per token, bumping into rate limits, sending data to someone else’s server, and getting stuck whenever the network drops. Ollama solves all of that by running open-source LLMs directly on your own computer, fully offline, and makes it trivial to plug those models into AI agents, MCP servers, or UIs like AnythingLLM and Open WebUI later on.
This guide walks through installing Ollama on Windows, Linux, and macOS, running your first model, picking a model size that matches your hardware, and calling Ollama from Python or any REST client.
What is Ollama?
Ollama is a local LLM runtime. Once installed, you can drive it from the terminal, from its built-in REST API, or from libraries in Python, JavaScript, Java, C#, Go, and more. It is also the backend that many popular tools — AI agents, AnythingLLM, Open WebUI — talk to by default.
Ollama vs. the models themselves
It helps to think of Ollama as a media player and LLMs as the video files. The player does not contain any movies — it downloads, loads, and plays whatever you pick. Ollama is the runtime that manages downloads, memory, and execution; the actual knowledge lives inside model weights such as Llama 3, Phi-3, Qwen, Mistral, or Gemma, typically distributed as GGUF weight files.
Ollama vs. LLaMA
A quick clarification because the names look similar: LLaMA is Meta’s family of open-weight models, while Ollama is the runtime you use to run them. They are separate projects — Ollama just happens to run LLaMA well, along with dozens of other model families.
Installation
Windows
On Windows you can install Ollama through winget, or grab the installer directly from the official download page.
winget install --id=Ollama.Ollama -e
After installation, open PowerShell and type ollama to verify everything is wired up.
Linux
Linux has a one-line install script that the Ollama team maintains at ollama.com/install.sh. It sets up a systemd service automatically.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Check service status
systemctl status ollama
# Start and stop the service
systemctl start ollama
systemctl stop ollama
# Enable on boot (already enabled by default)
systemctl enable ollama
Most of these commands need sudo because they touch systemd.
macOS
On macOS, the cleanest path is Homebrew. You can also grab the .dmg from ollama.com/download/mac if you prefer a GUI installer.
# Install via Homebrew
brew install ollama
# Start Ollama in the current terminal
ollama serve
# Or run it as a background service
brew services start ollama
# Stop the background service
brew services stop ollama
If you use ollama serve it will hold a terminal window. Press Control + C to stop it.
Run your first model
If you did not start Ollama as a service, open one terminal and run ollama serve to boot the runtime, then open a second terminal for the commands below. If you installed via a service (systemd or brew services), you can skip that step.
# Pull Google's Gemma 3 4B model from the Ollama library
ollama pull gemma3:4b
# Run it
ollama run gemma3:4b
Once it finishes loading you will see:
>>> Send a message (/? for help)
You can just start typing. /? shows built-in commands, and /bye exits the interactive session. Here is a quick exchange from my machine:
>>> How tall is the Eiffel Tower?
The Eiffel Tower stands **330 meters (1,083 feet)** tall, including its antennas.
Without the antennas, the structure itself reaches **300 meters (984 feet)**.
For more details, you can visit the official website:
* **Eiffel Tower official site:** [https://www.toureiffel.paris/en](https://www.toureiffel.paris/en)
Hope that helps!
Two things to notice in the reply. First, the output is full of Markdown syntax like **bold** and [link text](url) — that is how LLMs are trained to format their answers. Ollama’s terminal UI shows them raw; any graphical frontend will render them into styled text. Second, reasoning-capable models often wrap their internal thought process inside <think>...</think> blocks. In the terminal you see those directly; most GUI frontends hide them behind a “Show reasoning” toggle.
Using Ollama from the REST API
Ollama ships with a built-in REST API at http://localhost:11434 that any language or tool can hit. The two endpoints you will use most are:
POST /api/generate— single-shot completionPOST /api/chat— multi-turn conversation
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{"model":"llama3.2","prompt":"Hello"}'
Managing models
Everyday commands
# Download or update a model
ollama pull <model> # ollama pull gpt-oss
ollama pull <model>:<tag> # ollama pull gpt-oss:20b
# List installed models
ollama list
# Run a model interactively
ollama run <model> # ollama run gemma3
ollama run <model>:<tag> # ollama run gemma3:4b
# See what is currently loaded in memory
ollama ps
# Stop a running model
ollama stop <id>
# Delete a model
ollama rm <model>
ollama rm <model>:<tag>
The tag after the colon is the parameter size. If you omit it, Ollama uses the default latest tag for that model family — so gemma3 is usually equivalent to gemma3:latest, which today points to gemma3:4b.
You can browse every model Ollama supports at ollama.com/search. Each model page lists the available sizes and tells you which one latest maps to. For example, the Gemma 3 page lists 1B / 4B / 12B / 27B variants.
Parameters and quantization
Parameters are the number of weights in the model. 8B means 8 billion weights. More parameters usually means better reasoning and generation, at the cost of bigger downloads and more RAM / VRAM.
Quantization compresses those floating-point weights into lower-bit representations (q4_K_M, q8_0, and so on). The file gets smaller and loads faster, with a small quality hit. For local use, starting with a q4 variant is almost always the right default.
There is also a separate class of Mixture of Experts (MoE) models like Mixtral 8x7B. The name means eight 7B expert sub-models packed into one file — roughly 56B total parameters. At inference time, however, only one or two experts activate per token, so you get the knowledge breadth of a large model at a fraction of the compute cost.
Matching model size to hardware
The 4B model in the earlier example was deliberately small so it would download quickly and run on almost any laptop. Here is a rough sizing guide for the rest:
| Model size | Parameters | Minimum RAM / VRAM | Typical machine |
|---|---|---|---|
| 3B | 3 billion | 4 GB RAM / 2 GB VRAM | Ultrabook — basic chat, simple tool calls |
| 8B | 8 billion | 8 GB RAM / 4–6 GB VRAM | Sweet spot for most laptops — daily assistant, light coding help |
| 13B | 13 billion | 16 GB RAM / 8–10 GB VRAM | Mid-range GPU laptop or desktop — steadier answers, longer context |
| 30B | 30 billion | 32 GB RAM / 16 GB+ VRAM | Workstation — RAG, harder tasks |
| 70B | 70 billion | 64 GB RAM / 48 GB+ VRAM | Server-grade hardware |
On a MacBook Pro with an M2 Pro and 16 GB of unified memory, gemma3:4b flies and gemma3:12b is still responsive but eats most of the system RAM. Larger models get painful — deepseek-r1:14b is marginally usable, and gpt-oss:20b takes several seconds per token.
Gemma 3 also ships a tiny gemma3:270m variant — only 270 million parameters. It runs at warp speed but the output quality is too low to be useful for anything beyond testing.
Popular open-source models
The Ollama library has hundreds of models. Here are the families worth knowing when you are starting out:
| Model family | Team | Strengths |
|---|---|---|
Llama 3.1 / 3.2 (llama3.1, llama3.2) | Meta | General-purpose, strong multilingual support. Widely used for chat, Q&A, and coding. 8B runs locally; 70B needs serious hardware or cloud. |
Gemma 2 / Gemma 3 (gemma3:4b, gemma3:12b) | Lightweight and research-friendly. Small sizes start fast; 12B+ is capable enough for real work. | |
GPT-OSS (gpt-oss:20b, gpt-oss:120b) | OpenAI | OpenAI’s first open-weight release since GPT-2, with built-in chain-of-thought reasoning. Only ships in 20B and 120B sizes. |
Phi-3 (phi3:3.8b, phi3:7b) | Microsoft | ”Small but sharp” — strong at code and reasoning for their size. Great for resource-constrained machines. |
| Mistral / Mixtral 8x7B | Mistral AI | Mistral 7B hits a great efficiency / quality balance; Mixtral 8x7B is an MoE model with much stronger reasoning at the cost of more RAM. |
Qwen 2 / 2.5 (qwen2:7b, qwen2:14b) | Alibaba | Particularly strong in Chinese and generally multilingual. Solid for knowledge Q&A and document processing. |
DeepSeek (deepseek-r1:7b, deepseek-coder) | DeepSeek | Focused on code generation and mathematical reasoning. deepseek-coder is an especially nice local coding assistant. |
LLaVA (llava:7b) | Community (originated at UC Berkeley + Microsoft) | Multimodal — accepts both text and images. Used for image captioning, visual Q&A, and screen understanding. Built on top of Llama or Mistral. |
”Open source” vs. “open weights”
A quick terminology note that trips up a lot of newcomers. When we say something is open source — think Linux or Python — we usually mean the full source code is public, anyone can read, modify, and redistribute it, and some licenses even allow commercial forks.
LLMs are usually different. Most of what people call “open-source models” are more accurately open-weight models: the team releases the trained weight files so you can download and run them locally, but the training code, datasets, and recipe stay private. You can use the model, but you cannot fully reproduce how it was made.
It is worth keeping the distinction in mind when you evaluate licenses and provenance.
Calling Ollama from Python
Two quick examples. The first uses the REST API directly — so it works from any language, or even plain curl. The second uses Ollama’s official Python client.
Python with requests
Any language with an HTTP client can call Ollama the same way. Make sure Ollama is running before you send requests — you do not need to ollama run gemma3 first; the runtime loads the model on demand.
import requests
def main():
url = "http://localhost:11434/api/chat"
payload = {
"model": "gemma3:4b",
"messages": [
{"role": "system", "content": "You are an SEO analyst. Answer in English."},
{"role": "user", "content": "I run a developer blog. Give me a few SEO tips."},
],
"stream": False,
}
resp = requests.post(url, json=payload)
print(resp.json())
if __name__ == "__main__":
main()
While the request is running (and for a short grace period afterwards), ollama ps will show the model loaded in memory. The JSON response contains the fields below.
| Field | Type | What it means |
|---|---|---|
model | string | The model name and tag used, e.g. gemma3:4b. |
created_at | string (ISO 8601, UTC) | Request timestamp. |
message | object | The model’s reply. Contains role (system / user / assistant) and content. |
done | boolean | Whether this response is complete. In streaming mode, only the final chunk has done: true. |
done_reason | string | Why the model stopped: stop (hit a stop token), length (hit max tokens), or error. |
total_duration | integer (ns) | Total request time in nanoseconds. |
load_duration | integer (ns) | Time spent loading the model (longer on cold starts). |
prompt_eval_count | integer | Number of prompt tokens processed. |
prompt_eval_duration | integer (ns) | Time spent processing those prompt tokens. |
eval_count | integer | Number of output tokens generated. |
eval_duration | integer (ns) | Time spent generating the output. |
Python with the official ollama package
If you prefer a typed client, pip install ollama and skip the manual JSON handling.
import ollama
def main():
messages = [
{"role": "system", "content": "You are a friendly assistant. Answer in English."},
{"role": "user", "content": "Briefly describe the solar system."},
]
response = ollama.chat(model="gemma3:4b", messages=messages)
print(response["message"]["content"])
if __name__ == "__main__":
main()
That is enough to get started. Ollama really does just work — pick a size your machine can comfortably hold, point a tool at http://localhost:11434, and you have a fully local LLM backend ready for chat UIs, AI agents, retrieval pipelines, and whatever else you want to build next.