Using a Local Large Language Model (LLM): Running Ollama on Your Laptop
You can now run powerful LLMs like Llama 3.1 directly on your laptop using Ollama. There is no cloud, and there is no cost. Just install, pull a model, and start chatting, all in a local shell.
Large Language Models (LLMs) have revolutionized how we interact with data and systems, but many assume you need significant cloud resources or specialized hardware to run them. Today, I want to walk you through getting started with Ollama, an approachable tool that lets you run large language models locally on your laptop.
Why Run LLMs Locally?
Before diving into the details of working with a local LLM, let’s address why you might want to run LLMs on your own hardware:
- Privacy: Your data never leaves your machine
- Cost control: No cloud, no usage based billing
- Offline usage: Work anywhere without an internet connection
- Learning: Better understanding of how these models work
- Customization: Fine tune models for your specific needs
Installing Ollama
On macOS, we can use Homebrew:
brew install ollama
For Linux users, the official installation script works well:
curl -fsSL https://ollama.com/install.sh | sh
Windows users can download the installer from https://ollama.com/download/windows.
You might be wondering why I’m not using a Docker container, which is typically my preferred choice for tasks like this. Currently, Docker Desktop and OrbStack do not support GPU passthrough, and allowing Ollama access to the GPU is essential for optimal performance. While running on the CPU is possible, performance will not be great. For those looking for a solution, Docker Model Runner is an excellent (and emerging) option. It enables GPU acceleration for large language models (LLMs) on Docker Desktop. I’ll be blogging about this topic in the near future.
Starting the Service
Once installed, we need to start the Ollama service:
ollama serve &
On Linux and MacOS, the ampersand (&
) runs the Ollama process in the background, freeing up your terminal for further commands. Windows users, open a new terminal.
Launching an Interactive Session
Let’s start with a simple interaction using the llama3.1
model at the command line. Since we specified the llama3.1
model, Ollama will download this model to your local machine.
ollama run llama3.1
pulling manifest
pulling 667b0c1932bc... 6% ▕███████ ▏ 310 MB/4.9 GB 20 MB/s 3m42s[pulling manifest
pulling 667b0c1932bc... 85% ▕██████████████████████████████████████████████████████████████████████████████████████████████████ ▏ 4.2 GB/4.9 GB 22 MB/s 34s[pulling manifest
pulling 667b0c1932bc... 97% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████ ▏ 4.8 GB/4.9 GB 23 MB/s 5s[pulling manifest
pulling 667b0c1932bc... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB pulling manifest
pulling 667b0c1932bc... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB
pulling 667b0c1932bc... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB
pulling 667b0c1932bc... 100% ▕████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB
pulling 667b0c1932bc... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB
pulling 948af2743fc7... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.5 KB
pulling 0ba8f0e314b4... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB
pulling 56bb8bd477a5... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 96 B
pulling 455f34728c9b... 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 487 B
verifying sha256 digest
writing manifest
success
>>> Send a message (/? for help)
Interacting with a Model at the CLI
Once the model is downloaded, you will be dropped into the Ollama CLI, where you can interact with the LLM by typing at the prompt and pressing enter.
This launches an interactive session where you can chat with the model. In the output below, you can see that I asked the LLM, Give me a short sentence describing what a large language model is?
Type /bye
when you’re done to exit.
Please give me a short sentence describing what a large language model is.
A large language model is a computer program that can process and understand human language
to generate responses, answers, or text based on the input it receives.
>>> /bye
Listing and Managing Models
Once you’ve pulled and ran a model on your local machine, you can check a model’s status and verify that they are installed by running:
ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:latest 46e0c10c039e 7.0 GB 100% GPU 4 minutes from now
This command lists all currently running models and provides information such as their ID, processor usage, and the remaining time for the session.
To see what models are available locally:
ollama list
NAME ID SIZE MODIFIED
llama3.1:latest 46e0c10c039e 4.9 GB 44 hours ago
Downloading Additional Models
Ollama makes it easy to access various LLMs, allowing you to experiment with different models depending on your use case. Here’s how you can pull models to your local machine:
ollama pull llama3
In this case, we’re pulling the llama3
model, but you can substitute the model name with any available model, such as llama3.1
, nomic-embed-text
, or others. Ollama will download the model and its associated files to your local storage.
During the download process, you’ll see progress logs similar to the following:
pulling manifest
pulling 6a0746a1ec1a... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB
pulling 4fa551d4f938... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB
pulling 8ab4849b038c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 254 B
pulling 577073ffcc6c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B
pulling 3f8eb4da87fa... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 485 B
verifying sha256 digest
writing manifest
success
In this case, we’re pulling the llama3
model, but you can substitute the model name with any available model, such as llama3.1
, nomic-embed-text
, or others. Ollama will download the model and its associated files to your local storage.
Using ollama list
again, we can see that the additional model has been downloaded.
ollama list
NAME ID SIZE MODIFIED
llama3:latest 365c0bd3c000 4.7 GB 44 minutes ago
llama3.1:latest 46e0c10c039e 4.9 GB 44 hours ago
Choosing Between Models
Ollama supports a range of models, each with different capabilities and optimizations. For instance, the llama3.1
model is optimized for general-purpose tasks, while models like nomic-embed-text are designed for more specific applications, such as text embedding for semantic search. We’ll be using nomic-embed-text
later in this blog series.
Head over to https://ollama.com/search for a listing of available models.
Exploring the REST API
Ollama provides a REST API for easy integration into applications. After pulling and starting a model, you can interact with it programmatically. The API is accessible at http://localhost:11434
by default. You can send requests to perform actions like generating completions, embedding text, or managing models. When you submit a request to the API, by default, it will respond using a streaming response, which enables faster retrieval of partial responses, improving user experience in interactive applications.
curl -k http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"prompt": "Please give me a short sentence describing what a large language model is."
}'
{"model":"llama3.1","created_at":"2025-04-18T00:24:01.921459Z","response":"A","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:01.996083Z","response":" large","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.063884Z","response":" language","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.129147Z","response":" model","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.189396Z","response":" (","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.250582Z","response":"LL","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.31109Z","response":"M","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.374554Z","response":")","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.437303Z","response":" is","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.498758Z","response":" a","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.561764Z","response":" computer","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.624437Z","response":" program","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.688192Z","response":" that","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.756787Z","response":" uses","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.829267Z","response":" artificial","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.895286Z","response":" intelligence","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:02.956524Z","response":" to","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.016778Z","response":" process","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.07894Z","response":" and","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.139738Z","response":" generate","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.201174Z","response":" human","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.262151Z","response":"-like","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.3223Z","response":" text","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.385279Z","response":" based","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.446721Z","response":" on","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.507238Z","response":" patterns","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.568208Z","response":" learned","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.629634Z","response":" from","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.691328Z","response":" massive","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.752396Z","response":" amounts","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.812789Z","response":" of","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.873268Z","response":" data","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.934768Z","response":".","done":false}
{"model":"llama3.1","created_at":"2025-04-18T00:24:03.996892Z","response":"","done":true,"done_reason":"stop","context":[128006,882,128007,271,36227,757,264,2875,11914,23524,1148,264,3544,4221,1646,374,30,128009,128006,78191,128007,271,32,3544,4221,1646,320,4178,44,8,374,264,6500,2068,430,5829,21075,11478,311,1920,323,7068,3823,12970,1495,3196,389,12912,9687,505,11191,15055,315,828,13],"total_duration":2377351459,"load_duration":18090834,"prompt_eval_count":23,"prompt_eval_duration":280934292,"eval_count":34,"eval_duration":2077071583}
The response includes the answer and metadata about processing time and additional metadata.
Performance Considerations
The performance of these models significantly depends on your hardware, particularly the capability of your GPU. While Ollama is designed for ease of use, it is important to keep a few key factors in mind when assessing performance.
- GPU Usage: Ollama is compatible with CPU and GPU systems. It uses the neural engine for enhanced performance on machines that use Apple Silicon. Running Ollama with a GPU is preferred as it significantly improves responsiveness. You may experience slower response times on systems with only a CPU.
- Disk Space: Models can range from a few GBs to tens of GBs. Be mindful of how much disk space each model consumes on your system, and remove models that are no longer being used.
Integrating with Databases
I’m a database pro and am particularly interested in how LLMs can enhance database operations. For example, with embedding models like nomic-embed-text
, you can:
- Generate embeddings from your database’s text fields
- Store these vectors in Azure SQL Database, and soon SQL Server 2025
- Perform semantic similarity searches
- Build intelligent query interfaces for your data using natural language prompts
This bridges the gap between traditional structured data and natural language processing. I’ll be blogging on this topic in some upcoming posts.
Conclusion
Running large language models (LLMs) locally with Ollama democratizes access to AI technologies. What once required specialized knowledge and advanced hardware is now accessible to anyone with a reasonably powerful laptop.
In future posts, I will explore how to integrate these locally hosted models with Azure SQL Database. This will enable hybrid solutions that combine structured data with natural language AI.