Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3

Introduction

Part 3 of the Zero Dollar personal AI Assistant series, running Local LLMs on a Free Cloud Server — What Actually Works. Part 1 covers the architecture. Part 2 covers free Oracle Cloud setup.

Running a language model locally sounds straightforward until you try it. Download a model, point your app at it, done. In practice, there are real constraints: RAM limits, disk-space surprises, and CPU inference-speed walls that most tutorials gloss over.

This article is honest about all of it. What works on a free Oracle ARM instance, what doesn't, and how a hybrid local + free API fallback makes the whole thing practical.

The CPU Inference Reality Check

Before picking a model, understand what you're getting into.

Your Oracle ARM instance has no GPU. Every token generated by a language model runs on CPU cores. This matters because modern LLMs were designed to run on a GPU, the parallel processing architecture that makes inference fast. On the CPU, that parallelism doesn't exist in the same way.

What this means in practice:

Model size	RAM needed	Tokens/sec on 4 ARM CPUs	Response time (100 tokens)
3B parameters	~2GB	15-25 tok/s	4-7 seconds
8B parameters	~5GB	5-10 tok/s	10-20 seconds
14B parameters	~9GB	2-5 tok/s	20-50 seconds
70B parameters	~40GB	Won't fit	—

For a personal assistant responding to Telegram messages, 4-7 seconds for a short response is acceptable. You send a message, put your phone down, and pick it up to respond. Different mental model from a real-time chat UI, but workable.

The mistake to avoid: pulling a 70B model because it benchmarks well. It needs 40GB RAM minimum and simply won't run on your instance. I learned this the hard way: a partial 42GB download filled the disk before the model even ran.

Installing Ollama

Ollama is the runtime that downloads and runs open-source models locally. Think of it as the music player; the models are the music it plays.

Always use tmux before long-running commands:

sudo apt install tmux -y
tmux new -s setup

If your SSH session drops mid-install, reconnect and tmux attach -t setup to pick up exactly where you left off. Not using tmux for a bigger size model download is how you end up restarting from scratch.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Verify it's running:

systemctl status ollama
ollama --version

Ollama installs as a systemd service and starts automatically on boot, no manual management needed.

Model Selection

This is where most guides give you a benchmark table and call it done. What actually matters for your use case is the RAM-to-quality tradeoff on CPU hardware.

The models that make sense for this stack:

Llama 3.2:3B — The Speed Choice

ollama pull llama3.2:3b

RAM: ~2GB
Speed: 15-25 tokens/second — fastest option
Quality: Good for everyday tasks, struggles with complex reasoning
Made by: Meta
Best for: Quick responses, simple Q&A, drafting short content

Llama 3.1:8B — The Quality Choice

ollama pull llama3.1:8b

RAM: ~5GB
Speed: 5-10 tokens/second
Quality: Significantly better reasoning, handles nuanced tasks
Made by: Meta
Best for: More complex tasks where quality matters more than speed

Phi-4:14B — The Reasoning Choice

ollama pull phi4

RAM: ~9GB
Speed: 2-5 tokens/second — noticeably slower
Quality: Strong reasoning and instruction following, punches above its weight
Made by: Microsoft
Best for: Tasks requiring careful reasoning, analysis, and structured output

The recommendation for this stack: llama3.2:3b

Not because it's the best model, it isn't. But because OpenClaw's agent mode wraps every model call with tool context, memory, session history, and system prompts. What feels fast in a bare ollama run test becomes significantly slower when the agent layer adds 2-3KB of context to every request. With that overhead, the 3B model stays within acceptable response times. The 8B model starts hitting timeout issues in agent mode on the CPU.

If you want better quality and can accept 30-90 second response times for complex queries, llama3.1:8b is worth trying.

Disk Space Management

Model files are large. Managing disk space proactively saves painful cleanup sessions later.

Check your current disk usage:

df -h
du -sh /usr/share/ollama/.ollama/models/

List downloaded models:

ollama list

Remove a model you no longer need:

ollama rm <modelname>

The gotcha with partial downloads:

If a download fails or you cancel it, Ollama leaves a partial file in the blobs directory. These can be gigabytes in size and won't show up in ollama list. Check and clean manually:

# Stop Ollama first
sudo systemctl stop ollama

# Remove as ollama user (files are owned by this user)
sudo -u ollama rm -rf /usr/share/ollama/.ollama/models/blobs/*

# Restart
sudo systemctl start ollama

If the disk fills and growpart fails with "no space left on device", you need to free space before the partition can be extended, even growing the volume requires temp space. Remove partial downloads first, then retry growpart.

The Hybrid Architecture: Local + Gemini Fallback

Here's the truth about local-only inference for an AI assistant: it works, but has a quality ceiling. The 3B model handles most everyday tasks fine. But occasionally, a complex question, a nuanced writing task, something that requires real reasoning, either produces a weak response or times out entirely.

The solution: use the local model as the primary and Google's Gemini API as a free fallback.

Why Gemini free tier works here:

250K Tokens Per Minute (TPM) on the free tier — more than enough for one person
No credit card required
Gemini 2.5 Flash lite responds in 1-2 seconds
When the local model times out, Gemini catches it automatically

The flow:

Your message
     ↓
Ollama llama3.2:3b (primary)
     ↓ if timeout or failure
Gemini 2.5 Flash (fallback) ← free, fast, no card needed
     ↓
Response to Telegram

Most responses come from the local model at zero cost. Complex queries or timeouts fall through to Gemini, also at zero cost. The experience from your phone is just: you send a message, you get a response.

Get Your Gemini API Key

Go to aistudio.google.com
Click Get API Key → Create API key
Copy the key — it may start with AIza...

No credit card, no billing setup. Takes two minutes.

Verifying Everything Works

Check RAM usage while model is loaded:

free -h

With llama3.2:3b loaded, you should see ~2-3GB used out of 24GB, plenty of headroom for OpenClaw and everything else.

Check Ollama has auto-started:

systemctl status ollama

Should show active (running). The model itself loads into RAM only when first called and then stays resident for subsequent calls, which is why the first response after a reboot takes longer than subsequent ones.

Test Ollama directly:

ollama run llama3.2:3b "Just Reply OKAY!"

Should respond in under 10 seconds. If it takes longer, something is wrong with the Ollama service.

Test Gemini Model API Call

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-lite:generateContent" \
  -H 'Content-Type: application/json' \
  -H 'X-goog-api-key: API_KEY' \
  -X POST \
  -d '{
    "contents": [
      {
        "parts": [
          {
            "text": "Just! Reply OKAY"
          }
        ]
      }
    ]
  }'

HTTP Response status code should be 200 along with response text, and you should see the call log in your Google Studio - Logs

Common Issues

model requires more system memory than is available You pulled a model too large for your RAM. llama3.3 requires 40GB — it will never run on a 24GB instance. Remove it and pull a smaller model:

ollama rm llama3.3
ollama pull llama3.2:3b

Disk full during model download
The download filled your boot volume. Stop Ollama, remove partial files as the ollama user (not root), free space, then extend the partition if needed via Oracle Console → Boot Volume resize.
Ollama slow after reboot
The first call after a reboot loads the model into RAM, expected. Subsequent calls are faster since the model stays resident.

What's Next

With Ollama running and your hybrid local + Gemini fallback configured, the AI layer is ready.

Part 4 will cover installing OpenClaw on Linux — the right user, systemd service setup, the config file traps, and every mistake worth avoiding so you don't have to make them yourself.

This article is the third in a five-part series:

$0 Personal Agentic AI Assistant - Architecture
Setting Up Free Cloud Server — VCN, ARM instances, static IPs, the gotchas
Running Ollama on ARM — model selection, disk management, CPU inference, reality ← you are here
Installing OpenClaw on Linux — avoiding every trap
The Complete Setup — Telegram, Gemini fallback, end-to-end testing

Stay tuned, all links will be updated as articles are published.

If you have reached this point, I have made a satisfactory effort to keep you reading. Please be kind enough to leave any comments or share any corrections.

推荐订阅源

DEV Community