Introduction
Part 3 of the Zero Dollar personal AI Assistant series, running Local LLMs on a Free Cloud Server — What Actually Works. Part 1 covers the architecture. Part 2 covers free Oracle Cloud setup.
Running a language model locally sounds straightforward until you try it. Download a model, point your app at it, done. In practice, there are real constraints: RAM limits, disk-space surprises, and CPU inference-speed walls that most tutorials gloss over.
This article is honest about all of it. What works on a free Oracle ARM instance, what doesn't, and how a hybrid local + free API fallback makes the whole thing practical.
The CPU Inference Reality Check
Before picking a model, understand what you're getting into.
Your Oracle ARM instance has no GPU. Every token generated by a language model runs on CPU cores. This matters because modern LLMs were designed to run on a GPU, the parallel processing architecture that makes inference fast. On the CPU, that parallelism doesn't exist in the same way.
What this means in practice:
| Model size | RAM needed | Tokens/sec on 4 ARM CPUs | Response time (100 tokens) |
|---|---|---|---|
| 3B parameters | ~2GB | 15-25 tok/s | 4-7 seconds |
| 8B parameters | ~5GB | 5-10 tok/s | 10-20 seconds |
| 14B parameters | ~9GB | 2-5 tok/s | 20-50 seconds |
| 70B parameters | ~40GB | Won't fit | — |
For a personal assistant responding to Telegram messages, 4-7 seconds for a short response is acceptable. You send a message, put your phone down, and pick it up to respond. Different mental model from a real-time chat UI, but workable.
The mistake to avoid: pulling a 70B model because it benchmarks well. It needs 40GB RAM minimum and simply won't run on your instance. I learned this the hard way: a partial 42GB download filled the disk before the model even ran.
Installing Ollama
Ollama is the runtime that downloads and runs open-source models locally. Think of it as the music player; the models are the music it plays.
Always use tmux before long-running commands:
sudo apt install tmux -y
tmux new -s setup
If your SSH session drops mid-install, reconnect and tmux attach -t setup to pick up exactly where you left off. Not using tmux for a bigger size model download is how you end up restarting from scratch.
Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Verify it's running:
systemctl status ollama
ollama --version
Ollama installs as a systemd service and starts automatically on boot, no manual management needed.
Model Selection
This is where most guides give you a benchmark table and call it done. What actually matters for your use case is the RAM-to-quality tradeoff on CPU hardware.
The models that make sense for this stack:
Llama 3.2:3B — The Speed Choice
ollama pull llama3.2:3b
- RAM: ~2GB
- Speed: 15-25 tokens/second — fastest option
- Quality: Good for everyday tasks, struggles with complex reasoning
- Made by: Meta
- Best for: Quick responses, simple Q&A, drafting short content
Llama 3.1:8B — The Quality Choice
ollama pull llama3.1:8b
- RAM: ~5GB
- Speed: 5-10 tokens/second
- Quality: Significantly better reasoning, handles nuanced tasks
- Made by: Meta
- Best for: More complex tasks where quality matters more than speed
Phi-4:14B — The Reasoning Choice
ollama pull phi4
- RAM: ~9GB
- Speed: 2-5 tokens/second — noticeably slower
- Quality: Strong reasoning and instruction following, punches above its weight
- Made by: Microsoft
- Best for: Tasks requiring careful reasoning, analysis, and structured output
The recommendation for this stack: llama3.2:3b
Not because it's the best model, it isn't. But because OpenClaw's agent mode wraps every model call with tool context, memory, session history, and system prompts. What feels fast in a bare ollama run test becomes significantly slower when the agent layer adds 2-3KB of context to every request. With that overhead, the 3B model stays within acceptable response times. The 8B model starts hitting timeout issues in agent mode on the CPU.
If you want better quality and can accept 30-90 second response times for complex queries, llama3.1:8b is worth trying.
Disk Space Management
Model files are large. Managing disk space proactively saves painful cleanup sessions later.
Check your current disk usage:
df -h
du -sh /usr/share/ollama/.ollama/models/
List downloaded models:
ollama list
Remove a model you no longer need:
ollama rm <modelname>
The gotcha with partial downloads:
If a download fails or you cancel it, Ollama leaves a partial file in the blobs directory. These can be gigabytes in size and won't show up in ollama list. Check and clean manually:
# Stop Ollama first
sudo systemctl stop ollama
# Remove as ollama user (files are owned by this user)
sudo -u ollama rm -rf /usr/share/ollama/.ollama/models/blobs/*
# Restart
sudo systemctl start ollama
If the disk fills and growpart fails with "no space left on device", you need to free space before the partition can be extended, even growing the volume requires temp space. Remove partial downloads first, then retry growpart.
The Hybrid Architecture: Local + Gemini Fallback
Here's the truth about local-only inference for an AI assistant: it works, but has a quality ceiling. The 3B model handles most everyday tasks fine. But occasionally, a complex question, a nuanced writing task, something that requires real reasoning, either produces a weak response or times out entirely.
The solution: use the local model as the primary and Google's Gemini API as a free fallback.
Why Gemini free tier works here:
- 250K Tokens Per Minute (TPM) on the free tier — more than enough for one person
- No credit card required
- Gemini 2.5 Flash lite responds in 1-2 seconds
- When the local model times out, Gemini catches it automatically
The flow:
Your message
↓
Ollama llama3.2:3b (primary)
↓ if timeout or failure
Gemini 2.5 Flash (fallback) ← free, fast, no card needed
↓
Response to Telegram
Most responses come from the local model at zero cost. Complex queries or timeouts fall through to Gemini, also at zero cost. The experience from your phone is just: you send a message, you get a response.
Get Your Gemini API Key
- Go to aistudio.google.com
- Click Get API Key → Create API key
- Copy the key — it may start with
AIza...
No credit card, no billing setup. Takes two minutes.
Verifying Everything Works
Check RAM usage while model is loaded:
free -h
With llama3.2:3b loaded, you should see ~2-3GB used out of 24GB, plenty of headroom for OpenClaw and everything else.
Check Ollama has auto-started:
systemctl status ollama
Should show active (running). The model itself loads into RAM only when first called and then stays resident for subsequent calls, which is why the first response after a reboot takes longer than subsequent ones.
Test Ollama directly:
ollama run llama3.2:3b "Just Reply OKAY!"
Should respond in under 10 seconds. If it takes longer, something is wrong with the Ollama service.
Test Gemini Model API Call
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-lite:generateContent" \
-H 'Content-Type: application/json' \
-H 'X-goog-api-key: API_KEY' \
-X POST \
-d '{
"contents": [
{
"parts": [
{
"text": "Just! Reply OKAY"
}
]
}
]
}'
HTTP Response status code should be 200 along with response text, and you should see the call log in your Google Studio - Logs
Common Issues
-
model requires more system memory than is available
You pulled a model too large for your RAM.
llama3.3requires 40GB — it will never run on a 24GB instance. Remove it and pull a smaller model:
ollama rm llama3.3
ollama pull llama3.2:3b
Disk full during model download
The download filled your boot volume. Stop Ollama, remove partial files as the ollama user (not root), free space, then extend the partition if needed via Oracle Console → Boot Volume resize.Ollama slow after reboot
The first call after a reboot loads the model into RAM, expected. Subsequent calls are faster since the model stays resident.
What's Next
With Ollama running and your hybrid local + Gemini fallback configured, the AI layer is ready.
Part 4 will cover installing OpenClaw on Linux — the right user, systemd service setup, the config file traps, and every mistake worth avoiding so you don't have to make them yourself.
This article is the third in a five-part series:
- $0 Personal Agentic AI Assistant - Architecture
- Setting Up Free Cloud Server — VCN, ARM instances, static IPs, the gotchas
- Running Ollama on ARM — model selection, disk management, CPU inference, reality ← you are here
- Installing OpenClaw on Linux — avoiding every trap
- The Complete Setup — Telegram, Gemini fallback, end-to-end testing
Stay tuned, all links will be updated as articles are published.
If you have reached this point, I have made a satisfactory effort to keep you reading. Please be kind enough to leave any comments or share any corrections.















