惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Stack Overflow Blog
Stack Overflow Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
P
Proofpoint News Feed
Apple Machine Learning Research
Apple Machine Learning Research
T
Tailwind CSS Blog
罗磊的独立博客
F
Future of Privacy Forum
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
P
Privacy & Cybersecurity Law Blog
V
Visual Studio Blog
T
Tenable Blog
F
Fortinet All Blogs
D
Docker
V
Vulnerabilities – Threatpost
Cyberwarzone
Cyberwarzone
A
Arctic Wolf
T
Threat Research - Cisco Blogs
I
Intezer
T
Tor Project blog
大猫的无限游戏
大猫的无限游戏
MongoDB | Blog
MongoDB | Blog
博客园 - 司徒正美
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
G
GRAHAM CLULEY
T
Threatpost
美团技术团队
K
Kaspersky official blog
F
Fox-IT International blog
Hugging Face - Blog
Hugging Face - Blog
Vercel News
Vercel News
P
Palo Alto Networks Blog
Google DeepMind News
Google DeepMind News
T
The Blog of Author Tim Ferriss
S
Schneier on Security
腾讯CDC
Cisco Talos Blog
Cisco Talos Blog
C
Check Point Blog
博客园 - 叶小钗
I
InfoQ
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Blog — PlanetScale
Blog — PlanetScale
F
Full Disclosure
T
True Tiger Recordings
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
E
Exploit-DB.com RSS Feed
L
LINUX DO - 热门话题
J
Java Code Geeks
C
CERT Recently Published Vulnerability Notes

DEV Community

Mapping Polish company structures: querying KRS direct via API Built tmpdrop — a tiny self-hosted ephemeral file drop LLD Object-Oriented Design: Interfaces & Abstract Classes (Designing Contracts) The Smaller Ship: Vitalik, the Ethereum Foundation's Restructuring, and What It Leaves for Investors Looking for 4 people to build something weird with me Building a Local-Only RAG System with Ollama and TypeScript The False Positive Tax: a 1:1 TP:FP analysis of eslint-plugin-security What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses MTP Explained — And Why It Matters for Android on Mac Most Beginners Learn Full-Stack Development Backwards GitHub Glow-Up: Open Source, READMEs, Badges, Streaks, Git and gh CLI System Design Cheat Sheet: Concepts Every Developer Should Know Are Junior Developer Roles Actually Dying? A Fresher's Honest Take Using DigitalOcean Droplets as Ephemeral Sandboxes for AI Agents I built a VSCode extension that visualises your code navigation as a call tree — made for legacy codebase pain Vite predev/prebuild: chaining scripts without losing your mind A website to save you from messy browser tabs Dear Web2 Developer... Solana is here calling Postgres JSONB indexes: GIN vs BTREE on the same column The $5 AI That Remembers Everything What are your goals for the week? #180 Zettelkasten for Developers: A Practical Method That Works OpenClaw vs Hermes Agent: Stars, Downloads & Usage 2026 `act` vs. `waitFor` Global Teams Don’t Struggle With Time Zones. They Struggle With Context Python as a JavaScript Dev $5.4 Billion in Damage. 8.5 Million Machines Down. Three YAML Controls Would Have Prevented It. Here's the Structural Analysis. 🚫 Stop Using PN532 V1 for Your NFC Projects (Real Debugging Experience) Probabilistic Graph Neural Inference for smart agriculture microgrid orchestration for extreme data sparsity scenarios Inference Is Becoming the New Steady-State Cost Center Why AI-Generated Code Is Always Good Enough — And Never Great I built a dark admin dashboard template in HTML — no React, no npm, just pure HTML What is the Difference Between Lattice-Based and Hash-Based Signatures? Next.js App Router caching: revalidate, dynamic, and no-store without the folklore Next.js App Router caching: revalidate, dynamic y no-store sin folklore I built Stashly — a full-stack content manager with a rich text editor published: false tags: react, node, mongodb, typescript Why I Started Building React Projects Instead of Just Watching Tutorials ? Every Tool Eventually Becomes Tuesday Nobody Warns You That Real Software Engineering Feels Chaotic Tích hợp VNPay, Stripe trong Odoo 19 BeautifulSoup and Requests for Web Scraping With Python: When Simple Still Works I Was Stuck Debugging React — Then Developer Tools Changed It Buck Converter Ripple: Sizing the Inductor and Capacitor With Confidence AWS Just Made Its MCP Server Generally Available. Here's What It Actually Gives AI Agents. RAMPART Tests Your AI Agents in Dev. What Catches Malicious Tool Calls in Production? Vibe Team Software Engineering: What a Real AI Human Dev Team Workflow Actually Looks Like An npm Package for AI Agent Orchestration Just Shipped With Its Front Door Unlocked. Here's What the CVE Actually Reveals. Microsoft Foundry Just Added CI/CD for AI Agents. Here's What That Actually Changes. The Best Career Insurance Is a Tech Event You Don't Want to Attend Your GitHub Profile Already Tells Recruiters More Than Your Resume. Most Devs Just Don't Surface It. How to Add Execution Budgets to OpenAI Agents SDK Binary Tree Interview Problems: 6 Traversal Patterns, 15 Problems We trained a personal voice DoRA on Qwen3-8B for $1.50 — beat stock model 100% in blind A/B Stop Leaking API Keys: Why I Built a Local-First Vault for Developers 🔐 RAG Explained: How Retrieval-Augmented Generation Actually Works I Built a Fast Async JioSaavn API Wrapper in Python 🎧 chown & chgrp
Running Local LLM - 0$ Personal Agentic AI Assistant - Part 3
AK DevCraft · 2026-05-25 · via DEV Community

Introduction

Part 3 of the Zero Dollar personal AI Assistant series, running Local LLMs on a Free Cloud Server — What Actually Works. Part 1 covers the architecture. Part 2 covers free Oracle Cloud setup.

Running a language model locally sounds straightforward until you try it. Download a model, point your app at it, done. In practice, there are real constraints: RAM limits, disk-space surprises, and CPU inference-speed walls that most tutorials gloss over.

This article is honest about all of it. What works on a free Oracle ARM instance, what doesn't, and how a hybrid local + free API fallback makes the whole thing practical.

The CPU Inference Reality Check

Before picking a model, understand what you're getting into.

Your Oracle ARM instance has no GPU. Every token generated by a language model runs on CPU cores. This matters because modern LLMs were designed to run on a GPU, the parallel processing architecture that makes inference fast. On the CPU, that parallelism doesn't exist in the same way.

What this means in practice:

Model size RAM needed Tokens/sec on 4 ARM CPUs Response time (100 tokens)
3B parameters ~2GB 15-25 tok/s 4-7 seconds
8B parameters ~5GB 5-10 tok/s 10-20 seconds
14B parameters ~9GB 2-5 tok/s 20-50 seconds
70B parameters ~40GB Won't fit

For a personal assistant responding to Telegram messages, 4-7 seconds for a short response is acceptable. You send a message, put your phone down, and pick it up to respond. Different mental model from a real-time chat UI, but workable.

The mistake to avoid: pulling a 70B model because it benchmarks well. It needs 40GB RAM minimum and simply won't run on your instance. I learned this the hard way: a partial 42GB download filled the disk before the model even ran.

Installing Ollama

Ollama is the runtime that downloads and runs open-source models locally. Think of it as the music player; the models are the music it plays.

Always use tmux before long-running commands:

sudo apt install tmux -y
tmux new -s setup

Enter fullscreen mode Exit fullscreen mode

If your SSH session drops mid-install, reconnect and tmux attach -t setup to pick up exactly where you left off. Not using tmux for a bigger size model download is how you end up restarting from scratch.

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Enter fullscreen mode Exit fullscreen mode

Verify it's running:

systemctl status ollama
ollama --version

Enter fullscreen mode Exit fullscreen mode

Ollama installs as a systemd service and starts automatically on boot, no manual management needed.

Model Selection

This is where most guides give you a benchmark table and call it done. What actually matters for your use case is the RAM-to-quality tradeoff on CPU hardware.

The models that make sense for this stack:

Llama 3.2:3B — The Speed Choice

ollama pull llama3.2:3b

Enter fullscreen mode Exit fullscreen mode

  • RAM: ~2GB
  • Speed: 15-25 tokens/second — fastest option
  • Quality: Good for everyday tasks, struggles with complex reasoning
  • Made by: Meta
  • Best for: Quick responses, simple Q&A, drafting short content

Llama 3.1:8B — The Quality Choice

ollama pull llama3.1:8b

Enter fullscreen mode Exit fullscreen mode

  • RAM: ~5GB
  • Speed: 5-10 tokens/second
  • Quality: Significantly better reasoning, handles nuanced tasks
  • Made by: Meta
  • Best for: More complex tasks where quality matters more than speed

Phi-4:14B — The Reasoning Choice

ollama pull phi4

Enter fullscreen mode Exit fullscreen mode

  • RAM: ~9GB
  • Speed: 2-5 tokens/second — noticeably slower
  • Quality: Strong reasoning and instruction following, punches above its weight
  • Made by: Microsoft
  • Best for: Tasks requiring careful reasoning, analysis, and structured output

The recommendation for this stack: llama3.2:3b

Not because it's the best model, it isn't. But because OpenClaw's agent mode wraps every model call with tool context, memory, session history, and system prompts. What feels fast in a bare ollama run test becomes significantly slower when the agent layer adds 2-3KB of context to every request. With that overhead, the 3B model stays within acceptable response times. The 8B model starts hitting timeout issues in agent mode on the CPU.

If you want better quality and can accept 30-90 second response times for complex queries, llama3.1:8b is worth trying.

Disk Space Management

Model files are large. Managing disk space proactively saves painful cleanup sessions later.

Check your current disk usage:

df -h
du -sh /usr/share/ollama/.ollama/models/

Enter fullscreen mode Exit fullscreen mode

List downloaded models:

ollama list

Enter fullscreen mode Exit fullscreen mode

Remove a model you no longer need:

ollama rm <modelname>

Enter fullscreen mode Exit fullscreen mode

The gotcha with partial downloads:

If a download fails or you cancel it, Ollama leaves a partial file in the blobs directory. These can be gigabytes in size and won't show up in ollama list. Check and clean manually:

# Stop Ollama first
sudo systemctl stop ollama

# Remove as ollama user (files are owned by this user)
sudo -u ollama rm -rf /usr/share/ollama/.ollama/models/blobs/*

# Restart
sudo systemctl start ollama

Enter fullscreen mode Exit fullscreen mode

If the disk fills and growpart fails with "no space left on device", you need to free space before the partition can be extended, even growing the volume requires temp space. Remove partial downloads first, then retry growpart.

The Hybrid Architecture: Local + Gemini Fallback

Here's the truth about local-only inference for an AI assistant: it works, but has a quality ceiling. The 3B model handles most everyday tasks fine. But occasionally, a complex question, a nuanced writing task, something that requires real reasoning, either produces a weak response or times out entirely.

The solution: use the local model as the primary and Google's Gemini API as a free fallback.

Why Gemini free tier works here:

  • 250K Tokens Per Minute (TPM) on the free tier — more than enough for one person
  • No credit card required
  • Gemini 2.5 Flash lite responds in 1-2 seconds
  • When the local model times out, Gemini catches it automatically

The flow:

Your message
     ↓
Ollama llama3.2:3b (primary)
     ↓ if timeout or failure
Gemini 2.5 Flash (fallback) ← free, fast, no card needed
     ↓
Response to Telegram

Enter fullscreen mode Exit fullscreen mode

Most responses come from the local model at zero cost. Complex queries or timeouts fall through to Gemini, also at zero cost. The experience from your phone is just: you send a message, you get a response.

Get Your Gemini API Key

  1. Go to aistudio.google.com
  2. Click Get API KeyCreate API key
  3. Copy the key — it may start with AIza...

No credit card, no billing setup. Takes two minutes.

Verifying Everything Works

Check RAM usage while model is loaded:

free -h

Enter fullscreen mode Exit fullscreen mode

With llama3.2:3b loaded, you should see ~2-3GB used out of 24GB, plenty of headroom for OpenClaw and everything else.

Check Ollama has auto-started:

systemctl status ollama

Enter fullscreen mode Exit fullscreen mode

Should show active (running). The model itself loads into RAM only when first called and then stays resident for subsequent calls, which is why the first response after a reboot takes longer than subsequent ones.

Test Ollama directly:

ollama run llama3.2:3b "Just Reply OKAY!"

Enter fullscreen mode Exit fullscreen mode

Should respond in under 10 seconds. If it takes longer, something is wrong with the Ollama service.

Test Gemini Model API Call

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-lite:generateContent" \
  -H 'Content-Type: application/json' \
  -H 'X-goog-api-key: API_KEY' \
  -X POST \
  -d '{
    "contents": [
      {
        "parts": [
          {
            "text": "Just! Reply OKAY"
          }
        ]
      }
    ]
  }'

Enter fullscreen mode Exit fullscreen mode

HTTP Response status code should be 200 along with response text, and you should see the call log in your Google Studio - Logs

Common Issues

  • model requires more system memory than is available You pulled a model too large for your RAM. llama3.3 requires 40GB — it will never run on a 24GB instance. Remove it and pull a smaller model:
ollama rm llama3.3
ollama pull llama3.2:3b

Enter fullscreen mode Exit fullscreen mode

  • Disk full during model download
    The download filled your boot volume. Stop Ollama, remove partial files as the ollama user (not root), free space, then extend the partition if needed via Oracle Console → Boot Volume resize.

  • Ollama slow after reboot
    The first call after a reboot loads the model into RAM, expected. Subsequent calls are faster since the model stays resident.

What's Next

With Ollama running and your hybrid local + Gemini fallback configured, the AI layer is ready.

Part 4 will cover installing OpenClaw on Linux — the right user, systemd service setup, the config file traps, and every mistake worth avoiding so you don't have to make them yourself.

This article is the third in a five-part series:

  1. $0 Personal Agentic AI Assistant - Architecture
  2. Setting Up Free Cloud Server — VCN, ARM instances, static IPs, the gotchas
  3. Running Ollama on ARM — model selection, disk management, CPU inference, reality ← you are here
  4. Installing OpenClaw on Linux — avoiding every trap
  5. The Complete Setup — Telegram, Gemini fallback, end-to-end testing

Stay tuned, all links will be updated as articles are published.

If you have reached this point, I have made a satisfactory effort to keep you reading. Please be kind enough to leave any comments or share any corrections.

My Other Blogs: