惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
J
Java Code Geeks
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
H
Hackread – Cybersecurity News, Data Breaches, AI and More
V
Visual Studio Blog
G
Google Developers Blog
V
V2EX
The Register - Security
The Register - Security
博客园 - 三生石上(FineUI控件)
云风的 BLOG
云风的 BLOG
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园_首页
S
SegmentFault 最新的问题
博客园 - Franky
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog
A
About on SuperTechFans
人人都是产品经理
人人都是产品经理
aimingoo的专栏
aimingoo的专栏
罗磊的独立博客
C
Check Point Blog
MyScale Blog
MyScale Blog
T
The Blog of Author Tim Ferriss
MongoDB | Blog
MongoDB | Blog
The GitHub Blog
The GitHub Blog
Last Week in AI
Last Week in AI
Microsoft Azure Blog
Microsoft Azure Blog
IT之家
IT之家
F
Fortinet All Blogs
Jina AI
Jina AI
P
Proofpoint News Feed
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
阮一峰的网络日志
阮一峰的网络日志
B
Blog
L
LangChain Blog
月光博客
月光博客
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
宝玉的分享
宝玉的分享
博客园 - 【当耐特】
T
Tailwind CSS Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Microsoft Security Blog
Microsoft Security Blog
WordPress大学
WordPress大学
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
博客园 - 聂微东
Hugging Face - Blog
Hugging Face - Blog
M
MIT News - Artificial intelligence
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Breaking the AI Chatbox: How Berkeley Students Built Real Autonomous Agents
Praveen Tech World · 2026-06-20 · via DEV Community

Breaking the AI Chatbox: How Berkeley Students Built Real Autonomous Agents

Every AI demo looks the same. A chat window, a text prompt, a response streamed character by character. The chat interface has become the default mental model for interacting with AI. It is also a trap. It reduces the most powerful technology in a generation to a typing interface that trains you to ask questions instead of building solutions.

Last semester, a group of Berkeley CS students noticed this trap. They were using ChatGPT for their algorithms homework, getting perfect answers, and failing their proctored midterms. The chat interface was making them dependent. So they stopped using it. Instead, they built autonomous agents that run in sandboxes, plan their own tasks, execute Python code, store results in SQLite, and email the output. No chat box. No streaming text. No typing prompts.

This is how they built it and why it works better.

The Direct Answer

Chat interfaces train passive consumption. Autonomous agents train active engineering. The Berkeley students built a system where the LLM is not a chat partner but a planning engine. It receives a high-level task, decomposes it into subtasks, executes each subtask via sandboxed Python, stores intermediate results in a local SQLite database, and emails the final output. The agent runs without human intervention. The student reviews the output the same way they would review a colleague's work — critically, with full context.

The Core Architecture

The system has four components that run in sequence:

1. Task Planner. A lightweight LLM call (OpenRouter with a $0.10/M token model) receives the objective and outputs a JSON array of subtasks. Each subtask has a description, a success criterion, and a dependency list. The planner runs until all tasks are marked complete or a max iteration is hit.

2. Code Executor. Each subtask is handed to a code-writing LLM that generates Python scripts. The scripts run inside a Docker sandbox with no network access, no filesystem persistence, and a 30-second timeout. The agent captures stdout, stderr, and return codes. If the script errors, the executor re-prompts the LLM with the error message and retries up to 3 times.

3. SQLite Store. All intermediate results — parsed data, computed values, error logs — are written to a local SQLite database. The agent does not need to remember context between steps. It reads from the database. This is the key insight: the database replaces the chat context window. The agent can reference any past result without re-prompting the LLM.

4. Email Aggregator. When every subtask completes, the agent compiles a Markdown report of all outputs and emails it to the user. The email includes the task objective, the subtask list with completion status, the generated code, and any output files. The user never watches the agent work. They get the result when it is done.

When This Works

This architecture works for any task that can be decomposed into discrete, verifiable subtasks. Data analysis, web scraping, file processing, code generation, report generation, math computation, and algorithm implementation all fit naturally.

It works best when the success criterion for each subtask is objective. "Sum this column and save to a CSV" passes or fails. "Write a compelling introduction" is subjective and needs human evaluation.

The Berkeley students used it for their CS 170 algorithms problem sets. The agent would receive a problem statement, decompose it into subproblems, implement each algorithm in Python, run the test cases, log results to SQLite, and email the output. They learned more from reviewing the agent's code than they did from reading ChatGPT's answers because the agent's code was structured as engineering output, not chat responses.

When This Does NOT Work

This does not work for creative or open-ended tasks. If the objective is vague — "explore this dataset and find interesting patterns" — the agent will produce generic output that misses the human insight a domain expert would catch.

It also does not work when the task requires subjective judgment. Code reviews, design decisions, and architectural tradeoffs need human reasoning. The agent can generate options but cannot choose correctly without clear, measurable criteria.

It will fail on tasks that require real-time data from external APIs that change state. The agent plan assumes a static environment. If a website changes between subtask execution, the agent may produce stale results.

Step-by-Step: Build Your Own Agent Sandbox

Here is how to replicate the Berkeley setup in under 100 lines of Python.

Step 1: Set up the planner.

import json, sqlite3, smtplib, subprocess, os
from openai import OpenAI

client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_KEY"))

def plan_task(objective):
  prompt = f"""Given this objective: {objective}
Output a JSON array of subtasks. Each subtask must have:
- id (unique string)
- description (concrete action)
- depends_on (list of subtask ids that must complete first)
- success_criterion (how to verify completion)

Max 8 subtasks. Output only valid JSON."""
  resp = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.1)
  return json.loads(resp.choices[0].message.content.strip().removeprefix("```

json").removesuffix("

```").strip())

The planner returns a structured plan. Each subtask knows what it depends on and how to verify success.

Step 2: Execute subtasks in order.

def execute_subtask(subtask):
  code_prompt = f"""Write a Python script that accomplishes this: {subtask['description']}
The script must print its result as JSON to stdout.
Handle errors gracefully. Timeout after 30 seconds.
Output only the code inside a ```
{% endraw %}
python block."""
  resp = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": code_prompt}], temperature=0.2)
  code = resp.choices[0].message.content
  code = code.split("
{% raw %}
```python")[1].split("```

")[0] if "

```python" in code else code
  result = subprocess.run(["python", "-c", code], capture_output=True, text=True, timeout=30)
  return {"stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode}

The executor runs in a subprocess (or Docker container for production). It captures everything and retries on failure.

Step 3: Store in SQLite.

db = sqlite3.connect("agent_results.db")
db.execute("CREATE TABLE IF NOT EXISTS subtask_results (id TEXT, description TEXT, stdout TEXT, stderr TEXT, returncode INT, completed_at TEXT)")
for subtask in subtasks:
  result = execute_subtask(subtask)
  db.execute("INSERT INTO subtask_results VALUES (?, ?, ?, ?, ?, datetime('now'))", (subtask["id"], subtask["description"], result["stdout"], result["stderr"], result["returncode"]))
db.commit()

The database is the agent's memory. Any subtask can query previous results by reading from the table.

Step 4: Email the report.

def email_report(to_addr, objective, db_path):
  rows = db.execute("SELECT id, description, stdout, returncode FROM subtask_results").fetchall()
  report = f"# Agent Report: {objective}\n\n"
  for row in rows:
    report += f"## {row[0]}: {row[1]}\nStatus: {'PASS' if row[3] == 0 else 'FAIL'}\n```
{% endraw %}
\n{row[2][:500]}\n
{% raw %}
```\n\n"
  msg = f"Subject: Agent Complete - {objective}\n\n{report}"
  with smtplib.SMTP("smtp.gmail.com", 587) as server:
    server.starttls()
    server.login(os.getenv("EMAIL"), os.getenv("EMAIL_PASS"))
    server.sendmail(os.getenv("EMAIL"), [to_addr], msg)

The email arrives asynchronously. No polling, no dashboard, no chat interface. The agent runs, and the result appears in your inbox.

Why the Sandbox Matters

The sandbox is not optional. An agent that can write and execute code must be isolated from your system. The Berkeley students used Docker with --network none and a read-only filesystem. This prevents the agent from exfiltrating data, writing malware, or making network calls.

Without a sandbox, your agent is a security vulnerability with a text interface. With a sandbox, it is a safe, auditable worker that can run arbitrary code without risk.

The sandbox also forces discipline. If the agent needs data, it must be explicitly provided. If it needs a library, it must be installed in the sandbox image. There is no ambient access to your filesystem, your database, or your API keys.

Alternatives

  1. LangChain + LangGraph. A heavier framework that provides the same planner-executor pattern with more built-in tooling. Good for complex workflows but adds dependency overhead.

  2. Autogen (Microsoft). A multi-agent framework where agents communicate with each other. Useful if you need multiple specialized agents collaborating, but overkill for a single pipeline.

  3. Simple shell scripts with LLM calls. If your task is linear (step A then step B then step C), shell scripts piping JSON between LLM calls are easier to debug than a full agent framework.

  4. No-code agent builders. Bubble, Zapier, and Make have AI steps that approximate this pattern without writing code. Limited flexibility but zero setup.

Decision Summary

If you are asking ChatGPT the same questions every day → build an agent that automates those questions.

If your task has objective success criteria → use a code executor with a sandbox.

If your task needs subjective judgment → keep the human in the loop and use chat for exploration.

If you need results now → run the planner synchronously.

If you can wait → run the agent asynchronously and get the result via email.

If you are still using chat interfaces for production work → you are burning tokens and attention. Switch to autonomous agents.

Q: Is this just AutoGPT? How is it different?
A: AutoGPT was the first popular implementation of this pattern, but it had a fatal flaw: it used the GPT-4 context window as its memory store. Every step appended to the prompt. This caused exponential cost growth and context window limits. The Berkeley approach uses SQLite as external memory. The agent reads and writes to the database, not the prompt. This keeps costs flat regardless of how many steps the agent runs.

Q: Is it cheaper than ChatGPT Plus?
A: Yes, by a large margin. The Berkeley agent uses OpenRouter's gpt-4o-mini at $0.15/M input tokens. A typical algorithms problem set costs $0.08 to solve. A ChatGPT Plus subscription is $20/month. If you run 10 problem sets per month, the agent costs $0.80. Plus, you retain every output in SQLite for review.

Q: What if the agent generates incorrect code?
A: It will. The executor retries with the error message, which fixes about 70% of failures. The remaining 30% need human intervention. But here is the difference: when an agent fails, you get a full error trace, the generated code, and the test output. When ChatGPT gives you a wrong answer, you just get wrong text. The agent's failure mode is more informative.

Q: Can this run on a laptop?
A: Yes. The entire system runs on a 2020 MacBook Air. The LLM calls are remote, the code execution is local, and the SQLite database is a file. No GPU required, no cloud credits needed. Docker Desktop handles the sandbox.

Q: Does this violate Berkeley's academic integrity policy?
A: That depends on the course policy. The students who built this used it as a learning tool — they reviewed the output, understood the code, and could explain every decision the agent made. That is different from copy-pasting ChatGPT answers. Most professors distinguish between "automating the output" and "using a tool to generate starter code for review."