Breaking the AI Chatbox: How Berkeley Students Built Real Autonomous Agents

Every AI demo looks the same. A chat window, a text prompt, a response streamed character by character. The chat interface has become the default mental model for interacting with AI. It is also a trap. It reduces the most powerful technology in a generation to a typing interface that trains you to ask questions instead of building solutions.

Last semester, a group of Berkeley CS students noticed this trap. They were using ChatGPT for their algorithms homework, getting perfect answers, and failing their proctored midterms. The chat interface was making them dependent. So they stopped using it. Instead, they built autonomous agents that run in sandboxes, plan their own tasks, execute Python code, store results in SQLite, and email the output. No chat box. No streaming text. No typing prompts.

This is how they built it and why it works better.

The Direct Answer

Chat interfaces train passive consumption. Autonomous agents train active engineering. The Berkeley students built a system where the LLM is not a chat partner but a planning engine. It receives a high-level task, decomposes it into subtasks, executes each subtask via sandboxed Python, stores intermediate results in a local SQLite database, and emails the final output. The agent runs without human intervention. The student reviews the output the same way they would review a colleague's work — critically, with full context.

The Core Architecture

The system has four components that run in sequence:

1. Task Planner. A lightweight LLM call (OpenRouter with a $0.10/M token model) receives the objective and outputs a JSON array of subtasks. Each subtask has a description, a success criterion, and a dependency list. The planner runs until all tasks are marked complete or a max iteration is hit.

2. Code Executor. Each subtask is handed to a code-writing LLM that generates Python scripts. The scripts run inside a Docker sandbox with no network access, no filesystem persistence, and a 30-second timeout. The agent captures stdout, stderr, and return codes. If the script errors, the executor re-prompts the LLM with the error message and retries up to 3 times.

3. SQLite Store. All intermediate results — parsed data, computed values, error logs — are written to a local SQLite database. The agent does not need to remember context between steps. It reads from the database. This is the key insight: the database replaces the chat context window. The agent can reference any past result without re-prompting the LLM.

4. Email Aggregator. When every subtask completes, the agent compiles a Markdown report of all outputs and emails it to the user. The email includes the task objective, the subtask list with completion status, the generated code, and any output files. The user never watches the agent work. They get the result when it is done.

When This Works

This architecture works for any task that can be decomposed into discrete, verifiable subtasks. Data analysis, web scraping, file processing, code generation, report generation, math computation, and algorithm implementation all fit naturally.

It works best when the success criterion for each subtask is objective. "Sum this column and save to a CSV" passes or fails. "Write a compelling introduction" is subjective and needs human evaluation.

The Berkeley students used it for their CS 170 algorithms problem sets. The agent would receive a problem statement, decompose it into subproblems, implement each algorithm in Python, run the test cases, log results to SQLite, and email the output. They learned more from reviewing the agent's code than they did from reading ChatGPT's answers because the agent's code was structured as engineering output, not chat responses.

When This Does NOT Work

This does not work for creative or open-ended tasks. If the objective is vague — "explore this dataset and find interesting patterns" — the agent will produce generic output that misses the human insight a domain expert would catch.

It also does not work when the task requires subjective judgment. Code reviews, design decisions, and architectural tradeoffs need human reasoning. The agent can generate options but cannot choose correctly without clear, measurable criteria.

It will fail on tasks that require real-time data from external APIs that change state. The agent plan assumes a static environment. If a website changes between subtask execution, the agent may produce stale results.

Step-by-Step: Build Your Own Agent Sandbox

Here is how to replicate the Berkeley setup in under 100 lines of Python.

Step 1: Set up the planner.

import json, sqlite3, smtplib, subprocess, os
from openai import OpenAI

client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_KEY"))

def plan_task(objective):
  prompt = f"""Given this objective: {objective}
Output a JSON array of subtasks. Each subtask must have:
- id (unique string)
- description (concrete action)
- depends_on (list of subtask ids that must complete first)
- success_criterion (how to verify completion)

Max 8 subtasks. Output only valid JSON."""
  resp = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], temperature=0.1)
  return json.loads(resp.choices[0].message.content.strip().removeprefix("```

json").removesuffix("

```").strip())

The planner returns a structured plan. Each subtask knows what it depends on and how to verify success.

Step 2: Execute subtasks in order.

def execute_subtask(subtask):
  code_prompt = f"""Write a Python script that accomplishes this: {subtask['description']}
The script must print its result as JSON to stdout.
Handle errors gracefully. Timeout after 30 seconds.
Output only the code inside a ```
{% endraw %}
python block."""
  resp = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": code_prompt}], temperature=0.2)
  code = resp.choices[0].message.content
  code = code.split("
{% raw %}
```python")[1].split("```

")[0] if "

```python" in code else code
  result = subprocess.run(["python", "-c", code], capture_output=True, text=True, timeout=30)
  return {"stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode}

The executor runs in a subprocess (or Docker container for production). It captures everything and retries on failure.

Step 3: Store in SQLite.

db = sqlite3.connect("agent_results.db")
db.execute("CREATE TABLE IF NOT EXISTS subtask_results (id TEXT, description TEXT, stdout TEXT, stderr TEXT, returncode INT, completed_at TEXT)")
for subtask in subtasks:
  result = execute_subtask(subtask)
  db.execute("INSERT INTO subtask_results VALUES (?, ?, ?, ?, ?, datetime('now'))", (subtask["id"], subtask["description"], result["stdout"], result["stderr"], result["returncode"]))
db.commit()

The database is the agent's memory. Any subtask can query previous results by reading from the table.

Step 4: Email the report.

def email_report(to_addr, objective, db_path):
  rows = db.execute("SELECT id, description, stdout, returncode FROM subtask_results").fetchall()
  report = f"# Agent Report: {objective}\n\n"
  for row in rows:
    report += f"## {row[0]}: {row[1]}\nStatus: {'PASS' if row[3] == 0 else 'FAIL'}\n```
{% endraw %}
\n{row[2][:500]}\n
{% raw %}
```\n\n"
  msg = f"Subject: Agent Complete - {objective}\n\n{report}"
  with smtplib.SMTP("smtp.gmail.com", 587) as server:
    server.starttls()
    server.login(os.getenv("EMAIL"), os.getenv("EMAIL_PASS"))
    server.sendmail(os.getenv("EMAIL"), [to_addr], msg)

The email arrives asynchronously. No polling, no dashboard, no chat interface. The agent runs, and the result appears in your inbox.

Why the Sandbox Matters

The sandbox is not optional. An agent that can write and execute code must be isolated from your system. The Berkeley students used Docker with --network none and a read-only filesystem. This prevents the agent from exfiltrating data, writing malware, or making network calls.

Without a sandbox, your agent is a security vulnerability with a text interface. With a sandbox, it is a safe, auditable worker that can run arbitrary code without risk.

The sandbox also forces discipline. If the agent needs data, it must be explicitly provided. If it needs a library, it must be installed in the sandbox image. There is no ambient access to your filesystem, your database, or your API keys.

Alternatives

LangChain + LangGraph. A heavier framework that provides the same planner-executor pattern with more built-in tooling. Good for complex workflows but adds dependency overhead.
Autogen (Microsoft). A multi-agent framework where agents communicate with each other. Useful if you need multiple specialized agents collaborating, but overkill for a single pipeline.
Simple shell scripts with LLM calls. If your task is linear (step A then step B then step C), shell scripts piping JSON between LLM calls are easier to debug than a full agent framework.
No-code agent builders. Bubble, Zapier, and Make have AI steps that approximate this pattern without writing code. Limited flexibility but zero setup.

Decision Summary

If you are asking ChatGPT the same questions every day → build an agent that automates those questions.

If your task has objective success criteria → use a code executor with a sandbox.

If your task needs subjective judgment → keep the human in the loop and use chat for exploration.

If you need results now → run the planner synchronously.

If you can wait → run the agent asynchronously and get the result via email.

If you are still using chat interfaces for production work → you are burning tokens and attention. Switch to autonomous agents.

Q: Is this just AutoGPT? How is it different?
A: AutoGPT was the first popular implementation of this pattern, but it had a fatal flaw: it used the GPT-4 context window as its memory store. Every step appended to the prompt. This caused exponential cost growth and context window limits. The Berkeley approach uses SQLite as external memory. The agent reads and writes to the database, not the prompt. This keeps costs flat regardless of how many steps the agent runs.

Q: Is it cheaper than ChatGPT Plus?
A: Yes, by a large margin. The Berkeley agent uses OpenRouter's gpt-4o-mini at $0.15/M input tokens. A typical algorithms problem set costs $0.08 to solve. A ChatGPT Plus subscription is $20/month. If you run 10 problem sets per month, the agent costs $0.80. Plus, you retain every output in SQLite for review.

Q: What if the agent generates incorrect code?
A: It will. The executor retries with the error message, which fixes about 70% of failures. The remaining 30% need human intervention. But here is the difference: when an agent fails, you get a full error trace, the generated code, and the test output. When ChatGPT gives you a wrong answer, you just get wrong text. The agent's failure mode is more informative.

Q: Can this run on a laptop?
A: Yes. The entire system runs on a 2020 MacBook Air. The LLM calls are remote, the code execution is local, and the SQLite database is a file. No GPU required, no cloud credits needed. Docker Desktop handles the sandbox.

Q: Does this violate Berkeley's academic integrity policy?
A: That depends on the course policy. The students who built this used it as a learning tool — they reviewed the output, understood the code, and could explain every decision the agent made. That is different from copy-pasting ChatGPT answers. Most professors distinguish between "automating the output" and "using a tool to generate starter code for review."

推荐订阅源

DEV Community