No More Manual Test Writing: How I Used Gemma 4 to Turn a GitHub Repo Into a Full Test Suite 🎯

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

I built Scriptless.ai — an AI-powered testing workspace that lets you go from "I have a GitHub repo" to "I have running browser tests with failure analysis" without writing a single line of test code yourself.

Here's the honest backstory: every team I've seen treats testing as something you do after you're already exhausted from writing the feature. Test coverage slips, QA bottlenecks pile up, and debugging a CI failure at 2 AM means staring at a red checkbox with zero context. Scriptless.ai is my attempt to flip that dynamic entirely.

The core loop is this:

Connect your GitHub repo — OAuth in, pick a repository, done.
Generate test cases with Gemma 4 — The model reads your actual source files and file tree, then produces structured, route-specific test cases (UI, auth, form, API, edge-case, integration) tailored to your exact codebase. Not generic Lorem Ipsum tests — real ones that know your routes and files.
Execute in a real cloud browser — Each test case gets turned into a Playwright script and run in a live Browserbase session. Real browser, real network, real DOM.
Understand failures visually — When a test fails, Gemma 4's vision capability analyzes a screenshot of the page at the moment of failure, tells you what it sees, what likely went wrong, and what to fix. No more blind log archaeology.
Control everything with your voice — A Speechmatics-powered voice command layer lets you say "run failed tests" or "show only passing" and the UI responds. Hands-free QA.

It's a full-stack product: Next.js frontend, Neon Postgres + Drizzle for persistence, Clerk for auth, Stripe scaffold for monetization, and Vercel for deployment. Credits are tracked per user, deducted on generation and execution, and the billing infrastructure is wired up and ready for subscription plans.

Demo

🚀 Live app: https://scriptless-ai.vercel.app/

Sign in with Clerk → authorize GitHub → connect a repo → hit Generate Tests and watch Gemma 4 do its thing.

What to try first:

Connect any public Next.js or React repo you own
Click Generate Test Cases and see how the model structures tests specific to your file tree
Run a test — the Browserbase session spins up a real Chrome instance
If it fails, scroll to the Vision Analysis section to see Gemma 4's diagnosis

Code

📦 GitHub: https://github.com/Arjunhg/scriptless-ai

Key files worth exploring:

File	What it does
`lib/featherless/client.ts`	Featherless client config, model aliases pointing to `gemma-4-31B-it`
`lib/featherless/generateTests.ts`	Calls Gemma 4 with tool-calling to produce structured test cases
`lib/featherless/analyzeScreenshot.ts`	Sends failure screenshots to Gemma 4 vision for diagnosis
`lib/featherless/prompts/testGeneration.ts`	The system prompt that turns Gemma into a QA engineer
`app/api/generate-test-cases/route.ts`	API route that reads GitHub files → calls Gemma → saves to DB
`app/api/test-cases/run/route.ts`	Execution pipeline: script gen → Browserbase → vision fallback
`hooks/useVoiceCommands.ts`	Speechmatics real-time pipeline + utterance finalization logic
`lib/speechmatics/commandParser.ts`	Token-based NLP command parser (stemming, stopwords, synonyms)

How I Used Gemma 4

I'm using google/gemma-4-31B-it (the 31B Dense instruction-tuned variant), accessed via the Featherless AI OpenAI-compatible API. Gemma 4 powers two distinct, genuinely different use cases in this app:

1. 🧠 Test Case Generation (Text + Tool Calling)

When a user clicks "Generate Test Cases," the backend fetches their repo's file tree and a filtered set of source files from GitHub, then sends everything to Gemma 4 with a structured system prompt and a tool call definition for submit_test_cases.

// lib/featherless/client.ts
export const FEATHERLESS_TEXT_MODEL = "google/gemma-4-31B-it";

// lib/featherless/generateTests.ts
const response = await featherlessClient.chat.completions.create({
  model: FEATHERLESS_TEXT_MODEL,
  messages: [...messages],
  tools: [TEST_CASE_TOOL_DEFINITION],
  tool_choice: { type: "function", function: { name: "submit_test_cases" } },
});

The model returns a JSON payload with 5–8 test cases, each with a title, description, type (ui/auth/form/api/integration/edge-case), priority, targetRoute, targetFiles, and expectedResult. It understands your app's structure from context — it won't hallucinate routes or files that don't exist because you gave it the actual file tree.

I chose 31B over the smaller variants because structured output fidelity matters here. A smaller model tends to drift from the tool-call schema or produce partial JSON, especially on larger repos with complex file trees. The 31B model is reliably structured even with multi-thousand-token inputs.

2. 👁️ Failure Vision Analysis (Multimodal)

When a test run fails, the Playwright script captures a screenshot at the point of failure. That screenshot gets passed to Gemma 4's vision capability:

// lib/featherless/analyzeScreenshot.ts
const response = await featherlessClient.chat.completions.create({
  model, // google/gemma-4-31B-it
  messages: [{
    role: "user",
    content: [
      { type: "text", text: `You are a QA engineer analyzing a browser test failure screenshot.\n\nTest case: ${testDescription}\n\nDescribe what is visible on the page, what likely went wrong, and suggest concrete next steps...` },
      { type: "image_url", image_url: { url: screenshotUrl } },
    ],
  }],
  max_tokens: 512,
});

The result is a 3–5 sentence diagnosis rendered directly in the test result card. Instead of "FAILED — element not found," you get something like: "The page shows a 404 error. The route /dashboard/analytics does not exist yet. The test references a navigation action that was likely removed in a recent commit. Suggest updating the target route or adding a redirect."

This is the part that genuinely surprised me during development — Gemma 4's vision analysis is sharp. It doesn't just describe the page, it reasons about why a test targeting that page would fail.

Why 31B Dense?

Tool-calling reliability: Smaller models frequently break from structured output schemas under pressure. 31B is consistent.
Long-context reasoning: Test generation prompts can be 3,000–5,000 tokens with file content. The 31B model handles this gracefully.
Vision quality on UI screenshots: Browser screenshots have dense UI elements. The 31B vision model correctly identifies components, error states, and layout issues that smaller models tend to miss or describe too generically.
Two-in-one: Using the same model family for both text and vision keeps the integration simple and the behavior predictable across both tasks.

Built solo during the Gemma 4 Challenge. All infrastructure, prompting, voice pipeline, and UI designed and shipped.

推薦訂閱源

DEV Community