惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

罗磊的独立博客
T
Tenable Blog
人人都是产品经理
人人都是产品经理
IT之家
IT之家
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
小众软件
小众软件
美团技术团队
The GitHub Blog
The GitHub Blog
Y
Y Combinator Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
V
Visual Studio Blog
M
Microsoft Research Blog - Microsoft Research
aimingoo的专栏
aimingoo的专栏
P
Proofpoint News Feed
T
The Blog of Author Tim Ferriss
博客园 - 聂微东
V
V2EX
Microsoft Security Blog
Microsoft Security Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
爱范儿
爱范儿
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
I
InfoQ
H
Help Net Security
Google DeepMind News
Google DeepMind News
P
Privacy International News Feed
U
Unit 42
Cyberwarzone
Cyberwarzone
V
Vulnerabilities – Threatpost
F
Future of Privacy Forum
雷峰网
雷峰网
Recorded Future
Recorded Future
WordPress大学
WordPress大学
P
Privacy & Cybersecurity Law Blog
博客园 - Franky
D
Darknet – Hacking Tools, Hacker News & Cyber Security
N
Netflix TechBlog - Medium
D
Docker
博客园_首页
J
Java Code Geeks
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Blog — PlanetScale
Blog — PlanetScale
C
CERT Recently Published Vulnerability Notes
Malwarebytes
Malwarebytes
MongoDB | Blog
MongoDB | Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Cisco Talos Blog
Cisco Talos Blog
T
Threat Research - Cisco Blogs
Know Your Adversary
Know Your Adversary
GbyAI
GbyAI

DEV Community

Log Level Strategies: Balancing Observability and Cost Why WebMCP Is the Most Important Thing Google Announced at I/O 2026 (And Nobody's Talking About It) Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch Google's 2x Energy Efficiency Claim Is Real — But Here's What They're Not Measuring What's actually going on with CORS, under the hood Language-Agnostic Code Generation: The Driver Plugin Model Why We Rewrote Our Python CLI in Go (and What We Gained) I added up everything Google gives developers for free after I/O 2026. It's kind of absurd The Dawn of Smarter Apps: My Take on Google I/O 2026 AI Announcements Why AI Agents Like Hermes Need a Semantic Execution Layer for the Physical World Why We Built TestSmith: The Test Coverage Problem Nobody Talks About How to Convert Bank Statement PDFs to Excel: The Complete 2026 Guide Have You Ever Used a Website That Keeps Working After You Turn Off Your Internet? From idea to indexed: how I launched a SaaS in 60 days with Laravel + React Building a local-first AI tutor for my daughter (and 10–14 year-olds in Austrian schools) with Gemma 4 EC2 SSH Not Connecting? Here Are the 5 Things That Were Wrong (And How I Fixed Them) Best AI Tools for HVAC Contractors 2026 From Closed Internal Stack to Open-Source Ecosystem: I Finally Shipped Three Years of .NET Infrastructure Scrumpan is offlically LIVE!! Building a BMI Calculator CLI with TypeScript — Types, Functions, and Vitest From Building WordPress Websites to Node.js APIs: My Honest Full Stack Journey XiHan Snore Coach: Privacy-First On-Device MedTech Guardian powered by Gemma 4 Mobile Why AI Coding Agents Hallucinate and How to Fix It mcp-probe v1.4.0: Contract assertions for production MCP servers Google I/O 2026 Wasn't About One More Model. It Was About the Agent Stack. How I built 100+ crypto calculators in 6 languages on Astro The Dawn of Local Multi-Agent Architectures: Why Gemma 4 Changes Everything for Cloud Developers # I Told My AI to Simulate a Planet for 10,000 Years. It Built the Whole Thing Itself. 18/30 Days System Design Questions! From Hackathon Chaos to Clean CLI: Reviving My Daily Routine Analyser with GitHub Copilot Building a Home Lab with Proxmox and Terraform (for Kubernetes) PolicyAware vs Guardrails vs AI Gateways vs Model Routers: The Comparison Every AI Engineer Needs to Read Partner: An AI That Does Research While You Sleep Rugby Fundamentals as Software Concepts - Mapping the Pitch to your Code Base I Let Claude Code Run Unsupervised for 24 Hours. Here's What Happened. Why Zed Is Replacing VS Code in My AI-Augmented Workflow Build a scroll-driven WebGL hero in 30 lines Karpathy's LLM Wiki? No Code with Claude or Github Copilot! Why Platform Governance and Transparency Matter for Developers and Freelancers I built a Flutter CLI that generates Clean Architecture in seconds Using an LLM to automate a task that used to take hours by hand CyberArena – Interactive Cyber Security Simulation & Threat Analysis Platform Tile Extractor Mathematical Functions in CSS: clamp, min, max and How They Simplify Responsiveness Polyglot Persistence in Microservices: Let the Domain Choose the Database 190 Countries, Zero API Calls: Shipping Static Data in a Chrome Extension Your AI Writes Code Fast. Here’s How to Check It Before Shipping qwen2.5-coder is too slow for Claude Code on a Mac. Here's the fix. Building Automated Text-to-Video Pipelines with AI Can Gemini Become an Offline AI Tutor? Lessons from Building Educational AI OPRIX : From a simple messaging web app to a well structured and enhanced UI messaging web app Why React + TypeScript Nullability Slowly Becomes Exhausting Why AI Agents Need a Project Layer - Part 1 Stop Hand-Editing MCP Configs: A Zero-Dependency Go CLI What I Learned Working With Microsoft, SQUAD(GTCO), and Different Tech Communities 🧠 Hermes Agent Assistant — A Modular AI Agent System with Planner, Executor & Memory Spring Boot Auto-Configuration Source Code: Nail This Interview Question The Ultimate Guide to Free AI API Keys: 6 Platforms You Need to Know Why 91% of AI Agents Fail in Production (And What the 9% Do Differently) TryHackMe | Battery | WALKTHROUGH Stop Guessing Your Regex — Test It Live in the Browser I Built FreelancEye, an Open-Source Mobile PWA for Finding Clients Beyond the Hype: My Production Playbook for Docker Swarm Top AI App Builder Platforms with Integrated Backend, Hosting & Database ECS vs EKS in 2026: An Honest Comparison from Someone Who Has Run Both in Production Hardening Your Node.js App Against Supply Chain & Remote Code Execution Attacks linux commands A Practical GEO Case: How an AI System Started Recommending Our Blog Your AI Agent Works 24/7 and Earns $0. I Built the Fix. Your AI Trading Agent Will Lose All Your Money — Here's How To Stop It Google I/O 2026: What Happens When Everything Connects? Why AI writes software but doesn’t build a good product Beyond the Hype: How Google I/O 2026 Secretly Democratized Production-Ready AI Agents with Managed Sandboxes. The Killer Assumption Test: How to Spot Doomed Product Decisions Before You Ship Stop Describing Your Bugs — Just Screenshot Them # I Built an AI Website Builder and Here's What Actually Happened Cooking an AI Campaign in 5 Minutes with Google Cloud AI APIs Your PM Retrospectives Are Lying to You How I Built a Free, Self-Hosted Pipeline That Auto-Generates Faceless YouTube Shorts TypeScript 54 to 58: The Features That Actually Matter in 2026 How to Tailor Your CV to Any Job Posting in 2026 The 7-day SaaS MVP loop: ship fast, then validate with people who actually show up 95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job What Is a Frontend Developer Roadmap and Why You Need One Google shipped three Gemini "Flash" models. Picking the wrong one could 6 your AI bill Building an MCP server so Claude can query my SaaS analytics directly Google I/O 2026 and the Rise of the AI Ecosystem Your Docker Builds Are Slow Because You're Doing It Wrong (And I Built a Tool to Prove It) How do you verify GitHub contributions without trusting self-reported skills? CV vs Resume: What's the Difference and Which Do You Need? student Devs: Build AI Agents & Compete for $55K in Prizes 🚀 How to Write a Cover Letter That Actually Gets You Interviews Battle-Tested: What Getting Hacked Taught Me About Web & Cyber Security Unda folders za kuandika code >> mkdir src >> cd src >> mkdir controllers database routes services utils >> cd .. Directory: C:\Users\mwaki\microfinance-system Mode LastWriteTime Length Name Code Coverage .NET AI slop debt" is technical debt on fast forward. Nobody's ready. Multi-Head Latent Attention (MLA) Memoria - A Local AI Reading Companion Powered by Gemma 4 Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic Regression Models Serious Question: Is the Developer Job Actually in Risk Due to AI? published: true tags: #discuss #career #ai #help
Gemma 4 Has Four Variants. Here's How to Pick the Right One Before You Write a Single Line of Code.
Soumyadeep D · 2026-05-24 · via DEV Community

This is a submission for the Gemma 4 Challenge: Write About Gemma 4


The single most common mistake developers make when picking a local model is choosing based on benchmark scores. The second most common mistake is choosing based on what fits in VRAM.

Both of those things matter. But neither one is the actual first question.

The actual first question is: where does your model need to live, and what does it need to do there?

Gemma 4 ships in four variants - E2B, E4B, 26B A4B (MoE), and 31B - and Google made very deliberate architectural choices for each one. If you understand those choices, picking the right variant takes about five minutes. If you skip that step and benchmark-shop, you'll end up either underbuilding (a phone-ready E4B doing work that needs 256K context) or overbuilding (a 31B model sitting on $80/month of cloud compute when an E4B running locally would have been fine).

This post is that five-minute decision guide.


What Gemma 4 Actually Is

Released on April 2, 2026 under Apache 2.0, Gemma 4 is Google DeepMind's latest open-weight model family. Every variant ships with multimodal understanding (text + image as baseline, audio natively on the two smallest models), native function calling, and support for over 140 languages.

The headline capability that separates Gemma 4 from previous generations isn't any single feature. It's the intelligence-per-parameter ratio. The 26B MoE model only activates roughly 4B parameters per forward pass. The E4B runs on a phone. The 31B scores 89.2% on AIME 2026 math benchmarks - a score that would have required a model several times larger just a year ago.

The architecture decisions that make this possible:

  • Alternating local/global attention layers (local layers use sliding windows of 512-1024 tokens, global layers handle long-range context)
  • Per-Layer Embeddings (PLE) on the edge variants, which keeps the parameter count low while maintaining expressivity
  • Mixture-of-Experts on the 26B that routes each token through only the relevant expert layers, not the full network

This isn't just efficiency for efficiency's sake. It's what allows a 4-billion-parameter model to run offline on an Android phone with 4GB of RAM while still having a 128K context window. That combination didn't exist before.


The Four Variants, Actually Explained

Gemma 4 E2B - The Phone Model

~2.3B effective parameters, ~5.1B total with PLE, 35 layers, 128K context

This is the model you reach for when the edge is the deployment target. It runs on Android 12+ via Google AICore, on Raspberry Pi, and on Jetson devices. It supports text, image, and audio natively.

The "E" in the name stands for effective - because PLE means the model has more total parameters than it activates per forward pass, similar to how MoE works at a different level of the architecture. The practical result is a 1.5GB footprint with capabilities that land well above what a raw 2B parameter count would suggest.

Use E2B when: you're building a mobile app, an edge inference pipeline, a device-local assistant, or anything where network latency or data privacy makes sending requests to a remote API unacceptable.

Real use case: a receipt-scanning expense tracker that runs fully offline, reads image input, parses line items, and categorizes spending - all on device, no API call, no data leaving the phone.

# Running E2B locally with transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-E2B-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "user",
        "content": "Extract the total amount and vendor name from this receipt text: ..."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True
).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Enter fullscreen mode Exit fullscreen mode


Gemma 4 E4B - The Laptop Model

~4.5B effective parameters, ~8B total, 42 layers, 128K context

This is the everyday workhorse for developers who want to run a capable model locally without dedicated GPU hardware. It runs comfortably on a MacBook with 16GB unified memory, on a mid-range laptop with an integrated GPU, and on any machine where you'd rather not spin up a cloud instance.

The jump from E2B to E4B isn't just more parameters. The additional layers and parameter budget give it noticeably better instruction following, more reliable structured output, and stronger performance on tasks that require holding context across a long conversation.

It supports the same text, image, and audio modalities as E2B, which makes it genuinely multimodal in a way that matters for developer tooling - you can feed it screenshots, diagrams, or audio transcripts as part of a pipeline without needing a separate vision model.

Use E4B when: local inference is the requirement, your hardware doesn't have a discrete GPU, or you're prototyping something you'll later scale to a larger model and want fast iteration cycles.

Real use case: a local code review tool that takes a screenshot of your editor alongside the diff, understands both, and gives context-aware feedback - all running on your laptop, no telemetry.

# Quick Ollama setup for E4B (easiest local path)
# After installing Ollama: https://ollama.com

# In terminal:
# ollama pull gemma4:e4b

import ollama

response = ollama.chat(
    model="gemma4:e4b",
    messages=[
        {
            "role": "user",
            "content": "Review this function for edge cases and suggest improvements:",
        }
    ],
    options={
        "temperature": 0.3,
        "num_ctx": 8192  # can go up to 128K
    }
)

print(response["message"]["content"])

Enter fullscreen mode Exit fullscreen mode


Gemma 4 26B A4B (MoE) - The Consumer GPU Model

25.2B total parameters, ~3.8B active per forward pass, ~30 layers, 256K context

This is the one that makes the architecture story interesting. The 26B MoE sounds like it needs 26 billion parameters worth of compute. It doesn't. Only about 4 billion parameters activate for each token, which means it runs on a single RTX 3090 or RTX 4090 at full precision while delivering quality that competes with much larger dense models.

The jump to 256K context window is significant for developers. At 128K you can fit roughly a medium-sized codebase or a very long document. At 256K you're fitting large repositories, multi-document research contexts, or full conversation histories in customer-facing applications.

The MoE architecture also means that quality degrades more gracefully with quantization than a dense model of equivalent total parameters would. INT4 at 26B MoE looks better than INT4 at a comparable dense model.

Use 26B A4B when: you have a consumer GPU (24GB VRAM), need 256K context, and want near-flagship quality without flagship hardware costs. Also the right choice for anything agentic where the model needs to reason across large amounts of context to plan multi-step tasks.

Real use case: an agentic document processor that ingests a full legal contract (or a full codebase) in a single prompt, reasons across the entire document, and extracts structured data or answers specific questions - running locally on a 4090.

# Using the Gemma 4 26B with native function calling
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json

model_id = "google/gemma-4-26B-A4B-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # fits on 24GB with 4-bit quant
)

# Native function calling - define your tools
tools = [
    {
        "name": "search_contracts",
        "description": "Search the contract database by clause type or party name",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "clause_type": {
                    "type": "string",
                    "enum": ["liability", "termination", "payment", "IP"],
                    "description": "Type of clause to filter by"
                }
            },
            "required": ["query"]
        }
    }
]

messages = [
    {
        "role": "user",
        "content": "Find all termination clauses across the Q1 vendor contracts and summarize the notice periods."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    return_tensors="pt",
    return_dict=True
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Enter fullscreen mode Exit fullscreen mode


Gemma 4 31B - The Server Model

31 billion dense parameters, 256K context, full multimodal, thinking mode

This is the flagship. Every capability available in the family is present here. Thinking mode (chain-of-thought reasoning) is enabled. Math benchmark scores are serious: 89.2% on AIME 2026, compared to Gemma 3 27B's 20.8% on the same benchmark. It sits at #3 on the Arena open model leaderboard.

It requires ~20GB VRAM at FP16, or ~12GB with INT4 quantization. A single A100 80GB handles it comfortably at full precision. Two RTX 4090s with tensor parallelism also work. This is the model you deploy to a server, not run on a laptop.

Use 31B when: benchmark quality matters for your application, you need thinking mode for reasoning-heavy tasks, you're building a production service that will handle requests from multiple users, or you need the best math and coding performance available in an open-weight model.

Real use case: a coding assistant API that developers on your team query through a self-hosted endpoint - one 31B instance serving your whole engineering org at a cost that's a fraction of equivalent proprietary API calls.

# Serving 31B with vLLM for production throughput
# pip install vllm

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-4-31B-it",
    tensor_parallel_size=2,   # across 2x RTX 4090
    dtype="bfloat16",
    max_model_len=65536        # 64K for production balance
)

sampling_params = SamplingParams(
    temperature=0.2,
    top_p=0.9,
    max_tokens=2048
)

# Thinking mode for complex reasoning
prompts = [
    "<start_of_turn>user\nThink step by step: Given this algorithm, what's the worst-case time complexity and where is the bottleneck?\n\n[your code here]\n<end_of_turn>\n<start_of_turn>model\n"
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Enter fullscreen mode Exit fullscreen mode


The Decision Matrix

Here's the five-minute version:

Situation Model
Mobile app, Raspberry Pi, offline-first E2B
Laptop development, no GPU, fast iteration E4B
Consumer GPU (24GB), 256K context needed 26B A4B MoE
Server deployment, best quality, team-serving 31B
Agentic pipeline with many tool calls 26B A4B MoE (active param efficiency)
Math, coding, or reasoning-heavy production 31B
Privacy-sensitive user data, no API calls E4B or E2B
You have an A100 and want the best 31B

The Bigger Thing Happening Here

I want to step back from the specs for a second.

A model that scores 89.2% on a serious math benchmark, supports 256K context, runs multimodal inference, and has native function calling for agentic tasks... is now open-weight, Apache 2.0, and runs on hardware that a developer can actually own.

The E4B running on a laptop with 128K context and audio support isn't a "small model compromise." It's a capability that would have been frontier-level two years ago. The E2B running on a phone offline isn't a demo trick. It's a production-viable deployment target.

What that actually means is that the architectural question of "cloud or local?" is no longer primarily a capability question. It's a cost, latency, and privacy question. And for a lot of applications - the ones where user data is sensitive, where offline availability matters, where API costs compound at scale - local wins.

Gemma 4 doesn't make that argument. It just makes it very hard to argue against.


Getting Started in Under 5 Minutes

The fastest path to running any Gemma 4 variant locally is Ollama:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the variant you want
ollama pull gemma4:e4b     # ~5GB, laptop-ready
ollama pull gemma4:26b     # ~15GB, GPU-ready

# Run it
ollama run gemma4:e4b

# Or use the API directly
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [
    { "role": "user", "content": "Hello, what can you do?" }
  ]
}'

Enter fullscreen mode Exit fullscreen mode

If you want Python with the full transformers ecosystem (function calling, thinking mode, multimodal), the Hugging Face model cards for each variant have complete working examples. Start with google/gemma-4-E4B-it - it's the most accessible entry point and covers most development use cases.


Quick Note on Licensing

Apache 2.0 means you can use Gemma 4 commercially, modify the weights, build products on top of it, and distribute your derivative work - without paying royalties or asking permission. That is not the case for every "open" model out there, and it matters a lot for anyone building a business on top of local inference.


The right Gemma 4 variant is the one that runs where your users are, fits the hardware you can actually provision, and has enough context to do the task you're designing for. Everything else is optimization.

Start with E4B if you're unsure. Scale up when the task demands it.


Tags: devchallenge gemmachallenge gemma ai machinelearning python opensource