惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

云风的 BLOG
云风的 BLOG
Last Week in AI
Last Week in AI
IT之家
IT之家
H
Hackread – Cybersecurity News, Data Breaches, AI and More
博客园 - 三生石上(FineUI控件)
Microsoft Azure Blog
Microsoft Azure Blog
Recent Announcements
Recent Announcements
The Register - Security
The Register - Security
C
Cyber Attacks, Cyber Crime and Cyber Security
S
SegmentFault 最新的问题
Engineering at Meta
Engineering at Meta
Know Your Adversary
Know Your Adversary
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
WordPress大学
WordPress大学
C
CXSECURITY Database RSS Feed - CXSecurity.com
F
Fox-IT International blog
C
Cybersecurity and Infrastructure Security Agency CISA
P
Privacy & Cybersecurity Law Blog
雷峰网
雷峰网
大猫的无限游戏
大猫的无限游戏
F
Future of Privacy Forum
阮一峰的网络日志
阮一峰的网络日志
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recorded Future
Recorded Future
P
Proofpoint News Feed
O
OpenAI News
C
CERT Recently Published Vulnerability Notes
E
Exploit-DB.com RSS Feed
Spread Privacy
Spread Privacy
酷 壳 – CoolShell
酷 壳 – CoolShell
人人都是产品经理
人人都是产品经理
罗磊的独立博客
V
V2EX - 技术
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
The Blog of Author Tim Ferriss
N
Netflix TechBlog - Medium
AWS News Blog
AWS News Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
爱范儿
爱范儿
李成银的技术随笔
C
Cisco Blogs
SecWiki News
SecWiki News
Application and Cybersecurity Blog
Application and Cybersecurity Blog
L
LINUX DO - 热门话题
B
Blog RSS Feed
Google DeepMind News
Google DeepMind News
G
Google Developers Blog
Latest news
Latest news
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
J
Java Code Geeks

DEV Community

Regression Testing in Agile: How to Test Without Slowing Down Your Sprints I build projects and manage teams without a single call Making a Calculator UI with HTML5 and CSS3 KloudAudit vs AWS Cost Explorer: Why I Stopped Using Cost Explorer for Waste Detection Telegram: API bot access token Gemma 4 at the Edge AasPass: A lightweight, local-first password vault for developers Why Local AI Was the Real Winner of Google I/O 2026 (An Insider’s Take) Laravel Google Drive Filesystem: Unlimited Cloud Storage with Familiar Syntax When not to build an AI agent (and what to ship instead) What a real Sanity CMS development services proposal looks like Why hybrid search is the boring default we keep recommending I kept improving my .NET order pipeline after a CTO left feedback. Here is where it ended up. Why Developers go behind Linux ? Does Front End need HTML, CSS? - Part - 2 From Prompts to Action: What Gemini 3.5 Flash and the Agentic Stack Mean for Developers Does Front End need HTML, CSS? - Part - 1 The real attack surface for AI coding agents is the config file Chai aur SQL — A Beginner's Journey into Databases Find Your Route Source Score: Continuing Exploration of LLM Usage in Automated Workflows Tried using the Claude Platform on AWS Your Node.js Server is Using Just One CPU. Here's How to Fix It. 🚀 Google Antigravity 2.0 Quietly Changes What It Means to Be a Software Engineer Environment variables vs connection references in Power Platform Multi-BU D365 environment: single tenant, multiple LEs AI API Integration Testing Checklist for Multi-Model Apps ORA-00203 오류 원인과 해결 방법 완벽 가이드 Designing a Data Extension in SFMC: The Four Decisions First Kayrol — Day 0: Building AI highlight reels for athletes (in public) The Agony of Over-Engineered Operators: Why Simplicity Saved Our Treasure Hunt Engine Business Rules vs Power Automate vs Plugin: pick one Dataverse virtual tables on SQL: three latency patterns Comunicación y sincronización entre procesos distribuidos I let Gemma 4 analyze my credit card statements so I wouldn't have to Faithfulness gate: the agent layer most teams skip Centralized procurement D365: global address book + vendors Why I Can't Stop Thinking About Google's New A2A Protocol Perovskite cell scaps simulation analysis ¿Qué significan esas letras del CVSS? Guía para entenderlo de una vez scrcpy Integration in a Tauri App — Android Screen Mirroring on Mac Shopify theme editor: design tokens merchants can edit Dataverse security restructure: lessons applied too late Floatkit is live now!!! SimGemma: Democratizing STEM Education with Offline-First AI Simulations What to monitor in an AI agent before you launch (and after) The precedence rule deserves a name Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM Architecture [Boost] I Still Remember the Day Our Server Stall Almost Killed the Product Launch AI Agents Need More Than Fact-Checking Evaluation & Benchmark Results 5 things `flutter_gemma` doesn't tell you about shipping Gemma 4 on Android How I Indexed 2,000 Claude Code Skills (And What the Install Data Says About AI Coding in 2026) Architecting Instant Micro-Loans: Data Pipelines and KYC Automation Bulk Rename Files from the Command Line with Python Virtual SOC Analyst This project was an absolute blast to build for the Hermes Agent Challenge. If you found the architecture layout or the local automation breakdown helpful, please drop a ❤️ or a 🦄 on the post! Let me know if you want me to write a follow-up guide specifi How I built a fully offline AI assistant on Android with Gemma 4 E2B How I Got Users to Willingly Wait 1 Minute for an API Call (Without Over-Engineering) What Training Exists for Security Professionals Learning AI and Data Science? Easier Bets to Get Early Customer Validation and VC Attention django-deploy-probes — deployment probe endpoints for Django AI Won’t Replace Developers. Weak Thinking Will. Building Micro Agents as Production-Grade Microservices Why Open-Weight Models Like Gemma 4 Are the Future of Secure Backend Architecture I lost 3 enterprise clients in one night because of a GitHub repo. So I built a tool to make sure it never happens again. Building a Local AI SOC Analyst on an M1 MacBook Pro Carelo: A Modern Dual-Pane File Manager for Linux AI API Pricing in 2026: What You Actually Pay for GPT-5.5, Claude Opus, Gemini, and 20+ Models I Built a Free Offline-First Event Operations Platform at 13. Here's Why the Architecture Is Different. I Built an AI Tools Directory. These 10 Lessons Hurt the Most. The "Disappearing Zero": Handling Numeric Inputs in React Native Forms I Finished My Local AI Coding Agent After 5 Months — Eve Agent V2 Unleashed published Neuropsychology: What Brain Damage Reveals About the Mind Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour Engineers Don’t Fail Technical Interviews Because They’re Bad at Tech — They Fail Because They Ignore Communication The 20% of ML theory that earns its keep in production WeiQi - (Go) game based productivity tool Diário de dev #1: o que 15 minutos desbloqueou 远程安装及部署应用 · 用户配合指南 The Complete Guide to API Design in 2026: REST, GraphQL, and tRPC in Production 🐍 Flask Python Structured Logging — What Most Miss in Production CSS in 2026: Container Queries, Cascade Layers, and the End of Utility-Class Bloat TypeScript 5.5 — The Features That Actually Matter for Production Code Database Migration Strategies That Actually Work in Production Detecting unusual processes on your servers without writing a single rule 2026 Q1 is the year developers still build the agent harness. 2026 Q3 / 2027 is the year the LLM builds its own harness. Introduction to Generative AI no-cycle finds 0 cycles in next.js (and other lies caches tell you) Google I/O 2026 Wasn’t About AI Models — It Was About Infrastructure Hermes Agent vs Openclaw بناء موقع شخصي يمثلك كمطور: دروس من رحلتي Building a Developer Portfolio That Represents You: Lessons from My Journey Your Checkout Is Probably Leaking Revenue. The Problem Is You Cannot See Where. Domain-Based C++ Logging With Nova OpenCode Go + Oh My OpenAgent: The Model Routing Config That Actually Saves Money Seven Types of Data Extensions We Use on SFMC Projects Rollup vs calculated columns in Dataverse: the async trap we fell for MES integration with D365 Supply Chain: Azure middleware pattern
Beyond RAG: Architecting Local Long-Context Pipelines with Gemma 4's 31B Dense Model
Jagadeesh · 2026-05-24 · via DEV Community

Most AI document processing relies heavily on Retrieval-Augmented Generation (RAG). We chunk data into tiny pieces, vectorize it, and stitch the summaries together. RAG is excellent for finding a needle in a haystack, but it is fundamentally flawed when you need the model to understand the entire haystack at once.

With the release of Gemma 4, specifically the native 128K context window, we finally have the tools to move away from aggressive chunking.

In this post, I’ll break down why long-context local models change how we design AI pipelines, examine the architectural differences between the Gemma 4 variants, and share a case study of how I utilized the 31B Dense model to process massive, unbroken log files locally.


The Problem: Chunking Destroys Narrative Coherence

Imagine an Operational Command Center (OCC) monitoring a multi-tenant Kubernetes deployment. A massive cascading failure occurs, generating 200 interconnected infrastructure alerts—Kafka backlogs, CPU spikes, and database deadlocks.

If you feed these logs into a standard chunked AI pipeline, it:

  1. Splits the logs into 2,000-token chunks.
  2. Summarizes each chunk independently.
  3. Merges those summaries into a final report.

The problem? Separation of concerns works in code, but not in narrative analysis. The Kafka backlog in chunk 1 is never contextually linked to the database deadlock in chunk 7. You get a sterile list of bullet points, missing the actual root cause that ties the event together.

To solve this, the model must read the entire event timeline in a single prompt.


Why the 31B Dense Model is the Right Tool

The Gemma 4 family offers three main architectures. When designing a system that relies on a 128K context window, intentional model selection is critical.

Model Primary Strength Best For
2B / 4B Edge execution Ultra-mobile, browser-based tasks
26B MoE Throughput / Speed Chatbots, high-volume fast inference
31B Dense Deep Recall / Reasoning Complex analysis across large contexts

A typical severe OCC incident log is roughly 80,000 to 100,000 tokens.

I explicitly chose the 31B Dense model over the 26B Mixture-of-Experts (MoE). While MoE models are undeniably faster at inference, Dense architectures traditionally exhibit superior long-context recall. When asking a model to evaluate 100,000 tokens of raw server metrics and deduce the single underlying failure thread, coherent reasoning across the full document is far more valuable than raw token generation speed.

The Local-First Advantage

Infrastructure alert data is confidential. By running ollama run gemma4:31b, the data never leaves the machine. No API keys, no data residency concerns, and no per-token cost at scale.


Case Study: The Long-Context "Fast-Path" Architecture

To demonstrate this, I built a 4-agent pipeline to generate analytical reports from raw data. Instead of forcing all data through a chunking mechanism, the architecture implements a Long-Context Fast-Path.

Here is how the routing logic cleanly separates the decision-making process:

def _use_full_document(self, document_text: str) -> bool:
    """
    Determines if the document can be processed in a single, unchunked pass.
    """
    provider = getattr(config, "PROVIDER", "ollama")
    use_long_ctx = getattr(config, "USE_LONG_CONTEXT", True)
    model = getattr(config, "OLLAMA_MODEL", "gemma4:31b")

    if not use_long_ctx:
        return False

    is_gemma4_local = (provider == "ollama" and "gemma4" in model.lower())
    is_gemma4_cloud = (
        provider == "openrouter" and 
        "gemma-4" in getattr(config, "MODEL_ALL", "").lower()
    )

    if not (is_gemma4_local or is_gemma4_cloud):
        return False

    # Gemma 4 supports 128K tokens. 
    max_chars = getattr(config, "GEMMA4_LONG_CONTEXT_CHARS", 400_000)
    return len(document_text) <= max_chars

Enter fullscreen mode Exit fullscreen mode

When this returns True, the orchestrator bypasses all intermediate summarizing agents. The entire context is injected directly into the primary narrative agent.

Multimodal Processing

I also implemented a call_vision() gateway using Gemma 4's native multimodal input. Ops teams can drop a screenshot of a dashboard (.png, .jpg), and Gemma 4 inherently connects the visual spikes to the text-based logs, extracting the numbers to use in the slides without needing a separate vision model.


Code & Running It Yourself

You can find the complete code for the CLI pipeline, FastAPI backend, and React frontend here:

GitHub Repository: [Insert your GitHub URL here]

For local, private execution:

# Install Ollama and pull the model
ollama pull gemma4:31b

# Clone and install
git clone [Your-Repo-URL]
pip install -r requirements.txt
playwright install chromium

# Set provider
echo "PROVIDER=ollama" >> .env
echo "OLLAMA_MODEL=gemma4:31b" >> .env

# Run the orchestrator
python orchestrator.py --input your_alerts.txt

Enter fullscreen mode Exit fullscreen mode

(Instructions for OpenRouter are also available in the repository README).


What I Learned

  1. Long context is not free. Feeding 80,000+ tokens into a model requires real hardware — the 31B variant needs roughly ~32GB VRAM to run efficiently locally with quantization. For most developers, cloud APIs or Kaggle notebooks are the practical path.
  2. Dense beats MoE for recall tasks. For reading hundreds of alerts and synthesizing a coherent narrative, the Dense architecture was significantly more reliable.
  3. Multimodal is genuinely useful. Unlocking screenshot processing completely changed the workflow for teams who rely on visual dashboards.
  4. Open weights = architecture freedom. Being able to run this pipeline entirely on-premise under Apache 2.0 is a legitimate business advantage for enterprise environments.

The shift toward capable, open-weight, large-context models like Gemma 4 means we no longer have to compromise our data architecture to fit the limitations of an AI. We can finally build systems that read our data the way we do.