惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Stack Overflow Blog
Stack Overflow Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
P
Proofpoint News Feed
Apple Machine Learning Research
Apple Machine Learning Research
T
Tailwind CSS Blog
罗磊的独立博客
F
Future of Privacy Forum
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
P
Privacy & Cybersecurity Law Blog
V
Visual Studio Blog
T
Tenable Blog
F
Fortinet All Blogs
D
Docker
V
Vulnerabilities – Threatpost
Cyberwarzone
Cyberwarzone
A
Arctic Wolf
T
Threat Research - Cisco Blogs
I
Intezer
T
Tor Project blog
大猫的无限游戏
大猫的无限游戏
MongoDB | Blog
MongoDB | Blog
博客园 - 司徒正美
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
G
GRAHAM CLULEY
T
Threatpost
美团技术团队
K
Kaspersky official blog
F
Fox-IT International blog
Hugging Face - Blog
Hugging Face - Blog
Vercel News
Vercel News
P
Palo Alto Networks Blog
Google DeepMind News
Google DeepMind News
T
The Blog of Author Tim Ferriss
S
Schneier on Security
腾讯CDC
Cisco Talos Blog
Cisco Talos Blog
C
Check Point Blog
博客园 - 叶小钗
I
InfoQ
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Blog — PlanetScale
Blog — PlanetScale
F
Full Disclosure
T
True Tiger Recordings
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
E
Exploit-DB.com RSS Feed
L
LINUX DO - 热门话题
J
Java Code Geeks
C
CERT Recently Published Vulnerability Notes

DEV Community

How I self-hosted my Flask app on an old laptop for almost free I built a free DSA interview prep site because I was tired of the existing options I built an AI agent that migrates Next.js Pages Router to App Router Prisma Query Logging and PostgreSQL: Where the ORM Ends and the Database Begins Prisma query logging y PostgreSQL: dónde termina el ORM y empieza la base From Browser to Server : The Journey of an HTTP Request (Demystifying the Web’s Infrastructure) Santa Augmentcode Intent Ep.6 I Benchmarked 17 ESLint Security Plugins. Only One Found Every Vulnerability. How to Build a High-Performance Image Optimization Pipeline in 5 Minutes 50 Linux Commands Every DevOps Engineer Must Know Less Toil, More Flow - Automating the Path from Request to Implementation The Code Review Checklist I Actually Use How I run a small blog on Astro 5 + Content Collections Git: Best Practices for Professionals How IBM Bob Became My Everyday Coding Companion Solana Passkey Wallet: Replacing Seed Phrases with SIMD-0075 I built a small browser puzzle game about arrows I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong. Mobile Game Optimization: A Unity Developer's Checklist Git: Best Practices for Beginners Three days I lost chasing a ghost that was already dead on disk Why Too Many Parts Hurt ClickHouse Performance Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls Gemma Forge: Local AI Without the Setup Wall From Half‑dead Prototype to Local‑Only AI Medical Assistant: Rewiring MedClinic with GitHub Copilot Runninig a forkbomb in Jenkins What’s Actually Happening When You Use Git Preventing Recursive Tool Loops in LangChain Agents Building a Rock-Paper-Scissors CLI with TypeScript — Union Types, Conditionals, and Jest Your AI Coding Agent Wastes 80% of Its Context. Fixed That with Graph Theory. Why Flutter Has Become the Go-To Framework for Fintech App Development We built a scripting language just for AI agents. Here's why. Stop building AI inboxes. Build decision layers instead. Meme Monday Why I Built @editora/ui-react? Are AI tools the next level of abstraction in software development? Identity on Solana: Your Wallet Is Your Account One API Call Changed Everything The Internet Career Nobody Talks About Enough: What Is DevRel? Solar Panel Wiring Diagram: Series vs Parallel Hello everyone! Glad to join the dev.to community I Built an AI Agent That Tailors My Resume - Here's How Agents Actually Work I Built a WhatsApp OTP + AI Chatbot Platform for African Businesses MTP Explained — And Why It Matters for Android on Mac Most Beginners Learn Full-Stack Development Backwards GitHub Glow-Up: Open Source, READMEs, Badges, Streaks, Git and gh CLI System Design Cheat Sheet: Concepts Every Developer Should Know Are Junior Developer Roles Actually Dying? A Fresher's Honest Take Using DigitalOcean Droplets as Ephemeral Sandboxes for AI Agents I built a VSCode extension that visualises your code navigation as a call tree — made for legacy codebase pain Vite predev/prebuild: chaining scripts without losing your mind A website to save you from messy browser tabs Dear Web2 Developer... Solana is here calling Postgres JSONB indexes: GIN vs BTREE on the same column The $5 AI That Remembers Everything What are your goals for the week? #180 Zettelkasten for Developers: A Practical Method That Works OpenClaw vs Hermes Agent: Stars, Downloads & Usage 2026 `act` vs. `waitFor` Global Teams Don’t Struggle With Time Zones. They Struggle With Context Python as a JavaScript Dev $5.4 Billion in Damage. 8.5 Million Machines Down. Three YAML Controls Would Have Prevented It. Here's the Structural Analysis. 🚫 Stop Using PN532 V1 for Your NFC Projects (Real Debugging Experience) Probabilistic Graph Neural Inference for smart agriculture microgrid orchestration for extreme data sparsity scenarios Inference Is Becoming the New Steady-State Cost Center Why AI-Generated Code Is Always Good Enough — And Never Great I built a dark admin dashboard template in HTML — no React, no npm, just pure HTML What is the Difference Between Lattice-Based and Hash-Based Signatures? Next.js App Router caching: revalidate, dynamic, and no-store without the folklore Next.js App Router caching: revalidate, dynamic y no-store sin folklore I built Stashly — a full-stack content manager with a rich text editor published: false tags: react, node, mongodb, typescript Why I Started Building React Projects Instead of Just Watching Tutorials ? Every Tool Eventually Becomes Tuesday Nobody Warns You That Real Software Engineering Feels Chaotic Tích hợp VNPay, Stripe trong Odoo 19 BeautifulSoup and Requests for Web Scraping With Python: When Simple Still Works I Was Stuck Debugging React — Then Developer Tools Changed It Buck Converter Ripple: Sizing the Inductor and Capacitor With Confidence AWS Just Made Its MCP Server Generally Available. Here's What It Actually Gives AI Agents. RAMPART Tests Your AI Agents in Dev. What Catches Malicious Tool Calls in Production? Vibe Team Software Engineering: What a Real AI Human Dev Team Workflow Actually Looks Like An npm Package for AI Agent Orchestration Just Shipped With Its Front Door Unlocked. Here's What the CVE Actually Reveals. Microsoft Foundry Just Added CI/CD for AI Agents. Here's What That Actually Changes. The Best Career Insurance Is a Tech Event You Don't Want to Attend Your GitHub Profile Already Tells Recruiters More Than Your Resume. Most Devs Just Don't Surface It. How to Add Execution Budgets to OpenAI Agents SDK Binary Tree Interview Problems: 6 Traversal Patterns, 15 Problems We trained a personal voice DoRA on Qwen3-8B for $1.50 — beat stock model 100% in blind A/B Stop Leaking API Keys: Why I Built a Local-First Vault for Developers 🔐 RAG Explained: How Retrieval-Augmented Generation Actually Works I Built a Fast Async JioSaavn API Wrapper in Python 🎧 chown & chgrp Deploying Your First App on Kubernetes: A Beginner's Guide (Minikube & Kind) Logs in code It's called a PR "review" for a reason DePIN GPU Market: The Failed Job Receipt Developers Should Demand Why Your AI Agent Monitoring is Wrong (And How to Fix It) Lock Down Your Cloud Shares: A Beginner’s Guide to Azure Files Security. Building a Multi-Channel Content Syndication Pipeline with EmDash Plugins Turn Your Phone Into Voice Input for Any React Text Field
What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem
Godwill Chri · 2026-05-25 · via DEV Community

It's been a few months since I last wrote about Data Preprocessor, the IntelliJ plugin I built to stop re-writing the same pandas preprocessing scripts every project. The 1.5.x series has landed a real R codegen path, a more honest outlier-resistant normalizer, and one genuinely embarrassing deadlock that I want to talk about openly because the lesson is useful.
tl;dr on what the plugin does
You load a CSV, Excel, or JSON file inside your JetBrains IDE. The plugin profiles every column (type, null count, mean/median/std, mode, unique count). You build a pipeline visually — drop nulls, fill with mean, deduplicate, remove IQR outliers, normalize (min-max / z-score / robust), label-encode, one-hot, train/test split, sort, filter, type-cast — and then one click emits a complete, ready-to-run Python (pandas) or R (base + a few small libs) script.
All processing is local. The plugin collects no telemetry. The generated code is normal pandas or normal R — no runtime library, no plugin import, nothing magic. Read it, edit it, commit it alongside your dataset, run it long after you've uninstalled the plugin.
Here's roughly what a 5-step pipeline turns into:
python# Generated by Data Preprocessor 1.5.6

Source: sample-data/employees.csv

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("sample-data/employees.csv")

Step 1: drop rows where 'department' is null

df = df.dropna(subset=["department"])

Step 2: fill 'performance_score' null with median

df["performance_score"] = df["performance_score"].fillna(
df["performance_score"].median()
)

Step 3: remove duplicates

df = df.drop_duplicates()

Step 4: Robust Scaler on 'salary' (median/IQR, IQR=0 guard)

_med = df["salary"].median()
_q1 = df["salary"].quantile(0.25)
_q3 = df["salary"].quantile(0.75)
_iqr = _q3 - _q1
if _iqr != 0:
df["salary"] = (df["salary"] - _med) / _iqr

Step 5: train/test split (ratio 0.8)

train, test = train_test_split(df, train_size=0.8, random_state=42)
The R output is structurally the same, with readxl / jsonlite / fastDummies imported only when the pipeline actually uses them.
1.5.0 — R code generation, for real
The biggest change since I last posted is that the codegen is no longer Python-only. The full 16-operation pipeline now has an R equivalent. Label-encode is 0-based to match pandas.factorize (R's native factor() is 1-based by default — that was a fun footgun to find and fix in 1.5.5).
This was a deliberate choice rather than a feature request: when you preprocess for an analytics team, half of them are in Python and half are in R, and forcing the cleanup to be language-specific defeats the point of having a reproducible artifact. The visual pipeline is the spec; Python and R are just two render targets.
1.3.0 → 1.5.5 — Robust Scaler with honest edge cases
Min-max and z-score break in interesting ways when your column has outliers. A single row at 10⁹ collapses the rest of the column into a narrow band near zero. So 1.3.0 added the Robust Scaler — (x - median) / IQR — which gives you a normalization that doesn't get yanked around by the long tail.
The catch: when IQR = 0 (column is constant, or near-constant), the naïve formula divides by zero and silently produces NaN in Python or Inf in R. The Java preview already guarded against this (returned the column unchanged), but the generated scripts didn't. 1.5.5 added explicit if _iqr != 0: / if (.iqr != 0) guards in both generated outputs to match the preview's behaviour exactly.
Boring fix, but the kind of thing where the absence of an error is worse than a noisy crash. A NaN that propagates through three more steps is much harder to debug than a ZeroDivisionError at the source.
1.5.3 — the deadlock post-mortem
This is the one I want to talk about. The IntelliJ Platform 2024.2 changed how FileChooser.chooseFiles interacts with the EDT (event-dispatching thread). The Browse button started failing intermittently on newer IDEs, so 1.5.3 wrapped the call in ApplicationManager.invokeLater(...).
That was wrong, and not in a "minor regression" way — in a "the entire IDE freezes for every user who installs the plugin" way.
Here's the trap: FileChooser.chooseFiles is already asynchronous on its own. Wrapping it in invokeLater queues a runnable behind the EDT pump, but the runnable itself opens a modal-style dispatcher that blocks the EDT pump waiting for itself to dispatch. Neither side makes progress. Cursor hangs, dock icon stops responding, and the JVM has to be killed from Activity Monitor.
I caught it within about an hour because users on Marketplace were immediate and direct about it (sincere gratitude for that — angry early users are the most valuable kind), retracted 1.5.3, shipped 1.5.4 as a straight revert, and added a permanent comment to the source so I don't repeat the mistake:
java// FileChooser.chooseFiles is already asynchronous and must be called
// directly from the EDT — no wrapper is needed or safe.
1.5.5 then fixed the original Browse problem the right way: switched to the built-in single-file chooser, kept directories visible in the filter so users can navigate normally, and anchored the dialog to the tool window component rather than letting it float free.
Two lessons I'm carrying forward:

When the platform changes async semantics, read the source — don't guess. The 2024.2 release notes mentioned the dispatcher change, but I didn't connect it back to FileChooser because the API surface hadn't moved.
Modal-dialog-on-EDT bugs don't show up in CI. They show up the moment a real user clicks the button. Manual smoke-testing on a sandbox IDE before every publish is now non-negotiable for me.

1.5.6 — SDK alignment
Just shipped today. pluginSinceBuild bumped from 233 to 243, matching the 2024.3 SDK I actually compile against. JetBrains' Plugin Verifier reports Compatible against IC-243, IC-251, IC-252, and IU-253 — zero deprecated-API usages against 2024.3 itself, three soft deprecations in 2025.x that I'll address in the next minor.
I also disabled the Gradle IntelliJ Plugin's GitHub self-update check, which had a habit of failing the entire build whenever GitHub's API was rate-limited or my network was offline. That one ate two hours of my Monday before I tracked down the fix:
properties# gradle.properties
systemProp.org.jetbrains.intellij.buildFeature.selfUpdateCheck=false
If you build any IntelliJ plugin and you've ever stared at Cannot resolve the latest Gradle IntelliJ Plugin version and wondered why a build with no actual problems is failing — that line is the fix.
What's next
The most-requested features right now, in order:

Categorical binning — equal-width and quantile-based bucketization for numeric columns into categorical bins. Pandas has pd.cut and pd.qcut; R has a few options. Codegen for both is straightforward; the UI work is figuring out how to preview the bins without making the tool window huge.
Pipeline import/export as JSON so teams can share pipeline definitions in the repo and re-apply them via CLI in CI. This is the change that turns the plugin from a "speed up the first cleanup" tool into a "version-control your data cleanups" tool.
DuckDB read path for files too large to fit in memory. The current LoaderArchitecture is single-pass row-oriented; DuckDB would let the plugin profile and clean files up to ~10 GB on a laptop without rewriting the engine.

If you've used the plugin and have opinions on which of these to prioritize — or a totally different request — please drop it as an issue or just reply here. The most useful feedback is "I tried to do X and the generated code does Y instead" because those are the highest-leverage fixes.
Try it

Marketplace: https://plugins.jetbrains.com/plugin/31226-data-preprocessor
Source (MIT): https://github.com/codaBlurd/data-preprocessor-plugin

Bug reports, feature requests, and PRs all welcome. Reviews on the Marketplace are how the plugin gets discovered by new users — if it's saved you time, two minutes there is the highest-leverage thing you can do for it.
Thanks for reading. Build something good this week.