What's new in Data Preprocessor 1.5.x — R codegen, Robust Scaler, and a deadlock post-mortem

It's been a few months since I last wrote about Data Preprocessor, the IntelliJ plugin I built to stop re-writing the same pandas preprocessing scripts every project. The 1.5.x series has landed a real R codegen path, a more honest outlier-resistant normalizer, and one genuinely embarrassing deadlock that I want to talk about openly because the lesson is useful.
tl;dr on what the plugin does
You load a CSV, Excel, or JSON file inside your JetBrains IDE. The plugin profiles every column (type, null count, mean/median/std, mode, unique count). You build a pipeline visually — drop nulls, fill with mean, deduplicate, remove IQR outliers, normalize (min-max / z-score / robust), label-encode, one-hot, train/test split, sort, filter, type-cast — and then one click emits a complete, ready-to-run Python (pandas) or R (base + a few small libs) script.
All processing is local. The plugin collects no telemetry. The generated code is normal pandas or normal R — no runtime library, no plugin import, nothing magic. Read it, edit it, commit it alongside your dataset, run it long after you've uninstalled the plugin.
Here's roughly what a 5-step pipeline turns into:
python# Generated by Data Preprocessor 1.5.6

Source: sample-data/employees.csv

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("sample-data/employees.csv")

Step 1: drop rows where 'department' is null

df = df.dropna(subset=["department"])

Step 2: fill 'performance_score' null with median

df["performance_score"] = df["performance_score"].fillna(
df["performance_score"].median()
)

Step 3: remove duplicates

df = df.drop_duplicates()

Step 4: Robust Scaler on 'salary' (median/IQR, IQR=0 guard)

_med = df["salary"].median()
_q1 = df["salary"].quantile(0.25)
_q3 = df["salary"].quantile(0.75)
_iqr = _q3 - _q1
if _iqr != 0:
df["salary"] = (df["salary"] - _med) / _iqr

Step 5: train/test split (ratio 0.8)

train, test = train_test_split(df, train_size=0.8, random_state=42)
The R output is structurally the same, with readxl / jsonlite / fastDummies imported only when the pipeline actually uses them.
1.5.0 — R code generation, for real
The biggest change since I last posted is that the codegen is no longer Python-only. The full 16-operation pipeline now has an R equivalent. Label-encode is 0-based to match pandas.factorize (R's native factor() is 1-based by default — that was a fun footgun to find and fix in 1.5.5).
This was a deliberate choice rather than a feature request: when you preprocess for an analytics team, half of them are in Python and half are in R, and forcing the cleanup to be language-specific defeats the point of having a reproducible artifact. The visual pipeline is the spec; Python and R are just two render targets.
1.3.0 → 1.5.5 — Robust Scaler with honest edge cases
Min-max and z-score break in interesting ways when your column has outliers. A single row at 10⁹ collapses the rest of the column into a narrow band near zero. So 1.3.0 added the Robust Scaler — (x - median) / IQR — which gives you a normalization that doesn't get yanked around by the long tail.
The catch: when IQR = 0 (column is constant, or near-constant), the naïve formula divides by zero and silently produces NaN in Python or Inf in R. The Java preview already guarded against this (returned the column unchanged), but the generated scripts didn't. 1.5.5 added explicit if _iqr != 0: / if (.iqr != 0) guards in both generated outputs to match the preview's behaviour exactly.
Boring fix, but the kind of thing where the absence of an error is worse than a noisy crash. A NaN that propagates through three more steps is much harder to debug than a ZeroDivisionError at the source.
1.5.3 — the deadlock post-mortem
This is the one I want to talk about. The IntelliJ Platform 2024.2 changed how FileChooser.chooseFiles interacts with the EDT (event-dispatching thread). The Browse button started failing intermittently on newer IDEs, so 1.5.3 wrapped the call in ApplicationManager.invokeLater(...).
That was wrong, and not in a "minor regression" way — in a "the entire IDE freezes for every user who installs the plugin" way.
Here's the trap: FileChooser.chooseFiles is already asynchronous on its own. Wrapping it in invokeLater queues a runnable behind the EDT pump, but the runnable itself opens a modal-style dispatcher that blocks the EDT pump waiting for itself to dispatch. Neither side makes progress. Cursor hangs, dock icon stops responding, and the JVM has to be killed from Activity Monitor.
I caught it within about an hour because users on Marketplace were immediate and direct about it (sincere gratitude for that — angry early users are the most valuable kind), retracted 1.5.3, shipped 1.5.4 as a straight revert, and added a permanent comment to the source so I don't repeat the mistake:
java// FileChooser.chooseFiles is already asynchronous and must be called
// directly from the EDT — no wrapper is needed or safe.
1.5.5 then fixed the original Browse problem the right way: switched to the built-in single-file chooser, kept directories visible in the filter so users can navigate normally, and anchored the dialog to the tool window component rather than letting it float free.
Two lessons I'm carrying forward:

When the platform changes async semantics, read the source — don't guess. The 2024.2 release notes mentioned the dispatcher change, but I didn't connect it back to FileChooser because the API surface hadn't moved.
Modal-dialog-on-EDT bugs don't show up in CI. They show up the moment a real user clicks the button. Manual smoke-testing on a sandbox IDE before every publish is now non-negotiable for me.

1.5.6 — SDK alignment
Just shipped today. pluginSinceBuild bumped from 233 to 243, matching the 2024.3 SDK I actually compile against. JetBrains' Plugin Verifier reports Compatible against IC-243, IC-251, IC-252, and IU-253 — zero deprecated-API usages against 2024.3 itself, three soft deprecations in 2025.x that I'll address in the next minor.
I also disabled the Gradle IntelliJ Plugin's GitHub self-update check, which had a habit of failing the entire build whenever GitHub's API was rate-limited or my network was offline. That one ate two hours of my Monday before I tracked down the fix:
properties# gradle.properties
systemProp.org.jetbrains.intellij.buildFeature.selfUpdateCheck=false
If you build any IntelliJ plugin and you've ever stared at Cannot resolve the latest Gradle IntelliJ Plugin version and wondered why a build with no actual problems is failing — that line is the fix.
What's next
The most-requested features right now, in order:

Categorical binning — equal-width and quantile-based bucketization for numeric columns into categorical bins. Pandas has pd.cut and pd.qcut; R has a few options. Codegen for both is straightforward; the UI work is figuring out how to preview the bins without making the tool window huge.
Pipeline import/export as JSON so teams can share pipeline definitions in the repo and re-apply them via CLI in CI. This is the change that turns the plugin from a "speed up the first cleanup" tool into a "version-control your data cleanups" tool.
DuckDB read path for files too large to fit in memory. The current LoaderArchitecture is single-pass row-oriented; DuckDB would let the plugin profile and clean files up to ~10 GB on a laptop without rewriting the engine.

If you've used the plugin and have opinions on which of these to prioritize — or a totally different request — please drop it as an issue or just reply here. The most useful feedback is "I tried to do X and the generated code does Y instead" because those are the highest-leverage fixes.
Try it

Marketplace: https://plugins.jetbrains.com/plugin/31226-data-preprocessor
Source (MIT): https://github.com/codaBlurd/data-preprocessor-plugin

Bug reports, feature requests, and PRs all welcome. Reviews on the Marketplace are how the plugin gets discovered by new users — if it's saved you time, two minutes there is the highest-leverage thing you can do for it.
Thanks for reading. Build something good this week.

推荐订阅源

DEV Community