Stop Shipping ML Models With Bare Floats: A Deep Dive Into Statistically Rigorous Model Evaluation

Stop Shipping ML Models With Bare Floats

Every week, somewhere, a team makes a deployment decision that looks like this:

Model A: AUROC = 0.847
Model B: AUROC = 0.851

They ship Model B.

Maybe it's better.

Maybe it's noise.

Nobody knows—because nobody computed a confidence interval.

That's exactly why I built reliably-metrics.

The Problem With Bare Floats

Most ML evaluation today looks like this:

print(f"AUROC = {auroc:.4f}")

Output:

AUROC = 0.8512

Looks precise.

Looks scientific.

But it tells you almost nothing about uncertainty.

Metrics are estimates computed from finite samples. Without uncertainty quantification, you're making decisions using a single point estimate and hoping it's representative.

Consider two models evaluated on 500 test samples:

Model A: AUROC = 0.847
Model B: AUROC = 0.851
Difference = +0.004

Is that improvement real?

Or would it disappear if you collected another batch of test data?

Most ML tooling doesn't answer that question.

Introducing `reliably-metrics`

pip install reliably-metrics

Basic evaluation:

import reliably as rb

report = rb.evaluate(y_true, y_prob)

print(report.summary())

Output:

Report(task=binary, n=500)
  ECE=0.0412 [0.0287, 0.0541]
  smECE=0.0389 [0.0261, 0.0523]
  Brier=0.1834 [0.1612, 0.2063]
  NLL=0.4821 [0.4503, 0.5148]
  AUROC=0.8234 [0.7941, 0.8509]

Notice something different?

Every metric comes with a 95% confidence interval.

No extra code.

No manual bootstrap implementation.

No statistics package required.

Compare Models With Statistical Significance Testing

Instead of comparing raw metric values, compare uncertainty-aware estimates.

result = rb.compare(
    model_a,
    model_b,
    metric="auroc",
    y_true=y_true
)

print(f"Delta: {result.delta:+.4f}")
print(f"95% CI: [{result.ci.low:.4f}, {result.ci.high:.4f}]")
print(f"p-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")

Output:

Delta: +0.0182
95% CI: [-0.0031, 0.0396]
p-value: 0.094
Significant: False

Interpretation:

The confidence interval crosses zero.
The p-value is greater than 0.05.
The improvement is not statistically significant.

Translation:

Don't deploy Model B yet.

The library automatically selects the appropriate test:

Metric	Statistical Method
AUROC	DeLong Test
Other Metrics	Paired Bootstrap
Multiple Comparisons	Holm–Bonferroni Correction

Calibration: Measure It, Fix It, Verify It

A model can have excellent accuracy while being poorly calibrated.

If your model outputs:

predict_proba = 0.90

it should be correct approximately 90% of the time.

In practice, many production systems are far from this ideal.

Diagnose

report_before = rb.evaluate(
    y_true,
    y_prob
)

print(report_before["ECE"])

Output:

ECE=0.0821 [0.0612, 0.1034]

Recalibrate

cal = rb.recalibrate(
    y_true,
    y_prob,
    method="temperature"
)

y_prob_cal = cal.predict(y_prob_test)

Verify Improvement

report_after = rb.evaluate(
    y_true_test,
    y_prob_cal
)

print(report_after["ECE"])

Output:

ECE=0.0241 [0.0143, 0.0352]

Supported methods:

Temperature Scaling
Isotonic Regression
Platt Scaling
Beta Calibration
Histogram Binning
Vector Scaling
Matrix Scaling

Reliability Diagrams With Confidence Bands

Most calibration plots show a line and leave interpretation to the reader.

reliably-metrics can visualize uncertainty directly.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))

report.reliability_diagram(
    y_true,
    y_prob,
    ax=ax,
    band=True
)

plt.savefig(
    "calibration.png",
    dpi=150
)

The shaded region represents a bootstrap confidence band around the calibration curve.

This helps distinguish real calibration errors from random fluctuations.

Generate HTML Reports in One Line

Need a report for teammates or stakeholders?

report.to_html(
    path="model_report.html"
)

That's it.

The generated report contains:

Metrics
Confidence intervals
Calibration analysis
Reliability diagrams
Statistical comparisons

No Jupyter notebook required.

Why The Library Is Designed This Way

1. Dependency Isolation

Core installation:

pip install reliably-metrics

Visualization support:

pip install reliably-metrics[viz]

HTML reporting:

pip install reliably-metrics[report]

Everything:

pip install reliably-metrics[all]

Heavy dependencies are loaded only when needed.

2. Vectorized Bootstrap

Traditional bootstrap implementations often look like this:

for i in range(10000):
    sample = resample(data)
    metric = compute_metric(sample)

That means 10,000 Python loops.

reliably-metrics instead generates all bootstrap indices up front and performs calculations using vectorized NumPy operations.

The result:

Faster execution
Lower overhead
Better scalability

3. Deterministic Results

Every stochastic operation accepts an explicit seed.

report = rb.evaluate(
    y_true,
    y_prob,
    seed=42
)

Same data.

Same seed.

Same output.

Always.

4. Confidence Intervals Are Actually Tested

Many libraries claim statistical rigor.

We verify it.

The test suite repeatedly generates synthetic datasets with known ground-truth metrics and checks empirical confidence interval coverage.

If a nominal 95% confidence interval stops covering the true value approximately 95% of the time, CI tests fail.

Statistical correctness isn't just documentation—it's enforced in continuous integration.

Bonus: Disentanglement Metrics

If you're working on:

VAEs
Representation Learning
Self-Supervised Learning
Generative Models

the library also includes disentanglement evaluation metrics.

from reliably.repr import disentanglement

results = disentanglement(
    z,
    factors,
    metrics=(
        "mig",
        "sap",
        "dci",
        "factorvae",
        "irs"
    )
)

print(results["mig"])

Output:

MIG=0.312 [0.271, 0.354]

Included metrics:

MIG (Chen et al., 2018)
SAP (Kumar et al., 2017)
DCI (Eastwood & Williams, 2018)
FactorVAE Score (Kim & Mnih, 2018)
IRS (Suter et al., 2019)

All reported with bootstrap confidence intervals.

Get Involved

The project is still in its early stages, and contributions are welcome.

GitHub

https://github.com/nischal1234/reliably

Documentation

https://reliably.readthedocs.io

PyPI

pip install reliably-metrics

Good First Issues

ENIR recalibration
Bayesian Binning into Quantiles (BBQ)
HuggingFace adapters
LightGBM adapters
XGBoost adapters
Multiclass calibration metrics
Tutorial notebooks
Real-world examples

Final Thought

Machine learning has become incredibly good at reporting tiny metric improvements.

We're much worse at determining whether those improvements are actually real.

A model with:

AUROC = 0.851

isn't enough.

What you really need is:

AUROC = 0.851 [0.812, 0.887]

Because uncertainty isn't optional.

It's part of the measurement.

Let's make statistically rigorous ML evaluation the default—not the exception.

推荐订阅源

DEV Community