How to Brier-grade your own ML option-pricing forecasts in 40 lines of Python

If you ship a probabilistic forecast, the single highest-value habit you can build is logging your forecasts so you can grade them later. Sabermetrics figured this out forty years ago. Weather forecasting has done it for a century. Most ML model owners still do not do it.

This post walks through a 40-line Python recipe that logs an ML option-pricing model's per-contract probability-ITM forecast to a CSV, so you can compute the Brier loss after the option expires. The recipe is part of a small open-source cookbook for the Helium MCP REST surface — an MCP server that also exposes its tools as plain HTTPS GETs, which makes it convenient as a teaching substrate even if you do not use MCP.

You will not need an API key, a signup, or a Python SDK.

What we are doing

For every option contract we care about, we want one row that records:

The contract identifier (symbol, strike, expiration, type)
The model's predicted fair value
The model's probability the contract finishes in the money
The model's data date
(Filled in later) the market mark at the same timestamp
(Filled in at expiration) the realized underlying price
(Computed) whether the contract was actually ITM
(Computed) the Brier loss for the probability forecast

When we Brier-grade later, we get one number per contract. Average across many contracts and we have a directly comparable calibration score — exactly the discipline a baseball win-probability model or a weather precipitation forecast gets graded on.

The endpoint

The Helium server exposes its option-pricing tool at this URL:

GET https://heliumtrades.com/mcp_option_price/
    ?symbol=AAPL&strike=310&expiration=2026-06-26&option_type=call

Plain GET, JSON in / JSON out, no auth header, free tier of 50 calls per IP per day. A live call returns:

{
  "symbol": "AAPL",
  "strike": 310.0,
  "expiration": "2026-06-26",
  "option_type": "call",
  "predicted_price": 6.53,
  "prob_itm": 0.42,
  "options_data_date": "2026-05-26"
}

Two of those fields are forecasts about the future: predicted_price (the model's fair value) and prob_itm (the model's probability the option finishes ITM at expiration). The expiration date in the request is the fixed resolution date. That gives us a clean falsifiable target.

The recipe

"""Log Helium's ML option-price + prob_itm forecasts to a CSV so you can
Brier-grade them at expiration.
"""
import csv
import sys
from datetime import datetime
from pathlib import Path

import requests

ENDPOINT = "https://heliumtrades.com/mcp_option_price/"
LOG_FILE = Path("calibration_log.csv")


def main(symbol, strike, expiration, option_type):
    params = {
        "symbol": symbol, "strike": strike,
        "expiration": expiration, "option_type": option_type,
    }
    resp = requests.get(ENDPOINT, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    is_new = not LOG_FILE.exists()
    with LOG_FILE.open("a", newline="") as f:
        w = csv.writer(f)
        if is_new:
            w.writerow([
                "timestamp", "symbol", "strike", "expiration", "option_type",
                "helium_predicted_price", "helium_prob_itm", "helium_data_date",
                "market_mark", "realized_underlying_price", "realized_itm",
                "brier_loss",
            ])
        w.writerow([
            datetime.utcnow().isoformat(timespec="seconds"),
            symbol, strike, expiration, option_type,
            data.get("predicted_price"), data.get("prob_itm"),
            data.get("options_data_date"),
            "", "", "", "",
        ])
    print(f"Logged {symbol} ${strike} {option_type.upper()} {expiration}: "
          f"predicted={data['predicted_price']} prob_itm={data['prob_itm']}")


if __name__ == "__main__":
    main(sys.argv[1], float(sys.argv[2]), sys.argv[3], sys.argv[4])

Save as track.py, then:

pip install requests
python track.py AAPL 310 2026-06-26 call
python track.py AAPL 295 2026-06-26 put
python track.py NVDA 220 2026-07-17 call
# repeat for any contracts you want to grade later

The script appends one row per contract to calibration_log.csv. Snapshot the file once a day to capture how the forecast evolves over time.

Grading the forecast after expiration

At expiration, fill in the realized underlying price and compute Brier loss. For a single contract the Brier loss for the prob_itm forecast is:

brier_loss = (prob_itm - realized_itm) ** 2

where realized_itm is 1 if the contract finished in the money and 0 otherwise. Score every contract you logged, average the losses, and you have a calibration number you can compare across models, weeks, or strike regimes.

A quick scorer:

import csv
import pandas as pd

df = pd.read_csv("calibration_log.csv")

def realized_itm(row):
    s = float(row["realized_underlying_price"])
    k = float(row["strike"])
    if row["option_type"] == "call":
        return 1 if s >= k else 0
    return 1 if s <= k else 0

resolved = df[df["realized_underlying_price"] != ""].copy()
resolved["realized_itm"] = resolved.apply(realized_itm, axis=1)
resolved["brier_loss"] = (
    resolved["helium_prob_itm"].astype(float) - resolved["realized_itm"]
) ** 2

print(f"Contracts graded: {len(resolved)}")
print(f"Mean Brier loss: {resolved['brier_loss'].mean():.4f}")
print(f"Calibration histogram:")
print(resolved.groupby(
    pd.cut(resolved["helium_prob_itm"].astype(float), [0, 0.25, 0.5, 0.75, 1.0])
)["realized_itm"].mean())

The calibration histogram is the part most people skip. A model with mean Brier loss of 0.18 can still be wildly miscalibrated in specific probability bins (overconfident at extreme ends, say). The histogram tells you where it is miscalibrated.

Why this is useful

Most quant content compares predicted prices to current prices and stops there. That comparison cannot distinguish between "the model is right and the market is wrong" and the reverse — and both are unfalsifiable until expiration. Probability-ITM, on the other hand, has an unambiguous resolution: the underlying either closes above the strike or it does not.

So prob_itm is the friendliest output to grade. If you want to spend an hour playing with calibration intuition, log forecasts for 50 contracts across a few different expirations, wait for them to resolve, and run the scorer.

Other recipes in the cookbook

The same pattern — one endpoint, one short script, real output — works for the other tools the Helium server exposes:

News-bias dashboard: pull every tracked source's bias profile and rank by overall credibility, fearful bias, emotionality_score, or any other dimension
Balanced-news synthesis: pull multi-source synthesis on any topic with probability-weighted falsifiable outcomes already baked in
Source credibility ranking: top-N and bottom-N sources by credibility, with their emotionality and prescriptiveness alongside
Ticker forecast explorer: pull HTML-stripped bull/bear narrative cases for a watchlist
Top-strategies explorer: pull the daily short-vol and long-vol candidate lists

All six recipes are in the open-source cookbook here:

➡️ github.com/connerlambden/helium-mcp-cookbook

The cookbook is MIT-licensed. Fork it, modify it, write your own recipes. PRs welcome.

If you want MCP instead of REST

The same ten tools are also exposed as a remote MCP server. If you would rather call them from inside Claude Desktop, Cursor, or any MCP-aware client, the config is:

{
  "mcpServers": {
    "helium": {
      "command": "npx",
      "args": ["mcp-remote", "https://heliumtrades.com/mcp"]
    }
  }
}

After a client restart your LLM can call the same tools by name. The Helium repo is at github.com/connerlambden/helium-mcp.

Closing thought

If your model emits probabilities, you should grade them. The friction-free version is a 40-line script and a CSV. The day you put that habit in place is the day your forecasts start improving — not because the model changes, but because you finally have a feedback signal to learn from.

推荐订阅源

DEV Community