Ground-truth — checking against the real world · Learn

In one linePut the gridded product (ERA5) next to real thermometers and rain gauges. Measure exactly how wrong it is — a bias, an RMSE, a correlation — instead of hand-waving "ERA5 has some error."

Why this is the real frontier

Two models agreeing (cross-validation) is reassuring but circular — they can share inputs and be wrong together. The only way out of the model bubble is to compare against independent in-situ instruments: weather-station gauges from networks like GHCN and ISD (pulled via the meteostat library). Those stations don't define ERA5's grid, so they're a genuinely external yardstick.

Three numbers that replace a hand-wave

Bias — the average offset: does ERA5 run systematically warm or cool? bias = mean(satellite − station)
RMSE — typical error size, in the data's own units (°C). Penalises big misses.
r — correlation: does it move with the stations year to year? (1 = perfect, 0 = unrelated.)

A real result reads like: "vs 14 stations / 218 station-years over Germany, ERA5 runs 0.25 °C cool, RMSE 0.90, r = 0.57." That's evidence, not a caveat.

The discipline that mattersWhen in-situ coverage is too thin to be meaningful (say central Maharashtra: 2 station-years), it does not fabricate a number — it returns UNVALIDATED. Refusing to validate is itself a verification result.

Play with it

Each dot is a station: its real reading (x) vs what ERA5 said (y). The 1:1 line is "perfect agreement." Add a bias, add scatter, and — crucially — drop the station count and watch it refuse to grade itself.

Stations ERA5 bias (°C) Scatter

Do it yourself

editable · runs in your browser

import numpy as np
rng = np.random.default_rng(2)
MIN_STATIONS = 5
n_stations = 8                                   # try 3 -> it refuses to grade at all
if n_stations < MIN_STATIONS:
    print("UNVALIDATED - too few gauges to grade. Refuse, do not fabricate.")
else:
    truth = rng.normal(26, 3, 60)                # the real temperatures
    obs = truth + rng.normal(0, 0.3, 60)         # what the ground stations measure
    sat = truth + 0.8 + rng.normal(0, 0.7, 60)   # the satellite/reanalysis (0.8 degC warm bias)
    bias = float(np.mean(sat - obs))
    rmse = float(np.sqrt(np.mean((sat - obs) ** 2)))
    r = float(np.corrcoef(sat, obs)[0, 1])
    print("vs", n_stations, "stations: bias", round(bias, 2), "degC, RMSE", round(rmse, 2), ", r =", round(r, 2))
    print(">>>", "good agreement" if abs(bias) < 1 and r > 0.9 else "check the bias before you trust it")

The honest caveats

Stations are sparse and uneven — dense in Europe/US, thin across much of the Global South. Coverage drives whether you can validate at all.
A station is a point; a grid cell is an area — some mismatch is expected and isn't all "error."
This is the pillar that's only started: it turns "trust me" into a measured number where coverage allows, and an honest "unvalidated" where it doesn't. Converting that into proven-correct at scale is the ongoing work.