Why this is the real frontier
Two models agreeing (cross-validation) is reassuring but circular — they can share inputs and be wrong together. The only way out of the model bubble is to compare against independent in-situ instruments: weather-station gauges from networks like GHCN and ISD (pulled via the meteostat library). Those stations don't define ERA5's grid, so they're a genuinely external yardstick.
Three numbers that replace a hand-wave
- Bias — the average offset: does ERA5 run systematically warm or cool? bias = mean(satellite − station)
- RMSE — typical error size, in the data's own units (°C). Penalises big misses.
- r — correlation: does it move with the stations year to year? (1 = perfect, 0 = unrelated.)
A real result reads like: "vs 14 stations / 218 station-years over Germany, ERA5 runs 0.25 °C cool, RMSE 0.90, r = 0.57." That's evidence, not a caveat.
Play with it
Each dot is a station: its real reading (x) vs what ERA5 said (y). The 1:1 line is "perfect agreement." Add a bias, add scatter, and — crucially — drop the station count and watch it refuse to grade itself.
Do it yourself
The honest caveats
- Stations are sparse and uneven — dense in Europe/US, thin across much of the Global South. Coverage drives whether you can validate at all.
- A station is a point; a grid cell is an area — some mismatch is expected and isn't all "error."
- This is the pillar that's only started: it turns "trust me" into a measured number where coverage allows, and an honest "unvalidated" where it doesn't. Converting that into proven-correct at scale is the ongoing work.