Earth Data School/Ground-truth — checking against the real world
Lesson 4.2 · 11 of 17

Ground-truth — checking against the real world

Cross-validation compares two models to each other. Ground-truth does the harder, more honest thing: it compares the satellite/model to actual instruments on the ground — and refuses to grade itself when there aren't enough of them.

In one linePut the gridded product (ERA5) next to real thermometers and rain gauges. Measure exactly how wrong it is — a bias, an RMSE, a correlation — instead of hand-waving "ERA5 has some error."

Why this is the real frontier

Two models agreeing (cross-validation) is reassuring but circular — they can share inputs and be wrong together. The only way out of the model bubble is to compare against independent in-situ instruments: weather-station gauges from networks like GHCN and ISD (pulled via the meteostat library). Those stations don't define ERA5's grid, so they're a genuinely external yardstick.

Three numbers that replace a hand-wave

  • Bias — the average offset: does ERA5 run systematically warm or cool? bias = mean(satellite − station)
  • RMSE — typical error size, in the data's own units (°C). Penalises big misses.
  • r — correlation: does it move with the stations year to year? (1 = perfect, 0 = unrelated.)

A real result reads like: "vs 14 stations / 218 station-years over Germany, ERA5 runs 0.25 °C cool, RMSE 0.90, r = 0.57." That's evidence, not a caveat.

The discipline that mattersWhen in-situ coverage is too thin to be meaningful (say central Maharashtra: 2 station-years), it does not fabricate a number — it returns UNVALIDATED. Refusing to validate is itself a verification result.

Play with it

Each dot is a station: its real reading (x) vs what ERA5 said (y). The 1:1 line is "perfect agreement." Add a bias, add scatter, and — crucially — drop the station count and watch it refuse to grade itself.

Do it yourself

editable · runs in your browser

The honest caveats

  • Stations are sparse and uneven — dense in Europe/US, thin across much of the Global South. Coverage drives whether you can validate at all.
  • A station is a point; a grid cell is an area — some mismatch is expected and isn't all "error."
  • This is the pillar that's only started: it turns "trust me" into a measured number where coverage allows, and an honest "unvalidated" where it doesn't. Converting that into proven-correct at scale is the ongoing work.