Refusal discipline — a trust tool must fail loudly · Learn

In one lineThe AI is caged: it can only pick a typed query from fixed options, so it structurally cannot output a number — it can't hallucinate one. And a validator throws out questions that can't be answered honestly, instead of faking an answer.

The cage: the LLM never touches the number

In most "AI over data" tools the model writes the answer — which is exactly how you get confident, wrong numbers. Here the language model has one job: translate your question into a typed EarthQuery (a variable from a fixed list, a region, a year range, an operation). It picks from enums; it can't emit a value. Audited Python computes the number, deterministically — same query, same answer, every time. The model can be creative about understanding the question and still can't invent the result.

The validator: refuse, clarify, or split

Before any compute runs, the typed query is validated. If it's ill-posed, the system says so:

Too-short trend — a slope over < ~10 years is noise dressed as a trend → refuse.
Ambiguous region — "the valley" could be anywhere → clarify, don't guess.
Two questions in one — "hotter and drier?" mixes two variables → split into two answers.
Not a real season, a sum where only a mean makes sense, etc. → refuse.

Play with it

Pose a question by setting its shape. Watch the validator accept it, ask to clarify, split it, or refuse — exactly the behaviour behind the refusal cards on /verify.

Trend span (years) Region Variables

It's tested by trying to break it

Discipline you can't measure is just a promise. There's an adversarial eval suite whose only job is to make the system lie — and it currently resists all of them:

a 4-year "trend" (too short) · a sum of temperature (meaningless) · a scattered fake "season"
a below-noise-floor effect dressed up as a signal · "evidence" from a single pixel · an ocean box with no forest to analyse

Each must fail loudly — a clear refusal, not a plausible number. A trust tool that fails silently is worse than no tool.

Do it yourself

editable · runs in your browser

# the validator is the trust gate: it RAISES on an ill-posed query, and the agent surfaces a refusal.
MIN_TREND_YEARS = 10
def make_query(variable, time_range):
    y0, y1 = time_range
    if y1 - y0 < MIN_TREND_YEARS:
        raise ValueError("trend span " + str(y1 - y0) + "y < " + str(MIN_TREND_YEARS) + "y - refuse rather than mislead")
    return {"variable": variable, "time_range": time_range}
print("valid query :", make_query("2m_temperature", (1990, 2020)))
try:
    make_query("2m_temperature", (2018, 2021))   # only 3 years -> ill-posed
except ValueError as e:
    print(">>> REFUSED:", e)

Where the honesty endsEven a well-formed, computed answer is labelled "computed," not "verified." "Verified" is reserved for results a domain scientist has signed off. The refusals, the cage, and the cross-checks get you to trustworthy-by-construction; a human still draws the final line.