The cage: the LLM never touches the number
In most "AI over data" tools the model writes the answer — which is exactly how you get confident, wrong numbers. Here the language model has one job: translate your question into a typed EarthQuery (a variable from a fixed list, a region, a year range, an operation). It picks from enums; it can't emit a value. Audited Python computes the number, deterministically — same query, same answer, every time. The model can be creative about understanding the question and still can't invent the result.
The validator: refuse, clarify, or split
Before any compute runs, the typed query is validated. If it's ill-posed, the system says so:
- Too-short trend — a slope over < ~10 years is noise dressed as a trend → refuse.
- Ambiguous region — "the valley" could be anywhere → clarify, don't guess.
- Two questions in one — "hotter and drier?" mixes two variables → split into two answers.
- Not a real season, a sum where only a mean makes sense, etc. → refuse.
Play with it
Pose a question by setting its shape. Watch the validator accept it, ask to clarify, split it, or refuse — exactly the behaviour behind the refusal cards on /verify.
It's tested by trying to break it
Discipline you can't measure is just a promise. There's an adversarial eval suite whose only job is to make the system lie — and it currently resists all of them:
- a 4-year "trend" (too short) · a sum of temperature (meaningless) · a scattered fake "season"
- a below-noise-floor effect dressed up as a signal · "evidence" from a single pixel · an ocean box with no forest to analyse
Each must fail loudly — a clear refusal, not a plausible number. A trust tool that fails silently is worse than no tool.