Earth Data School/ARCO, Zarr, GeoParquet, GeoArrow & DataFusion
Lesson 1.1 · 3 of 17

ARCO, Zarr, GeoParquet, GeoArrow & DataFusion

How do you analyse a petabyte from a laptop? You don't download it — you read only the bytes you need, straight from the cloud. This is the stack that makes that possible, and why each piece exists.

In one lineStore the data chunked + columnar in the cloud, describe it so a reader can fetch just the slice it wants, and run the query next to the data. That's the whole game.

The old pain

Classic Earth data ships as big monolithic files (HDF, NetCDF). To get a 30-year rainfall series for one Indian state you'd download entire global files — tens of gigabytes — then throw 99% away after subsetting locally. Slow, expensive, laptop-melting. The whole modern stack exists to delete that step.

ARCO — the philosophy

Analysis-Ready, Cloud-Optimized. Not a format — a goal: lay data out in cloud object storage (S3/GCS) so a tool can read an arbitrary sub-region by fetching only the relevant bytes over HTTP, no full download, no server. "Analysis-ready" = already cleaned/gridded so you compute, not wrangle. ARCO is realised by the formats below.

Zarr — chunked arrays (for grids/cubes)

For N-dimensional arrays (time × lat × lon — the shape of ERA5, MODIS, climate cubes). Zarr splits the big array into a grid of chunks, compresses each, and stores every chunk as a separate object in the cloud, plus a little JSON describing the layout. Want one region for one decade? The reader computes which chunks overlap and fetches only those — kilobytes, not gigabytes. NASA's ERA5 (and the ARCO-ERA5 store this site's trends use) is Zarr.

See it: fetch only what you need

Click cells to "select a region". The array is chunked 1×1 here — watch how few chunks a small query actually pulls.

chunks fetched: / data transferred: of the whole array

GeoParquet — columnar tables (for vectors/points)

Not everything is a grid. Footprints, station points, polygons, STAC catalog entries — those are tables with a geometry column. Parquet is the cloud-standard columnar table format: it stores each column together (so reading 2 of 40 columns touches 5% of the file) and keeps per-chunk min/max statistics so a query can skip whole blocks. GeoParquet is Parquet plus a standard for the geometry column + coordinate-system metadata. geopandas.read_parquet and it's a GeoDataFrame.

GeoArrow — the in-memory layout

Parquet is how geo tables rest on disk; Apache Arrow is how they live in memory — a columnar layout every modern tool shares, so data moves between them zero-copy (no re-parsing). GeoArrow is the agreed way to put geometry into Arrow. The pairing: GeoParquet on disk ⇄ GeoArrow in memory, no expensive conversion at the boundary.

DataFusion — the query engine

Given Arrow/Parquet data, you want to ask questions of it fast. DataFusion is a high-performance query engine (Rust, built on Arrow) that runs SQL over Parquet/Arrow and pushes filters down — using those per-chunk statistics to scan only the blocks that can match. Projects like zarr-datafusion-search point DataFusion at Zarr/STAC so you can SQL-query "which granules cover this box and time" without crawling the whole catalog.

How they fit together

cloud objects (Zarr chunks · GeoParquet files) → Arrow / GeoArrow (in memory, zero-copy) → DataFusion / xarray (query, filter-pushdown) → your analysis

That chain is why a verified trend for one region computes in seconds from a laptop (or a tiny server): the reader pulls a handful of Zarr chunks, lands them in Arrow, reduces them — never touching the petabytes it didn't ask for.

Do it yourself

# Zarr: open a cloud array lazily, subset, THEN load (only those chunks fetch)
import xarray as xr
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg.zarr", chunks={})
box = ds["2m_temperature"].sel(latitude=slice(22,17), longitude=slice(74,80))
series = box.mean(["latitude","longitude"]).resample(time="YE").mean().load()

# GeoParquet: read a geo table, only the columns you need
import geopandas as gpd
gdf = gpd.read_parquet("s3://.../footprints.parquet", columns=["id","geometry"])

# DataFusion: SQL over Parquet, filters pushed down to skip blocks
from datafusion import SessionContext
ctx = SessionContext(); ctx.register_parquet("items", "s3://.../stac_items.parquet")
ctx.sql("SELECT id FROM items WHERE datetime BETWEEN '2020-01-01' AND '2020-12-31'").show()
The takeawayYou rarely move the data anymore — you move the question to the data, and only the bytes that answer it come back. Chunked (Zarr) for cubes, columnar (GeoParquet/Arrow) for tables, a pushdown engine (DataFusion) over both.