The old pain
Classic Earth data ships as big monolithic files (HDF, NetCDF). To get a 30-year rainfall series for one Indian state you'd download entire global files — tens of gigabytes — then throw 99% away after subsetting locally. Slow, expensive, laptop-melting. The whole modern stack exists to delete that step.
ARCO — the philosophy
Analysis-Ready, Cloud-Optimized. Not a format — a goal: lay data out in cloud object storage (S3/GCS) so a tool can read an arbitrary sub-region by fetching only the relevant bytes over HTTP, no full download, no server. "Analysis-ready" = already cleaned/gridded so you compute, not wrangle. ARCO is realised by the formats below.
Zarr — chunked arrays (for grids/cubes)
For N-dimensional arrays (time × lat × lon — the shape of ERA5, MODIS, climate cubes). Zarr splits the big array into a grid of chunks, compresses each, and stores every chunk as a separate object in the cloud, plus a little JSON describing the layout. Want one region for one decade? The reader computes which chunks overlap and fetches only those — kilobytes, not gigabytes. NASA's ERA5 (and the ARCO-ERA5 store this site's trends use) is Zarr.
See it: fetch only what you need
Click cells to "select a region". The array is chunked 1×1 here — watch how few chunks a small query actually pulls.
GeoParquet — columnar tables (for vectors/points)
Not everything is a grid. Footprints, station points, polygons, STAC catalog entries — those are tables with a geometry column. Parquet is the cloud-standard columnar table format: it stores each column together (so reading 2 of 40 columns touches 5% of the file) and keeps per-chunk min/max statistics so a query can skip whole blocks. GeoParquet is Parquet plus a standard for the geometry column + coordinate-system metadata. geopandas.read_parquet and it's a GeoDataFrame.
GeoArrow — the in-memory layout
Parquet is how geo tables rest on disk; Apache Arrow is how they live in memory — a columnar layout every modern tool shares, so data moves between them zero-copy (no re-parsing). GeoArrow is the agreed way to put geometry into Arrow. The pairing: GeoParquet on disk ⇄ GeoArrow in memory, no expensive conversion at the boundary.
DataFusion — the query engine
Given Arrow/Parquet data, you want to ask questions of it fast. DataFusion is a high-performance query engine (Rust, built on Arrow) that runs SQL over Parquet/Arrow and pushes filters down — using those per-chunk statistics to scan only the blocks that can match. Projects like zarr-datafusion-search point DataFusion at Zarr/STAC so you can SQL-query "which granules cover this box and time" without crawling the whole catalog.
How they fit together
That chain is why a verified trend for one region computes in seconds from a laptop (or a tiny server): the reader pulls a handful of Zarr chunks, lands them in Arrow, reduces them — never touching the petabytes it didn't ask for.
Do it yourself
# Zarr: open a cloud array lazily, subset, THEN load (only those chunks fetch)
import xarray as xr
ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg.zarr", chunks={})
box = ds["2m_temperature"].sel(latitude=slice(22,17), longitude=slice(74,80))
series = box.mean(["latitude","longitude"]).resample(time="YE").mean().load()
# GeoParquet: read a geo table, only the columns you need
import geopandas as gpd
gdf = gpd.read_parquet("s3://.../footprints.parquet", columns=["id","geometry"])
# DataFusion: SQL over Parquet, filters pushed down to skip blocks
from datafusion import SessionContext
ctx = SessionContext(); ctx.register_parquet("items", "s3://.../stac_items.parquet")
ctx.sql("SELECT id FROM items WHERE datetime BETWEEN '2020-01-01' AND '2020-12-31'").show()