MMEarth

Abstract

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create MMEarth, a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that pretraining with multi-modal pretext tasks notably improves the linear probing performance compared to pretraining on optical satellite images only. This also leads to better label efficiency and parameter efficiency which are crucial aspects in global scale applications.

Loading the dataset and pretrained models

MMEarth Dataset

MMEarth covers data from 1.2 million locations sampled globally, making the optical image count similar to ImageNet-1K. At each location, data from 12 aligned modalities were collected and grouped into pixel-level and image-level modalities.
MMEarth was sampled uniformly across 14 biomes (e.g., Mangroves, Temperate Conifer Forests, Tundra, etc.). To increase diversity, we considered data from four years 2017–2020. Furthermore, we ensured that time-critical modalities were collected around the Sentinel-2 observation date, which serves as the reference.
The six pixel-level modalities represent raster data of size 128 × 128 pixels which capture 1.28 km × 1.28 km on the ground (e.g., Sentinel-2, Sentinel-1, Aster DEM, Dynamic World, and ESA World Cover). The remaining six image-level modalities represent scalar values for each location (e.g., Biome, Ecoregion, ERA5 temperature, ERA5 precipitation, Geolocation, and Sentinel-2 observation date).
The data can be downloaded from [here]. In addition to the full MMEarth dataset, we provide two smaller "taster" versions (MMEarth64 and MMEarth100k) to facilitate research in multi-modal representation learning:
- MMEarth: 1.2M locations with image size 128 × 128 pixels. [597GB]
- MMEarth64: 1.2M locations with center crops of 64 × 64 pixels. [152GB]
- MMEarth100k: 100k locations with image size 128 × 128 pixels. [48GB]

Temporal and Spatial distribution of the data

Multi-Pretext Masked Autoencoder

Our Multi-Pretext Masked Autoencoder (MP-MAE) model, builds on masked image modelling with the ConvNeXt V2 architecture. ConvNeXt V2 is a fully convolutional masked autoencoder (MAE) that uses sparse convolutions to predict the masked pixels of an image.
MP-MAE extends ConvNeXt V2 by adding a task-specific decoder for each pretext task. The general-purpose representation is learned by combining the losses of all pretext tasks.

Experimental Results

Our main results can be summarized as:

Domain specific pretraining improves representations (i.e. pretraining on optical satellite images). Multi-spectral input images improve over RGB channels.
Multi-modal pretext tasks improve representations for Sentinel-2 images - especially for linear probing as well as in low-data scenarios.
Our MP-MAE compares favourably to prior work on SSL for Earth observation data - even with a small encoder.

table results — **Table: Downstream performance.** Fine-tuning (FT) and linear probing (LP) performance for multi-pretext pretraining (MMEarth64) and masked image pretraining (ImageNet, MMEarth64-S2). Multi-pretext pretraining improves both FT and LP results.

label_efficiency — **Figure: Label efficiency for few-shot downstream performance.** Linear probing performance for varying downstream dataset sizes. MP-MAE (‘Atto’) pretrained on ImageNet, MMEarth64-S2 (multi-spectral only), MMEarth64 (all multi-modal pretext tasks).

Citation

Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, & Nico Lang (2024). MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning.

@inproceedings{nedungadi2024mmearth,
  title={MMEarth: Exploring multi-modal pretext tasks for geospatial representation learning},
  author={Nedungadi, Vishal and Kariryaa, Ankit and Oehmcke, Stefan and Belongie, Serge and Igel, Christian and Lang, Nico},
  booktitle={European Conference on Computer Vision},
  pages={164--182},
  year={2024},
  organization={Springer}
}

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning