MMEarth banner

Abstract

The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat, compared to pretraining on optical satellite images only. We show that this also leads to better label and parameter efficiency which are crucial aspects in global scale applications.


MMEarth Dataset

  • MMEarth covers data from 1.2 million locations sampled globally, making the optical image count similar to ImageNet-1K. At each location, data from 12 aligned modalities were collected and grouped into pixel-level and image-level modalities.
  • MMEarth was sampled uniformly across 14 biomes (e.g., Mangroves, Temperate Conifer Forests, Tundra, etc.). To increase diversity, we considered data from four years 2017–2020. Furthermore, we ensured that time-critical modalities were collected around the Sentinel-2 observation date, which serves as the reference.
  • The six pixel-level modalities represent raster data of size 128 × 128 pixels which capture 1.28 km × 1.28 km on the ground (e.g., Sentinel-2, Sentinel-1, Aster DEM, Dynamic World, and ESA World Cover). The remaining six image-level modalities represent scalar values for each location (e.g., Biome, Ecoregion, ERA5 temperature, ERA5 precipitation, Geolocation, and Sentinel-2 observation date).
  • The data can be downloaded from [here]. In addition to the full MMEarth dataset, we provide two smaller "taster" versions (MMEarth64 and MMEarth100k) to facilitate research in multi-modal representation learning:
    • MMEarth: 1.2M locations with image size 128 × 128 pixels. [639GB]
    • MMEarth64: 1.2M locations with center crops of 64 × 64 pixels. [163GB]
    • MMEarth100k: 100k locations with image size 128 × 128 pixels. [48GB]

Temporal and Spatial distribution of the data

temporal_distribution
spatial_distribution

Multi-Pretext Masked Autoencoder

  • Our Multi-Pretext Masked Autoencoder (MP-MAE) model, builds on masked image modelling with the ConvNeXt V2 architecture. ConvNeXt V2 is a fully convolutional masked autoencoder (MAE) that uses sparse convolutions to predict the masked pixels of an image.
  • MP-MAE extends ConvNeXt V2 by adding a task-specific decoder for each pretext task. The general-purpose representation is learned by combining the losses of all pretext tasks.
MMEarth Model

Experimental Results

Our main results can be summarized as:
  1. Domain specific pretraining improves representations (i.e. pretraining on optical satellite images). Multi-spectral input images improve over RGB channels.
  2. Multi-modal pretext tasks improve representations for Sentinel-2 images - especially for linear probing as well as in low-data scenarios.
  3. Our MP-MAE compares favourably to prior work on SSL for Earth observation data - even with a small encoder.
table results

MMEarth examples


Citation

Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, & Nico Lang (2024). MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning.

@misc{nedungadi2024mmearth,
      title={MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning},
      author={Vishal Nedungadi and Ankit Kariryaa and Stefan Oehmcke and Serge Belongie and Christian Igel and Nico Lang},
      year={2024},
      eprint={2405.02771},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}