Updates | Introduction | Usage | Results & Models | Statement |
Scene Recognition: Please see Usage for a quick start;
Sementic Segmentation: Please see Remote Sensing Pretraining for Semantic Segmentation;
Object Detection: Please see Remote Sensing Pretraining for Object Detection;
Change Detection: Please see Remote Sensing Pretraining for Change Detection;
ViTAE: Please see ViTAE-Transformer;
Matting: Please see ViTAE-Transformer for matting;
011/04/2022
The baiduyun links of scene recognition models are provided.
07/04/2022
The paper is post on arxiv!
06/04/2022
The pretrained models for ResNet-50, Swin-T and ViTAEv2-S are released. The code for pretraining and scene recognition task are also provided for reference.
This repository contains codes, models and test results for the paper "An Empirical Study of Remote Sensing Pretraining".
The aerial images are usually obtained by a camera in a birdview perspective lying on the planes or satellites, perceiving a large scope of land uses and land covers, whose scene is usually difficult to be interpreted since the interference of the scene-irrelevant regions and the complicated spatial distribution of land objects. Although deep learning has largely reshaped remote sensing research for aerial image understanding and made a great success. However, most of existing deep models are initialized with ImageNet pretrained weights, where the natural images inevitably presents a large domain gap relative to the aerial images, probably limiting the finetuning performance on downstream aerial scene tasks. This issue motivates us to conduct an empirical study of remote sensing pretraining (RSP). To this end, we train different networks from scratch with the help of the largest remote sensing scene recognition dataset up to now-MillionAID, to obtain the remote sensing pretrained backbones, including both convolutional neural networks (CNN) and vision transformers such as Swin and ViTAE, which have shown promising performance on computer vision tasks. Then, we investigate the impact of ImageNet pretraining (IMP) and RSP on a series of downstream tasks including #scene recognition#, semantic segmentation, object detection, and change detection using the CNN and vision transformers backbones.
Backbone | Input size | Acc@1 (μ±σ) | Model |
---|---|---|---|
RSP-ResNet-50-E300 | 224 × 224 | 99.48 ± 0.10 | google & baidu |
RSP-Swin-T-E300 | 224 × 224 | 99.52 ± 0.00 | google & baidu |
RSP-ViTAEv2-S-E100 | 224 × 224 | 99.90 ± 0.13 | google & baidu |
Backbone | Input size | Acc@1 (μ±σ) | Model |
---|---|---|---|
RSP-ResNet-50-E300 | 224 × 224 | 96.81 ± 0.03 | google & baidu |
RSP-Swin-T-E300 | 224 × 224 | 96.89 ± 0.08 | google & baidu |
RSP-ViTAEv2-S-E100 | 224 × 224 | 96.91 ± 0.06 | google & baidu |
Backbone | Input size | Acc@1 (μ±σ) | Model |
---|---|---|---|
RSP-ResNet-50-300 | 224 × 224 | 97.89 ± 0.08 | google & baidu |
RSP-Swin-T-E300 | 224 × 224 | 98.30 ± 0.04 | google & baidu |
RSP-ViTAEv2-S-E100 | 224 × 224 | 98.22 ± 0.09 | google & baidu |
Backbone | Input size | Acc@1 (μ±σ) | Model |
---|---|---|---|
RSP-ResNet-50-E300 | 224 × 224 | 93.93 ± 0.10 | google & baidu |
RSP-Swin-T-E300 | 224 × 224 | 93.02 ± 0.12 | google & baidu |
RSP-ViTAEv2-S-E100 | 224 × 224 | 94.41 ± 0.11 | google & baidu |
Backbone | Input size | Acc@1 (μ±σ) | Model |
---|---|---|---|
RSP-ResNet-50-E300 | 224 × 224 | 95.02 ± 0.06 | google & baidu |
RSP-Swin-T-E300 | 224 × 224 | 94.51 ± 0.05 | google & baidu |
RSP-ViTAEv2-S-E100 | 224 × 224 | 95.60 ± 0.06 | google & baidu |
- Create a conda virtual environment and activate it
conda create -n rsp python=3.8 -y
conda activate rsp
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch
pip install timm==0.4.12
- Install apex (optional)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
- Install other requirements:
pip install pyyaml yacs pillow
- Clone this repo
git clone https://github.com/ViTAE-Transformer/RSP.git
We use the MillionAID dataset for pretraining, and fine tune the pretrained model on UCM/AID/NWPU-RESISC45 datasets. For each dataset, we firstly merge all images together, and then split them to training and validation sets, where their information are separately recoded in train_label.txt
and valid_label.txt
. Note we only consider the third-level categories (totally 51 classes) for MillionAID dataset. The form in train_label.txt
is exemplified as
P0960374.jpg dry_field 0
P0973343.jpg dry_field 0
P0235595.jpg dry_field 0
P0740591.jpg dry_field 0
P0099281.jpg dry_field 0
P0285964.jpg dry_field 0
...
Here, 0 is the training id of category for corresponded image.
- For pretraining, take ResNet-50 as an example, training on MillionAID dataset with 4 GPU and 512 batch size
python -m torch.distributed.launch --nproc_per_node 4 --master_port 6666 main.py \
--dataset 'millionAID' --model 'resnet' --exp_num 1 \
--batch-size 128 --epochs 300 --img_size 224 --split 100 \
--lr 5e-4 --weight_decay 0.05 --gpu_num 4 \
--output [model save path]
- When repeatedly finetuning the pretrained ViTAE model on AID dataset with the setting of (2:8) in 5 times
python -m torch.distributed.launch --nproc_per_node 1 --master_port 7777 main.py \
--dataset 'aid' --model 'vitae_win' --ratio 28 --exp_num 5 \
--batch-size 64 --epochs 200 --img_size 224 --split 1 \
--lr 5e-4 --weight_decay 0.05 --gpu_num 1 \
--output [model save path] \
--pretrained [pretraind vitae path]
- Evaluate the existing model
python -m torch.distributed.launch --nproc_per_node 1 --master_port 8888 main.py \
--dataset 'nwpuresisc' --model 'vitae_win' --ratio 28 --exp_num 5 \
--batch-size 64 --epochs 200 --img_size 224 --split 100 \
--lr 5e-4 --weight_decay 0.05 --gpu_num 1 \
--output [log save path] \
--resume [model load path] \
--eval
Note: When pretraining the Swin model, please uncomment _update_config_from_file(config, args.cfg)
in config.py
, and add
--cfg configs/swin_tiny_patch4_window7_224.yaml
Sementic Segmentation: Please see Remote Sensing Pretraining for Semantic Segmentation;
Object Detection: Please see Remote Sensing Pretraining for Object Detection;
Change Detection: Please see Remote Sensing Pretraining for Change Detection;
ViTAE: Please see ViTAE-Transformer;
Matting: Please see ViTAE-Transformer for matting;
This project is for research purpose only. For any other questions please contact di.wang at gmail.com .
The codes of Pretraining & Recognition part mainly from Swin Transformer.