Sem descrição

Shalini De Mello d6ee6ce211 Added a first version. há 3 anos atrás
configs d6ee6ce211 Added a first version. há 3 anos atrás
convert_dataset d6ee6ce211 Added a first version. há 3 anos atrás
datasets d6ee6ce211 Added a first version. há 3 anos atrás
demo d6ee6ce211 Added a first version. há 3 anos atrás
figs d6ee6ce211 Added a first version. há 3 anos atrás
models d6ee6ce211 Added a first version. há 3 anos atrás
segmentation d6ee6ce211 Added a first version. há 3 anos atrás
tools d6ee6ce211 Added a first version. há 3 anos atrás
utils d6ee6ce211 Added a first version. há 3 anos atrás
LICENSE faf119e07b Added LICENCE file. há 3 anos atrás
README.md d6ee6ce211 Added a first version. há 3 anos atrás
main_group_vit.py d6ee6ce211 Added a first version. há 3 anos atrás
main_seg.py d6ee6ce211 Added a first version. há 3 anos atrás
setup.cfg d6ee6ce211 Added a first version. há 3 anos atrás

README.md

GroupViT: Semantic Segmentation Emerges from Text Supervision

This repository is the official implementation for GroupViT introduced in the paper:

GroupViT: Semantic Segmentation Emerges from Text Supervision
Jiarui Xu, Shalini De Mello, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang
CVPR 2022

The project page with examples is at https://jerryxu.net/GroupViT/.

Citation

If you find our work useful in your research, please cite:

@article{xu2022groupvit,
  author    = {Xu, Jiarui and De Mello, Shalini and Liu, Sifei and Byeon, Wonmin and Breuel, Thomas and Kautz, Jan and Wang, Xiaolong},
  title     = {GroupViT: Semantic Segmentation Emerges from Text Supervision},
  journal   = {arXiv preprint arXiv:2202.11094},
  year      = {2022},
}

Environmental Setup

  • Python 3.7
  • PyTorch 1.8
  • webdataset 0.1.103
  • mmsegmentation 0.18.0
  • timm 0.4.12

Quick start full script:

conda create -n groupvit python=3.7 -y
conda activate groupvit
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install mmcv-full==1.3.14 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
pip install mmsegmentation==0.18.0
pip install webdataset==0.1.103
pip install timm==0.4.12
git clone https://github.com/NVIDIA/apex
cd && apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install opencv-python==4.4.0.46 termcolor==1.1.0 diffdist einops omegaconf
pip install nltk ftfy regex tqdm

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo: Hugging Face Spaces

Run demo on Colab: Open In Colab

To run demo from command line:

python demo/demo_seg.py --cfg configs/group_vit_gcc_yfcc_30e.yml --resume /path/to/checkpoint --vis input_pred_label final_group --input demo/examples/voc.jpg --output_dir demo/output

The output is saved in demo/output/.

Benchmark

Zero-shot Classification Zero-shot Segmentation
config ImageNet Pascal VOC Pascal Context COCO
cfg 43.7 52.3 22.4 24.3
cfg 51.6 50.8 23.7 27.5

You may download pre-trained weights group_vit_gcc_yfcc_30e-879422e0.pth and group_vit_gcc_redcap_30e-3dd09a76.pth from Jiarui Xu's Github.

Zero-shot Transfer to Classification on ImageNet
./tools/dist_launch.sh main_group_vit.py /path/to/config 8 --resume /path/to/checkpoint --eval
Zero-shot Transfer to Semantic Segmentation on Pascal VOC
./tools/dist_launch.sh main_seg.py /path/to/config 8 --resume /path/to/checkpoint
Zero-shot Transfer to Semantic Segmentation on Pascal Context
./tools/dist_launch.sh main_seg.py /path/to/config 8 --resume /path/to/checkpoint --opts evaluate.seg.cfg=segmentation/configs/_base_/datasets/pascal_context.py
Zero-shot Transfer to Semantic Segmentation on COCO
./tools/dist_launch.sh main_seg.py /path/to/config 8 --resume /path/to/checkpoint --opts evaluate.seg.cfg=segmentation/configs/_base_/datasets/coco.py

Data Preparation

During training, we use webdataset for scalable data loading. To convert image text pairs into webdataset format, we use the img2dataset tool to download and preprocess the dataset.

For inference, we use mmsegmentation for semantic segmentation testing, evaluation and visualization on Pascal VOC, Pascal Context and COCO datasets.

The overall file structure is as follows:

GroupViT
├── local_data
│   ├── gcc3m_shards
│   │   ├── gcc-train-000000.tar
│   │   ├── ...
│   │   ├── gcc-train-000436.tar
│   ├── gcc12m_shards
│   │   ├── gcc-conceptual-12m-000000.tar
│   │   ├── ...
│   │   ├── gcc-conceptual-12m-001943.tar
│   ├── yfcc14m_shards
│   │   ├── yfcc14m-000000.tar
│   │   ├── ...
│   │   ├── yfcc14m-001888.tar
│   ├── redcap12m_shards
│   │   ├── redcap12m-000000.tar
│   │   ├── ...
│   │   ├── redcap12m-001211.tar
│   ├── imagenet_shards
│   │   ├── imagenet-val-000000.tar
│   │   ├── ...
│   │   ├── imagenet-val-000049.tar
│   ├── VOCdevkit
│   │   ├── VOC2012
│   │   │   ├── JPEGImages
│   │   │   ├── SegmentationClass
│   │   │   ├── ImageSets
│   │   │   │   ├── Segmentation
│   │   ├── VOC2010
│   │   │   ├── JPEGImages
│   │   │   ├── SegmentationClassContext
│   │   │   ├── ImageSets
│   │   │   │   ├── SegmentationContext
│   │   │   │   │   ├── train.txt
│   │   │   │   │   ├── val.txt
│   │   │   ├── trainval_merged.json
│   │   ├── VOCaug
│   │   │   ├── dataset
│   │   │   │   ├── cls
│   ├── coco
│   │   ├── images
│   │   │   ├── train2017
│   │   │   ├── val2017
│   │   ├── annotations
│   │   │   ├── train2017
│   │   │   ├── val2017

The instructions for preparing each dataset are as followed.

GCC3M

Please download the training split annotation file from Conceptual Caption 12M and name it to gcc3m.tsv.

Then run img2dataset to download the image text pairs and save in webdataset format.

sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc3m_shards
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*

Please refer to img2dataset CC3M tutorial for details.

GCC12M

Please download the annotation file from Conceptual Caption 12M and name it to gcc12m.tsv.

Then run img2dataset to download the image text pairs and save in webdataset format.

sed -i '1s/^/caption\turl\n/' gcc12m.tsv
img2dataset --url_list gcc12m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc12m_shards \
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-conceptual-12m-/' local_data/gcc12m_shards/*

Please refer to img2dataset CC12M tutorial for details.

YFCC14M

Please run following CLIP Data Preparation to download YFCC14M subset.

wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
bunzip2 yfcc100m_subset_data.tsv.bz2

Then run preprocessing script to create subset sql db and annotation tsv file (may take a while).

python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv

This script will create two files: SQLite db yfcc100m_dataset.sql and annotation tsv file yfcc14m_dataset.tsv.

Then follow YFCC100M Download Instruction to download the dataset and meta file.

pip install git+https://gitlab.com/jfolz/yfcc100m.git
mkdir -p yfcc100m_meta
python -m yfcc100m.convert_metadata . -o yfcc100m_meta --skip_verification
mkdir -p yfcc100m_zip
python -m yfcc100m.download yfcc100m_meta -o yfcc100m_zip

Finally convert dataset into webdataset format.

python convert_dataset/convert_yfcc14m.py --root yfcc100m_zip --info yfcc14m_dataset.tsv --shards yfcc14m_shards

RedCaps12M

Please download the annotation file from RedCaps.

wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
unzip redcaps_v1.0_annotations.zip

Then run preprocessing script and img2dataset to download the image text pairs and save in webdataset format.

python convert_dataset/process_redcaps.py annotations redcaps12m_meta/redcaps12m.parquet --num-split 16
img2dataset --url_list ~/data/redcaps12m/ --input_format "parquet" \
            --url_col "URL" --caption_col "TEXT" --output_format webdataset \
            --output_folder local_data/recaps12m_shards
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/redcap12m-/' local_data/recaps12m_shards/*

ImageNet

Please follow webdataset ImageNet Example to convert ImageNet into webdataset format.

Pascal VOC

Please follow MMSegmentation Pascal VOC Preparation to download and setup the Pascal VOC dataset.

Pascal Context

Please refer to MMSegmentation Pascal Context Preparation to download and setup the Pascal Context dataset.

COCO

COCO dataset is an object detection dataset with instance segmentation annotations. To evaluate GroupViT, we combine all the instance masks together and generate semantic segmentation maps. To generate the semantic segmentation maps, please follow MMSegmentation's documentation to download the COCO-Stuff-164k dataset first, then run following

python convert_dataset/convert_coco.py local_data/data/coco/ -o local_data/data/coco/

Run Experiments

Pre-train

Train on single node:

(node0)$ ./tools/dist_train.sh /path/to/config $GPUS_PER_NODE

For example, to train on a node with 8 GPUs, run:

(node0)$ ./tools/dist_train.sh configs/group_vit_gcc_yfcc_30e.yml 8

Train on multiple nodes:

(node0)$ ./tools/dist_mn_train.sh /path/to/config $NUM_NODES $NODE_RANK $GPUS_PER_NODE $MASTER_ADDR
(node1)$ ./tools/dist_mn_train.sh /path/to/config $NUM_NODES $NODE_RANK $GPUS_PER_NODE $MASTER_ADDR

For example, to train on two nodes with 8 GPUs each, run:

(node0)$ ./tools/dist_mn_train.sh configs/group_vit_gcc_yfcc_30e.yml 0 2 8 tcp://node0
(node1)$ ./tools/dist_mn_train.sh configs/group_vit_gcc_yfcc_30e.yml 1 2 8 tcp://node0

We use 16 GPUs for pre-training in our paper.

Zero-shot Transfer to Semantic Segmentation

Pascal VOC

./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint

Pascal Context

./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/pascal_context.py

COCO

./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/coco.py