3 anos atrás · 1c3c8ffc6e
--- a/README.md
+++ b/README.md
@@ -1,25 +1,36 @@
 
				 # GroupViT: Semantic Segmentation Emerges from Text Supervision
			
 
				 
			
 
				-This repository is the official implementation for the GroupViT framework introduced in the paper:
			
 
				+GroupViT is a framework for learning semantic segmentation purely from text captions without
			
 
				+using any mask supervision. It learns to perform bottom-up heirarchical spatial grouping of 
			
 
				+semantically-related visual regions. This repository is the official implementation of GroupViT 
			
 
				+introduced in the paper:
			
 
				 
			
 
				-[**GroupViT: Semantic Segmentation Emerges from Text Supervision**](https://arxiv.org/abs/2202.11094)
			
 
				-<br>
			
 
				+[**GroupViT: Semantic Segmentation Emerges from Text Supervision**](https://arxiv.org/abs/2202.11094),
			
 
				 [*Jiarui Xu*](https://jerryxu.net),
			
 
				 [*Shalini De Mello*](https://research.nvidia.com/person/shalini-gupta),
			
 
				 [*Wonmin Byeon*](https://wonmin-byeon.github.io/),
			
 
				 [*Thomas Breuel*](http://www.tmbdev.net/),
			
 
				 [*Jan Kautz*](https://research.nvidia.com/person/jan-kautz),
			
 
				 [*Xiaolong Wang*](https://xiaolonw.github.io/),
			
 
				-<br>
			
 
				 CVPR 2022.
			
 
				+<div align="center">
			
 
				+<img src="figs/github_arch.gif" width="85%">
			
 
				 
			
 
				-GroupViT is a framework for learning to perform semantic segmentation guided purely by text annotations, 
			
 
				-which performs bottom-up heirarchical spatial grouping of semantically-related visual regions.
			
 
				+</div>
			
 
				 
			
 
				-The project page with examples is at [https://jerryxu.net/GroupViT/](https://jerryxu.net/GroupViT/).
			
 
				+## Visual Results
			
 
				 
			
 
				 <div align="center">
			
 
				-<img src="figs/github_arch.gif" width="85%">
			
 
				+<img src="figs/github_voc.gif" width="32%">
			
 
				+<img src="figs/github_ctx.gif" width="32%">
			
 
				+<img src="figs/github_coco.gif" width="32%">
			
 
				+</div>
			
 
				+
			
 
				+## Links
			
 
				+* [Jiarui Xu's Project Page](https://jerryxu.net/GroupViT/) (with additonal visual results)
			
 
				+* [arXiv Page](https://arxiv.org/abs/2202.11094)
			
 
				+
			
 
				+
			
 
				 </div>
			
 
				 
			
 
				 ## Citation
			
@@ -43,7 +54,7 @@ If you find our work useful in your research, please cite:
 
				 * mmsegmentation 0.18.0
			
 
				 * timm 0.4.12
			
 
				 
			
 
				-Quick start full script:
			
 
				+Instructions:
			
 
				 
			
 
				 ```shell
			
 
				 conda create -n groupvit python=3.7 -y
			
@@ -61,18 +72,18 @@ pip install nltk ftfy regex tqdm
 
				 
			
 
				 ## Demo
			
 
				 
			
 
				-Integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the web demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/xvjiarui/GroupViT)
			
 
				+* Integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the web demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/xvjiarui/GroupViT)
			
 
				 
			
 
				-Run demo on Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pdJVfAZUchMiHCraA_qBwAs4xnt1ekIU)
			
 
				+* Run the demo on Google Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pdJVfAZUchMiHCraA_qBwAs4xnt1ekIU)
			
 
				 
			
 
				-To run demo from command line:
			
 
				+* To run the demo from the command line:
			
 
				 
			
 
				 ```shell
			
 
				 python demo/demo_seg.py --cfg configs/group_vit_gcc_yfcc_30e.yml --resume /path/to/checkpoint --vis input_pred_label final_group --input demo/examples/voc.jpg --output_dir demo/output
			
 
				 ```
			
 
				-The output is saved in `demo/output/`.
			
 
				+  The output is saved in `demo/output/`.
			
 
				 
			
 
				-## Benchmark
			
 
				+## Benchmark Results
			
 
				 
			
 
				 <table>
			
 
				 <thead>
			
@@ -91,14 +102,14 @@ The output is saved in `demo/output/`.
 
				     <td>COCO</td>
			
 
				   </tr>
			
 
				   <tr>
			
 
				-    <td><a href="configs/group_vit_gcc_yfcc_30e.yml">cfg</a></td>
			
 
				+    <td>GCC + YFCC (<a href="configs/group_vit_gcc_yfcc_30e.yml">cfg</a>)</td>
			
 
				     <td>43.7</td>
			
 
				     <td>52.3</td>
			
 
				     <td>22.4</td>
			
 
				     <td>24.3</td>
			
 
				   </tr>
			
 
				   <tr>
			
 
				-    <td><a href="configs/group_vit_gcc_redcap_30e.yml">cfg</a></td>
			
 
				+    <td>GCC + RedCaps (<a href="configs/group_vit_gcc_redcap_30e.yml">cfg</a>)</td>
			
 
				     <td>51.6</td>
			
 
				     <td>50.8</td>
			
 
				     <td>23.7</td>
			
@@ -107,15 +118,11 @@ The output is saved in `demo/output/`.
 
				 </tbody>
			
 
				 </table>
			
 
				 
			
 
				-You may download pre-trained weights `group_vit_gcc_yfcc_30e-879422e0.pth` and `group_vit_gcc_redcap_30e-3dd09a76.pth` from [Jiarui Xu's Github](https://github.com/xvjiarui/GroupViT#benchmark).
			
 
				+Pre-trained weights `group_vit_gcc_yfcc_30e-879422e0.pth` and `group_vit_gcc_redcap_30e-3dd09a76.pth` for these models are provided by Jiarui Xu [here](https://github.com/xvjiarui/GroupViT#benchmark). 
			
 
				 
			
 
				-<div align="center">
			
 
				-<img src="figs/github_voc.gif" width="32%">
			
 
				-<img src="figs/github_ctx.gif" width="32%">
			
 
				-<img src="figs/github_coco.gif" width="32%">
			
 
				-</div>
			
 
				+To reproduce the benchmark results with these pre-trained models:
			
 
				 
			
 
				-<details><summary>Zero-shot Transfer to Classification on ImageNet</summary> <pre><code>./tools/dist_launch.sh main_group_vit.py /path/to/config 8 --resume /path/to/checkpoint --eval</code></pre> </details>
			
 
				+<details><summary>Zero-shot Transfer to Classification on ImageNet</summary><pre><code>./tools/dist_launch.sh main_group_vit.py /path/to/config 8 --resume /path/to/checkpoint --eval</code></pre> </details>
			
 
				 <details><summary>Zero-shot Transfer to Semantic Segmentation on Pascal VOC</summary><pre><code>./tools/dist_launch.sh main_seg.py /path/to/config 8 --resume /path/to/checkpoint</code></pre></details>
			
 
				 <details><summary>Zero-shot Transfer to Semantic Segmentation on Pascal Context</summary><pre><code>./tools/dist_launch.sh main_seg.py /path/to/config 8 --resume /path/to/checkpoint --opts evaluate.seg.cfg=segmentation/configs/_base_/datasets/pascal_context.py</code></pre></details>
			
 
				 <details><summary>Zero-shot Transfer to Semantic Segmentation on COCO</summary><pre><code>./tools/dist_launch.sh main_seg.py /path/to/config 8 --resume /path/to/checkpoint --opts evaluate.seg.cfg=segmentation/configs/_base_/datasets/coco.py</code></pre></details>
			
@@ -123,7 +130,7 @@ You may download pre-trained weights `group_vit_gcc_yfcc_30e-879422e0.pth` and `
 
				 ## Data Preparation
			
 
				 
			
 
				 During training, we use [webdataset](https://webdataset.github.io/webdataset/) for scalable data loading.
			
 
				-To convert image text pairs into webdataset format, we use the [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the dataset.
			
 
				+To convert image text pairs into the webdataset format, we use the [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the dataset.
			
 
				 
			
 
				 For inference, we use [mmsegmentation](https://github.com/open-mmlab/mmsegmentation) for semantic segmentation testing, evaluation and visualization on Pascal VOC, Pascal Context and COCO datasets.
			
 
				 
			
@@ -178,13 +185,13 @@ GroupViT
 
				 │   │   │   ├── val2017
			
 
				 ```
			
 
				 
			
 
				-The instructions for preparing each dataset are as followed.
			
 
				+The instructions for preparing each dataset are as follows.
			
 
				 
			
 
				 ### GCC3M
			
 
				 
			
 
				-Please download the training split annotation file from [Conceptual Caption 12M](https://ai.google.com/research/ConceptualCaptions/download) and name it to `gcc3m.tsv`.
			
 
				+Please download the training split annotation file from [Conceptual Caption 12M](https://ai.google.com/research/ConceptualCaptions/download) and name it as `gcc3m.tsv`.
			
 
				 
			
 
				-Then run `img2dataset` to download the image text pairs and save in webdataset format.
			
 
				+Then run `img2dataset` to download the image text pairs and save them in the webdataset format.
			
 
				 ```
			
 
				 sed -i '1s/^/caption\turl\n/' gcc3m.tsv
			
 
				 img2dataset --url_list gcc3m.tsv --input_format "tsv" \
			
@@ -195,13 +202,13 @@ img2dataset --url_list gcc3m.tsv --input_format "tsv" \
 
				             --enable_wandb True --save_metadata False --oom_shard_count 6
			
 
				 rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*
			
 
				 ```
			
 
				-Please refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for details.
			
 
				+Please refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for more details.
			
 
				 
			
 
				 ### GCC12M
			
 
				 
			
 
				-Please download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it to `gcc12m.tsv`.
			
 
				+Please download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it as `gcc12m.tsv`.
			
 
				 
			
 
				-Then run `img2dataset` to download the image text pairs and save in webdataset format.
			
 
				+Then run `img2dataset` to download the image text pairs and save them in the webdataset format.
			
 
				 ```
			
 
				 sed -i '1s/^/caption\turl\n/' gcc12m.tsv
			
 
				 img2dataset --url_list gcc12m.tsv --input_format "tsv" \
			
@@ -212,22 +219,22 @@ img2dataset --url_list gcc12m.tsv --input_format "tsv" \
 
				             --enable_wandb True --save_metadata False --oom_shard_count 6
			
 
				 rename -d 's/^/gcc-conceptual-12m-/' local_data/gcc12m_shards/*
			
 
				 ```
			
 
				-Please refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for details.
			
 
				+Please refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for more details.
			
 
				 
			
 
				 ### YFCC14M
			
 
				-Please run following [CLIP Data Preparation](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) to download YFCC14M subset.
			
 
				+Please follow the [CLIP Data Preparation](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md) instructions to download the YFCC14M subset.
			
 
				 ```
			
 
				 wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
			
 
				 bunzip2 yfcc100m_subset_data.tsv.bz2
			
 
				 ```
			
 
				 
			
 
				-Then run preprocessing script to create subset sql db and annotation tsv file (may take a while).
			
 
				+Then run the preprocessing script to create the subset sql db and annotation tsv files. This may take a while.
			
 
				 ```
			
 
				 python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv
			
 
				 ```
			
 
				-This script will create two files: SQLite db `yfcc100m_dataset.sql` and annotation tsv file `yfcc14m_dataset.tsv`.
			
 
				+This script will create two files: an SQLite db called `yfcc100m_dataset.sql` and an annotation tsv file called `yfcc14m_dataset.tsv`.
			
 
				 
			
 
				-Then follow [YFCC100M Download Instruction](https://gitlab.com/jfolz/yfcc100m/-/tree/master) to download the dataset and meta file.
			
 
				+Then follow the [YFCC100M Download Instruction](https://gitlab.com/jfolz/yfcc100m/-/tree/master) to download the dataset and its metadata file.
			
 
				 ```
			
 
				 pip install git+https://gitlab.com/jfolz/yfcc100m.git
			
 
				 mkdir -p yfcc100m_meta
			
@@ -236,7 +243,7 @@ mkdir -p yfcc100m_zip
 
				 python -m yfcc100m.download yfcc100m_meta -o yfcc100m_zip
			
 
				 ```
			
 
				 
			
 
				-Finally convert dataset into webdataset format.
			
 
				+Finally convert the dataset into the webdataset format.
			
 
				 ```
			
 
				 python convert_dataset/convert_yfcc14m.py --root yfcc100m_zip --info yfcc14m_dataset.tsv --shards yfcc14m_shards
			
 
				 ```
			
@@ -249,7 +256,7 @@ wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
 
				 unzip redcaps_v1.0_annotations.zip
			
 
				 ```
			
 
				 
			
 
				-Then run preprocessing script and `img2dataset` to download the image text pairs and save in webdataset format.
			
 
				+Then run the preprocessing script and `img2dataset` to download the image text pairs and save them in the webdataset format.
			
 
				 ```
			
 
				 python convert_dataset/process_redcaps.py annotations redcaps12m_meta/redcaps12m.parquet --num-split 16
			
 
				 img2dataset --url_list ~/data/redcaps12m/ --input_format "parquet" \
			
@@ -263,21 +270,21 @@ rename -d 's/^/redcap12m-/' local_data/recaps12m_shards/*
 
				 
			
 
				 ### ImageNet
			
 
				 
			
 
				-Please follow [webdataset ImageNet Example](https://github.com/tmbdev-archive/webdataset-examples/blob/master/makeshards.py) to convert ImageNet into webdataset format.
			
 
				+Please follow the [webdataset ImageNet Example](https://github.com/tmbdev-archive/webdataset-examples/blob/master/makeshards.py) to convert ImageNet into the webdataset format.
			
 
				 
			
 
				 ### Pascal VOC
			
 
				 
			
 
				-Please follow [MMSegmentation Pascal VOC Preparation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc) to download and setup the Pascal VOC dataset.
			
 
				+Please follow the[MMSegmentation Pascal VOC Preparation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc) instructions to download and setup the Pascal VOC dataset.
			
 
				 
			
 
				 ### Pascal Context
			
 
				 
			
 
				-Please refer to [MMSegmentation Pascal Context Preparation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context) to download and setup the Pascal Context dataset.
			
 
				+Please refer to the[MMSegmentation Pascal Context Preparation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context) instructions to download and setup the Pascal Context dataset.
			
 
				 
			
 
				 ### COCO
			
 
				 
			
 
				 [COCO dataset](https://cocodataset.org/) is an object detection dataset with instance segmentation annotations.
			
 
				-To evaluate GroupViT, we combine all the instance masks together and generate semantic segmentation maps.
			
 
				-To generate the semantic segmentation maps, please follow [MMSegmentation's documentation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k) to download the COCO-Stuff-164k dataset first, then run following
			
 
				+To evaluate GroupViT, we combine all the instance masks of a catergory together and generate semantic segmentation maps.
			
 
				+To generate the semantic segmentation maps, please follow [MMSegmentation's documentation](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k) to download the COCO-Stuff-164k dataset first and then run the following
			
 
				 
			
 
				 ```shell
			
 
				 python convert_dataset/convert_coco.py local_data/data/coco/ -o local_data/data/coco/
			
@@ -287,7 +294,7 @@ python convert_dataset/convert_coco.py local_data/data/coco/ -o local_data/data/
 
				 
			
 
				 ### Pre-train
			
 
				 
			
 
				-Train on single node:
			
 
				+Train on a single node:
			
 
				 
			
 
				 ```shell
			
 
				 (node0)$ ./tools/dist_train.sh /path/to/config $GPUS_PER_NODE
			
@@ -312,7 +319,7 @@ For example, to train on two nodes with 8 GPUs each, run:
 
				 (node1)$ ./tools/dist_mn_train.sh configs/group_vit_gcc_yfcc_30e.yml 1 2 8 tcp://node0
			
 
				 ```
			
 
				 
			
 
				-We use 16 GPUs for pre-training in our paper.
			
 
				+We used 16 NVIDIA V100 GPUs for pre-training (in 2 days) in our paper.
			
 
				 
			
 
				 ### Zero-shot Transfer to Semantic Segmentation