- Classification
- avaliable datasets
- Finetune: eval_cls.py eval_cls_ffcv.py
- Linear Prob
- k-NN Classification
- Unsupervised Classification on ImageNet
- Semi-Supervised Classification on ImageNet
- Object Detection and Instance Segmentation on COCO
- Semantic Segmentation on ADE20K
- Transfer Learning on Smaller Datasets
- Nearest Neighbor Retrieval
- linear prob on COCO
- projection
- Fewshot
Classification
avaliable datasets
The argument --data_set
can be one of the following:
- IN1K
- ominiglot
- STL
- CIFAR10
- CIFAR100
- Cars
- Pets
- Aircraft
- Flowers
- Folder
eval_cls.py
eval_cls_ffcv.py
Finetune: We reproduced the results of MAE on ImageNet. The results are as follows:
ImageNet Accuracy | ViT-Base | ViT-Large | ViT-Huge |
---|---|---|---|
MAE repo | 83.664 | 85.952 | 86.928 |
Our repo | |||
Our repo (dres) | 83.302 |
Training time is ~7h11m in 32 V100 GPUs for MAE repo.
To launch the evaluation, use vitrun
or submitit
. For example, to finetune a pre-trained model on ImageNet, run:
submitit --module vitookit.evaluation.eval_cls_ffcv --train_path ~/data/ffcv/IN1K_train_500_95.ffcv --val_path ~/data/ffcv/IN1K_val_500_100.ffcv --fast_dir /raid/local_scratch/jxw30-hxc19/ --gin VisionTransformer.global_pool='"avg"' --blr 5e-4 --layer_decay 0.65 --weight_decay 0.05 --drop_path 0.1 --checkpoint_key=model -w ~/models/mae_pretrain_vit_base.pth
Here the effective batch size is 128 (batch_size per gpu) * 8 (gpus per node) = 1024.
dres Dynamic Resolution for Efficient Supervised Learning.
vitrun --nproc_per_node=8 eval_cls_ffcv.py --train_path <> --val_path <> -w ~/models/mae_pretrain_vit_base.pth --checkpoint_key=model --layer_decay=0.65 --gin VisionTransformer.global_pool='"avg"' DynamicResolution.start_ramp=0 DynamicResolution.end_ramp=60 DynamicResolution.scheme=1 --dynamic_resolution
Linear Prob
We follow the MAE recipe to train the linear classifier. Note that:
- The effective batch size is 16384 = 512 (batch_size per gpu) * 1 (nodes) * 8 (gpus per node) * 4 (accum_iter).
- The actual
lr
is computed bylr`` =
blr`` * effective batch size / 256. - Training time is ~2h20m for 90 epochs in 32 V100 GPUs. Reference results for MAE in linear probing:
ViT-Base | ViT-Large | ViT-Huge | |
---|---|---|---|
paper (TF/TPU) | 68.0 | 75.8 | 76.6 |
MAE repo (PT/GPU) | 67.8 | 76.0 | 77.2 |
Our repo (PT/GPU) | 67.8 | 76.0 | 77.2 |
To train a single classifier on frozen weights, run:
submitit --module vitookit.evaluation.eval_linear_ffcv --train_path ~/data/ffcv/IN1K_train_500_95.ffcv --val_path ~/data/ffcv/IN1K_val_500_95.ffcv -w ~/models/mae_pretrain_vit_base.pth --checkpoint_key=model --gin VisionTransformer.global_pool='"avg"' --fast_dir /raid/local_scratch/jxw30-hxc19/ --batch_size=128 --accum_iter=16 --blr=0.1
k-NN Classification
To evaluate k-NN classification on the frozen features, run:
python -m torch.distributed.launch --master_port=29501 --nproc_per_node=2 evaluation/eval_knn.py --pretrained_weights <weight> --data_location <data_path> --data_set <data_set> --output_dir <output_dir> --head_type <> --dis_fn <cosine/euclidean>
Unsupervised Classification on ImageNet
To evaluate for unsupervised classification, run:
/run.sh imagenet_unsup_cls $JOB_NAME vit_{small,base} teacher 8
Note: To ensure one-to-one assignment, the output dimension of projection head for [CLS] token (also patch tokens for iBOT) should be set to 1000 during pre-training. We here share this pre-trained model with its args.
Semi-Supervised Classification on ImageNet
For semi-supervsied classification, we use the data split defined in SimCLRv2, see here. For settings evaluated on fronzen features, k-NN, LR, and linear probing, just change the --data_location
to the imagenet splits from the above commands with full data. For end-to-end fine-tuning, we fine-tuning the pre-trained model from the first layer of the projection head:
./run.sh imagenet_semi_cls $JOB_NAME vit_small teacher 8 \
--epochs 1000 \
--lr 5e-6 \
--data_location data/imagenet_{1,10}p_split \
--finetune_head_layer 1
Object Detection and Instance Segmentation on COCO
To train ViT-S/16 with Cascaded Mask R-CNN as the task layer, run:
./run.sh coco_det $JOB_NAME vit_small teacher 8 \
data.samples_per_gpu=4 \
lr_config.step=8,11 \
runner.max_epochs=12 \
optimizer.paramwise_cfg.layer_decay_rate=0.8
To train ViT-B/16 with Cascaded Mask R-CNN as the task layer, run:
./run.sh coco_det $JOB_NAME vit_base teacher 8 \
data.samples_per_gpu=2 \
lr_config.step=8,11 \
runner.max_epochs=12 \
optimizer.paramwise_cfg.layer_decay_rate=0.75
Semantic Segmentation on ADE20K
To train ViT-S/16 with UperNet as the task layer, run:
./run.sh ade20k_seg $JOB_NAME vit_small teacher 4 \
data.samples_per_gpu=4 \
model.backbone.out_with_norm=true \
optimizer.lr=3e-5
To train ViT-B/16 with fixed backbone and linear head as the task layer, run:
torchrun --nproc_per_node=4 evaluation/semantic_segmentation/train.py evaluation/semantic_segmentation/configs/linear/vit_base_512_*.py \
--launcher pytorch --work_dir $(output_dir) --auto_resume --options data.samples_per_gpu=2 model.backbone.frozen_stages=12
To finetune the whole network, set model.backbone.frozen_stages=-
.
To test ViT-B/16 run:
torchrun --nproc_per_node=4 evaluation/semantic_segmentation/test.py evaluation/semantic_segmentation/configs/linear/vit_base_512_ade20k_160k.py <ckpt path>/iter_40000.pth --launcher pytorch --eval mIoU --out < file path to write the results>
Transfer Learning on Smaller Datasets
For historical issues and reproductivity, we use the default default fine-tuning recipe (i.e., w/o layerwise decay, a smaller learing rate, and a longer training scheduler) proposed in DEiT.
The default configuration in run.sh is for ViT-B/16, and just one-line command is easy to go:
./run.sh cifar10_cls+cifar_cls+cars_cls+flwrs_cls+inat_cls+inat19_cls $JOB_NAME vit_base teacher 8
Note: ViT-S/16 shares most of the configuration, except that we set the --lr
as 5e-5 for INAT18 dataset and 2.5e-5 for INAT19 dataset.
Nearest Neighbor Retrieval
Nerest neighbor retrieval is considered using the frozen pre-trained features following the evaluation protocol as DINO. We consider three settings:
Video Object Segmentation on DAVIS
model_dir=<>
python evaluation/eval_video_segmentation.py --pretrained_weights=${model_dir}/weights.pth --arch=vit_base --data_location=../data/DAVIS --output_dir=${model_dir}/eval_seg/video_knn-davis &&\
python evaluation/davis2017-evaluation/evaluation_method.py --task semi-supervised --davis_path ../data/DAVIS --results_path ${model_dir}/eval_seg/video_knn-davis
Image Retrieval on Paris and Oxford
python evaluation/eval_image_retrieval.py --data_location ../data/revisited_paris_oxford/ --arch=vit_base -w outputs/models/SiT/checkpoint.pth --data_set=roxford5k
The evaluation needs 5 minuts for vit_small.
batch evaluation:
model_dir= ['outputs/models/mae-base',
'outputs/models/SiT',
'outputs/models/sit_iBOT-ViT_B',
'outputs/models/ours_450',
'outputs/models/sit_sit-ViT_B',
'outputs/models/iBOT-ViT_B',
'outputs/models/MC_SSL-ViT_B',
'outputs/models/ours_iBOT2',
'outputs/models/MSN-ViT_B',
'outputs/models/DINO-ViT_B',
'outputs/models/sit-ViT_B']
for m in model_dir:
for d in ['roxford5k', 'rparis6k']:
print(m, d)
cmd = f"python evaluation/eval_image_retrieval.py --data_location ../data/revisited_paris_oxford/ --arch=vit_base -w {m}/checkpoint.pth --output_dir={m}/eval/ImageRetieval/ --data_set={d}"
print(cmd)
os.system(cmd)
Copy Detection on Copydays
./run.sh copydays_copydey $JOB_NAME vit_small teacher 1 \
--data_location data/copydays
linear prob on COCO
CUDA_VISIBLE_DEVICES=1 MODEL_DIR= screen bash evaluation/awesome-semantic-segmentation-pytorch/scripts/linear_coco_dist.sh
projection
Fewshot
python evaluation/eval_fewshot_cls.py --pretrained_weights <weight> --data_location <data_path> --output_dir <output_dir>
batch evaluation:
model_dir= [
'outputs/models/MMC',
'outputs/models/mae-base',
'outputs/models/SiT','outputs/models/iBOT-ViT_B','outputs/models/MSN-ViT_B','outputs/models/DINO-ViT_B','outputs/models/sit-ViT_B','outputs/models/MC_SSL-ViT_B'
]
for m in model_dir:
for ds in ['ominiglot']:
cmd = f"python evaluation/eval_fewshot_cls.py --data_location ../data --arch=vit_base -w {m}/checkpoint.pth --output_dir={m}/eval/fewshot --data_set={ds}"
print(cmd)
os.system(cmd)