Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

Classification

avaliable datasets

The argument --data_set can be one of the following:

  • IN1K
  • ominiglot
  • STL
  • CIFAR10
  • CIFAR100
  • Cars
  • Pets
  • Aircraft
  • Flowers
  • Folder

Finetune: eval_cls.py eval_cls_ffcv.py

We reproduced the results of MAE on ImageNet. The results are as follows:

ImageNet Accuracy ViT-Base ViT-Large ViT-Huge
MAE repo 83.664 85.952 86.928
Our repo
Our repo (dres) 83.302

Training time is ~7h11m in 32 V100 GPUs for MAE repo.

To launch the evaluation, use vitrun or submitit. For example, to finetune a pre-trained model on ImageNet, run:

submitit  --module vitookit.evaluation.eval_cls_ffcv   --train_path ~/data/ffcv/IN1K_train_500_95.ffcv --val_path  ~/data/ffcv/IN1K_val_500_100.ffcv --fast_dir /raid/local_scratch/jxw30-hxc19/ --gin VisionTransformer.global_pool='"avg"'   --blr 5e-4 --layer_decay 0.65 --weight_decay 0.05 --drop_path 0.1 --checkpoint_key=model -w ~/models/mae_pretrain_vit_base.pth 

Here the effective batch size is 128 (batch_size per gpu) * 8 (gpus per node) = 1024.

dres Dynamic Resolution for Efficient Supervised Learning.

vitrun --nproc_per_node=8 eval_cls_ffcv.py --train_path <> --val_path <> -w ~/models/mae_pretrain_vit_base.pth --checkpoint_key=model --layer_decay=0.65 --gin VisionTransformer.global_pool='"avg"' DynamicResolution.start_ramp=0 DynamicResolution.end_ramp=60 DynamicResolution.scheme=1 --dynamic_resolution

Linear Prob

We follow the MAE recipe to train the linear classifier. Note that:

  • The effective batch size is 16384 = 512 (batch_size per gpu) * 1 (nodes) * 8 (gpus per node) * 4 (accum_iter).
  • The actual lr is computed by lr`` = blr`` * effective batch size / 256.
  • Training time is ~2h20m for 90 epochs in 32 V100 GPUs. Reference results for MAE in linear probing:
ViT-Base ViT-Large ViT-Huge
paper (TF/TPU) 68.0 75.8 76.6
MAE repo (PT/GPU) 67.8 76.0 77.2
Our repo (PT/GPU) 67.8 76.0 77.2

To train a single classifier on frozen weights, run:

submitit --module vitookit.evaluation.eval_linear_ffcv --train_path ~/data/ffcv/IN1K_train_500_95.ffcv --val_path ~/data/ffcv/IN1K_val_500_95.ffcv  -w ~/models/mae_pretrain_vit_base.pth --checkpoint_key=model  --gin VisionTransformer.global_pool='"avg"'  --fast_dir /raid/local_scratch/jxw30-hxc19/ --batch_size=128 --accum_iter=16 --blr=0.1

k-NN Classification

To evaluate k-NN classification on the frozen features, run:

python -m torch.distributed.launch  --master_port=29501 --nproc_per_node=2 evaluation/eval_knn.py --pretrained_weights <weight> --data_location <data_path> --data_set <data_set> --output_dir <output_dir> --head_type <> --dis_fn <cosine/euclidean>

Unsupervised Classification on ImageNet

To evaluate for unsupervised classification, run:

/run.sh imagenet_unsup_cls $JOB_NAME vit_{small,base} teacher 8

Note: To ensure one-to-one assignment, the output dimension of projection head for [CLS] token (also patch tokens for iBOT) should be set to 1000 during pre-training. We here share this pre-trained model with its args.

Semi-Supervised Classification on ImageNet

For semi-supervsied classification, we use the data split defined in SimCLRv2, see here. For settings evaluated on fronzen features, k-NN, LR, and linear probing, just change the --data_location to the imagenet splits from the above commands with full data. For end-to-end fine-tuning, we fine-tuning the pre-trained model from the first layer of the projection head:

./run.sh imagenet_semi_cls $JOB_NAME vit_small teacher 8 \
  --epochs 1000 \
  --lr 5e-6 \
  --data_location data/imagenet_{1,10}p_split \
  --finetune_head_layer 1

Object Detection and Instance Segmentation on COCO

To train ViT-S/16 with Cascaded Mask R-CNN as the task layer, run:

./run.sh coco_det $JOB_NAME vit_small teacher 8 \
  data.samples_per_gpu=4 \
  lr_config.step=8,11 \
  runner.max_epochs=12 \
  optimizer.paramwise_cfg.layer_decay_rate=0.8

To train ViT-B/16 with Cascaded Mask R-CNN as the task layer, run:

./run.sh coco_det $JOB_NAME vit_base teacher 8 \
  data.samples_per_gpu=2 \
  lr_config.step=8,11 \
  runner.max_epochs=12 \
  optimizer.paramwise_cfg.layer_decay_rate=0.75

Semantic Segmentation on ADE20K

To train ViT-S/16 with UperNet as the task layer, run:

./run.sh ade20k_seg $JOB_NAME vit_small teacher 4 \
  data.samples_per_gpu=4 \
  model.backbone.out_with_norm=true \
  optimizer.lr=3e-5

To train ViT-B/16 with fixed backbone and linear head as the task layer, run:

torchrun --nproc_per_node=4  evaluation/semantic_segmentation/train.py evaluation/semantic_segmentation/configs/linear/vit_base_512_*.py \
    --launcher pytorch  --work_dir $(output_dir) --auto_resume --options data.samples_per_gpu=2 model.backbone.frozen_stages=12

To finetune the whole network, set model.backbone.frozen_stages=-.

To test ViT-B/16 run:

torchrun --nproc_per_node=4 evaluation/semantic_segmentation/test.py evaluation/semantic_segmentation/configs/linear/vit_base_512_ade20k_160k.py <ckpt path>/iter_40000.pth --launcher pytorch --eval mIoU --out < file path to write the results>

Transfer Learning on Smaller Datasets

For historical issues and reproductivity, we use the default default fine-tuning recipe (i.e., w/o layerwise decay, a smaller learing rate, and a longer training scheduler) proposed in DEiT.

The default configuration in run.sh is for ViT-B/16, and just one-line command is easy to go:

./run.sh cifar10_cls+cifar_cls+cars_cls+flwrs_cls+inat_cls+inat19_cls $JOB_NAME vit_base teacher 8

Note: ViT-S/16 shares most of the configuration, except that we set the --lr as 5e-5 for INAT18 dataset and 2.5e-5 for INAT19 dataset.

Nearest Neighbor Retrieval

Nerest neighbor retrieval is considered using the frozen pre-trained features following the evaluation protocol as DINO. We consider three settings:

Video Object Segmentation on DAVIS

model_dir=<>
python evaluation/eval_video_segmentation.py --pretrained_weights=${model_dir}/weights.pth --arch=vit_base --data_location=../data/DAVIS --output_dir=${model_dir}/eval_seg/video_knn-davis &&\
python evaluation/davis2017-evaluation/evaluation_method.py --task semi-supervised --davis_path ../data/DAVIS --results_path ${model_dir}/eval_seg/video_knn-davis

Image Retrieval on Paris and Oxford

python evaluation/eval_image_retrieval.py --data_location ../data/revisited_paris_oxford/ --arch=vit_base -w outputs/models/SiT/checkpoint.pth --data_set=roxford5k

The evaluation needs 5 minuts for vit_small.

batch evaluation:

model_dir= ['outputs/models/mae-base',
 'outputs/models/SiT',
 'outputs/models/sit_iBOT-ViT_B',
 'outputs/models/ours_450',
 'outputs/models/sit_sit-ViT_B',
 'outputs/models/iBOT-ViT_B',
 'outputs/models/MC_SSL-ViT_B',
 'outputs/models/ours_iBOT2',
 'outputs/models/MSN-ViT_B',
 'outputs/models/DINO-ViT_B',
 'outputs/models/sit-ViT_B']

for m in model_dir:
    for d in ['roxford5k', 'rparis6k']:
        print(m, d)
        cmd = f"python evaluation/eval_image_retrieval.py --data_location ../data/revisited_paris_oxford/ --arch=vit_base -w {m}/checkpoint.pth --output_dir={m}/eval/ImageRetieval/ --data_set={d}"
        print(cmd)
        os.system(cmd)

Copy Detection on Copydays

./run.sh copydays_copydey $JOB_NAME vit_small teacher 1 \
  --data_location data/copydays

linear prob on COCO

CUDA_VISIBLE_DEVICES=1 MODEL_DIR= screen bash evaluation/awesome-semantic-segmentation-pytorch/scripts/linear_coco_dist.sh

projection

Fewshot

python evaluation/eval_fewshot_cls.py --pretrained_weights <weight> --data_location <data_path> --output_dir <output_dir> 

batch evaluation:

model_dir= [
'outputs/models/MMC',
'outputs/models/mae-base',
'outputs/models/SiT','outputs/models/iBOT-ViT_B','outputs/models/MSN-ViT_B','outputs/models/DINO-ViT_B','outputs/models/sit-ViT_B','outputs/models/MC_SSL-ViT_B'
]

for m in model_dir:
  for ds in ['ominiglot']:
    cmd = f"python evaluation/eval_fewshot_cls.py --data_location ../data --arch=vit_base -w {m}/checkpoint.pth --output_dir={m}/eval/fewshot --data_set={ds}"
    print(cmd)
    os.system(cmd)