Initial commit

2c79b01a · Georgios Pavlakos · GitHub · b91ed5ca · 2c79b01a · 2c79b01a
Unverified Commit 2c79b01a authored Dec 11, 2023 by Georgios Pavlakos Committed by GitHub Dec 11, 2023
--- a/LICENSE.md
+++ b/LICENSE.md
+MIT License
+
+Copyright (c) 2023 UC Regents, Georgios Pavlakos
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
-# hamer
+# HaMeR: Hand Mesh Recovery
+Code repository for the paper:
+**Reconstructing Hands in 3D with Transformers**
+
+[Georgios Pavlakos](https://geopavlakos.github.io/), [Dandan Shan](https://ddshan.github.io/), [Ilija Radosavovic](https://people.eecs.berkeley.edu/~ilija/), [Angjoo Kanazawa](https://people.eecs.berkeley.edu/~kanazawa/), [David Fouhey](https://cs.nyu.edu/~fouhey/), [Jitendra Malik](http://people.eecs.berkeley.edu/~malik/)
+
+[![arXiv](https://img.shields.io/badge/arXiv-2305.20091-00ff00.svg)](https://arxiv.org/pdf/2312.05251.pdf)  [![Website shields.io](https://img.shields.io/website-up-down-green-red/http/shields.io.svg)](https://geopavlakos.github.io/hamer/)     [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rQbQzegFWGVOm1n1d-S6koOWDo7F2ucu?usp=sharing)  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/geopavlakos/HaMeR)
+
+![teaser](assets/teaser.jpg)
+
+## Installation
+First you need to clone the repo:
+```
+git clone --recursive git@github.com:geopavlakos/hamer.git
+cd hamer
+```
+
+We recommend creating a virtual environment for HaMeR. You can use venv:
+```bash
+python3.10 -m venv .hamer
+source .hamer/bin/activate
+```
+
+or alternatively conda:
+```bash
+conda create --name hamer python=3.10
+conda activate hamer
+```
+
+Then, you can install the rest of the dependencies. This is for CUDA 11.7, but you can adapt accordingly:
+```bash
+pip install torch torchvision --index-url https://download.pytorch.org/whl/cu117
+pip install -e .[all]
+pip install -v -e third-party/ViTPose
+```
+
+You also need to download the trained models:
+```bash
+bash fetch_demo_data.sh
+```
+
+Besides these files, you also need to download the MANO model. Please visit the [MANO website](https://mano.is.tue.mpg.de) and register to get access to the downloads section.  We only require the right hand model. You need to put `MANO_RIGHT.pkl` under the `_DATA/data/mano` folder.
+
+## Demo
+```bash
+python demo.py \
+    --img_folder example_data --out_folder demo_out \
+    --batch_size=48 --side_view --save_mesh --full_frame
+```
+
+## Training
+First, download the training data to `./hamer_training_data/` by running:
+```
+bash fetch_training_data.sh
+```
+
+Then you can start training using the following command:
+```
+python train.py exp_name=hamer data=mix_all experiment=hamer_vit_transformer trainer=gpu launcher=local
+```
+Checkpoints and logs will be saved to `./logs/`.
+
+## Acknowledgements
+Parts of the code are taken or adapted from the following repos:
+- [4DHumans](https://github.com/shubham-goel/4D-Humans)
+- [SLAHMR](https://github.com/vye16/slahmr)
+- [ProHMR](https://github.com/nkolot/ProHMR)
+- [SPIN](https://github.com/nkolot/SPIN)
+- [SMPLify-X](https://github.com/vchoutas/smplify-x)
+- [HMR](https://github.com/akanazawa/hmr)
+- [ViTPose](https://github.com/ViTAE-Transformer/ViTPose)
+- [Detectron2](https://github.com/facebookresearch/detectron2)
+
+Additionally, we thank [StabilityAI](https://stability.ai/) for a generous compute grant that enabled this work.
+
+## Citing
+If you find this code useful for your research, please consider citing the following paper:
+
+```bibtex
+@inproceedings{pavlakos2023reconstructing,
+    title={Reconstructing Hands in 3{D} with Transformers},
+    author={Pavlakos, Georgios and Shan, Dandan and Radosavovic, Ilija and Kanazawa, Angjoo and Fouhey, David and Malik, Jitendra},
+    booktitle={arxiv},
+    year={2023}
+}
+```
--- a/assets/teaser.jpg
+++ b/assets/teaser.jpg
--- a/demo.py
+++ b/demo.py
+from pathlib import Path
+import torch
+import argparse
+import os
+import cv2
+import numpy as np
+
+from hamer.configs import CACHE_DIR_HAMER
+from hamer.models import HAMER, download_models, load_hamer, DEFAULT_CHECKPOINT
+from hamer.utils import recursive_to
+from hamer.datasets.vitdet_dataset import ViTDetDataset, DEFAULT_MEAN, DEFAULT_STD
+from hamer.utils.renderer import Renderer, cam_crop_to_full
+
+LIGHT_BLUE=(0.65098039,  0.74117647,  0.85882353)
+
+from vitpose_model import ViTPoseModel
+
+import json
+from typing import Dict, Optional
+
+def main():
+    parser = argparse.ArgumentParser(description='HaMeR demo code')
+    parser.add_argument('--checkpoint', type=str, default=DEFAULT_CHECKPOINT, help='Path to pretrained model checkpoint')
+    parser.add_argument('--img_folder', type=str, default='images', help='Folder with input images')
+    parser.add_argument('--out_folder', type=str, default='out_demo', help='Output folder to save rendered results')
+    parser.add_argument('--side_view', dest='side_view', action='store_true', default=False, help='If set, render side view also')
+    parser.add_argument('--full_frame', dest='full_frame', action='store_true', default=True, help='If set, render all people together also')
+    parser.add_argument('--save_mesh', dest='save_mesh', action='store_true', default=False, help='If set, save meshes to disk also')
+    parser.add_argument('--batch_size', type=int, default=1, help='Batch size for inference/fitting')
+    parser.add_argument('--rescale_factor', type=float, default=2.0, help='Factor for padding the bbox')
+    parser.add_argument('--file_type', nargs='+', default=['*.jpg', '*.png'], help='List of file extensions to consider')
+
+    args = parser.parse_args()
+
+    # Download and load checkpoints
+    #download_models(CACHE_DIR_HAMER)
+    model, model_cfg = load_hamer(args.checkpoint)
+
+    # Setup HaMeR model
+    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+    model = model.to(device)
+    model.eval()
+
+    # Load detector
+    from hamer.utils.utils_detectron2 import DefaultPredictor_Lazy
+    from detectron2.config import LazyConfig
+    import hamer
+    cfg_path = Path(hamer.__file__).parent/'configs'/'cascade_mask_rcnn_vitdet_h_75ep.py'
+    detectron2_cfg = LazyConfig.load(str(cfg_path))
+    detectron2_cfg.train.init_checkpoint = "https://dl.fbaipublicfiles.com/detectron2/ViTDet/COCO/cascade_mask_rcnn_vitdet_h/f328730692/model_final_f05665.pkl"
+    for i in range(3):
+        detectron2_cfg.model.roi_heads.box_predictors[i].test_score_thresh = 0.25
+    detector = DefaultPredictor_Lazy(detectron2_cfg)
+
+    # keypoint detector
+    cpm = ViTPoseModel(device)
+
+    # Setup the renderer
+    renderer = Renderer(model_cfg, faces=model.mano.faces)
+
+    # Make output directory if it does not exist
+    os.makedirs(args.out_folder, exist_ok=True)
+
+    # Get all demo images ends with .jpg or .png
+    img_paths = [img for end in args.file_type for img in Path(args.img_folder).glob(end)]
+
+    # Iterate over all images in folder
+    for img_path in img_paths:
+        img_cv2 = cv2.imread(str(img_path))
+
+        # Detect humans in image
+        det_out = detector(img_cv2)
+        img = img_cv2.copy()[:, :, ::-1]
+
+        det_instances = det_out['instances']
+        valid_idx = (det_instances.pred_classes==0) & (det_instances.scores > 0.5)
+        pred_bboxes=det_instances.pred_boxes.tensor[valid_idx].cpu().numpy()
+        pred_scores=det_instances.scores[valid_idx].cpu().numpy()
+
+        # Detect human keypoints for each person
+        vitposes_out = cpm.predict_pose(
+            img_cv2,
+            [np.concatenate([pred_bboxes, pred_scores[:, None]], axis=1)],
+        )
+
+        bboxes = []
+        is_right = []
+
+        # Use hands based on hand keypoint detections
+        for vitposes in vitposes_out:
+            left_hand_keyp = vitposes['keypoints'][-42:-21]
+            right_hand_keyp = vitposes['keypoints'][-21:]
+
+            # Rejecting not confident detections
+            keyp = left_hand_keyp
+            valid = keyp[:,2] > 0.5
+            if sum(valid) > 3:
+                bbox = [keyp[valid,0].min(), keyp[valid,1].min(), keyp[valid,0].max(), keyp[valid,1].max()]
+                bboxes.append(bbox)
+                is_right.append(0)
+            keyp = right_hand_keyp
+            valid = keyp[:,2] > 0.5
+            if sum(valid) > 3:
+                bbox = [keyp[valid,0].min(), keyp[valid,1].min(), keyp[valid,0].max(), keyp[valid,1].max()]
+                bboxes.append(bbox)
+                is_right.append(1)
+
+        if len(bboxes) == 0:
+            continue
+
+        boxes = np.stack(bboxes)
+        right = np.stack(is_right)
+
+        # Run reconstruction on all detected hands
+        dataset = ViTDetDataset(model_cfg, img_cv2, boxes, right, rescale_factor=args.rescale_factor)
+        dataloader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=False, num_workers=0)
+
+        all_verts = []
+        all_cam_t = []
+        all_right = []
+        
+        for batch in dataloader:
+            batch = recursive_to(batch, device)
+            with torch.no_grad():
+                out = model(batch)
+
+            multiplier = (2*batch['right']-1)
+            pred_cam = out['pred_cam']
+            pred_cam[:,1] = multiplier*pred_cam[:,1]
+            box_center = batch["box_center"].float()
+            box_size = batch["box_size"].float()
+            img_size = batch["img_size"].float()
+            multiplier = (2*batch['right']-1)
+            scaled_focal_length = model_cfg.EXTRA.FOCAL_LENGTH / model_cfg.MODEL.IMAGE_SIZE * img_size.max()
+            pred_cam_t_full = cam_crop_to_full(pred_cam, box_center, box_size, img_size, scaled_focal_length).detach().cpu().numpy()
+
+            # Render the result
+            batch_size = batch['img'].shape[0]
+            for n in range(batch_size):
+                # Get filename from path img_path
+                img_fn, _ = os.path.splitext(os.path.basename(img_path))
+                person_id = int(batch['personid'][n])
+                white_img = (torch.ones_like(batch['img'][n]).cpu() - DEFAULT_MEAN[:,None,None]/255) / (DEFAULT_STD[:,None,None]/255)
+                input_patch = batch['img'][n].cpu() * (DEFAULT_STD[:,None,None]/255) + (DEFAULT_MEAN[:,None,None]/255)
+                input_patch = input_patch.permute(1,2,0).numpy()
+
+                regression_img = renderer(out['pred_vertices'][n].detach().cpu().numpy(),
+                                        out['pred_cam_t'][n].detach().cpu().numpy(),
+                                        batch['img'][n],
+                                        mesh_base_color=LIGHT_BLUE,
+                                        scene_bg_color=(1, 1, 1),
+                                        )
+
+                if args.side_view:
+                    side_img = renderer(out['pred_vertices'][n].detach().cpu().numpy(),
+                                            out['pred_cam_t'][n].detach().cpu().numpy(),
+                                            white_img,
+                                            mesh_base_color=LIGHT_BLUE,
+                                            scene_bg_color=(1, 1, 1),
+                                            side_view=True)
+                    final_img = np.concatenate([input_patch, regression_img, side_img], axis=1)
+                else:
+                    final_img = np.concatenate([input_patch, regression_img], axis=1)
+
+                cv2.imwrite(os.path.join(args.out_folder, f'{img_fn}_{person_id}.png'), 255*final_img[:, :, ::-1])
+
+                # Add all verts and cams to list
+                verts = out['pred_vertices'][n].detach().cpu().numpy()
+                is_right = batch['right'][n].cpu().numpy()
+                verts[:,0] = (2*is_right-1)*verts[:,0]
+                cam_t = pred_cam_t_full[n]
+                all_verts.append(verts)
+                all_cam_t.append(cam_t)
+                all_right.append(is_right)
+
+                # Save all meshes to disk
+                if args.save_mesh:
+                    camera_translation = cam_t.copy()
+                    tmesh = renderer.vertices_to_trimesh(verts, camera_translation, LIGHT_BLUE, is_right=is_right)
+                    tmesh.export(os.path.join(args.out_folder, f'{img_fn}_{person_id}.obj'))
+
+        # Render front view
+        if args.full_frame and len(all_verts) > 0:
+            misc_args = dict(
+                mesh_base_color=LIGHT_BLUE,
+                scene_bg_color=(1, 1, 1),
+                focal_length=scaled_focal_length,
+            )
+            cam_view = renderer.render_rgba_multiple(all_verts, cam_t=all_cam_t, render_res=img_size[n], is_right=all_right, **misc_args)
+
+            # Overlay image
+            input_img = img_cv2.astype(np.float32)[:,:,::-1]/255.0
+            input_img = np.concatenate([input_img, np.ones_like(input_img[:,:,:1])], axis=2) # Add alpha channel
+            input_img_overlay = input_img[:,:,:3] * (1-cam_view[:,:,3:]) + cam_view[:,:,:3] * cam_view[:,:,3:]
+
+            cv2.imwrite(os.path.join(args.out_folder, f'{img_fn}_all.jpg'), 255*input_img_overlay[:, :, ::-1])
+
+if __name__ == '__main__':
+    main()
--- a/example_data/test1.jpg
+++ b/example_data/test1.jpg
--- a/example_data/test2.jpg
+++ b/example_data/test2.jpg
--- a/example_data/test3.jpg
+++ b/example_data/test3.jpg
--- a/example_data/test4.jpg
+++ b/example_data/test4.jpg
--- a/example_data/test5.jpg
+++ b/example_data/test5.jpg
--- a/fetch_demo_data.sh
+++ b/fetch_demo_data.sh
+wget https://www.dropbox.com/s/6zejmxu0aur3568/hamer_demo_data.tar.gz
+tar --warning=no-unknown-keyword --exclude=".*" -xvf hamer_demo_data.tar.gz
--- a/fetch_training_data.sh
+++ b/fetch_training_data.sh
+# Downloading all tars
+wget -O hamer_training_data_part1.tar.gz https://www.dropbox.com/scl/fi/f249h32hd35x78l058ofy/hamer_training_data_part1.tar.gz?rlkey=puuvwg5ngueaxl4xxwf3yd15a
+wget -O hamer_training_data_part2.tar.gz https://www.dropbox.com/scl/fi/l9l5udalchu0mh4qxnw2t/hamer_training_data_part2.tar.gz?rlkey=i0n2lzix4q6jxmhm4sr5rtmkt
+wget -O hamer_training_data_part3.tar.gz https://www.dropbox.com/scl/fi/6lamcbwt79ri0oj4knwm3/hamer_training_data_part3.tar.gz?rlkey=j5y7ea7xrlu440ud12otaj2ne
+wget -O hamer_training_data_part4a.tar.gz https://www.dropbox.com/scl/fi/vp6cw7he8t0eigjf6001l/hamer_training_data_part4a.tar.gz?rlkey=wylmufft4a5nq3yxep2olifrk
+wget -O hamer_training_data_part4b.tar.gz https://www.dropbox.com/scl/fi/vyjasngr67ru14fb8s108/hamer_training_data_part4b.tar.gz?rlkey=qgotg1v9lkgo5eu78gh8b007t
+wget -O hamer_training_data_part4c.tar.gz https://www.dropbox.com/scl/fi/nfvz5zpcmhz8hkwzc6ji4/hamer_training_data_part4c.tar.gz?rlkey=ygh0wvse04twhh1ri3xiw2sag
+
+for f in hamer_training_data_part*.tar; do
+    tar --warning=no-unknown-keyword --exclude=".*" -xvf $f
+done
--- a/hamer/__init__.py
+++ b/hamer/__init__.py
--- a/hamer/configs/__init__.py
+++ b/hamer/configs/__init__.py
+import os
+from typing import Dict
+from yacs.config import CfgNode as CN
+
+CACHE_DIR_HAMER = "./_DATA"
+
+def to_lower(x: Dict) -> Dict:
+    """
+    Convert all dictionary keys to lowercase
+    Args:
+      x (dict): Input dictionary
+    Returns:
+      dict: Output dictionary with all keys converted to lowercase
+    """
+    return {k.lower(): v for k, v in x.items()}
+
+_C = CN(new_allowed=True)
+
+_C.GENERAL = CN(new_allowed=True)
+_C.GENERAL.RESUME = True
+_C.GENERAL.TIME_TO_RUN = 3300
+_C.GENERAL.VAL_STEPS = 100
+_C.GENERAL.LOG_STEPS = 100
+_C.GENERAL.CHECKPOINT_STEPS = 20000
+_C.GENERAL.CHECKPOINT_DIR = "checkpoints"
+_C.GENERAL.SUMMARY_DIR = "tensorboard"
+_C.GENERAL.NUM_GPUS = 1
+_C.GENERAL.NUM_WORKERS = 4
+_C.GENERAL.MIXED_PRECISION = True
+_C.GENERAL.ALLOW_CUDA = True
+_C.GENERAL.PIN_MEMORY = False
+_C.GENERAL.DISTRIBUTED = False
+_C.GENERAL.LOCAL_RANK = 0
+_C.GENERAL.USE_SYNCBN = False
+_C.GENERAL.WORLD_SIZE = 1
+
+_C.TRAIN = CN(new_allowed=True)
+_C.TRAIN.NUM_EPOCHS = 100
+_C.TRAIN.BATCH_SIZE = 32
+_C.TRAIN.SHUFFLE = True
+_C.TRAIN.WARMUP = False
+_C.TRAIN.NORMALIZE_PER_IMAGE = False
+_C.TRAIN.CLIP_GRAD = False
+_C.TRAIN.CLIP_GRAD_VALUE = 1.0
+_C.LOSS_WEIGHTS = CN(new_allowed=True)
+
+_C.DATASETS = CN(new_allowed=True)
+
+_C.MODEL = CN(new_allowed=True)
+_C.MODEL.IMAGE_SIZE = 224
+
+_C.EXTRA = CN(new_allowed=True)
+_C.EXTRA.FOCAL_LENGTH = 5000
+
+_C.DATASETS.CONFIG = CN(new_allowed=True)
+_C.DATASETS.CONFIG.SCALE_FACTOR = 0.3
+_C.DATASETS.CONFIG.ROT_FACTOR = 30
+_C.DATASETS.CONFIG.TRANS_FACTOR = 0.02
+_C.DATASETS.CONFIG.COLOR_SCALE = 0.2
+_C.DATASETS.CONFIG.ROT_AUG_RATE = 0.6
+_C.DATASETS.CONFIG.TRANS_AUG_RATE = 0.5
+_C.DATASETS.CONFIG.DO_FLIP = False
+_C.DATASETS.CONFIG.FLIP_AUG_RATE = 0.5
+_C.DATASETS.CONFIG.EXTREME_CROP_AUG_RATE = 0.10
+
+def default_config() -> CN:
+    """
+    Get a yacs CfgNode object with the default config values.
+    """
+    # Return a clone so that the defaults will not be altered
+    # This is for the "local variable" use pattern
+    return _C.clone()
+
+def dataset_config() -> CN:
+    """
+    Get dataset config file
+    Returns:
+      CfgNode: Dataset config as a yacs CfgNode object.
+    """
+    cfg = CN(new_allowed=True)
+    config_file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'datasets_tar.yaml')
+    cfg.merge_from_file(config_file)
+    cfg.freeze()
+    return cfg
+
+def get_config(config_file: str, merge: bool = True, update_cachedir: bool = False) -> CN:
+    """
+    Read a config file and optionally merge it with the default config file.
+    Args:
+      config_file (str): Path to config file.
+      merge (bool): Whether to merge with the default config or not.
+    Returns:
+      CfgNode: Config as a yacs CfgNode object.
+    """
+    if merge:
+      cfg = default_config()
+    else:
+      cfg = CN(new_allowed=True)
+    cfg.merge_from_file(config_file)
+
+    if update_cachedir:
+      def update_path(path: str) -> str:
+        if os.path.isabs(path):
+          return path
+        return os.path.join(CACHE_DIR_HAMER, path)
+
+      cfg.MANO.MODEL_PATH = update_path(cfg.MANO.MODEL_PATH)
+      cfg.MANO.MEAN_PARAMS = update_path(cfg.MANO.MEAN_PARAMS)
+
+    cfg.freeze()
+    return cfg
--- a/hamer/configs/cascade_mask_rcnn_vitdet_h_75ep.py
+++ b/hamer/configs/cascade_mask_rcnn_vitdet_h_75ep.py
+## coco_loader_lsj.py
+
+import detectron2.data.transforms as T
+from detectron2 import model_zoo
+from detectron2.config import LazyCall as L
+
+# Data using LSJ
+image_size = 1024
+dataloader = model_zoo.get_config("common/data/coco.py").dataloader
+dataloader.train.mapper.augmentations = [
+    L(T.RandomFlip)(horizontal=True),  # flip first
+    L(T.ResizeScale)(
+        min_scale=0.1, max_scale=2.0, target_height=image_size, target_width=image_size
+    ),
+    L(T.FixedSizeCrop)(crop_size=(image_size, image_size), pad=False),
+]
+dataloader.train.mapper.image_format = "RGB"
+dataloader.train.total_batch_size = 64
+# recompute boxes due to cropping
+dataloader.train.mapper.recompute_boxes = True
+
+dataloader.test.mapper.augmentations = [
+    L(T.ResizeShortestEdge)(short_edge_length=image_size, max_size=image_size),
+]
+
+from functools import partial
+from fvcore.common.param_scheduler import MultiStepParamScheduler
+
+from detectron2 import model_zoo
+from detectron2.config import LazyCall as L
+from detectron2.solver import WarmupParamScheduler
+from detectron2.modeling.backbone.vit import get_vit_lr_decay_rate
+
+# mask_rcnn_vitdet_b_100ep.py
+
+model = model_zoo.get_config("common/models/mask_rcnn_vitdet.py").model
+
+# Initialization and trainer settings
+train = model_zoo.get_config("common/train.py").train
+train.amp.enabled = True
+train.ddp.fp16_compression = True
+train.init_checkpoint = "detectron2://ImageNetPretrained/MAE/mae_pretrain_vit_base.pth"
+
+
+# Schedule
+# 100 ep = 184375 iters * 64 images/iter / 118000 images/ep
+train.max_iter = 184375
+
+lr_multiplier = L(WarmupParamScheduler)(
+    scheduler=L(MultiStepParamScheduler)(
+        values=[1.0, 0.1, 0.01],
+        milestones=[163889, 177546],
+        num_updates=train.max_iter,
+    ),
+    warmup_length=250 / train.max_iter,
+    warmup_factor=0.001,
+)
+
+# Optimizer
+optimizer = model_zoo.get_config("common/optim.py").AdamW
+optimizer.params.lr_factor_func = partial(get_vit_lr_decay_rate, num_layers=12, lr_decay_rate=0.7)
+optimizer.params.overrides = {"pos_embed": {"weight_decay": 0.0}}
+
+# cascade_mask_rcnn_vitdet_b_100ep.py
+
+from detectron2.config import LazyCall as L
+from detectron2.layers import ShapeSpec
+from detectron2.modeling.box_regression import Box2BoxTransform
+from detectron2.modeling.matcher import Matcher
+from detectron2.modeling.roi_heads import (
+    FastRCNNOutputLayers,
+    FastRCNNConvFCHead,
+    CascadeROIHeads,
+)
+
+# arguments that don't exist for Cascade R-CNN
+[model.roi_heads.pop(k) for k in ["box_head", "box_predictor", "proposal_matcher"]]
+
+model.roi_heads.update(
+    _target_=CascadeROIHeads,
+    box_heads=[
+        L(FastRCNNConvFCHead)(
+            input_shape=ShapeSpec(channels=256, height=7, width=7),
+            conv_dims=[256, 256, 256, 256],
+            fc_dims=[1024],
+            conv_norm="LN",
+        )
+        for _ in range(3)
+    ],
+    box_predictors=[
+        L(FastRCNNOutputLayers)(
+            input_shape=ShapeSpec(channels=1024),
+            test_score_thresh=0.05,
+            box2box_transform=L(Box2BoxTransform)(weights=(w1, w1, w2, w2)),
+            cls_agnostic_bbox_reg=True,
+            num_classes="${...num_classes}",
+        )
+        for (w1, w2) in [(10, 5), (20, 10), (30, 15)]
+    ],
+    proposal_matchers=[
+        L(Matcher)(thresholds=[th], labels=[0, 1], allow_low_quality_matches=False)
+        for th in [0.5, 0.6, 0.7]
+    ],
+)
+
+# cascade_mask_rcnn_vitdet_h_75ep.py
+
+from functools import partial
+
+train.init_checkpoint = "detectron2://ImageNetPretrained/MAE/mae_pretrain_vit_huge_p14to16.pth"
+
+model.backbone.net.embed_dim = 1280
+model.backbone.net.depth = 32
+model.backbone.net.num_heads = 16
+model.backbone.net.drop_path_rate = 0.5
+# 7, 15, 23, 31 for global attention
+model.backbone.net.window_block_indexes = (
+    list(range(0, 7)) + list(range(8, 15)) + list(range(16, 23)) + list(range(24, 31))
+)
+
+optimizer.params.lr_factor_func = partial(get_vit_lr_decay_rate, lr_decay_rate=0.9, num_layers=32)
+optimizer.params.overrides = {}
+optimizer.params.weight_decay_norm = None
+
+train.max_iter = train.max_iter * 3 // 4  # 100ep -> 75ep
+lr_multiplier.scheduler.milestones = [
+    milestone * 3 // 4 for milestone in lr_multiplier.scheduler.milestones
+]
+lr_multiplier.scheduler.num_updates = train.max_iter
--- a/hamer/configs/datasets_tar.yaml
+++ b/hamer/configs/datasets_tar.yaml
+FREIHAND-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/freihand-train/{000000..000130}.tar
+    epoch_size: 130_240
+INTERHAND26M-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/interhand26m-train/{000000..001056}.tar
+    epoch_size: 1_424_632
+HALPE-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/halpe-train/{000000..000022}.tar
+    epoch_size: 34_289
+COCOW-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/cocow-train/{000000..000036}.tar
+    epoch_size: 78_666
+MTC-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/mtc-train/{000000..000306}.tar
+    epoch_size: 363_947
+RHD-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/rhd-train/{000000..000041}.tar
+    epoch_size: 61_705
+MPIINZSL-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/mpiinzsl-train/{000000..000015}.tar
+    epoch_size: 15_184
+HO3D-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/ho3d-train/{000000..000083}.tar
+    epoch_size: 83_325
+H2O3D-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/h2o3d-train/{000000..000060}.tar
+    epoch_size: 121_996
+DEX-TRAIN:
+    TYPE: ImageDataset
+    URLS: hamer_training_data/dataset_tars/dex-train/{000000..000406}.tar
+    epoch_size: 406_888
+FREIHAND-MOCAP:
+    DATASET_FILE: hamer_training_data/freihand_mocap.npz
--- a/hamer/configs_hydra/data/mix_all.yaml
+++ b/hamer/configs_hydra/data/mix_all.yaml
+# @package _global_
+defaults:
+  - /data_filtering: low1
+
+DATASETS:
+  TRAIN: 
+    FREIHAND-TRAIN:
+      WEIGHT: 0.25
+    INTERHAND26M-TRAIN:
+      WEIGHT: 0.25
+    MTC-TRAIN:
+      WEIGHT: 0.1
+    RHD-TRAIN:
+      WEIGHT: 0.05
+    COCOW-TRAIN:
+      WEIGHT: 0.1
+    HALPE-TRAIN:
+      WEIGHT: 0.05
+    MPIINZSL-TRAIN:
+      WEIGHT: 0.05
+    HO3D-TRAIN:
+      WEIGHT: 0.05
+    H2O3D-TRAIN:
+      WEIGHT: 0.05
+    DEX-TRAIN:
+      WEIGHT: 0.05
+  VAL:
+    FREIHAND-TRAIN:
+      WEIGHT: 1.0
+
+  MOCAP: FREIHAND-MOCAP
--- a/hamer/configs_hydra/data_filtering/low1.yaml
+++ b/hamer/configs_hydra/data_filtering/low1.yaml
+# @package _global_
+
+DATASETS:
+  # Data filtering during training
+  SUPPRESS_KP_CONF_THRESH: 0.3
+  FILTER_NUM_KP: 4
+  FILTER_NUM_KP_THRESH: 0.0
+  FILTER_REPROJ_THRESH: 31000
+
+  SUPPRESS_BETAS_THRESH: 3.0
+  SUPPRESS_BAD_POSES: False
+  POSES_BETAS_SIMULTANEOUS: True
+  FILTER_NO_POSES: False # If True, filters images that don't have poses
--- a/hamer/configs_hydra/experiment/default.yaml
+++ b/hamer/configs_hydra/experiment/default.yaml
+# @package _global_
+
+MANO:
+  DATA_DIR: _DATA/data/
+  MODEL_PATH: ${MANO.DATA_DIR}/mano
+  GENDER: neutral
+  NUM_HAND_JOINTS: 15
+  MEAN_PARAMS: ${MANO.DATA_DIR}/mano_mean_params.npz
+  CREATE_BODY_POSE: FALSE
+
+EXTRA:
+  FOCAL_LENGTH: 5000
+  NUM_LOG_IMAGES: 4
+  NUM_LOG_SAMPLES_PER_IMAGE: 8
+  PELVIS_IND: 0
+
+DATASETS:
+  BETAS_REG: True
+  CONFIG:
+    SCALE_FACTOR: 0.3
+    ROT_FACTOR: 30
+    TRANS_FACTOR: 0.02
+    COLOR_SCALE: 0.2
+    ROT_AUG_RATE: 0.6
+    TRANS_AUG_RATE: 0.5
+    DO_FLIP: False
+    FLIP_AUG_RATE: 0.0
+    EXTREME_CROP_AUG_RATE: 0.0
+    EXTREME_CROP_AUG_LEVEL: 1
--- a/hamer/configs_hydra/experiment/hamer_vit_transformer.yaml
+++ b/hamer/configs_hydra/experiment/hamer_vit_transformer.yaml
+# @package _global_
+
+defaults:
+  - default.yaml
+
+GENERAL:
+  TOTAL_STEPS: 1_000_000
+  LOG_STEPS: 1000
+  VAL_STEPS: 1000
+  CHECKPOINT_STEPS: 1000
+  CHECKPOINT_SAVE_TOP_K: 1
+  NUM_WORKERS: 25
+  PREFETCH_FACTOR: 2
+
+TRAIN:
+  LR: 1e-5
+  WEIGHT_DECAY: 1e-4
+  BATCH_SIZE: 8
+  LOSS_REDUCTION: mean
+  NUM_TRAIN_SAMPLES: 2
+  NUM_TEST_SAMPLES: 64
+  POSE_2D_NOISE_RATIO: 0.01
+  SMPL_PARAM_NOISE_RATIO: 0.005
+
+MODEL:
+  IMAGE_SIZE: 256
+  IMAGE_MEAN: [0.485, 0.456, 0.406]
+  IMAGE_STD: [0.229, 0.224, 0.225]
+  BACKBONE:
+    TYPE: vit
+    PRETRAINED_WEIGHTS: hamer_training_data/vitpose_backbone.pth
+  MANO_HEAD:
+    TYPE: transformer_decoder
+    IN_CHANNELS: 2048
+    TRANSFORMER_DECODER:
+      depth: 6
+      heads: 8
+      mlp_dim: 1024
+      dim_head: 64
+      dropout: 0.0
+      emb_dropout: 0.0
+      norm: layer
+      context_dim: 1280 # from vitpose-H
+
+LOSS_WEIGHTS:
+  KEYPOINTS_3D: 0.05
+  KEYPOINTS_2D: 0.01
+  GLOBAL_ORIENT: 0.001
+  HAND_POSE: 0.001
+  BETAS: 0.0005
+  ADVERSARIAL: 0.0005
--- a/hamer/configs_hydra/extras/default.yaml
+++ b/hamer/configs_hydra/extras/default.yaml
+# disable python warnings if they annoy you
+ignore_warnings: False
+
+# ask user for tags if none are provided in the config
+enforce_tags: True
+
+# pretty print config tree at the start of the run using Rich library
+print_config: True