Experiments in using Masked Autoencoders for pre-traininged a Vision Transformer on EXTBNB data.
We will use this as a baseline for several experiments:
- Fine-tuning for Matt's network
- Fine-tuning for LArMatch
- Collaboration with Bill Freeman on unsupervised semantic segmentation
- Fine-tuning for DEiT
Run3 G1 EXTBNB sample has 34K files with about 15 events each. If we aim for a crop size of 512x512, we will have about 2*4 images from each event.
This leads us to roughly an effective image sample size of 4 million images for each plane, a bit more for the Y-plane.
We use larbys/larcv Version 1 for handling microboone data.
- Generate small sample size, 10
- Get ViT encoder, maybe a standard one from something like mm
- Define de-coder, just a small number of blocks
- Practice training on small sample or even single image.
- Big training
- Fine tuning on MC labeled SSNet
- Write paper, publish weights.
The repository lucidarains/vit-pytorch has ViT and a MAE wrapper of some sort. Easy peasy!