How to Finetune VideoMAE

Abdulkader Helwan
3 min readDec 29, 2023

VideoMAE is a self-supervised video pre-training method that uses masked autoencoders to learn data-efficient video representations. The method is based on video masking with a high ratio, which improves the performance of video reconstruction and the generalization of video representations on small datasets. The authors of the paper show that VideoMAE is a data-efficient learner for self-supervised video pre-training, and that it can achieve impressive results on very small datasets without using any extra data. The code for VideoMAE is available on GitHub.

VideoMAE architecture

What is a Masked Encoder

A masked autoencoder is a type of neural network that can learn to extract and map meaningful latent representations into high-dimensional space from data by training on large datasets of input samples. The method is based on masking random patches of the input image and reconstructing the missing pixels. It is based on two core designs: an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Masked autoencoders are scalable self-supervised learners for computer vision and…