Video Classification Using CNN and Transformer

Abdulkader Helwan
6 min readFeb 2, 2023

Video classification is an important task in computer vision, with many applications in areas such as surveillance, autonomous vehicles, and medical diagnostics. Until recently, most methods used 2D convolutional neural networks (CNNs) to classify videos. However, this approach has several limitations, including being unable to capture the temporal relationships between frames and being unable to capture 3D features like motion.

To address these challenges, 3D convolutional neural networks (3D CNNs) have been proposed. 3D CNNs are similar to 2D CNNs but are designed to capture the temporal relationships between video frames by operating on a sequence of frames instead of individual frames. Moreover, 3D CNNs have the ability to learn 3D features from video sequences, such as motion, which are not possible with 2D CNNs.

In this blog post, we will discuss how to classify videos using 3D convolutions in Tensorflow. We will first look at the architecture of 3D CNNs and then discuss how to build a 3D CNN for video classification using Tensorflow. Moreover, we will showcase how to use CNN as a Feature extractor of the frames of videos and use them as inputs for a Transformer that will work as a classification model.

Classifying Videos Using 3D Convolutions in Tensorflow

The architecture of 3D CNNs is similar to that of 2D CNNs but with two main differences. First, 3D CNNs use three-dimensional kernels, which allow them to capture temporal relationships between frames in a video. Second, 3D CNNs use three-dimensional feature maps, which allow them to capture 3D features such as motion.

Here is a snippet of the code for creating the 3D-CNN in Tensorflow:

import numpy as np 
import h5py
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.initializers import Constant
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

Load the dataset


Buil the model

model = Sequential()…