Zero-shot image classification using CLIP

Abdulkader Helwan
7 min readSep 11, 2023

Cutting-edge computer vision models, often referred to as state-of-the-art (SotA) models, exhibit a limitation in their comprehension of the visual world, primarily shaped by their training data.

While these models excel in specific tasks and datasets, their ability to generalize is limited. They struggle with novel categories or images that fall outside the scope of their original training domain.

This brittleness can pose challenges when creating specialized image classification applications, such as identifying defects in agricultural products or detecting counterfeit banknotes to combat fraud. Gathering sufficiently large labeled datasets for fine-tuning conventional computer vision models in these niche areas can be exceptionally challenging.

Ideally, a computer vision model should learn to grasp the content of images without fixating excessively on the specific labels it was initially trained on. For instance, when presented with an image of a dog, the model should not only recognize the dog but also understand contextual details like the presence of trees in the background, the time of day, and the dog’s location on a grassy field.

Regrettably, the outcome of classification training contradicts this ideal. Models tend to group their internal representations of dogs into a designated “dog vector space” and cats into a designated “cat vector space.” Their focus becomes binary, centered on determining whether an image aligns with a specific class or not.