Introduction
Have you ever wanted to find an image among your never-ending image dataset, but found it too tedious? In this tutorial we’ll build an image similarity search engine to easily find images using either a text query or a reference image. For your convenience, the complete code for this tutorial is provided at the bottom of the article as a Colab notebook.
Pipeline Overview
The semantic meaning of an image can be represented by a numerical vector called an embedding. Comparing these low-dimensional embedding vectors, rather than the raw images, allows for efficient similarity searches. For each image in the dataset, we’ll create an embedding vector and store it in an index. When a text query or a reference image is provided, its embedding is generated and compared against the indexed embeddings to retrieve the most similar images.
Here’s a brief overview:
- Embedding: The embeddings of the images are extracted using the CLIP model.
- Indexing: The embeddings are stored as a FAISS index.
- Retrieval: With FAISS, The embedding of the query is compared against the indexed embeddings to retrieve the most similar images.
CLIP Model
The CLIP (Contrastive Language-Image Pre-training) model, developed by OpenAI, is a multi-modal vision and language model that maps images and text to the same latent space. Since we will use both image and text queries to search for images, we will use the CLIP model to embed our data. For further reading about CLIP, you can check out my previous article here.
FAISS Index
FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta. It is built around the Index object that stores the database embedding vectors. FAISS enables efficient similarity search and clustering of dense vectors, and we will use it to index our dataset and retrieve the photos that resemble to the query.
Code Implementation
Step 1 — Dataset Exploration
To create the image dataset for this tutorial I collected 52 images of varied topics from Pexels. To get the feeling, lets observe 10 random images:

Step 2 — Extract CLIP Embeddings from the Image Dataset
To extract CLIP embeddings, we‘ll first load the CLIP model using the HuggingFace SentenceTransformer library:



