Remove the background from images using AI and Python

Feature image

Artificial Intelligence has taken the world by storm. The concept has wholly revolutionized almost every other domain as more and more professions integrate artificial intelligence with their respective fields. The following article subsequently explores the use of artificial intelligence for the background removal of images.

The idea of background removal is not a modern one; instead, it has existed in the field since the dawn of modern technology. Hence, there have been numerous techniques for background removal in the domain of computer vision. Nevertheless, classical computer vision techniques are constantly being overthrown by modern, artificial intelligence-based ones by lodging more innovative ideas giving more precise and accurate results.


Classical Computer Vision

Classical computer vision relied on manual effort, which was rather extensive, to construct rules-based techniques to detect and classify a particular group of pixels in an image. The model needed to be explicitly told what to detect and what to classify, and what objects are present in the image. It comprises multiple techniques for vision models like edge detectors and object detectors. The concept of filtering images also comes under the same umbrella. Moreover, Convolution, Hough transform, and smoothening techniques are also included. To learn more about the classical methods of computer vision, visit classical concepts in computer vision .

Diving more towards the question at hand, how does classical computer vision handle the idea of background removal from images? One of the most famous methods for foreground extraction, otherwise known as background removal, is segmentation-based color thresholding. In this model, the whole image is divided into multiple smaller segments, and each pixel value is compared with a previously set threshold. The threshold suppresses the background information by reducing the intensity value of the pixels to that of a single solid shade of grey (usually pure white, with a pixel value of 255). You can read more about it in the paper: Color thresholding Method for Segmentation of Natural Images .

Processed image for background removal using a classical approach. Source: https://datacarpentry.org/image-processing/07-thresholding/

Processed image for background removal using a classical approach. Source: https://datacarpentry.org/image-processing/07-thresholding/

Another good example of the classical approach, is the work we did here on the blog for self-driving cars and the automated lane detection system .


Modern Computer Vision

The dawn of modern AI and machine learning has wholly altered the field of computer vision. Unlike classical computer vision, which relies more on complex coded mathematical functions and transforms, modern computer vision is primarily based on artificial intelligence’s deep learning techniques to perform computer vision tasks. Deep learning is a sub-branch of machine learning concerned with algorithms majorly inspired by the functionality of the human brain. The algorithm of a deep learning model is referred to as an artificial neural network. You can find a lot more about neural networks in our top deep learning algorithms you should know .

Neural network architecture. Source: https://www.extremetech.com/extreme/215170-artificial-neural-networks-are-changing-the-world-what-are-they

Neural network architecture. Source: https://www.extremetech.com/extreme/215170-artificial-neural-networks-are-changing-the-world-what-are-they

The concept of convolutional neural networks in deep learning has majorly impacted how everyday computer vision tasks were thought of and approached. The artificial neural networks require a tremendous amount of input data for training purposes, so much so that for deep learning, we can even flip the saying ‘less is more’ and refer to the data as ‘‘more Is less.’

The convolutional neural networks take an image as an input and pass it through several convolutional and non-linear processes to output a so-called ‘‘feature map’ comprising the spatial information about the input image. The feature map is then used to extract an output label of the class, which is then compared with the known actual label of the input image class using a loss function.

This process is iteratively done for all the training examples present in the training data, and an optimized model is obtained. Then the optimized model is used to extract a label for a new image. The labels can be binary (0 and 1), or they can be more than two. The labels are binary when there are only two classes; for example, we must determine whether the image is of a cat or not or whether the image has a cat or a dog. To learn more about the use of deep learning in computer vision, refer to the blog here . For image processing in computer vision and related methods, refer to our first steps with OpenCV for Python .


Semantic Segmentation

Now pivoting back to the original problem of the article, how can background subtraction be achieved using artificial intelligence? The most famous method to acquire this goal is called semantic image segmentation.

The concept behind semantic image segmentation is that each pixel of an image belongs to a particular class and is given that class’s label. This method is used in several applications used across the world. It extends, and is not limited, to handwriting recognition, virtual makeup, virtual try-on, visual image search, and autonomous vehicle.

The abstract idea about the method is that when each pixel is given a class label, the neural network is trained over numerous examples to learn what kind of pixel belongs to a particular class, let’s say foreground class, and which pixel belongs to the other class, the background one.

After the pixel classifier is trained, the resulting pixels with the label corresponding to the background class are assigned a shade of grey (usually white). The foreground image is retained as is; hence the background of the image is said to be removed.

The state-of-the-art methods used for the model building of semantic segmentation are ‘UNet’ and ‘DeepLab.’ Both architectures propose various techniques to acquire semantic segmentation, minimizing computational costs and maximizing accuracy.

To learn more about how semantic segmentation is used for both images and videos, visit A 2021 guide to Semantic Segmentation .

State of the art example using DL. Source: https://devblogs.microsoft.com/cse/2018/04/18/deep-learning-image-segmentation-for-ecommerce-catalogue-visual-search/

State of the art example using DL. Source: https://devblogs.microsoft.com/cse/2018/04/18/deep-learning-image-segmentation-for-ecommerce-catalogue-visual-search/

This article will describe a state-of-the-art model known as MODNet, which uses image matting (an advanced form of semantic segmentation) to perform background subtraction.


What is Image Matting?

Image matting is an advanced extended concept of image segmentation. In image matting, instead of assigning pixels to only two labels; foreground (1) and background (0), we label each pixel with any value between 0 and 1, referring to the intensity, or opacity, of the foreground. This value between 0 and 1 is referred to as αp. As understood already, image matting is a far more complex task than image segmentation; hence interactive methods such as trimap and strokes are used to output user-desired results.

The concept of trimap essentially gave rise to a meaningful extraction of the foreground. In the case of trimap, the user can manually assign the pixels of an image with one of the three labels, i-e the foreground, the background, and unknown regions. Hence for a given trimap, the image matting is simplified to estimate the background colors, the foreground colors, and the alpha values of the pixels in the unknown regions of the image based on the foreground and background pixels.

Unlike the trimap method, the strokes method only needs to specify a few foreground and background scribbles in appropriate image regions. In the strokes-based algorithm, these marked scribbles are considered the input and used to extract the alpha value.

Other than the trimap and strokes method, there are multiple different interactive modes of image matting which include blue screen matting, natural image matting, sampling-based image matting, propagation-based image matting, and learning-based image matting. To learn more about these image-matting interactive techniques, visit ‘What’s the role of image matting in image segmentation?’.


MODNet

First, we will abstractly take a look at the architecture of MODNet, which in itself is a complex algorithm comprising of a combination of interactive and non-interactive techniques to perform image matting.

Architecture of MODNet

On its backbone, MODNet is built on the idea that a trimap-free matting as a segmentation step plus a trimap-based matting step can achieve better performance. MODNet extends this idea by further dividing the trimap-free image matting into semantic estimation, detail prediction, and semantic-detail fusion. The semantic estimation outputs a coarse foreground mask, whereas the detail prediction outputs a fine foreground mask. The semantic-detail fusion blends the features from both the first two layers.

  1. Semantic Estimation The first step of semantic estimation is to locate the human/person in the image. This difference from the original method is that the extraction of the high-level features is only done by an encoder I-e the low-level branch. This turns the estimation more efficient since it is no longer done by a separate model that contains the decoder, in addition to the encoder.

  2. Detail Prediction The region around the foreground portrait is processed with a high-resolution branch which takes the image, the low-level branch output, and the low-level features from the low-level branch as inputs. The purpose of reusing the low-level features is to reduce the computational overheads of detail prediction.

  3. Semantic Detail Fusion The fusion branch of MODNet is convolutional neural network architecture. To learn about convolutional neural networks, visit a comprehensive guide to convolutional neural networks . The values from the previous two branches are concatenated together to predict the final alpha matte.

MODNet presents a simple, fast, and effective architecture to avoid using green screens in real-time portrait matting. By only looking at RGB images, the model enables the prediction of alpha matte under varying scenarios. Although having many upsides, the downside of the MODNet architecture is that it cannot handle strange costumes and strong motion blurs, which are not covered by the training set the model hasn’t been trained on. To study extensively about the MODNet architecture and its experimentation, refer to the paper is a green screen really necessary for real-time portrait matting?

The MODNet Architecture. Source: https://arxiv.org/abs/2011.11961

The MODNet Architecture. Source: https://arxiv.org/abs/2011.11961


Time to code

As always, you can code it yourself, or you can get access to the full working code on Google Colab here .

Now that we have covered all the basics of background subtraction originating from the difference between classical computer vision and modern computer vision to the basic architecture of MODNet. We can move forward and run the code to use the pre-trained MODNet to be used as a background remover of your images.

Please note that the code was developed for Google Colab, so if you want to run locally you will have to make adaptations in how you store and load files.

Model Preparation

import os

# clone the repository
%cd /content
if not os.path.exists('MODNet'):
  !git clone https://github.com/ZHKKKe/MODNet
%cd MODNet/

# dowload the pre-trained ckpt for image matting
pretrained_ckpt = 'pretrained/modnet_photographic_portrait_matting.ckpt'
if not os.path.exists(pretrained_ckpt):
  !gdown --id 1mcr7ALciuAsHCpLnrtG_eop5-EYhbCmz \
          -O pretrained/modnet_photographic_portrait_matting.ckpt

The ‘import os’ command imports the operating system module, part of the standard library. The os module provides functions for creating and removing directories, fetching is content, changing or identifying the current directory, etc.

The ‘%cd’ command enters the directory whose path is mentioned in front of it. The if statement is placed there to propose a check in the code, to condition on whether the MODNet directory is already present in the current directory or not. If not, the MODNet directory is cloned from the GitHub repository containing the directory.

After cloning the MODNet directory, we import the pre-trained model checkpoint file from the directory. Here, another check is placed to look for whether the checkpoint file is present in the directory or not. If not, the pre-trained model checkpoint file is downloaded.

Upload image

import shutil
import os
from google.colab import files

# clean and rebuild the image folders
input_folder = 'demo/image_matting/colab/input'
if os.path.exists(input_folder):
  shutil.rmtree(input_folder)
os.makedirs(input_folder)

output_folder = 'demo/image_matting/colab/output'
if os.path.exists(output_folder):
  shutil.rmtree(output_folder)
os.makedirs(output_folder)

# upload images (PNG or JPG)
image_names = list(files.upload().keys())
for image_name in image_names:
  shutil.move(image_name, os.path.join(input_folder, image_name))

Now we move on to the next segment of the code. We import the necessary modules first. The ‘shitil’ module offers several functions to deal with operations on files and their collections. It provides the ability to copy and remove files. Similarly, we import the os module and the files module of colab. The files module will help us in uploading files from our local machine to google colab. First, we create the input and output folders by checking whether they already exist in the directory or not. If yes, the existing folders are removed, and new ones with the same name are created. Next, we use the files module of colab to upload the images from our local machine on the google colab runtime. After the upload, we move the uploaded images to the input folder we created before.

Inference

!python -m demo.image_matting.colab.inference \
        --input-path demo/image_matting/colab/input \
        --output-path demo/image_matting/colab/output \
        --ckpt-path ./pretrained/modnet_photographic_portrait_matting.ckpt

Run the command above as it is to use the pre-trained MODNet model to infer the alpha matte on the input image/s. The command above uses the checkpoint model files we downloaded in step 1 to perform the matting operation on the image/s uploaded.

Visualization

import numpy as np
from PIL import Image

def combined_display(image, matte):
  # calculate display resolution
  w, h = image.width, image.height
  rw, rh = 800, int(h * 800 / (3 * w))
  
  # obtain predicted foreground
  image = np.asarray(image)
  if len(image.shape) == 2:
    image = image[:, :, None]
  if image.shape[2] == 1:
    image = np.repeat(image, 3, axis=2)
  elif image.shape[2] == 4:
    image = image[:, :, 0:3]
  matte = np.repeat(np.asarray(matte)[:, :, None], 3, axis=2) / 255
  foreground = image * matte + np.full(image.shape, 255) * (1 - matte)
  
  # combine image, foreground, and alpha into one line
  combined = np.concatenate((image, foreground, matte * 255), axis=1)
  combined = Image.fromarray(np.uint8(combined)).resize((rw, rh))
  return combined

# visualize all images
image_names = os.listdir(input_folder)
for image_name in image_names:
  matte_name = image_name.split('.')[0] + '.png'
  image = Image.open(os.path.join(input_folder, image_name))
  matte = Image.open(os.path.join(output_folder, matte_name))
  display(combined_display(image, matte))
  print(image_name, '\n')

Now we move on to the visualization of the new image with its background removed. We will also look at the alpha matte extracted from the input image.

First, we import the necessary modules. NumPy is a library that supports large, multi-dimensional arrays and matrices, along with an extensive collection of high-level mathematical functions to operate on these arrays. To know more about what operations could be performed using NumPy, read the documentation here . The Python Imaging Library (PIL) is a library that adds support for opening, manipulating, and saving many different images file formats.

We start with the function ‘combined_display.’ First, the resolution of the images to be displayed is calculated. Consequently, we extract the predicted foreground image, that is, the image with its background removed, by performing certain image operations, which include reshaping the images into 3-channel (RGB) images, and then we concatenate the images into a single image so that the three images (original, foreground, and matte) are displayed in a single image. We obtain the output as shown below.

Original, processed and mask image

Original, processed and mask image

Save the results

zip_filename = 'matte.zip'
if os.path.exists(zip_filename):
  os.remove(zip_filename)

os.system(f"zip -r -j {zip_filename} {output_folder}/*")
files.download(zip_filename)

Finally, you can save the predicted alpha matted into a zip file to your local runtime/machine using the OS module.


Conclusion

Modern, deep learning-based computer vision has taken the world by storm with its applications. Background removal has many classical techniques, but recently artificial intelligence has been proving itself to provide better and more efficient methods to accomplish the task. One such method is known as semantic segmentation, where we train the input and output images by looking at each pixel of the image and assigning it one of the two classes, foreground or background.

Consequently, many researchers and developers have worked to create improved and more efficient models and architectures to perform every task. The same is the case with background remover, otherwise known as foreground extractor. One example of such an architecture is MODNet, which uses the concept of image matting using trimap interactive technique, where the image pixels are divided into three classes, namely: foreground, background, and unknown regions. Different values of alpha are calculating for the unknown regions based on the opacity of the foreground.

Using the pre-trained MODNet model is straightforward where you import the pre-trained model from the official public GitHub repository and input the images you want the background removed from. It outputs the image with the background removed.

Thanks for reading!