Image segmentation is a computer vision task that involves dividing an image into meaningful and semantically coherent regions or segments. The goal is to partition an image into regions that share similar visual properties, such as color, texture, or intensity, while being distinct from surrounding areas. Image segmentation is a crucial step in many computer vision applications, including object recognition, scene understanding, and medical image analysis.
To build the U-Net for image segmentation, I have used the MobileNetV2 model for encoding (or down-sampling) and the Pix2Pix model for decoding (or up-sampling).
The U-Net architecture is a convolutional neural network (CNN) designed for semantic segmentation tasks, where the goal is to partition an image into distinct regions and assign each pixel to a specific class. U-Net was introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015 and has since become a popular choice for biomedical image segmentation and other applications.
The U-Net architecture is characterized by a U-shaped structure, which consists of a contracting path, a bottleneck, and an expansive path. Here's a breakdown of its key components:
A lightweight convolutional neural network (CNN) architecture, MobileNetV2, is specifically designed for mobile and embedded vision applications. Google researchers developed it as an enhancement over the original MobileNet model. Another remarkable aspect of this model is its ability to strike a good balance between model size and accuracy, rendering it ideal for resource-constrained devices.
The architecture of MobileNetV2 consists of a series of convolutional layers, followed by depthwise separable convolutions, inverted residuals, bottleneck design, linear bottlenecks, and squeeze-and-excitation (SE) blocks. These components work together to reduce the number of parameters and computations required while maintaining the model's ability to capture complex features.
Pix2Pix, short for "Image-to-Image Translation with Conditional Adversarial Networks," is a generative adversarial network (GAN) architecture designed for image-to-image translation tasks. It was introduced by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros in 2017. Pix2Pix is particularly well-known for its ability to transform images from one domain to another, often producing realistic and high-quality results.
Pix2Pix employs a U-Net-like architecture for its generator, where the decoder (or up-sampler) is responsible for reconstructing the output image from the features extracted by the encoder. The decoder is crucial in restoring the spatial resolution of the image after it has been down-sampled by the encoder. Here are the key features of the Pix2Pix decoder:
The dataset used here is the "oxford_iiit_pet" dataset. The Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed.
The train set contains 3,680 expamples and the test set contains 3,669 expamples.
To install tensorflow_examples run
pip install -q git+https://github.com/tensorflow/examples.git
pip install tensorflow-datasets
For the encoder part of the U-Net I am using the block_1_expand_relu
,
block_3_expand_relu
,
block_6_expand_relu
,
block_13_expand_relu
and
block_16_project
layers of the MobileNetV2 model.
For the decoder part of the U-Net I am using the pix2pix.upsample(512, 3)
,
pix2pix.upsample(256, 3)
,
pix2pix.upsample(128, 3)
and
pix2pix.upsample(64, 3)
layers of the Pix2Pix model.
Model Summary:
The model achieved 93.77%
Training accuracy and 88.96%
Evaluation accuracy