arivis Scientific Image Analysis
Sreenivas Bhattiprolu Oct 12, 2023 2:00:43 PM 15 min read

What to expect with small training datasets for instance segmentation?

Deep neural networks require large amounts of training data to tune millions of parameters and develop a learned model for future use. While it is possible to crowdsource image annotation for natural scenes, it is difficult to find domain experts to annotate scientific images. The research topic is often unique requiring researchers to annotate their own images. Therefore, a researcher can end up with a handful of annotated images containing only 10s of labeled objects.

With this blog, I’d like to share my experience in working with a small dataset for ‘instance segmentation’ using the arivis Cloud ML toolkit. arivis Cloud team modified the standard U-net with EfficientNet encoder to make it versatile for many applications. In addition, it uses Focal Loss to address the issues with class imbalance and segmentation of hard classes against easy classes.

Dataset description:

For this exercise, I downloaded the electron microscopy dataset with annotated mitochondria from EPFL public repository [1]. The dataset represented a 5x5x5 µm section taken from the CA1 hippocampus region of the brain. The resolution of each voxel is approximately 5x5x5 nm and provided as multipage TIF files.


Original Image                                                       Annotated image with labeled mitochondria

The dataset contained a total 330 images and corresponding masks (annotated images). I picked 12 pairs of images and masks, each 1024x768 pixels and representing about 170 labeled mitochondria. arivis Cloud ML takes care of dividing images and masks into smaller patches for efficient training. Therefore, I supplied original large images and masks as inputs to the arivis Cloud ML trainer for instance segmentation. The training workflow includes a hyperparameter tuning step where it finds optimal parameters for the learning rate and batch size. It also contains an image augmentation step to slightly transform input images during each training epoch. In a way, it is like adding more training data to increase the accuracy and robustness of the model. 


Understanding the training plots:

Upon successful training of the model, three different training plots are reported. Each plot shows curves for both training (in blue) and validation data (in orange), respectively. These plots include loss curves, accuracy, and intersection over union (IoU) – all plotted as a function of the number of epochs. Let me explain the definition of these terms and interpret the following plots generated in this exercise.


Epoch: An epoch is when the model sees all training data. An epoch would be very short for small datasets and very long for large datasets, sometimes longer than an hour. The total training time for my data was fixed at 12 hours. From the above plot, it is evident that the training ran for about 4000 epochs. Therefore, we can calculate the time per epoch to be 11 seconds (12 hrs./4000), very short time for deep learning training. Also, the bumpy nature of training plots indicates the lack of required amount of training data that is essential for reliable calculation of metrics.

Loss: Loss is the penalty for a bad prediction. It is a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of deep learning algorithm is to minimize loss. In the above loss curves (left), training loss was trending downward while validation loss was constant or even trending slightly upward. Upward trend in this case implies that the amount of validation data is not enough to reliably calculate the loss metric. Please note that about 20% of total data is devoted towards validation; in this exercise it is only 2 images.

Accuracy: Accuracy is a metric used during training to evaluate the model. It represents the fraction of predictions the model got right. In the above accuracy plot(middle), training accuracy went up to 99.7% while the validation accuracy saturated at 99.2%. This is excellent accuracy, especially considering the small training dataset. Unfortunately, it is easy to achieve high accuracy on images with large background regions. This is due to accuracy metric being heavily skewed by the background in comparison to objects of interest. IoU is a better metric for post-training evaluation of the segmentation model.  

Intersection over Union (IoU): IoU is the most popular metric to evaluate the quality of semantic or instance segmentation. It is a measurement to quantify the accuracy of overlap between two areas. In my example, it quantifies the area of predicted mitochondria against its area in the corresponding labeled image (ground truth). IoU in the above plot (right) was calculated after each epoch. It shows both training and validation values trending upward reaching 85% and 82%,respectively.

In summary, the model appears to perform very well on training and validation data. Therefore, I applied the model to165 images from the original dataset to evaluate its effectiveness in performing instance segmentation. To properly interpret and compare results, I repeated the exercise using all 165 images and corresponding masks for training; about 14 times more data compared to 12 training images used in my original exercise.

Segmentation Results:

The following images represent slices 17and 40 from the test image stack and show the original image, ground truth label, segmentation result using 12 training images (in blue), and segmentation result using 165 training images (in pink), respectively. 

Slide 17                                                                 Slide 40

Original images


‍Ground truth labels


‍Results with 12 training images


Results with 165 training images


As expected, the results using 165 training images are far superior to the results using only 12 training images.  In fact, the results from using large training data are almost identical to the ground truth. This proves the efficacy of the algorithm at performing instance segmentation. For the most part, the results from smaller training dataset are acceptable except for the areas where it fails at separating closely spaced objects. These objects can be separate during post processing operations such as watershed.


-  Using smaller datasets with image augmentation for training provides acceptable results, especially for objects that are well separated.

-  For arivis Cloud ML (and U-Net in general), it is recommended to work with enough images that contain at least 150 objects of interest. In deep learning, more data often yields better results.

Further steps to enhance the results:

-    Use additional augmentation methods to improve the accuracy. For example, creating elastically transformed images help in representing mitochondria under various scenarios. This augmentation will be released on arivis Cloud early April 2021.

-    Start off with pre-trained model instead of poor initial state with random weights. Transfer learning has been proven to provide superior results in less training time. This ability is also planned for arivis Cloud.


Useful links:

- For your free arivis Cloud account: Sign up here
- Video about interpreting the training curves. Watch


Sreenivas Bhattiprolu

Dr. Sreenivas Bhattiprolu's team at ZEISS focuses on solving tough microscopy challenges by leveraging the latest advancements in digital technology and artificial intelligence. Dr. Bhattiprolu has over 25 years of experience in microscopy. He received his Doctorate in Materials Sciences and Engineering from Michigan Technological University and earned his Master’s degree in Physics from the University of Hyderabad.