Tutorial 1: Bacterial Segmentation and Curve Fitting

© 2020 Vahe Galstyan. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license. Credit to Muir Morrison for code snippets borrowed from his tutorials.

In this tutorial, we will walk through the steps of segmenting out bacterial colonies from fluorescence microscopy images.

We start off by importing the necessary modules for our analysis. numpy and matplotlib.pylob are the standard computational and plotting packages, respectively, which you are already familiar with from the earlier tutorials. We will be additionally using the glob package for reading all the filenames of movie frames, and different modules of the powerful image analysis package skimage (scikit-image).

In [4]:
# Import the Numpy module
import numpy as np

# For plotting
import matplotlib.pyplot as plt
%matplotlib inline

# For reading filenames
import glob

# For dealing with images
import skimage.io
import skimage.filters
import skimage.morphology

The data for the E. coli growth movie are provided as a collection of .tif files, whose paths share a common format. Note that the specification of the path format may be different depending on where the data is located. Using the glob.glob function, we extact the full list of frame filenames and sort them alphabetically to match their temporal acquisition sequence.

In [5]:
# Format of image frame names
path_format = '../data/ecoli_growth/*_000.tif'

# Full list of image frame names
paths = glob.glob(path_format)

# Sort the list
paths = sorted(paths)

# First few entries
paths[0:4]
Out[5]:
['../data/ecoli_growth/img_000000000_TRITC_000.tif',
 '../data/ecoli_growth/img_000000001_TRITC_000.tif',
 '../data/ecoli_growth/img_000000002_TRITC_000.tif',
 '../data/ecoli_growth/img_000000003_TRITC_000.tif']

To read the set of images, we will use the skimage.io.ImageCollection function and pass the frame paths as an input. Looking at the first frame, we can see that it contains no cells and therefore, we remove this frame from the image collection variable.

In [8]:
# Read the collection of frames into a single object
im_set = skimage.io.ImageCollection(paths)

# Look at the first frame
plt.imshow(im_set[0])
Out[8]:
<matplotlib.image.AxesImage at 0x1c1e1317b8>

To check to see that the first frame of the pruned image collection indeed contains cells.

In [9]:
# Remove the first frame
im_set_pruned = im_set[1:]

# Show the first frame in the pruned set
plt.imshow(im_set_pruned[0])
Out[9]:
<matplotlib.image.AxesImage at 0x1c1ee72f28>

To get a closer look at the colony, we can 'zoom in' by slicing only a small region of the image that contains the cells. Cells appear much more pronounced in this zoomed image, and the reason for that is that the background that was cut out contained a 'hot pixel' with an intensity value much higher than that of the bacterial pixels.

In [10]:
# Save the first frame in a separate variable
im = im_set_pruned[0]

# Zoomed image
im_zoom = im[400:600, 500:850]

# Show the zoomed image
plt.imshow(im_zoom)
Out[10]:
<matplotlib.image.AxesImage at 0x1c1f6eedd8>

Higher intensity value is what distinguishes bacterial pixels from background. To find the right threshold, we plot the histogram of pixel intensities. As we can see, there is a large population of pixels with intensity values less than 210, which would correspond to background, and a much smaller population of pixels with higher intensity values, which correspond to cells.

In [24]:
# Histogram of pixel intensities
_ = plt.hist(im.flatten(), 100)

# Limits on the X-axis
plt.xlim([180, 230])

# Axis labels
plt.xlabel('pixel intensity')
plt.ylabel('frequency')
plt.show()

Choosing 210 as a threshold value, we convert the intensity image into a binary image. Pixels with intensities greater than 210 would be given a value of 1 in the binary image, while the rest will be given a value of 0.

In [11]:
# Threshold the image
im_thresh = im > 210

# Show the thresholded image
plt.imshow(im_thresh)
Out[11]:
<matplotlib.image.AxesImage at 0x114182828>

As we can see, the colony is identified fairly well. However, there are standalone pixels in the backgroud with value 1 and also, some of the bacterial pixels are assigned a value of 0. To remove these artifacts, we perform a median filter on the original fluorescence image which assigns to each pixel the median of the surrounding 9 pixel values (3x3 square grid by default), thereby substantially removing the local noise.

In [15]:
# Apply a median filter and threshold
im_median = skimage.filters.median(im)
im_thresh = im_median > 210

plt.imshow(im_thresh)
Out[15]:
<matplotlib.image.AxesImage at 0x1c20106d68>

The median-filtered image yields a much better colony segmentation. We can now calculate the total number of bacterial pixels, which serves as a good proxy for the amount of cells (with a proportionality constant, which is irrelevant for the growth rate calculation).

In [16]:
area_tot = np.sum(im_thresh)
print('Number of bacterial pixels: ' + str(area_tot))
Number of bacterial pixels: 10117

Applying the same steps for a different movie frame, we get a very innacturate segmentation. This means that, unfortunately, our strategy of manually picking a threshold value for one frame and reapplying it for different frames is not going to work.

In [17]:
# Look at a different frame (call im2 so that the previous im is not overwritten)
im2 = im_set_pruned[22]
im2_median = skimage.filters.median(im2)

plt.imshow(im2_median > 210)
Out[17]:
<matplotlib.image.AxesImage at 0x1c1fc84048>

Otsu's thresholding method

To overcome the challenge of manually identifying the appropriate threshold value for each frame, we use the established Otsu's thresholding method which automatically outputs the optimal threshold value for the given fluorescence image. The main idea behind the workings of the method is two divide the pixels into two populations in such a way that minimizes the variance of intensity values within each population.

In [18]:
thresh = skimage.filters.threshold_otsu(im2_median)
print(thresh)
206
In [19]:
im2_thresh = im2_median > thresh

# Show the fluorescence and thresholded images side by side
fig, ax = plt.subplots(1, 2, figsize = (8,4))
ax[0].imshow(im2_median)
ax[1].imshow(im2_thresh)
Out[19]:
<matplotlib.image.AxesImage at 0x1c206a1a58>

This automated thresholding method does an excellent job of segmenting out the bacterial colony.

We can still see some bright pixels in the segmented image though, even after performing a median filter. In this last step, we remove these pixels by imposing a threshold on the minimum number of pixels for any connected component. This nicely clears the small objects and leaves out only the bacterial colony.

In [20]:
# Remove the small objects
im2_thresh = skimage.morphology.remove_small_objects(im2_thresh,min_size=200)

plt.imshow(im2_thresh)
Out[20]:
<matplotlib.image.AxesImage at 0x1c1ebf8278>
In [21]:
print('Number of bacterial pixels in the second image: ' + str(im2_thresh.sum()))
Number of bacterial pixels in the second image: 373486

It is now your task to repeat this or a different procedure of your choice for all the movie frames in order to find the bacterial colony size as a function of time.

Poor man's fitting procedure

Having calculated the bacterial area for the different time points, you will next need to use that information to infer the bacterial growth rate. Here we outline a simple procedure for fitting data to a theoretical curve, and you are welcome to use an alternative procedure to complete this task.

The exponentially growing size of a bacterial colony can be described through the equation $N(t)/N_0 = e^{k t}$, where $N_0$ is the initial size, $N(t)$ is the size at time $t$, and $k$ is the growth rate. We therefore use noisely data generated from an exponential growth curve for our demonstration. Our goal will be to infer the parameter $k = 0.8$ used in the data generation.

In [36]:
x_ls = np.linspace(0, 4, 100)
y_ls = np.exp(0.8*x_ls) + np.random.normal(size=100)
plt.scatter(x_ls, y_ls)
Out[36]:
<matplotlib.collections.PathCollection at 0x1c2297be10>

We start by guessing a range of values for $k$ and plotting the corresponding theoretical curves on top of the data. We can see that data falls in between the upper and lower theoretical curves, which means that the optimal $k$ value is in the range that we specified.

In [37]:
k_ls = np.linspace(0.4, 1.0, 10)

plt.scatter(x_ls, y_ls)

for i in range(len(k_ls)):
    k = k_ls[i]
    y_theory_ls = np.exp(k*x_ls)

    plt.plot(x_ls, y_theory_ls, '--', linewidth = 2, color = 'y', alpha = 0.5)

plt.ylim([-5, 30])
Out[37]:
(-5, 30)

To obtain the optimal value for $k$, we scan over the specified range with a higher resolution (100 points), and calculate the sum of the deviation distances between the theoretical curves and the data. This sum should be the lowest for the optimal $k$ value.

In [38]:
k_ls = np.linspace(0.4, 1.0, 100)
diff_ls = []

for i in range(len(k_ls)):
    
    k = k_ls[i]
    y_theory_ls = np.exp(k*x_ls)
    
    diffs = np.abs(y_theory_ls - y_ls)
    
    diff_ls.append(np.sum(diffs))
    
plt.plot(k_ls, diff_ls)

plt.xlabel('k')
plt.ylabel('Sum of deviation distances')
plt.show()

We can visually see that $k \approx 0.79$ corresponds to the lowest sum. To check that this is indeed the optimal value, let us plot the theoretical curve on top of the data.

In [41]:
k_optimal = 0.79

y_theory_ls = np.exp(k_optimal * x_ls)

plt.scatter(x_ls, y_ls, label = 'data')
plt.plot(x_ls, y_theory_ls, color = 'y', linewidth = 4, label = 'fit')
plt.legend()

plt.xlabel('x')
plt.ylabel('y')
Out[41]:
Text(0, 0.5, 'y')

The achieved fit passes the visual test and therefore, we can report the value $k \approx 0.79$ as the optimal fit. In fact, since the data is fairly noisy, keeping only the first significant digit would be equally fine, i.e. $k \approx 0.8$.