Automatic Panoramas

Image registration based on automatic feature matching

github project

About

This project implements a pipeline for the automatic stitching of images into panoramic compositions via reprojection of the images onto a common plane. In order to compute a homography to reproject the images, common features between pairs of images are detected and matched automatically. The project is an application of image registration and is based on the work by Brown, Szeliski and Winder [1].

Figure 1. Living Room dataset (3 images: left, center, right) spanning approximately a 120° FOV.
Figure 2. Computed panorama from the Living Room dataset (ANMS).
Figure 3. Diagram of the pipeline.

Corner Detection + Non-Maximal Suppression

The first step in the pipeline is detecting several Harris corners in both (grayscale) images and, in order to keep the computational cost of the subsequent matching stage low [1], the number of corners is further decreased by a suppression scheme. Additionally, in this project the corner detection and generation of feature descriptors are carried out in parallel (one thread per image) in order to further reduce computation time.

The suppression scheme presented in [1] is Adaptive Non-Maximal Suppression (ANMS). This scheme preserves a good spatial distribution for the remaining corners, which is desirable. In practice however, I found that just keeping the N strongest corners a pixel apart, over 3x3 pixel regions sometimes produced better corners, even if it resulted in a biased distribution.

Figure 4. 666 corners initially detected
Figure 5. 300 corners remaining after ANMS (left) vs 244 with scikit peak_local_max (right).
Figure 6. Computed panorama from the Balcony dataset (11 images) using ANMS.
Figure 7. Computed panorama from the Balcony dataset (11 images) using peak_local_max. Notice the better convergence on the top-right.
Figure 8. Computed panorama from the Living Room dataset using peak_local_max. Notice the worse convergence around the sofa and coffee table compared to Figure 2.

Feature Descriptors

From the remaining corners, 9x9 feature descriptors centered at each corner are extracted. Instead of following the exact approach in [1], I subsampled 45x45 patches with stride 5, resulting in 9x9 patches centered at each corner, which are then normalized.

Figure 9. Patch around a corner (left), 45x45 descriptor (center) and normalized 9x9 descriptor (right).

Feature Matching and Homography Computation

Descriptors between both images are matched in a two-step process. First, features are matched based on the ratio of their two nearest neighbors (as determined by the metric). Then the resulting candidate features are evaluated with 4-point Random Sample Consensus (RANSAC) in order to further filter out outliers.

The RANSAC algorithm computes successive homographies in order to evaluate how good random 4-point (4 feature centers) sets are in mapping from source to target image. Once a suitable 4-point set is found, another homography is computed from those 4 points plus all points which mapped correctly (according to a pixel threshold) under that homography, in order to create a larger set of correspondences which translates to a more robust homography. This process is repeated for a number of times and the homography for the best candidate set is used to finally warp the images.

Figure 10. A first matching step returns 24 candidates for the Living Room dataset. Some are outliers however.
Figure 11. After further outlier rejection with RANSAC, 9 consistent candidates remain. These are used to compute the final homography.

The homography maps source pixel locations to their corresponding target locations . That is, if is a source pixel location in homogeneous coordinates, the homography is such that , where and is the target in cartesian coordinates.

A homography satisfying the above equation can be recovered by formulating the search as a least squares approximation (i.e.: ), where represents the unknown parameters in .

The constraints to this system of equations are essentially derived from wanting each component in a target point to match that of a source point transformed by the homography. At least 4 matching feature pairs are needed (each pair provides 2 constraints; there are 8 unknowns / DOF in H) but the more the better:

  • x:

  • y:

  • scale:

Image Warping

Once a good homography has been found, the source image’s 4 corners are forward warped onto the target space in order to compute the extent of the final composite image and two empty images of this new size are created.

The target image is copied to one of these images without warping (though accounting for its offset in the new image size). The source image is instead inverse warped in order to interpolate pixel values. First, all coordinates from target space are warped onto source space (via the inverse of ). Then, for all target coordinates which mapped to within the range of the original source image (i.e.: between (0,0) and (w, h) of the unmodified source), pixels in the other empty image are assigned values bilinearly interpolated from the source space values.

Figure 12. The left image from the Living Room dataset is warped onto the space of the center image in the dataset.

Compositing

Compositing is done by adding both final images, each multiplied by an alpha mask. The process is similar to linear interpolation of two images but instead each image has its own alpha mask with a nonlinear gradient applied only to the area where both images intersect. The full process can then be repeated with another image pair consisting of the composited image and a new image.

Figure 13. Left and center images from the Living Room dataset after applying alpha masks.
Figure 14. Final composition for left and center images in the Living Room dataset.