Transformations on images open up a world of possibilities. This project explores the creation of panoramic views by seamlessly combining multiple photographs. The process involves aligning the images and stitching them together while minimizing visual artifacts to achieve a smooth and cohesive final result.
A homography is a projective transformation that aligns points between two images of the same planar surface. It can handle perspective distortions, unlike translations or affine transformations that preserve parallelism. One can define the homography matrix as
where and are corresponding points, and is a scaling factor. Homographies are utilized to align images by transforming one image to match the perspective of another. To compute , we need at least 4 point correspondences. For each correspondence, we derive two linear equations, leading to the following system of equations
For more than 4 points, we solve the overdetermined system using least squares to minimize the error and estimate the homography matrix. Later we will be transforming images with their calculated homographies to align them with a reference image, rectify them and build mosaic images.
Homography can be applied to rectify images by transforming perspective views into frontal views, allowing angled objects to appear as if they are facing the camera directly. This is achieved by selecting four corners of a polygon representing the object in the image and defining corresponding points on a canvas that form a rectangle. By computing the homography matrix from these point pairs, we can warp the image, effectively converting the polygon into a rectangle and correcting its perspective. This technique is particularly useful in applications such as document scanning and object recognition, where accurate geometric representation is essential.
Now that we know how homography transformations work we can try do more cool stuff with them like building mosaic images. We begin by choosing two images to stitch together and determining a one-to-one mapping of key points between the two base and new image.
Once the key points are identified, we compute the homography matrix that will align the new image with the reference points in the base image. The resulting matrix will allow the transformation needed to overlay the new image accurately. Adding a new image to the base could potentially affect the size of the output. To accommodate this, one should compute the coordinates of the warped boundaries of the new image and check the minimum and maximum extents. Based on these calculations, expand the canvas size if necessary and translate the images to create space for the new addition. After defining the canvas size and applying the appropiate translation to the images, we warp the new image using the homography computed earlier.
To minimize visual artifacts, particularly at the edges of the images, one could blend two images using a two-band Laplacian approach. The low-frequency band is blended by combining pixel values through a convex combination. The blending coefficients are determined based on the relative distance transform of the images.
For the high-frequency band, the pixel with the higher distance transform is selected from the overlapping images to preserve sharp details. This ensures smooth transitions for low-frequency components and clarity in the high-frequency components.
To build larger mosaics including more than two images one could iteratively follow the described procedure. Note that then the base image would contain previously stitched images, providing a base to extend with a new image. If the base image consists of previously stitched images, the coordinates of the points might have changed due to earlier transformations. To find their new true coordinates, apply the transformations that were used applied to reference image.
The results for stitching look very nice, however, producing images can be very tedious due to the manual selection of the correspondences. Now, we attempt to automate the correspondences detection.
We first we identify key interest points within an image, particularly focusing on corners with the Harris Interest Point Detector. Corners serve as ideal points for matching due to their unique, localized structure, which contrasts distinctly with the surrounding area. The Harris Interest Point Detector identifies corners by evaluating changes in intensity around each pixel, effectively using the local gradient structure. It considers the second moment matrix over a small window of the image:
where and are the intensity gradients in the and directions. Then, we define the response of the detector as
where and , with and as the eigenvalues of . Then, if both and are large we identify it as a corner, if only is large then it is an edge, otherwise just a flat region. By setting a threshold for , we select points with strong corner responses. This approach provides reliable interest points ideal for image alignment and stitching.
Corners are very unique from their local surroundings so one would expect them to be nicely distributed throughout the image, producing a meaningful variety of correspondances. However the corners detected by Harris algorithm can be very clustered together. That's why we refine initial corner detections using Adaptive Non-Maximal Suppression (ANMS), which improves spatial distribution by selecting only the strongest and most isolated corners. This prevents clustering of corners in high-response regions, leading to a more balanced set of points for robust matching. For each corner with response , we calculate a minimum radius , defined as the smallest Euclidean distance to any stronger corner where . Then
where is the set of corners with stronger responses than . ANMS then selects corners based on either a specified radius threshold, enforcing a minimum spacing between corners, or a target number of corners by selecting those with the largest values.
After detecting corners for the two images we are attempting to define correspondances for, one must extract the features of these corners to compare them with those of another image. We extract axis-aligned features by defining patches centered around each corner with a specified radius. Note that before extracting features we apply a Gaussian blur to smooth the image and reduce noise, enhancing key structural details and avoiding aliasing. The results below are the non-normalized features of the image we found previously found corners for.
To find correspondences, we compare the features extracted from two images using a nearest-neighbor approach. Lowe's ratio test is applied to assess the quality of matches, where for each feature in one image, we identify its nearest and second-nearest neighbors in the other image. We retain a match only if the ratio of their distances
for some ratio threshold , ensuring that a true match is significantly closer than any potential false match. This method reduces the likelihood of incorrect correspondences by leveraging the intuition that a reliable match should exhibit a substantial distance disparity from the next closest feature, thereby enhancing the robustness of the matching process. Note that before matching features we normalize them to reduce the influence of varying feature magnitudes, allowing for a more accurate comparison based on the structure of the features.
After so many layers of filtering, there might be some outliers which skew the homography. A powerful solution ot outliers while computing the homogrpahy is to employ RANSAC (Random Sample Consensus), a powerful algorithm that iteratively estimates a homography matrix between matched points while identifying and discarding outliers. In each iteration, RANSAC randomly selects a minimal subset of correspondences, four points for an homography, to compute a candidate transformation. The algorithm then evaluates all correspondences against this transformation, counting inliers based on a defined error threshold. By iterating this process, RANSAC effectively converges on the most accurate homography that best fits the inliers, making it resilient to outliers and enhancing the overall robustness of the matching process.
Then we apply our stitching procedure to the images, only now with the correspondances we found automatically instead of the manual mapping. Notice that the difference is negligible, yet we have made the very tedious task of labeling images a breeze.