We propose a framework for aligning and fusing multiple images into a single coordinate-based neural representations. Our framework targets burst images that have misalignment due to camera ego motion and small changes in the scene. We describe different strategies for alignment depending on the assumption of the scene motion, namely, perspective planar (i.e., homography), optical flow with minimal scene change, and optical flow with notable occlusion and disocclusion. Our framework effectively combines the multiple inputs into a single neural implicit function without the need for selecting one of the images as a reference frame. We demonstrate how to use this multi-frame fusion framework for various layer separation tasks.
Figure 1. Overview of the neural image representations for multi-image fusion. Assuming that f(x, y) learns a canonical view that summarizes all input images, the rendering of each image is formulated as a projection of the canonical view onto the view of the image. According to the assumption of the world, we use different parameterization of motion such as (a) homography-based neural representations, (b) occlusion-free flow-based neural representations, and (c) occlusion-aware flow-based neural representations. Unlike conventional multi-image fusion working on discrete 2D grids, our method fuses multiple images in a continuous image space. In addition, our method does not rely on a reference image manually selected among input images.
Figure 2. Visualization of learned canonical view. We capture 9 consecutive images (left), and fit a homography-based neural representation to them. As can be seen, our method automatically stitches all the images in the canonical view (right) learned in the neural representation.
Figure 3. Overview of our two-stream neural representations for multi-image layer separation. The goal of our method is to separate the underlying scene and interference moving differently in images into two layers stored in a different neural representation. To this end, we simultaneously train two neural image representations. f1 is parameterized by our homography or flow-based neural representations so as to learn the underlying scene moving according to the explicit motion model. In contrast, the interference layer that is difficult to be modelled by the motion model is stored in f2. The generic form of image formation is a linear combination of both networks, but varies according to tasks. We also use a few regularizations for optimization, which are described in detail in the paper.
|Input||Li et al., 2013||Double DIP||Liu et al., 2020||Ours|
Figure 4. Comparison of refleciton removal methods on real images. We used the baseline results reproduced by this, where video results are not available.
|Input||Liu et al., 2020||Ours|
Figure 5. Comparison of fence removal methods on real images. We used the baseline results reproduced by this, where video results are not available. Note that the gray pixels in the fence layer of Liu et al. indicate empty, which is same as the black pixels in our result.
Figure 6. Comparison of rain removal methods on real images. All results are visualized in videos.
Figure 7. Comparison of moiré removal methods on real images. All results are in videos.