Theory and practice of self-supervised frame-to-frame video restoration
ABSTRACT
Deep-learning techniques represent nowadays the state of the art in image and video restoration. However, obtaining realistic and large enough datasets of degraded/clean data for a supervised training can be challenging in many application scenarios. A series of methods have been recently proposed to train restoration networks using only degraded data, i.e. without requiring ground truth images. These methods take inspiration from the noise2noise denoising method of Lehtinen et al. They rely heavily on the temporal redundancy of uncorrupted signals and use a neighboring frame from the degraded sequence as target in the loss. They can be considered self-supervised in the sense that the supervision signal comes from the same degraded sequence which is being restored. In this presentation we will review some of these methods for diverse applications ranging from demosaicking to multi-image super-resolution and present them in a common framework. This framework can be seen as a generalization of noise2noise and accounts for a linear degradation operator and motion between the output image and the target. We study in detail the case of the mean square error loss where a close-form expression can be found for the optimal estimator which, under certain conditions, is equivalent to supervised training.