IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022)
<aside>
π‘ Press Ctrl (or β) + Shift + L
to switch between Light / Dark mode in this page.
</aside>
Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting.
To implement the prior knowledge, we ο¬rst train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-visual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network.
This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudo-class consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio.
Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded.
<aside> π PDF file
</aside>
<aside> π IEEE Xplore
</aside>
https://www.youtube.com/watch?v=wlvEb5ImN3M