We've developed a simple and general framework for image-to-image translation called Palette.
We've evaluated Palette on four challenging computer vision tasks, namely colorization, inpainting, uncropping, and JPEG restoration.
Palette is able outperform strong task-specific GANs without any task-specific customization or hyper-parameter tuning.
Many computer vision problems can be formulated as image-to-image translation. Examples include restoration tasks like super-resolution, colorization, and inpainting . The difficulty in these problems arises because for a single input image, we can have multiple plausible output images e.g. for colorization, given a black-and-white image, there can be several possible colorized versions of it.
A natural approach to image-to-image translation is to learn the conditional distribution of output images given the input, using deep generative models. Generative Adversarial Networks (GANs)
have emerged as the model family of choice for many image-to-image translation tasks
,
as they are capable of generating high fidelity outputs, are broadly applicable, and support efficient sampling. Nevertheless, GANs can be challenging to train , and often drop modes in the output distribution . Autoregressive Models , VAEs , and Normalizing Flows have also seen success in specific applications, but arguably, have not established the same level of sample quality and generality as GANs.
Diffusion and score-based models have received a surge of recent interest , resulting in several key advances in modeling continuous data. These models learn to reverse a sequential corruption process that iteratively adds small amounts of gaussian noise to data eventually converting it to random noise. Data generation in these models involves a sequence of refinement steps that gradually denoise random noise to produce samples from data distribution. While diffusion models have achieved great recent success in conditional generation tasks such as speech synthesis , class-conditional ImageNet generation , image super-resolution and many more, they have not been applied to a broader family of tasks, and it is not clear whether they can rival GANs in offering a versatile and general solution to the problem of image-to-image translation.
We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models. We apply Palette on four challenging and diverse image-to-image translation tasks - image colorization, inpainting, uncropping and JPEG artifact removal . Palette achieves state of the art results in colorization while beating many strong prior task-specific GAN and Regression based methods on other tasks. Importantly, this is achieved without any task-specific customizations demonstrating a desirable degree of generality and flexibility. Furthermore, we show that a single generalist Palette model trained on multiple tasks simultaneously, performs as well or better than Palette models trained on individual tasks.