In this exploration of diffusion models, we'll start by experimenting with DeepFloyd, a pre-trained text-to-image model, to build intuition about how these models work. Then, we'll dive into the mathematics and implementation details of diffusion models, working our way up from basic denoising to building our own MNIST digit generator. Along the way, we'll explore various techniques like classifier-free guidance, image-to-image translation, and visual anagrams that showcase the versatility and power of diffusion models.
Before training a diffusion model, we'll build up intuition by experimenting with a pre-trained one. For our experiments we chose DeepFloyd, an open-source text-to-image model with a high degree of language understanding, i.e. takes a string input and ouputs an image that aligns with the prompt. The model consists of a frozen text encoder and three cascaded pixel diffusion modules, namely the first, second and third stage. Each stage takes as input the ouput of the previous one to produce a higher resolution version of the image, from 64x64 to 256x256, to 1024x1024. Below we show results for the first and second stage.
The results look incredible, and the images match the prompts. The second stage model is very coherent with the first, virtually producing an upscaled version of the original image produced by the first stage, although it can introduce artifacts or content not in the lower resolution images. Note that we can improve the quality of the image with a higher number of inference steps, particularly for the second stage model.
Diffusion works by gradually denoising an image starting from pure noise to hopefully produce an image distguishable from noise and with little to no artifacts. Therefore, one could easily guess that a key part of diffusion is to take a clean image and add noise to it. This is the forward process, defined as
As an exercise, we implement the forward process, i.e. given a clean image and some timestep, we produce a noisy image using the time-dependant parameter . Excuse the low-resolution images produced by the first stage model.
One could naively attempt to denoise the noisy images employing traditional methods like a gaussian blur, hoping that the noise would fade in. However, results show such techniques to be far from effective.
One could instead train a denoiser on a large dataset of pair images, employed in diffusion models to recover Gaussian noise from an image from which one could compute the original image. We show the outputs of DeepFloyd's first stage denoiser, which takes form as a UNet conditioned on the amount of Gaussian noise by taking timestep as input.
As expected, the denoiser struggles to recover high quality results from very noisy images in a single step - and reasonably so, as the task becomes exponentially harder with increased noise levels. This limitation motivated the development of iterative denoising, where instead of attempting to recover the clean image in one ambitious leap, we gradually denoise the image through multiple steps. The process starts from pure noise and progressively applies the denoising UNet to obtain less noisy versions . While one could naively run the denoiser for all T steps, this would be computationally expensive. Instead, we can leverage connections to differential equations to skip steps, dramatically speeding up the sampling process while maintaining quality. When moving from a noisier timestep to a less noisy timestep , we update the image according to
This formula, first introduced in DDPM, allows us to use a stride to apply our denoiser only at specific timesteps, making the process both efficient and effective. The equation combines our current estimate of the clean image , the noisy image , and a noise term , weighted by coefficients that depend on the noise schedule parameters and .
Previously we iteratively denoised an image by starting from non-pure noise, i.e. a noisy image with some structure, and a non-zero timestep. What if we tried to start from random pure noise and timestep zero? Then, our model will generate images from scratch!
While our previous sampling approach produces images that are distinguishable from noise, they often lack clarity and detail - sometimes barely recognizable as their intended subjects. Classifier-Free Guidance (CFG) addresses this limitation and significantly improves image quality by combining conditional and unconditional denoising predictions. At each denoising step, we compute both a noise estimate conditioned on our text prompt and an unconditional estimate using an empty prompt. These are combined using where is the guidance scale - larger values produce higher-quality but less diverse images. While a full theoretical understanding remains elusive, CFG has become standard practice in modern diffusion models. The technique essentially exaggerates the difference between conditional and unconditional predictions, pushing generated images to more strongly align with the given prompt.
Rather than starting from pure noise, we can add varying amounts of noise to an existing image before denoising it. The more noise we add, the more creative freedom the model has to modify the image during denoising. It's like giving an artist a partially erased drawing - the less visible the original, the more room for reinterpretation. This technique, called SDEdit, allows for controlled edits while preserving the image's core structure.
We could also denoise specific regions using a binary mask . At each timestep, we update only pixels where while forcing pixels where to match the original image with the appropriate noise level
More interestingly, the model was trained on text conditioning. Therefore, we can input a text and go through the image translation, but guide the projection with a text prompt.
One can get even more creative and run to instances of the diffusion model for two different prompts and produce visual anagrams which reveal different content when viewed upside down. We do this by averaging noise estimates from the two different prompts during each denoising step, one for each orientation.
Alternatively, we could produce hybrid images which combine content visible at different viewing distances by merging low and high frequencies from two sources. To do so, at each denoising step, we compute noise estimates for two different prompts, then create a composite by combining low frequencies from one with high frequencies from the other using Gaussian filters.
Now we build our own diffusion model, starting from the single-step denoising UNet, which takes in a noisy image an input to produce a clear version by estimating the noise. Our UNet has the following architecture
To train our UNet to denoise a noisy image into a cleaner version we must first produce pairs of noisy and clean images from a dataset
We input the noisy image into the UNet, which we optimize it to output the original image. For our model we chose , the Adam optimizer and our loss function to be the L2 distance between the reconstructed and original image
Note that our model was trained to denoise the difits denoised with . We can extrapolate and view how the model performs for other noise levels. As expected, for higher noise levels the model struggles to reconstruct the original image.
By introducing the time conditioning to our denoiser we can construct a diffusion model. We extend the architecture by introducing a new form of input to the model that allows the model to account for the varying levels of noise at different timesteps during the forward diffusion process. We modify the architecture so it looks like the following
To train our UNet to denoise a noisy image into a cleaner version we must first produce pairs of noisy and clean images from a dataset
To inject the timestep into the model (out of for our model) we first normalize it and input it to the FCBlock before adding it elementwise to the respective layer's feature maps. Just like the diffusion models we experiment with previously, we sample a clean image at timestep and with and produce a noisy image as follows
Then we train our model on the MNIST with an Adam optimizer and an exponentially decaying learnign rate, computing the loss as
To sample from our model we start from pure noise and iteratively denoise from to , at which point the image shouldn't be any noisy. At each timestep our model outputs the predicted noise and updates the noisy image to produce a cleaner image according to the following equation
Just like we conditioned the UNet on time to allow the model to account for noise levels, we can condition the UNet on a class to account for the digit that we are trying to produce. We do so by injecting the information about the digit as a one-hot encoded vector into the layers through the FCBlocks. However, the classes now remain constant throughout the denoising process and instrad of shifting the weights of the laters by the ouput of the FCBlock we scale them. During training we also use Classifier-Free Guidance (CFG) by dropping the class conditioning with a probability of , replacing it with a zero vector. Doing so allows the model to perform class-conditioned and unconditional denoising.