Overview

In this exploration of diffusion models, we'll start by experimenting with DeepFloyd, a pre-trained text-to-image model, to build intuition about how these models work. Then, we'll dive into the mathematics and implementation details of diffusion models, working our way up from basic denoising to building our own MNIST digit generator. Along the way, we'll explore various techniques like classifier-free guidance, image-to-image translation, and visual anagrams that showcase the versatility and power of diffusion models.

Introduction

Before training a diffusion model, we'll build up intuition by experimenting with a pre-trained one. For our experiments we chose DeepFloyd, an open-source text-to-image model with a high degree of language understanding, i.e. takes a string input and ouputs an image that aligns with the prompt. The model consists of a frozen text encoder and three cascaded pixel diffusion modules, namely the first, second and third stage. Each stage takes as input the ouput of the previous one to produce a higher resolution version of the image, from 64x64 to 256x256, to 1024x1024. Below we show results for the first and second stage.

The results look incredible, and the images match the prompts. The second stage model is very coherent with the first, virtually producing an upscaled version of the original image produced by the first stage, although it can introduce artifacts or content not in the lower resolution images. Note that we can improve the quality of the image with a higher number of inference steps, particularly for the second stage model.

Sampling Loops

Diffusion works by gradually denoising an image starting from pure noise to hopefully produce an image distguishable from noise and with little to no artifacts. Therefore, one could easily guess that a key part of diffusion is to take a clean image and add noise to it. This is the forward process, defined as

q(x_t | x_0) = N(x_t ; \sqrt{\bar\alpha_t} x_0, (1 - \bar\alpha_t)\mathbf{I})

x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1)

As an exercise, we implement the forward process, i.e. given a clean image and some timestep, we produce a noisy image using the time-dependant parameter $\bar\alpha_t$ . Excuse the low-resolution images produced by the first stage model.

Classical Denoising

One could naively attempt to denoise the noisy images employing traditional methods like a gaussian blur, hoping that the noise would fade in. However, results show such techniques to be far from effective.

Implementing One Step Denoising

One could instead train a denoiser on a large dataset of pair images, employed in diffusion models to recover Gaussian noise from an image from which one could compute the original image. We show the outputs of DeepFloyd's first stage denoiser, which takes form as a UNet conditioned on the amount of Gaussian noise by taking timestep $t$ as input.

Iterative Denoising

As expected, the denoiser struggles to recover high quality results from very noisy images in a single step - and reasonably so, as the task becomes exponentially harder with increased noise levels. This limitation motivated the development of iterative denoising, where instead of attempting to recover the clean image in one ambitious leap, we gradually denoise the image through multiple steps. The process starts from pure noise $x_T$ and progressively applies the denoising UNet to obtain less noisy versions $x_{T-1}, x_{T-2}, ..., x_0$ . While one could naively run the denoiser for all T steps, this would be computationally expensive. Instead, we can leverage connections to differential equations to skip steps, dramatically speeding up the sampling process while maintaining quality. When moving from a noisier timestep $t$ to a less noisy timestep $t'$ , we update the image according to

x_{t'} = \frac{\sqrt{\bar\alpha_{t'}}\beta_t}{1 - \bar\alpha_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t'})}{1 - \bar\alpha_t} x_t + v_\sigma​

This formula, first introduced in DDPM, allows us to use a stride to apply our denoiser only at specific timesteps, making the process both efficient and effective. The equation combines our current estimate of the clean image $x_0$ , the noisy image $x_t$ , and a noise term $v_\sigma$ , weighted by coefficients that depend on the noise schedule parameters $\alpha$ and $\beta$ .

Diffusion Model Sampling

Previously we iteratively denoised an image by starting from non-pure noise, i.e. a noisy image with some structure, and a non-zero timestep. What if we tried to start from random pure noise and timestep zero? Then, our model will generate images from scratch!

Classifier Free Guidance

While our previous sampling approach produces images that are distinguishable from noise, they often lack clarity and detail - sometimes barely recognizable as their intended subjects. Classifier-Free Guidance (CFG) addresses this limitation and significantly improves image quality by combining conditional and unconditional denoising predictions. At each denoising step, we compute both a noise estimate $\epsilon_c$ conditioned on our text prompt and an unconditional estimate $\epsilon_u$ using an empty prompt. These are combined using $\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)$ where $\gamma > 1$ is the guidance scale - larger values produce higher-quality but less diverse images. While a full theoretical understanding remains elusive, CFG has become standard practice in modern diffusion models. The technique essentially exaggerates the difference between conditional and unconditional predictions, pushing generated images to more strongly align with the given prompt.

Image-to-Image Translation

Rather than starting from pure noise, we can add varying amounts of noise to an existing image before denoising it. The more noise we add, the more creative freedom the model has to modify the image during denoising. It's like giving an artist a partially erased drawing - the less visible the original, the more room for reinterpretation. This technique, called SDEdit, allows for controlled edits while preserving the image's core structure.

Inpainting

We could also denoise specific regions using a binary mask $\bf m$ . At each timestep, we update only pixels where $\bf m=1$ while forcing pixels where $\bf m=0$ to match the original image with the appropriate noise level $t$

x_t \leftarrow \textbf{m} x_t + (1 - \textbf{m}) \cdot \text{forward}(x_{orig}, t)

.

Text-Conditioned Image-to-image Translation

More interestingly, the model was trained on text conditioning. Therefore, we can input a text and go through the image translation, but guide the projection with a text prompt.

Visual Anagrams

One can get even more creative and run to instances of the diffusion model for two different prompts and produce visual anagrams which reveal different content when viewed upside down. We do this by averaging noise estimates from the two different prompts during each denoising step, one for each orientation.

Hybrid Images

Alternatively, we could produce hybrid images which combine content visible at different viewing distances by merging low and high frequencies from two sources. To do so, at each denoising step, we compute noise estimates for two different prompts, then create a composite by combining low frequencies from one with high frequencies from the other using Gaussian filters.

Singe Step Denoising UNet

Now we build our own diffusion model, starting from the single-step denoising UNet, which takes in a noisy image an input to produce a clear version by estimating the noise. Our UNet has the following architecture

To train our UNet to denoise a noisy image into a cleaner version we must first produce pairs of noisy $z$ and clean $x$ images from a dataset

z=x+\sigma \epsilon,\ \ \ \ and \ \ \ \ \epsilon \sim \mathcal{N}(0, I)

We input the noisy image into the UNet, which we optimize it to output the original image. For our model we chose $\sigma = 0.5$ , the Adam optimizer and our loss function to be the L2 distance between the reconstructed and original image

\mathcal{L} = \mathbb{E}_{z, x} \left[\| D_\theta(z) - x \|^2\right]

.

Note that our model was trained to denoise the difits denoised with $\sigma = 0.5$ . We can extrapolate and view how the model performs for other noise levels. As expected, for higher noise levels the model struggles to reconstruct the original image.

Time-Conditioned DDPM Model

By introducing the time conditioning to our denoiser we can construct a diffusion model. We extend the architecture by introducing a new form of input to the model that allows the model to account for the varying levels of noise at different timesteps during the forward diffusion process. We modify the architecture so it looks like the following

To train our UNet to denoise a noisy image into a cleaner version we must first produce pairs of noisy $z$ and clean $x$ images from a dataset

z=x+\sigma \epsilon,\ \ \ \ and \ \ \ \ \epsilon \sim \mathcal{N}(0, I)

To inject the timestep $t$ into the model (out of $N=300$ for our model) we first normalize it and input it to the FCBlock before adding it elementwise to the respective layer's feature maps. Just like the diffusion models we experiment with previously, we sample a clean image $x_0$ at timestep $t$ and with $\epsilon$ and produce a noisy image $x_t$ as follows

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \text{where } \epsilon \sim \mathcal{N}(0, I)

Then we train our model on the MNIST with an Adam optimizer and an exponentially decaying learnign rate, computing the loss as

\mathcal{L} = \mathbb{E}_{x_0, t, \epsilon} \left[ \big\| \epsilon - \hat{\epsilon}_\theta(x_t, t) \big\|^2 \right]

.

To sample from our model we start from pure noise and iteratively denoise from $t=N - 1$ to $t=0$ , at which point the image shouldn't be any noisy. At each timestep our model outputs the predicted noise and updates the noisy image to produce a cleaner image according to the following equation

\mu_t = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}}\hat{\epsilon}_\theta(x_t,t)\right)

x_{t-1} = \mu_t + \sqrt{\beta_t}z, \text{ where } z \sim \mathcal{N}(0,I) \text{ if } t > 1, \text{ else } z = 0

Class and Time-Conditioned DDPM Model

Just like we conditioned the UNet on time to allow the model to account for noise levels, we can condition the UNet on a class to account for the digit that we are trying to produce. We do so by injecting the information about the digit as a one-hot encoded vector into the layers through the FCBlocks. However, the classes now remain constant throughout the denoising process and instrad of shifting the weights of the laters by the ouput of the FCBlock we scale them. During training we also use Classifier-Free Guidance (CFG) by dropping the class conditioning with a probability of $0.1$ , replacing it with a zero vector. Doing so allows the model to perform class-conditioned and unconditional denoising.

by Jorge Diaz Chao

Fun with Diffusion Models

Overview

Introduction

Sampling Loops

Classical Denoising

Implementing One Step Denoising

Iterative Denoising

Diffusion Model Sampling

Classifier Free Guidance

Image-to-Image Translation

Inpainting

Text-Conditioned Image-to-image Translation

Visual Anagrams

Hybrid Images

Singe Step Denoising UNet

Time-Conditioned DDPM Model

Class and Time-Conditioned DDPM Model