1. Introduction
Recently, text-to-image diffusion models have expanded to
encompass editing tasks. Most approaches [2, 5, 7] rely
purely on text-guidance for editing, setting a constraint to
the user, since a prompt of limited length can only contain
so much information. However, we know that an image
is worth a thousand words (if not more), and the field has
evolved to embrace reference-based editing as a more nu-
anced alternative. This shift enables users to leverage vi-
sual information directly from reference images solving the
inherent limitation of text guidance in capturing complex
visual attributes.
Reference-based editing proves particularly valuable
when attempting to transfer specific visual characteristics,
such as intricate textures or distinctive features of objects
and individuals, which often demand impossibly precise
textual description. This advancement represents a natural
progression in making image editing more intuitive and ex-
pressive, as it aligns with arguably an individual’s strongest
form of communication, vision.
2. Background
There has been recent attempts at exemplar-based image
editing [3, 4, 8] which do allow for reference-based image
editing. However, we noted that SOTA solutions demand
training, tuning or are subject to issues like excessive man-
ual input and limited editing capabilities such as style trans-
fer.
A training-free approach is desirable as it eliminates
the need for computationally intensive retraining or fine-
tuning of models, making it more practical and accessi-
ble for a wide range of users. Furthermore, training-free
approaches preserve the original capabilities of pre-trained
models while focusing on lightweight manipulations, en-
abling users to achieve high-quality results without com-
promising usability or performance.
In summary, the goal is to enable reference-based image
editing that is training-free, requires no manual masking,
and offers fine-grained control over both the extraction of
specific information from a reference image and its precise
injection into the source image.
Our solution is inspired by previous work on attention
[1–3, 7] in the context of diffusion which has proven atten-
tion swapping and manipulation effective to transfer content
and/or style from one image to another. As well as attention
maps to produce masks.
3. Preliminaries
3.1. DDIM Inversion
Denoising Diffusion Implicit Models (DDIM) [6] extend
the diffusion framework by introducing a deterministic sam-
pling process, which enables both efficient sampling and
reversible transformations between noisy latents and clean
images. The deterministic behavior of DDIM allows inver-
sions which map a clean image x
0
to a noisy latent x
T
, con-
venient for tasks such as image editing and reconstruction
while preserving structural and semantic consistency.
The forward diffusion process gradually alters an image
x
0
into a noisy latent x
t
over T timesteps as follows
q(x
t
| x
0
) = N(x
t
;
√
α
t
x
0
, (1 − α
t
)I),
where α
t
∈ (0, 1) is a noise scheduling parameter that
controls the level of noise injected at each step t.
To invert this process, DDIM defines a deterministic up-
date rule to compute progressively noisier latents. At each
step t ∈ {0, . . . , T − 1}, the noisy latent is computed as:
x
t+1
=
√
α
t+1
ˆx
0
+
p
1 − α
t+1
· ˆϵ,
where ˆx
0
and ˆϵ are the estimated clean image and noise
component, respectively. Specifically, ˆx
0
is computed as:
ˆx
0
=
x
t
−
√
1 − α
t
· ˆϵ
√
α
t
.
Here, ˆϵ is the noise predicted by the denoising model at
timestep t, and α
t
and α
t+1
are the noise schedule coeffi-
cients for the current and next timesteps, respectively.
The inversion process begins with the clean image x
0
,
which is iteratively transformed into progressively noisier
latents x
t
using the above update rule. The final noisy la-
tent x
T
encodes both structural and semantic information
from the original image while aligning with the learned dif-
fusion trajectory. This latent serves as the starting point for
downstream applications such as reconstruction or image
editing.
3.2. Attention in Diffusion UNets
Attention mechanisms in diffusion UNets allow the model
to capture long-range dependencies in the latent space, crit-
ical for tasks like image generation and editing. The atten-
tion mechanism operates on queries Q, keys K, and values
V , which are linear projections of the input. Intuitively, Q
represents the element being updated, K identifies elements
to attend to, and V provides the corresponding information.
The output of attention is computed as:
Attention(Q, K, V ) = Softmax
QK
⊤
√
d
V,
where d is the dimensionality of Q and K. In the con-
text of diffusion models, self-attention helps encode spatial
dependencies within the image, while cross-attention inte-
grates conditioning information, such as text or reference
images, enabling guided generation.