A Prompt Can Only Do So Much:
Training-Free Exemplar-Based Image Editing
Kelvin Li
UC Berkeley
kelvin.li.jm@berkeley.edu
Jorge Diaz Chao
UC Berkeley
jdiazchao@berkeley.edu
Figure 1. An overview of our method. Throughout the diffusion we extract features in areas of importance for the input images, which are
later blended together adaptively in the reverse process with manual and/or automatic attention masks according to an edit prompt.
Abstract
Recent advances in text-to-image diffusion models al-
low for accurate reconstruction and high quality text-
conditioned image editing. However, a text-only guid-
ance can only do so much and nuanced edits are hard to
express through a prompt. There has been attempts for
exemplar-based solutions, i.e. editing a source image uti-
lizing another for reference. However, these solutions are
trained, fined-tuned, or are subject to other issues like ex-
cessive manual input and limited editing capabilities such
as style transfer. We introduce a novel training-free sam-
pling method that edits a source image with another as ref-
erence and an optional edit prompt. We do so by intro-
ducing the hybrid attention block, a modification to self-
attention blocks in the diffusion UNet during reverse diffu-
sion, which blends masked attention outputs with different
inputs. Moreover, we show that our solution’s suitable for
automation with attention maps as masks that adapt over
time to an optimal feature blending.
Figure 2. A preview of our results.
1. Introduction
Recently, text-to-image diffusion models have expanded to
encompass editing tasks. Most approaches [2, 5, 7] rely
purely on text-guidance for editing, setting a constraint to
the user, since a prompt of limited length can only contain
so much information. However, we know that an image
is worth a thousand words (if not more), and the field has
evolved to embrace reference-based editing as a more nu-
anced alternative. This shift enables users to leverage vi-
sual information directly from reference images solving the
inherent limitation of text guidance in capturing complex
visual attributes.
Reference-based editing proves particularly valuable
when attempting to transfer specific visual characteristics,
such as intricate textures or distinctive features of objects
and individuals, which often demand impossibly precise
textual description. This advancement represents a natural
progression in making image editing more intuitive and ex-
pressive, as it aligns with arguably an individual’s strongest
form of communication, vision.
2. Background
There has been recent attempts at exemplar-based image
editing [3, 4, 8] which do allow for reference-based image
editing. However, we noted that SOTA solutions demand
training, tuning or are subject to issues like excessive man-
ual input and limited editing capabilities such as style trans-
fer.
A training-free approach is desirable as it eliminates
the need for computationally intensive retraining or fine-
tuning of models, making it more practical and accessi-
ble for a wide range of users. Furthermore, training-free
approaches preserve the original capabilities of pre-trained
models while focusing on lightweight manipulations, en-
abling users to achieve high-quality results without com-
promising usability or performance.
In summary, the goal is to enable reference-based image
editing that is training-free, requires no manual masking,
and offers fine-grained control over both the extraction of
specific information from a reference image and its precise
injection into the source image.
Our solution is inspired by previous work on attention
[13, 7] in the context of diffusion which has proven atten-
tion swapping and manipulation effective to transfer content
and/or style from one image to another. As well as attention
maps to produce masks.
3. Preliminaries
3.1. DDIM Inversion
Denoising Diffusion Implicit Models (DDIM) [6] extend
the diffusion framework by introducing a deterministic sam-
pling process, which enables both efficient sampling and
reversible transformations between noisy latents and clean
images. The deterministic behavior of DDIM allows inver-
sions which map a clean image x
0
to a noisy latent x
T
, con-
venient for tasks such as image editing and reconstruction
while preserving structural and semantic consistency.
The forward diffusion process gradually alters an image
x
0
into a noisy latent x
t
over T timesteps as follows
q(x
t
| x
0
) = N(x
t
;
α
t
x
0
, (1 α
t
)I),
where α
t
(0, 1) is a noise scheduling parameter that
controls the level of noise injected at each step t.
To invert this process, DDIM defines a deterministic up-
date rule to compute progressively noisier latents. At each
step t {0, . . . , T 1}, the noisy latent is computed as:
x
t+1
=
α
t+1
ˆx
0
+
p
1 α
t+1
· ˆϵ,
where ˆx
0
and ˆϵ are the estimated clean image and noise
component, respectively. Specifically, ˆx
0
is computed as:
ˆx
0
=
x
t
1 α
t
· ˆϵ
α
t
.
Here, ˆϵ is the noise predicted by the denoising model at
timestep t, and α
t
and α
t+1
are the noise schedule coeffi-
cients for the current and next timesteps, respectively.
The inversion process begins with the clean image x
0
,
which is iteratively transformed into progressively noisier
latents x
t
using the above update rule. The final noisy la-
tent x
T
encodes both structural and semantic information
from the original image while aligning with the learned dif-
fusion trajectory. This latent serves as the starting point for
downstream applications such as reconstruction or image
editing.
3.2. Attention in Diffusion UNets
Attention mechanisms in diffusion UNets allow the model
to capture long-range dependencies in the latent space, crit-
ical for tasks like image generation and editing. The atten-
tion mechanism operates on queries Q, keys K, and values
V , which are linear projections of the input. Intuitively, Q
represents the element being updated, K identifies elements
to attend to, and V provides the corresponding information.
The output of attention is computed as:
Attention(Q, K, V ) = Softmax
QK
d
V,
where d is the dimensionality of Q and K. In the con-
text of diffusion models, self-attention helps encode spatial
dependencies within the image, while cross-attention inte-
grates conditioning information, such as text or reference
images, enabling guided generation.
Figure 3. An overview of the architecture. We run null-text DDIM inversion on the source and reference images and begin constructing the
edit starting from the noised latent of the source image. Throughout the reverse diffusion process we inject the attention stored from the
source and reference using manual masks or automatic as inferred from the cross attention blocks of the inversions for the attention masks
and the previous editing de-noising step for the blending mask.
3.3. Masked Attention
Masked attention extends standard attention by introducing
a binary mask M to restrict focus to specific regions in the
latent space. The masked attention output is defined as:
MaskedAttn(Q, K, V ; M ) = Softmax
QK
d
+ M
V,
where M assigns large negative values (e.g., −∞) to po-
sitions to be excluded. This mechanism is particularly rele-
vant in image editing, where it allows selective modification
of regions (e.g., applying edits to a specific area while pre-
serving the background). Masked attention facilitates pre-
cise control in tasks such as inpainting, object replacement,
or hybrid blending of source and reference content.
3.4. Cross-Attention Maps in Diffusion UNets
In diffusion UNets, cross-attention maps provide insights
into how textual tokens influence specific spatial regions in
the latent space during the reverse and forward diffusion
[2]. These maps are extracted from the intermediate cross-
attention with the prompt. By multiplying latent queries Q
with latent keys K we get n attention maps for an n-token
text guidance.
Each token in the prmpt contributes a distinct cross-
attention map that highlights the refions in the latent space
infleunced by that token. By averaging these maps across
all timesteps, we can visualize the spatial relevance of indi-
vidual tokens in guiding the generation process.
This shows that the latent spaces, even at very deep lay-
ers in the UNet where we find these attention blocks, still
preserve spatial relation which means that one can get a
rough idea of the area in the picture the block is attending
to for each token.
4. Method
Our approach begins with a null-text DDIM inversion ap-
plied to both the source and reference images, denoted as
x
s,0
and x
r,0
, respectively. During each inversion step,
we store the queries (Q), keys (K), and values (V ) of the
UNet’s attention blocks. Additionally, the final noised la-
tent of the source image x
s,T
produced by the DDIM inver-
sion is retained for further use. This process is outlined by
the souce path and reference path in Figure 3.
4.1. Hybrid Attention
As outlined by Figure 1 the idea now is to utilize the infor-
mation extracted during the DDIM inversion of the source
Figure 4. Our Hybrid Attention block.
and reference images in the form of attention inputs Q, K
and V . We filter the information of interest for the source
and reference latents coming from the source path and ref-
erence path and blend that information into one latent to
continue with the reverse diffusion through the edit path.
The editing process starts from the noised latent of the
source image, x
s,T
, which serves as a strong initialization
point for image editing. The latent encapsulates essential
structural information about the image, facilitating accurate
reconstruction. From this initialization, we iteratively ap-
ply the denoising UNet to produce the final edited image.
At each denoising step, the self-attention (SA) blocks of the
UNet are replaced with a novel hybrid attention (HA) mech-
anism, defined as follows
O
s
= MaskAttn(Q
f
, K
s
, V
s
; M
s
)
O
r
= MaskAttn(Q
f
, K
r
, V
r
; M
r
)
HybridAttn = O
s
· (1 M
f
) + O
r
· M
f
Here, O
s
represents the output attention from the source
image, while O
r
corresponds to the output attention from
the reference image. note we compute these always with
Q
f
such that the edit latent queries the source and reference
latents by combining with K
s
and V
s
or K
r
and V
r
. We
do this since through experimentation we’ve realized that
the maintaining the query maximizes coherence while the
swapping the keys and values maximize feature injection
into the latent.
The masks M
s
and M
r
modulate the attention to control
the regions of focus for the source and reference images,
respectively. Specifically, M
s
filters out any content that’s
being edited out from the source image, and M
r
isolates the
desired regions from the reference image.
The blending mask M
f
plays a critical role in defining
the regions of injection for the output latent. When M
f
specifies the editing region, 1 M
f
effectively covers the
effective background. As a result, O
s
is used to reconstruct
the effective background, preserving most of the source im-
age’s original content, while O
r
is injected into the editing
region. This ensures that the final output retains the struc-
tural fidelity of the source image while incorporating the
desired edits from the reference.
4.2. Optimization
The attention blocks across different UNet layers encode
distinct features, such as high-level structure or low-level
texture. Modulating the hybrid attention across different
layers impacts how information from the reference image
is integrated into the source.
Similarly, swapping attention at different timesteps dur-
ing the denoising process influences the final image. Early
timestep swaps guide the model toward reconstructing the
desired edits earlier in the diffusion process.
To manage this, we propose optimizing a schedule that
determines which layers and timesteps should utilize hybrid
attention, balancing the reconstruction of the source image
with the integration of edits.
4.3. Automation and Adaptation of Masks
Masks are central to our pipeline, as they define the content
to attend to and its injection regions in the output. While
users can provide manual source, reference, and blend-
ing masks, automation is often desirable. To automate
mask generation, we leverage cross-attention maps from
the UNet’s cross-attention blocks. These maps serve as
heatmaps that highlight token-level relevance in the image.
By thresholding these maps, binary masks can be created
and used in hybrid attention.
For more complex edits, the blending mask M
f
needs
to adapt dynamically. Unlike the source mask, M
f
cannot
always be predefined. To address this, we update M
f
using
the cross-attention map from the previous timestep in the
editing path. This enables the blending mask to evolve over
time as the content being injected becomes more refined,
granting the diffusion model flexibility to optimize the edits
without manual intervention.
4.4. Post-Processing for Mask Refinement
Figure 5. Automated thresholding for masks derived from cross-
attention maps.
The cross-attention maps, particularly in early diffusion
steps, can be noisy and lack normalization, making bi-
nary thresholding challenging. To mitigate this, we apply
a Gaussian filter to denoise the cross-attention maps, fol-
lowed by normalization. Pixels above a chosen threshold
are selected to form the mask. To ensure smoothness, we
Figure 6. Our results prove our method to work across a wide range of editing domains. To the left, examples of replacement. Upper right
shows an example or removal. Middle right shows an example of insertion. Bottom right shows a dynamic change or non-rigid edit.
perform dilation and erosion steps, resulting in cleaner and
more robust masks.
This combination of automated mask generation and re-
finement ensures that our method achieves high-quality ed-
its while minimizing manual input and bias in the masking
process.
5. Experiment
We applied our method to the UNet-based text-to-image
Stable Diffusion model using the publicly available weights
v2.1. All editing experiments were performed on real im-
ages. Specifically, hybrid attention was implemented for
the top two layers of the encoder and decoder in the UNet
across all timesteps.
For figures labeled as None, either the reference image
was used as the source, or no reference image was provided.
These scenarios demonstrate the performance of our system
in such cases.
5.1. Layer Configuration and Hybrid Attention
We tested various configurations of layers and timestep
ranges to integrate the proposed hybrid attention blocks in
place of self-attention during the reverse diffusion process.
For most edits, the top two attention layers of the encoder
and decoder yielded optimal results. This is likely because:
Deeper layers in the network encode general feature and
class information. Modifying these layers significantly
impacts the structure of the entire image, leading to unde-
sirable global changes.
Higher layers tend to focus on fine-grained details and
superficial modifications, which are more suitable for lo-
calized edits.
The higher layers maintain a better spatial relationship be-
tween the latent representations and the image, resulting
in more effective and well-behaved masking.
5.2. Preliminary Results
Our method demonstrated promising results across a range
of editing applications, including replacement, removal,
insertion, and dynamic changes (e.g., pose adjustments).
With the appropriate configuration, the edited content dis-
played high coherence with the original image while adapt-
ing seamlessly to the overall style.
5.3. Automatic and Adaptive Attention Masks
As shown in Figure 7 we compared results using man-
ual masks, automatic masks, and adaptive attention (AA)
masks. Automating the masking process reduces user input
and eliminates potential human bias, which is advantageous
in certain scenarios. However, the effectiveness of automa-
tion varies depending on the image.
Automating the Source and Reference Masks: For au-
tomation, we compute cross-attention maps for specific
tokens (e.g., “glass of wine” or “juice”) in the source
and reference images. This is achieved by performing a
prompt-guided reconstruction for each image and extracting
the query and key values from the UNet’s cross-attention
Figure 7. Comparison of manual (top), automatic (middle), and
adaptive attention masks (bottom).
blocks. The resulting cross-attention maps effectively iden-
tify and isolate the regions corresponding to these tokens,
thereby automating the source and reference masks.
Adaptive Blending Mask (AA): The blend mask, which
defines the region to be modified, cannot always be derived
directly from the source and reference masks. Instead, we
allow the diffusion model to make an initial guess using
the edit prompt. At subsequent timesteps, we update the
blend mask dynamically by leveraging the cross-attention
maps for relevant tokens (e.g., “juice”) from the previ-
ous timestep. This adaptive approach enables the diffusion
model to explore coherent new shapes and integrate changes
progressively over time. Results using AA masks resolve
many issues seen with basic automation and, in many cases,
produce results comparable to or better than manual mask-
ing.
5.4. Failure Cases and Observations
While automatic masking often improves certain aspects of
editing, such as reducing human bias, it is not without chal-
lenges. In some failure cases, noisy attention maps caused
overlapping and blending of content, leading to incoher-
ent results. Interestingly, even in these failures, the au-
tomated process demonstrated better adaptation of shapes
compared to manual masking. For instance, in one exam-
ple, the cap shape produced by automation appeared more
forward-facing and properly oriented, suggesting that the
automated pipeline avoids simple copying and pasting in
favor of generating new shapes consistent with the input
Figure 8. Preliminary results with automatic adaptive attention
masks where bottom images are referencing themselves.
prompts.
These observations suggest that different stages of the re-
verse diffusion process handle varying levels of abstraction,
and automated attention allows the model to focus on what
it deems relevant at each step. Although this can sacrifice
overall coherence in some cases, it improves adaptability
and spatial consistency in others.
Figure 9. Failures.
6. Conclusion
This research project is a work in progress. Our solution
has proven to be successful at different tasks under some
pictures while fail at others.
6.1. Limitation and Future Work
As we experimented with different configurations and
stress-tested our solution we’ve gained a lot on insights on
how the attention works, what information each layer might
be encoding, how thresholding could be improved amongst
other things. Our solution being a training-free sampling
technique comes with limtations, relying a lot on the pre-
trained model that we are using, so we are looking into test-
ing our method on different models. On an important note
is that very recent research an advances on new forms of
diffusion like rectified flow or alternatives to the UNet like
the vision transformer have proven to be very powerful in
reconstruction and image editing as well as efficient [7], we
believe that we can bring this concept to this newer archi-
tecture and expect better results.
References
[1] Jiwoo Chung, Sangeek Hyun, and Jae-Pil Heo. Style injection
in diffusion: A training-free approach for adapting large-scale
diffusion models for style transfer, 2024. 2
[2] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image
editing with cross attention control. 2022. 2, 3
[3] Ying Hu, Chenyi Zhuang, and Pan Gao. Diffusest: Unleashing
the capability of the diffusion model for style transfer, 2024.
2
[4] Pengzhi Li, Qiang Nie, Ying Chen, Xi Jiang, Kai Wu, Yuhuan
Lin, Yong Liu, Jinlong Peng, Chengjie Wang, and Feng
Zheng. Tuning-free image customization with image and text
guidance, 2024. 2
[5] Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama-
nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im-
age inversion and editing using rectified stochastic differential
equations. arXiv preprint arXiv:2410.10792, 2024. 2
[6] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. CoRR, abs/2010.02502, 2020. 2
[7] Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma,
Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming
rectified flow for inversion and editing, 2024. 2, 7
[8] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin
Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by ex-
ample: Exemplar-based image editing with diffusion models,
2022. 2