A Prompt Can Only Do So Much:

Training-Free Exemplar-Based Image Editing

Kelvin Li

UC Berkeley

kelvin.li.jm@berkeley.edu

Jorge Diaz Chao

UC Berkeley

jdiazchao@berkeley.edu

Figure 1. An overview of our method. Throughout the diffusion we extract features in areas of importance for the input images, which are

later blended together adaptively in the reverse process with manual and/or automatic attention masks according to an edit prompt.

Abstract

Recent advances in text-to-image diffusion models al-

low for accurate reconstruction and high quality text-

conditioned image editing. However, a text-only guid-

ance can only do so much and nuanced edits are hard to

express through a prompt. There has been attempts for

exemplar-based solutions, i.e. editing a source image uti-

lizing another for reference. However, these solutions are

trained, ﬁned-tuned, or are subject to other issues like ex-

cessive manual input and limited editing capabilities such

as style transfer. We introduce a novel training-free sam-

pling method that edits a source image with another as ref-

erence and an optional edit prompt. We do so by intro-

ducing the hybrid attention block, a modiﬁcation to self-

attention blocks in the diffusion UNet during reverse diffu-

sion, which blends masked attention outputs with different

inputs. Moreover, we show that our solution’s suitable for

automation with attention maps as masks that adapt over

time to an optimal feature blending.

Figure 2. A preview of our results.

1. Introduction

Recently, text-to-image diffusion models have expanded to

encompass editing tasks. Most approaches [2, 5, 7] rely

purely on text-guidance for editing, setting a constraint to

the user, since a prompt of limited length can only contain

so much information. However, we know that an image

is worth a thousand words (if not more), and the ﬁeld has

evolved to embrace reference-based editing as a more nu-

anced alternative. This shift enables users to leverage vi-

sual information directly from reference images solving the

inherent limitation of text guidance in capturing complex

visual attributes.

Reference-based editing proves particularly valuable

when attempting to transfer speciﬁc visual characteristics,

such as intricate textures or distinctive features of objects

and individuals, which often demand impossibly precise

textual description. This advancement represents a natural

progression in making image editing more intuitive and ex-

pressive, as it aligns with arguably an individual’s strongest

form of communication, vision.

2. Background

There has been recent attempts at exemplar-based image

editing [3, 4, 8] which do allow for reference-based image

editing. However, we noted that SOTA solutions demand

training, tuning or are subject to issues like excessive man-

ual input and limited editing capabilities such as style trans-

fer.

A training-free approach is desirable as it eliminates

the need for computationally intensive retraining or ﬁne-

tuning of models, making it more practical and accessi-

ble for a wide range of users. Furthermore, training-free

approaches preserve the original capabilities of pre-trained

models while focusing on lightweight manipulations, en-

abling users to achieve high-quality results without com-

promising usability or performance.

In summary, the goal is to enable reference-based image

editing that is training-free, requires no manual masking,

and offers ﬁne-grained control over both the extraction of

speciﬁc information from a reference image and its precise

injection into the source image.

Our solution is inspired by previous work on attention

[1–3, 7] in the context of diffusion which has proven atten-

tion swapping and manipulation effective to transfer content

and/or style from one image to another. As well as attention

maps to produce masks.

3. Preliminaries

3.1. DDIM Inversion

Denoising Diffusion Implicit Models (DDIM) [6] extend

the diffusion framework by introducing a deterministic sam-

pling process, which enables both efﬁcient sampling and

reversible transformations between noisy latents and clean

images. The deterministic behavior of DDIM allows inver-

sions which map a clean image x

to a noisy latent x

, con-

venient for tasks such as image editing and reconstruction

while preserving structural and semantic consistency.

The forward diffusion process gradually alters an image

into a noisy latent x

over T timesteps as follows

q(x

| x

) = N(x

;

√

, (1 − α

)I),

where α

∈ (0, 1) is a noise scheduling parameter that

controls the level of noise injected at each step t.

To invert this process, DDIM deﬁnes a deterministic up-

date rule to compute progressively noisier latents. At each

step t ∈ {0, . . . , T − 1}, the noisy latent is computed as:

t+1

√

t+1

ˆx

1 − α

t+1

· ˆϵ,

where ˆx

and ˆϵ are the estimated clean image and noise

component, respectively. Speciﬁcally, ˆx

is computed as:

ˆx

−

√

1 − α

· ˆϵ

√

Here, ˆϵ is the noise predicted by the denoising model at

timestep t, and α

and α

t+1

are the noise schedule coefﬁ-

cients for the current and next timesteps, respectively.

The inversion process begins with the clean image x

which is iteratively transformed into progressively noisier

latents x

using the above update rule. The ﬁnal noisy la-

tent x

encodes both structural and semantic information

from the original image while aligning with the learned dif-

fusion trajectory. This latent serves as the starting point for

downstream applications such as reconstruction or image

editing.

3.2. Attention in Diffusion UNets

Attention mechanisms in diffusion UNets allow the model

to capture long-range dependencies in the latent space, crit-

ical for tasks like image generation and editing. The atten-

tion mechanism operates on queries Q, keys K, and values

V , which are linear projections of the input. Intuitively, Q

represents the element being updated, K identiﬁes elements

to attend to, and V provides the corresponding information.

The output of attention is computed as:

Attention(Q, K, V ) = Softmax



⊤

√



where d is the dimensionality of Q and K. In the con-

text of diffusion models, self-attention helps encode spatial

dependencies within the image, while cross-attention inte-

grates conditioning information, such as text or reference

images, enabling guided generation.

Figure 3. An overview of the architecture. We run null-text DDIM inversion on the source and reference images and begin constructing the

edit starting from the noised latent of the source image. Throughout the reverse diffusion process we inject the attention stored from the

source and reference using manual masks or automatic as inferred from the cross attention blocks of the inversions for the attention masks

and the previous editing de-noising step for the blending mask.

3.3. Masked Attention

Masked attention extends standard attention by introducing

a binary mask M to restrict focus to speciﬁc regions in the

latent space. The masked attention output is deﬁned as:

MaskedAttn(Q, K, V ; M ) = Softmax



⊤

√

+ M



where M assigns large negative values (e.g., −∞) to po-

sitions to be excluded. This mechanism is particularly rele-

vant in image editing, where it allows selective modiﬁcation

of regions (e.g., applying edits to a speciﬁc area while pre-

serving the background). Masked attention facilitates pre-

cise control in tasks such as inpainting, object replacement,

or hybrid blending of source and reference content.

3.4. Cross-Attention Maps in Diffusion UNets

In diffusion UNets, cross-attention maps provide insights

into how textual tokens inﬂuence speciﬁc spatial regions in

the latent space during the reverse and forward diffusion

[2]. These maps are extracted from the intermediate cross-

attention with the prompt. By multiplying latent queries Q

with latent keys K we get n attention maps for an n-token

text guidance.

Each token in the prmpt contributes a distinct cross-

attention map that highlights the reﬁons in the latent space

inﬂeunced by that token. By averaging these maps across

all timesteps, we can visualize the spatial relevance of indi-

vidual tokens in guiding the generation process.

This shows that the latent spaces, even at very deep lay-

ers in the UNet where we ﬁnd these attention blocks, still

preserve spatial relation which means that one can get a

rough idea of the area in the picture the block is attending

to for each token.

4. Method

Our approach begins with a null-text DDIM inversion ap-

plied to both the source and reference images, denoted as

s,0

and x

r,0

, respectively. During each inversion step,

we store the queries (Q), keys (K), and values (V ) of the

UNet’s attention blocks. Additionally, the ﬁnal noised la-

tent of the source image x

s,T

produced by the DDIM inver-

sion is retained for further use. This process is outlined by

the souce path and reference path in Figure 3.

4.1. Hybrid Attention

As outlined by Figure 1 the idea now is to utilize the infor-

mation extracted during the DDIM inversion of the source

Figure 4. Our Hybrid Attention block.

and reference images in the form of attention inputs Q, K

and V . We ﬁlter the information of interest for the source

and reference latents coming from the source path and ref-

erence path and blend that information into one latent to

continue with the reverse diffusion through the edit path.

The editing process starts from the noised latent of the

source image, x

s,T

, which serves as a strong initialization

point for image editing. The latent encapsulates essential

structural information about the image, facilitating accurate

reconstruction. From this initialization, we iteratively ap-

ply the denoising UNet to produce the ﬁnal edited image.

At each denoising step, the self-attention (SA) blocks of the

UNet are replaced with a novel hybrid attention (HA) mech-

anism, deﬁned as follows

= MaskAttn(Q

, K

, V

; M

)

= MaskAttn(Q

, K

, V

; M

)

HybridAttn = O

· (1 − M

) + O

· M

Here, O

represents the output attention from the source

image, while O

corresponds to the output attention from

the reference image. note we compute these always with

such that the edit latent queries the source and reference

latents by combining with K

and V

or K

and V

. We

do this since through experimentation we’ve realized that

the maintaining the query maximizes coherence while the

swapping the keys and values maximize feature injection

into the latent.

The masks M

and M

modulate the attention to control

the regions of focus for the source and reference images,

respectively. Speciﬁcally, M

ﬁlters out any content that’s

being edited out from the source image, and M

isolates the

desired regions from the reference image.

The blending mask M

plays a critical role in deﬁning

the regions of injection for the output latent. When M

speciﬁes the editing region, 1 − M

effectively covers the

effective background. As a result, O

is used to reconstruct

the effective background, preserving most of the source im-

age’s original content, while O

is injected into the editing

region. This ensures that the ﬁnal output retains the struc-

tural ﬁdelity of the source image while incorporating the

desired edits from the reference.

4.2. Optimization

The attention blocks across different UNet layers encode

distinct features, such as high-level structure or low-level

texture. Modulating the hybrid attention across different

layers impacts how information from the reference image

is integrated into the source.

Similarly, swapping attention at different timesteps dur-

ing the denoising process inﬂuences the ﬁnal image. Early

timestep swaps guide the model toward reconstructing the

desired edits earlier in the diffusion process.

To manage this, we propose optimizing a schedule that

determines which layers and timesteps should utilize hybrid

attention, balancing the reconstruction of the source image

with the integration of edits.

4.3. Automation and Adaptation of Masks

Masks are central to our pipeline, as they deﬁne the content

to attend to and its injection regions in the output. While

users can provide manual source, reference, and blend-

ing masks, automation is often desirable. To automate

mask generation, we leverage cross-attention maps from

the UNet’s cross-attention blocks. These maps serve as

heatmaps that highlight token-level relevance in the image.

By thresholding these maps, binary masks can be created

and used in hybrid attention.

For more complex edits, the blending mask M

needs

to adapt dynamically. Unlike the source mask, M

cannot

always be predeﬁned. To address this, we update M

using

the cross-attention map from the previous timestep in the

editing path. This enables the blending mask to evolve over

time as the content being injected becomes more reﬁned,

granting the diffusion model ﬂexibility to optimize the edits

without manual intervention.

4.4. Post-Processing for Mask Reﬁnement

Figure 5. Automated thresholding for masks derived from cross-

attention maps.

The cross-attention maps, particularly in early diffusion

steps, can be noisy and lack normalization, making bi-

nary thresholding challenging. To mitigate this, we apply

a Gaussian ﬁlter to denoise the cross-attention maps, fol-

lowed by normalization. Pixels above a chosen threshold

are selected to form the mask. To ensure smoothness, we

Figure 6. Our results prove our method to work across a wide range of editing domains. To the left, examples of replacement. Upper right

shows an example or removal. Middle right shows an example of insertion. Bottom right shows a dynamic change or non-rigid edit.

perform dilation and erosion steps, resulting in cleaner and

more robust masks.

This combination of automated mask generation and re-

ﬁnement ensures that our method achieves high-quality ed-

its while minimizing manual input and bias in the masking

process.

5. Experiment

We applied our method to the UNet-based text-to-image

Stable Diffusion model using the publicly available weights

v2.1. All editing experiments were performed on real im-

ages. Speciﬁcally, hybrid attention was implemented for

the top two layers of the encoder and decoder in the UNet

across all timesteps.

For ﬁgures labeled as None, either the reference image

was used as the source, or no reference image was provided.

These scenarios demonstrate the performance of our system

in such cases.

5.1. Layer Conﬁguration and Hybrid Attention

We tested various conﬁgurations of layers and timestep

ranges to integrate the proposed hybrid attention blocks in

place of self-attention during the reverse diffusion process.

For most edits, the top two attention layers of the encoder

and decoder yielded optimal results. This is likely because:

• Deeper layers in the network encode general feature and

class information. Modifying these layers signiﬁcantly

impacts the structure of the entire image, leading to unde-

sirable global changes.

• Higher layers tend to focus on ﬁne-grained details and

superﬁcial modiﬁcations, which are more suitable for lo-

calized edits.

• The higher layers maintain a better spatial relationship be-

tween the latent representations and the image, resulting

in more effective and well-behaved masking.

5.2. Preliminary Results

Our method demonstrated promising results across a range

of editing applications, including replacement, removal,

insertion, and dynamic changes (e.g., pose adjustments).

With the appropriate conﬁguration, the edited content dis-

played high coherence with the original image while adapt-

ing seamlessly to the overall style.

5.3. Automatic and Adaptive Attention Masks

As shown in Figure 7 we compared results using man-

ual masks, automatic masks, and adaptive attention (AA)

masks. Automating the masking process reduces user input

and eliminates potential human bias, which is advantageous

in certain scenarios. However, the effectiveness of automa-

tion varies depending on the image.

Automating the Source and Reference Masks: For au-

tomation, we compute cross-attention maps for speciﬁc

tokens (e.g., “glass of wine” or “juice”) in the source

and reference images. This is achieved by performing a

prompt-guided reconstruction for each image and extracting

the query and key values from the UNet’s cross-attention

Figure 7. Comparison of manual (top), automatic (middle), and

adaptive attention masks (bottom).

blocks. The resulting cross-attention maps effectively iden-

tify and isolate the regions corresponding to these tokens,

thereby automating the source and reference masks.

Adaptive Blending Mask (AA): The blend mask, which

deﬁnes the region to be modiﬁed, cannot always be derived

directly from the source and reference masks. Instead, we

allow the diffusion model to make an initial guess using

the edit prompt. At subsequent timesteps, we update the

blend mask dynamically by leveraging the cross-attention

maps for relevant tokens (e.g., “juice”) from the previ-

ous timestep. This adaptive approach enables the diffusion

model to explore coherent new shapes and integrate changes

progressively over time. Results using AA masks resolve

many issues seen with basic automation and, in many cases,

produce results comparable to or better than manual mask-

ing.

5.4. Failure Cases and Observations

While automatic masking often improves certain aspects of

editing, such as reducing human bias, it is not without chal-

lenges. In some failure cases, noisy attention maps caused

overlapping and blending of content, leading to incoher-

ent results. Interestingly, even in these failures, the au-

tomated process demonstrated better adaptation of shapes

compared to manual masking. For instance, in one exam-

ple, the cap shape produced by automation appeared more

forward-facing and properly oriented, suggesting that the

automated pipeline avoids simple copying and pasting in

favor of generating new shapes consistent with the input

Figure 8. Preliminary results with automatic adaptive attention

masks where bottom images are referencing themselves.

prompts.

These observations suggest that different stages of the re-

verse diffusion process handle varying levels of abstraction,

and automated attention allows the model to focus on what

it deems relevant at each step. Although this can sacriﬁce

overall coherence in some cases, it improves adaptability

and spatial consistency in others.

Figure 9. Failures.

6. Conclusion

This research project is a work in progress. Our solution

has proven to be successful at different tasks under some

pictures while fail at others.

6.1. Limitation and Future Work

As we experimented with different conﬁgurations and

stress-tested our solution we’ve gained a lot on insights on

how the attention works, what information each layer might

be encoding, how thresholding could be improved amongst

other things. Our solution being a training-free sampling

technique comes with limtations, relying a lot on the pre-

trained model that we are using, so we are looking into test-

ing our method on different models. On an important note

is that very recent research an advances on new forms of

diffusion like rectiﬁed ﬂow or alternatives to the UNet like

the vision transformer have proven to be very powerful in

reconstruction and image editing as well as efﬁcient [7], we

believe that we can bring this concept to this newer archi-

tecture and expect better results.

References

[1] Jiwoo Chung, Sangeek Hyun, and Jae-Pil Heo. Style injection

in diffusion: A training-free approach for adapting large-scale

diffusion models for style transfer, 2024. 2

[2] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kﬁr Aberman,

Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image

editing with cross attention control. 2022. 2, 3

[3] Ying Hu, Chenyi Zhuang, and Pan Gao. Diffusest: Unleashing

the capability of the diffusion model for style transfer, 2024.

[4] Pengzhi Li, Qiang Nie, Ying Chen, Xi Jiang, Kai Wu, Yuhuan

Lin, Yong Liu, Jinlong Peng, Chengjie Wang, and Feng

Zheng. Tuning-free image customization with image and text

guidance, 2024. 2

[5] Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama-

nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im-

age inversion and editing using rectiﬁed stochastic differential

equations. arXiv preprint arXiv:2410.10792, 2024. 2

[6] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising

diffusion implicit models. CoRR, abs/2010.02502, 2020. 2

[7] Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma,

Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming

rectiﬁed ﬂow for inversion and editing, 2024. 2, 7

[8] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin

Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by ex-

ample: Exemplar-based image editing with diffusion models,

2022. 2