Question on LDM Conditioning Mechanisms (Concatenation vs. Cross-Attention)

Hello,

I’d like to confirm my understanding of the conditioning mechanisms in Latent Diffusion Models (LDMs):


**A. Spatially aligned image conditions (images/masks/semantic maps)**
1. Downsample(interpolation) to the latent resolution → concatenate with the input Zt along the channel dimension and feed into the UNet once at the input stage.
2. It is channel-concatenation–based injection, not cross-attention–based injection.
3. _τθ=ID_ 


**B. Text / token-based conditions (y)**
1. Create an embedding _τθ(y)_ via tokenizer/Transformer.
2. Inject via SpatialTransformer (cross-attention) at multiple UNet stages, using the embedding as context (K,V).
3. Not channel concatenation; it is cross-attention–based injection.


Could you please let me know if this is accurate?
Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question on LDM Conditioning Mechanisms (Concatenation vs. Cross-Attention) #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on LDM Conditioning Mechanisms (Concatenation vs. Cross-Attention) #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions