Skip to content

Question on LDM Conditioning Mechanisms (Concatenation vs. Cross-Attention) #401

@HONGcogml

Description

@HONGcogml

Hello,

I’d like to confirm my understanding of the conditioning mechanisms in Latent Diffusion Models (LDMs):

A. Spatially aligned image conditions (images/masks/semantic maps)

  1. Downsample(interpolation) to the latent resolution → concatenate with the input Zt along the channel dimension and feed into the UNet once at the input stage.
  2. It is channel-concatenation–based injection, not cross-attention–based injection.
  3. τθ=ID

B. Text / token-based conditions (y)

  1. Create an embedding τθ(y) via tokenizer/Transformer.
  2. Inject via SpatialTransformer (cross-attention) at multiple UNet stages, using the embedding as context (K,V).
  3. Not channel concatenation; it is cross-attention–based injection.

Could you please let me know if this is accurate?
Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions