Hello,
I’d like to confirm my understanding of the conditioning mechanisms in Latent Diffusion Models (LDMs):
A. Spatially aligned image conditions (images/masks/semantic maps)
- Downsample(interpolation) to the latent resolution → concatenate with the input Zt along the channel dimension and feed into the UNet once at the input stage.
- It is channel-concatenation–based injection, not cross-attention–based injection.
- τθ=ID
B. Text / token-based conditions (y)
- Create an embedding τθ(y) via tokenizer/Transformer.
- Inject via SpatialTransformer (cross-attention) at multiple UNet stages, using the embedding as context (K,V).
- Not channel concatenation; it is cross-attention–based injection.
Could you please let me know if this is accurate?
Thank you.