UNet2DModel - ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D 

### Describe the bug
When I load the model for my training, I just want to use the unconditional model. But I have this problem. Could anyone tell me how to solve this?
 
ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D 
on  
self.unet = UNet2DModel.from_pretrained(model_key, subfolder="unet").to(self.device)
![image](https://github.com/huggingface/diffusers/assets/63243678/ed05dc41-e8bd-4bd1-8664-636d6f6f1ca7)


### Reproduction

from diffusers import AutoencoderKL, UNet2DModel, DDIMScheduler

import torch
import torch.nn as nn
import torch.nn.functional as F

class StableDiffusion(nn.Module):
    
    def __init__(self, device, sd_version='1.5', hf_key=None):
        super().__init__()

        self.device = device
        self.sd_version = sd_version
        # self.opt = opt
        t_range = [0.02, 0.98]
        print(f'[INFO] loading stable diffusion...')
        
        if hf_key is not None:
            print(f'[INFO] using hugging face custom model key: {hf_key}')
            model_key = hf_key
        elif self.sd_version == '2.1':
            model_key = "stabilityai/stable-diffusion-2-1-base"
        elif self.sd_version == '2.0':
            model_key = "stabilityai/stable-diffusion-2-base"
        elif self.sd_version == '1.5':
            model_key = "runwayml/stable-diffusion-v1-5"
        else:
            raise ValueError(f'Stable-diffusion version {self.sd_version} not supported.')

        # Create model
        self.vae = AutoencoderKL.from_pretrained(model_key, subfolder="vae").to(self.device)
        self.unet = UNet2DModel.from_pretrained(model_key, subfolder="unet").to(self.device)

        self.scheduler = DDIMScheduler.from_pretrained(model_key, subfolder="scheduler")

        self.num_train_timesteps = self.scheduler.config.num_train_timesteps
        self.min_step = int(self.num_train_timesteps * t_range[0])
        self.max_step = int(self.num_train_timesteps * t_range[1])
        self.alphas = self.scheduler.alphas_cumprod.to(self.device) # for convenience

        print(f'[INFO] loaded stable diffusion!')

    def defuse(self, pred_rgb):
        
        # interp to 512x512 to be fed into vae.
        # assert torch.isnan(pred_rgb).sum() == 0, print(pred_rgb)
        input_size = pred_rgb.size
        
        pred_rgb_512 = F.interpolate(pred_rgb, (512, 512), mode='bilinear', align_corners=False)
        
        # encode image into latents with vae, requires grad!
        latents = self.encode_imgs(pred_rgb_512)        

        t = torch.randint(self.min_step, self.max_step + 1, [1], dtype=torch.long, device=self.device)

        # predict the noise residual with unet, NO grad!
        with torch.no_grad():
            # add noise
            noise = torch.randn_like(latents)
            latents_noisy = self.scheduler.add_noise(latents, noise, t)
            
            # pred noise
            latent_model_input = torch.cat([latents_noisy] * 2)
            noise_pred = self.unet(latent_model_input, t).sample
        
        self.w = (1 - self.alphas[t])
        out = self.decode_latents(noise_pred)
        out = F.interpolate(out, input_size, mode='bilinear', align_corners=False)

        return out

### Logs

```shell
None
```


### System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `diffusers` version: 0.28.0.dev0
- Platform: Linux-5.4.0-113-generic-x86_64-with-glibc2.17
- Python version: 3.8.19
- PyTorch version (GPU?): 1.13.1 (True)
- Huggingface_hub version: 0.22.2
- Transformers version: 4.39.3
- Accelerate version: 0.29.2
- xFormers version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UNet2DModel - ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D #7686

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UNet2DModel - ValueError: cross_attention_dim must be specified for CrossAttnDownBlock2D #7686

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions