-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Description
Describe the bug
I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. I am using DeepSpeed Stage-3 with the accelerate config below.
When I use only 1 GPU (configured via accelerate config file below), it takes around 42GB during training. When I use all 8 GPUs in a single node, it still takes around 42GB per GPU.
I don't know about the parallelization details of DeepSpeed but I would expect DeepSpeed Stage-3 to shard the model weights further and reduce the memory usage per GPU for 8 GPUs compared to single-GPU case.
PS: I am not sure if this issue is related to the CN training script in diffusers or accelerate. I have opened the same issue in accelerate.
Reproduction
Link to the script: https://pastebin.com/SdQZcQR8
Command used to run the script:
accelerate launch --config_file "./default_config_fsdp.yaml" train_controlnet_flux_minimum_working_example.py --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" --dataset_name=fusing/fill50k --conditioning_image_column=conditioning_image --image_column=image --caption_column=text --output_dir="./training_output/" --mixed_precision="bf16" --resolution=512 --learning_rate=1e-5 --max_train_steps=15000 --validation_steps=100 --checkpointing_steps=1 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" --validation_prompt "red circle with blue background" "cyan circle with brown floral background" --train_batch_size=1 --gradient_accumulation_steps=1 --report_to="wandb" --num_double_layers=4 --num_single_layers=0 --seed=42Accelerate Config File
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
gpu_ids: all # "0"
num_machines: 1
num_processes: 8 # 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: falseLogs
No response
System Info
- 🤗 Diffusers version: 0.32.0.dev0
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.10
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.26.2
- Transformers version: 4.46.3
- Accelerate version: 1.2.0.dev0
- PEFT version: not installed
- Bitsandbytes version: not installed
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB - Using GPU in script?:
- Using distributed or parallel set-up in script?: