Skip to content

Flux ControlNet Training Multi-GPU DeepSpeed Stage-3 doesn't reduce memory compared to Single GPU #10026

@enesmsahin

Description

@enesmsahin

Describe the bug

I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. I am using DeepSpeed Stage-3 with the accelerate config below.

When I use only 1 GPU (configured via accelerate config file below), it takes around 42GB during training. When I use all 8 GPUs in a single node, it still takes around 42GB per GPU.

I don't know about the parallelization details of DeepSpeed but I would expect DeepSpeed Stage-3 to shard the model weights further and reduce the memory usage per GPU for 8 GPUs compared to single-GPU case.

PS: I am not sure if this issue is related to the CN training script in diffusers or accelerate. I have opened the same issue in accelerate.

Reproduction

Link to the script: https://pastebin.com/SdQZcQR8

Command used to run the script:

accelerate launch --config_file "./default_config_fsdp.yaml" train_controlnet_flux_minimum_working_example.py     --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"     --dataset_name=fusing/fill50k     --conditioning_image_column=conditioning_image     --image_column=image     --caption_column=text     --output_dir="./training_output/"     --mixed_precision="bf16"     --resolution=512     --learning_rate=1e-5     --max_train_steps=15000     --validation_steps=100     --checkpointing_steps=1     --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"     --validation_prompt "red circle with blue background" "cyan circle with brown floral background"     --train_batch_size=1     --gradient_accumulation_steps=1     --report_to="wandb"     --num_double_layers=4     --num_single_layers=0     --seed=42

Accelerate Config File

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
gpu_ids: all # "0"
num_machines: 1
num_processes: 8 # 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Logs

No response

System Info

  • 🤗 Diffusers version: 0.32.0.dev0
  • Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
  • Running on Google Colab?: No
  • Python version: 3.10.10
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.26.2
  • Transformers version: 4.46.3
  • Accelerate version: 1.2.0.dev0
  • PEFT version: not installed
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.5
  • xFormers version: not installed
  • Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@PromeAIpro @sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions