Skip to content

Conversation

@patil-suraj
Copy link
Contributor

@patil-suraj patil-suraj commented Oct 10, 2022

Currently stable diffusion (or diffusers in general) doesn't work with bf16 as nearest upsampling in torch is not supported in bf16.

Minimal code to reproduce:

import torch
import torch.nn.functional as F

image = torch.randn(1, 4, 32, 32).to(device="cuda", dtype=torch.bfloat16)
out = F.interpolate(image, size=(64, 64), mode="nearest")

Addinatly, in pipelines we need to cast the images to fp32 as bf16 is not yet supported in numpy.

This is a draft PR to enable bf16 training/inference for stable diffusion by casting input to fp32 where bf16 is not supported.

Not sure if this is the right way, curious to you hear your feedback @patrickvonplaten @NouamaneTazi

fixes #771

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 10, 2022

The documentation is not available anymore as the PR was closed or merged.

cos_dist = cosine_distance(image_embeds, self.concept_embeds).cpu()

# cast to float32 to as numpy does not support bfloat16
if image_embeds.dtype == torch.bfloat16:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need an if statement here as .float() should always work and be correct (e.g. we can't do fp16 on CPU either and doing numpy() will move it to fp32 anyways

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten why would numpy() move the arrays to fp32? I thought that numpy arrays support fp16?
I believe we should use .half() instead which should work on CPU also?

Copy link
Member

@NouamaneTazi NouamaneTazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately some of these precision modifications will add more unrolled_elementwise_kernel< direct_copy_kernel_cuda> kernels, but at least doesnt require CPU-GPU sync, and doesn't happen inside a loop. So LGTM :-)

cos_dist = cosine_distance(image_embeds, self.concept_embeds).cpu()

# cast to float32 to as numpy does not support bfloat16
if image_embeds.dtype == torch.bfloat16:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten why would numpy() move the arrays to fp32? I thought that numpy arrays support fp16?
I believe we should use .half() instead which should work on CPU also?

@patil-suraj patil-suraj marked this pull request as ready for review October 11, 2022 09:34
@patil-suraj patil-suraj merged commit 797b290 into main Oct 11, 2022
@patil-suraj patil-suraj deleted the sd-bf16 branch October 11, 2022 10:02
prathikr pushed a commit to prathikr/diffusers that referenced this pull request Oct 26, 2022
* support bf16 for stable diffusion

* fix typo

* address review comments
PhaneeshB pushed a commit to nod-ai/diffusers that referenced this pull request Mar 1, 2023
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* support bf16 for stable diffusion

* fix typo

* address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BF16 doesn't work with dreambooth

5 participants