Add Shap-E #3742

yiyixuxu · 2023-06-11T23:45:34Z

original repo: https://github.com/openai/shap-e

text-to-3D

import torch
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

from diffusers import ShapEPipeline


batch_size = 1
guidance_scale = 15.0
prompt = "a shark"
torch.manual_seed(0)

repo = "YiYiXu/shap-e"
pipe = ShapEPipeline.from_pretrained(repo)
pipe = pipe.to(device)


generator = torch.Generator(device="cuda").manual_seed(0)
images = pipe(
    prompt, 
    num_images_per_prompt=batch_size, 
    generator=generator, 
    guidance_scale=guidance_scale,
    num_inference_steps=64, 
    frambe_size=256, 
    output_type='pil').images

pipe.save_gif(images[0], ""shark.gif")

generated from original code

image-to-3D

from PIL import Image
import torch
import numpy as np

from diffusers import ShapEImg2ImgPipeline


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

batch_size = 1
guidance_scale = 3.0

image = Image.open("corgi.png")

repo = "YiYiXu/shap-e-img2img"
pipe = ShapEImg2ImgPipeline.from_pretrained(repo)
pipe = pipe.to(device)


generator = torch.Generator(device=device).manual_seed(0)
images = pipe(
    image, 
    num_images_per_image=batch_size, 
    generator=generator, 
    guidance_scale=guidance_scale,
    num_inference_steps= 64, 
    size = 256, 
    output_type='pil').images

pipe.save(images[0], "corgi_3d.gif")

image:

3d

as a reference, this is the 3d render generated with original repo with same inputs and seed

To-do:

refactor based on feedback
add image_to_3d
investigate more on numerical differences (compare image qualities)
tests & doc

adding conversion script add pipeline add step_index from pipeline, + remove permute add zero pad token remove copy from statement for betas_for_alpha_bar function

yiyixuxu · 2023-06-12T20:00:07Z

@patrickvonplaten
The generations seem maybe ok (I still need to compare more) but I've been really struggling to match the numerical outputs of the pipeline closely to the original repo. See below the testing script that compares pipeline outputs - it returns 0.07. We would normally like to see this number less than 1e-3 no? or should I wait until we add decoder and compare the decoded output instead?

When I compared the model forward pass (see equivalency test for model forward pass), the results matched nicely with the max element difference less than 1e-5, but I think the model is sensitive to the difference in text embedding inputs. e.g., if I run the same test with text embeddings inputs generated from the original CLIP model vs transformers CLIP model that we use, the difference will increase to 1e-3; And this discrepancy seems to be further amplified during the sampling process

Not sure what to do here and appreciate any feedback/advices:)

equivalency test for pipeline outputs

this script returns 0.07

import torch
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

batch_size = 4
guidance_scale = 15.
sigma_min = 1e-3
prompt = "a shark"

# diffusers 

from diffusers import ShapEPipeline
repo = "YiYiXu/shap-e"
pipe = ShapEPipeline.from_pretrained(repo)
pipe = pipe.to(device)

generator = torch.Generator(device="cuda").manual_seed(0)
latents_d = pipe(prompt, num_images_per_prompt=batch_size, generator=generator, guidance_scale=guidance_scale,num_inference_steps= 64, sigma_min=sigma_min).latents

# original
from shap_e.diffusion.sample import sample_latents
from shap_e.diffusion.gaussian_diffusion import diffusion_from_config
from shap_e.models.download import load_model, load_config

model = load_model('text300M', device=device)
diffusion = diffusion_from_config(load_config('diffusion'))

latents, _ = sample_latents(
    batch_size=batch_size,
    model=model,
    diffusion=diffusion,
    guidance_scale=guidance_scale,
    model_kwargs=dict(texts=[prompt] * batch_size),
    progress=True,
    clip_denoised=True,
    use_fp16=False,
    use_karras=True,
    karras_steps=64,
    sigma_min=sigma_min,
    sigma_max=160,
    s_churn=0,

)

# compare
print("max diff latents:") 
print(np.abs(latents.reshape(4, 1024,1024).detach().cpu().numpy() - latents_d.detach().cpu().numpy()).max())

equivalency test for model forward pass

TEST2 compare the model output with exactly same inputs, the maximum element difference is 4.6790e-06
TEST1 compare the model output with embedding generated with original CLIP model(used by the original repo) and the transformer CLIP model(used by diffusers) - and the maximum element difference is 1e-3
I also compared the generated text embeddings - difference 0.00019

import torch
import numpy as np
import clip

from diffusers.models.prior_transformer import PriorTransformer

from transformers import CLIPTextModelWithProjection, CLIPTokenizer
from shap_e.models.download import load_model

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# create original model
model = load_model('text300M', device=device)
transformer = model.wrapped

# create diffusers model
path_shape = "YiYiXu/shap-e"
transformer_d = PriorTransformer.from_pretrained(path_shape, subfolder="prior").to(device)

# inputs
batch_size = 1
torch.manual_seed(0)
x = torch.randn([batch_size, 1024, 1024], device=device)
t = torch.tensor([0] * batch_size, device=device)
prompt = ["a shark"] * batch_size

# create embeddings using original clip model
clip_name = "ViT-L/14"
download_root= "/home/yiyi_huggingface_co/shap-e/shap_e_model_cache"

clip_model, _ = clip.load(clip_name, device=device, download_root=download_root)
tokenize = clip.tokenize

embeddings = clip_model.encode_text(
        tokenize(list(prompt), truncate=True).to(device)
    ).float()
embeddings = embeddings / torch.linalg.norm(embeddings, dim=-1, keepdim=True)

# create embeddings using transformer clip 
repo = "openai/clip-vit-large-patch14"
d_text_encoder = CLIPTextModelWithProjection.from_pretrained(repo).to(device)
d_tokenizer = CLIPTokenizer.from_pretrained(repo)

tokens = d_tokenizer(prompt, padding="max_length", max_length=d_tokenizer.model_max_length, truncation=True, return_tensors="pt",).input_ids
embeddings_d= d_text_encoder(tokens.to(device)).text_embeds.float()
embeddings_d = embeddings_d / torch.linalg.norm(embeddings_d, dim=-1, keepdim=True)

# compare the embeddings : 0.00019
print(f" compare embeddings: {np.abs(embeddings.detach().cpu().numpy() - embeddings_d.detach().cpu().numpy()).max()}")


# TEST1: compare the output using respective embeddings: 0.0012

# original output
out = transformer(x,t, embeddings = embeddings)

# diffusers output
out_d = transformer_d(x.permute(0,2,1), 0, embeddings_d, return_dict=False)[0]

print(" ")
print(" test1 result") # 0.0012
print((out_d.permute(0, 2, 1) - out).abs().max())

# TEST2: compare the outputs using same embedding: 4.6790e-06

# original output
out = transformer(x,t, embeddings = embeddings)


# diffusers output
out_d = transformer_d(x.permute(0,2,1), 0, embeddings, return_dict=False)[0]

print(" ")
print(" test2 result") # 4.6790e-06
print((out_d.permute(0, 2, 1) - out).abs().max())

testing script compare the pipeline output with `sigma_min = 1`

I also run the equivalency test fo pipeline with sigmas from 160 ~ 1 (vs the first pipeline test was run with default sigma range 160 ~ 1e-3). This test return 5e-4 so maybe it becomes unstable when sigma gets really small

import torch
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

batch_size = 4
guidance_scale = 15.
sigma_min = 1e-3
prompt = "a shark"

# diffusers 

from diffusers import ShapEPipeline
repo = "YiYiXu/shap-e"
pipe = ShapEPipeline.from_pretrained(repo)
pipe = pipe.to(device)

generator = torch.Generator(device="cuda").manual_seed(0)
latents_d = pipe(prompt, num_images_per_prompt=batch_size, generator=generator, guidance_scale=guidance_scale,num_inference_steps= 64, sigma_min=sigma_min).latents

# create prompt_embeds from diffusers pipeline and pass it to the original model
 prompt_embeds = pipe._encode_prompt(
     [prompt], device, batch_size, True
 )

# original
from shap_e.diffusion.sample import sample_latents
from shap_e.diffusion.gaussian_diffusion import diffusion_from_config
from shap_e.models.download import load_model, load_config

model = load_model('text300M', device=device)
diffusion = diffusion_from_config(load_config('diffusion'))

latents, _ = sample_latents(
    batch_size=batch_size,
    model=model,
    diffusion=diffusion,
    guidance_scale=guidance_scale,
    model_kwargs=dict(embeddings=prompt_embeds[4:]),
    progress=True,
    clip_denoised=True,
    use_fp16=False,
    use_karras=True,
    karras_steps=64,
    sigma_min=sigma_min,
    sigma_max=160,
    s_churn=0,

)

# compare
print("max diff latents:") 
print(np.abs(latents.reshape(4, 1024,1024).detach().cpu().numpy() - latents_d.detach().cpu().numpy()).max()) #0.00050234795

patrickvonplaten · 2023-06-15T09:05:32Z

src/diffusers/schedulers/scheduling_heun_discrete.py

        return t

+    # YiYi Notes: Taking from the origional repo, will refactor and not introduce dependency on spicy
+    def _sigma_to_t_yiyi(self, sigma):


Sounds good!

src/diffusers/schedulers/scheduling_heun_discrete.py

src/diffusers/pipelines/shap_e/pipeline_shap_e.py

src/diffusers/models/prior_transformer.py

patrickvonplaten · 2023-06-15T11:42:03Z

In the general the design looks good to me! I just noticed that we don't have any prior transformer tests so I added them here: #3796. This PR also allows to disable the PT 2 attention processor which should help with precision issues.

Could you maybe merge #3796 into your PR and once it's merged and then use set_default_attn_processor() to improve precision in the tests?

Thing we're on a good way here to have a powerful new model class in diffusers 🚀

HuggingFaceDocBuilderDev · 2023-06-20T23:09:01Z

The documentation is not available anymore as the PR was closed or merged.

Co-authored-by: Patrick von Platen <[email protected]>

…to shap-ee

src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py

src/diffusers/pipelines/shap_e/pipeline_shap_e.py

src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py

src/diffusers/models/attention.py

…shap-ee

src/diffusers/models/prior_transformer.py

src/diffusers/pipelines/shap_e/pipeline_shap_e.py

src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py

pcuenca · 2023-07-06T12:34:40Z

src/diffusers/pipelines/shap_e/renderer.py

+            255.0,
+            255.0,
+            255.0,


refactor prior_transformer

20e5be7

adding conversion script add pipeline add step_index from pipeline, + remove permute add zero pad token remove copy from statement for betas_for_alpha_bar function