Skip to content

Issues with DPOTrainer and Qwen2-VL processor #2660

@baichuanzhou

Description

@baichuanzhou

Hey guys, I am trying to dig into the DPO implementation for VLM and I encountered this issue:
Here in the process_row function,

processor, tokenizer = processing_class, processing_class.tokenizer  # the processing class is a processor
processed_features = processor(images=features["images"], text=features["prompt"], add_special_tokens=False)

prompt_input_ids = processed_features["input_ids"][0]
pixel_values = processed_features["pixel_values"][0]

images will be turned into pixel_values by indexing on the first returned pixel_values.(I assume this is because dataset format request that inputs should only have one image). However, after playing with Qwen2-VL's processor, I found that Qwen2-VL's processor always returns a 2D tensor, which would make indexing on the first element essensially indexing on the first row of the pixel values.

If this is the case, I don't think pixel_values should be handled that way.

Here's the code I used to test Qwen2-VL's processor:

from PIL import Image
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained('Qwen/Qwen2-VL-7B-Instruct')
image = Image.open('some image')
outputs = processor(images=[image, image], text="<image><image>", return_tensors="pt")
print(outputs.pixel_values.size())
# should be a 2D tensor, in my case it was torch.Size([3128, 1176])

And here I checked the Qwen2-VL's preprocessing logic, it seems that the processor will always return a flattened out 2D tensor.

Wouldn't indexing a 2D image tensor be a problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏋 DPORelated to DPO🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions