clip : refactor, add `image_manipulation` and `llava_uhd` classes #13011

ngxson · 2025-04-18T12:46:50Z

This PR partially resolves #12944 , but the public API for counting tokens will be added in a follow-up PR (cc @mattjcly )

The current PR only contains internal changes, the public API is not touched.

Motivation

The idea of this refactoring comes from my blog post where I documented the preprocessing step of vision models. This step is purely algorithmic. The "slicing" and "grid" systems are somewhat equivalent to pre-tokenizer regex in text models.

Currently, only some vision arch using this system:

minicpm-v: grid is dynamically calculated
llava-1.6: grid is decided using a list of grid pinpoints

Implementation

After this refactoring, the overall flow is:

Given an image, get_slice_instructions is called to decide how to slice the image
After getting the instruction, we use image_manipulation class to actually modify the image (crop, resize, pad) ; The output of this step are slices
Convert slices to clip_image_f32 and normalize them

sequenceDiagram
    participant User
    participant clip_image_preprocess
    participant llava_uhd as get_slice_instructions
    participant slice_image as slice_image
    participant image_manipulation
    participant normalize as normalize_image_u8_to_f32
    
    User->>clip_image_preprocess: submit clip_image_u8
    clip_image_preprocess->>llava_uhd: request slicing instructions
    llava_uhd-->>clip_image_preprocess: return list of slice_instructions
    
    clip_image_preprocess->>slice_image: call with slice_instructions
    
    loop for each slice_instruction
        slice_image->>image_manipulation: crop, resize, pad image
        image_manipulation-->>slice_image: return processed image slice
    end
    
    slice_image-->>clip_image_preprocess: return std::vector<clip_image_u8_ptr>
    
    loop for each clip_image_u8_ptr
        clip_image_preprocess->>normalize: convert image
        normalize-->>clip_image_preprocess: return clip_image_f32
    end
    
    clip_image_preprocess-->>User: return list of clip_image_f32

Test

Test script confirms that this does not break anything:

OK:   llama-gemma3-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-llava-cli cmp-nct/Yi-VL-6B-GGUF:Q5_K
OK:   llama-llava-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-llava-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-llava-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-llava-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-llava-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-minicpmv-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-minicpmv-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-minicpmv-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-qwen2vl-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M

…ml-org#13011) * clip : refactor, add `image_manipulation` and `llava_uhd` * refactor llava-1.6 preprocessing * simplify logic for llava-1.5 * missing include

ngxson added 3 commits April 18, 2025 12:11

clip : refactor, add image_manipulation and llava_uhd

13e5f59

refactor llava-1.6 preprocessing

dd08673

simplify logic for llava-1.5

1395a4a

ngxson requested a review from ggerganov April 18, 2025 12:46

github-actions bot added the examples label Apr 18, 2025

ngxson changed the title ~~clip : refactor, add image_manipulation and llava_uhd~~ clip : refactor, add image_manipulation and llava_uhd classes Apr 18, 2025

missing include

c22806d

ggerganov approved these changes Apr 19, 2025

View reviewed changes

ngxson merged commit 37b9f0d into ggml-org:master Apr 19, 2025
48 of 51 checks passed

lcarrere mentioned this pull request May 1, 2025

Re-enable upscaling of images smaller than the CLIP input size; fix MiniCPM evaluation on small bitmaps #13237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clip : refactor, add `image_manipulation` and `llava_uhd` classes #13011

clip : refactor, add `image_manipulation` and `llava_uhd` classes #13011

Uh oh!

ngxson commented Apr 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clip : refactor, add image_manipulation and llava_uhd classes #13011

clip : refactor, add image_manipulation and llava_uhd classes #13011

Uh oh!

Conversation

ngxson commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Implementation

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clip : refactor, add `image_manipulation` and `llava_uhd` classes #13011

clip : refactor, add `image_manipulation` and `llava_uhd` classes #13011

ngxson commented Apr 18, 2025 •

edited

Loading