Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 18, 2025

This PR partially resolves #12944 , but the public API for counting tokens will be added in a follow-up PR (cc @mattjcly )

The current PR only contains internal changes, the public API is not touched.

Motivation

The idea of this refactoring comes from my blog post where I documented the preprocessing step of vision models. This step is purely algorithmic. The "slicing" and "grid" systems are somewhat equivalent to pre-tokenizer regex in text models.

Currently, only some vision arch using this system:

  • minicpm-v: grid is dynamically calculated
  • llava-1.6: grid is decided using a list of grid pinpoints

Implementation

After this refactoring, the overall flow is:

  • Given an image, get_slice_instructions is called to decide how to slice the image
  • After getting the instruction, we use image_manipulation class to actually modify the image (crop, resize, pad) ; The output of this step are slices
  • Convert slices to clip_image_f32 and normalize them
sequenceDiagram
    participant User
    participant clip_image_preprocess
    participant llava_uhd as get_slice_instructions
    participant slice_image as slice_image
    participant image_manipulation
    participant normalize as normalize_image_u8_to_f32
    
    User->>clip_image_preprocess: submit clip_image_u8
    clip_image_preprocess->>llava_uhd: request slicing instructions
    llava_uhd-->>clip_image_preprocess: return list of slice_instructions
    
    clip_image_preprocess->>slice_image: call with slice_instructions
    
    loop for each slice_instruction
        slice_image->>image_manipulation: crop, resize, pad image
        image_manipulation-->>slice_image: return processed image slice
    end
    
    slice_image-->>clip_image_preprocess: return std::vector<clip_image_u8_ptr>
    
    loop for each clip_image_u8_ptr
        clip_image_preprocess->>normalize: convert image
        normalize-->>clip_image_preprocess: return clip_image_f32
    end
    
    clip_image_preprocess-->>User: return list of clip_image_f32
Loading

Test

Test script confirms that this does not break anything:

OK:   llama-gemma3-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-llava-cli cmp-nct/Yi-VL-6B-GGUF:Q5_K
OK:   llama-llava-cli guinmoon/MobileVLM-3B-GGUF:Q4_K_M
OK:   llama-llava-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-llava-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-llava-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-llava-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-minicpmv-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-minicpmv-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-minicpmv-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-qwen2vl-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M

@ngxson ngxson requested a review from ggerganov April 18, 2025 12:46
@ngxson ngxson changed the title clip : refactor, add image_manipulation and llava_uhd clip : refactor, add image_manipulation and llava_uhd classes Apr 18, 2025
@ngxson ngxson merged commit 37b9f0d into ggml-org:master Apr 19, 2025
48 of 51 checks passed
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
…ml-org#13011)

* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
…ml-org#13011)

* clip : refactor, add `image_manipulation` and `llava_uhd`

* refactor llava-1.6 preprocessing

* simplify logic for llava-1.5

* missing include
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants