clip : refactor, add image_manipulation and llava_uhd classes
          #13011
        
          
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
This PR partially resolves #12944 , but the public API for counting tokens will be added in a follow-up PR (cc @mattjcly )
The current PR only contains internal changes, the public API is not touched.
Motivation
The idea of this refactoring comes from my blog post where I documented the preprocessing step of vision models. This step is purely algorithmic. The "slicing" and "grid" systems are somewhat equivalent to pre-tokenizer regex in text models.
Currently, only some vision arch using this system:
Implementation
After this refactoring, the overall flow is:
get_slice_instructionsis called to decide how to slice the imageimage_manipulationclass to actually modify the image (crop, resize, pad) ; The output of this step are slicesclip_image_f32and normalize themsequenceDiagram participant User participant clip_image_preprocess participant llava_uhd as get_slice_instructions participant slice_image as slice_image participant image_manipulation participant normalize as normalize_image_u8_to_f32 User->>clip_image_preprocess: submit clip_image_u8 clip_image_preprocess->>llava_uhd: request slicing instructions llava_uhd-->>clip_image_preprocess: return list of slice_instructions clip_image_preprocess->>slice_image: call with slice_instructions loop for each slice_instruction slice_image->>image_manipulation: crop, resize, pad image image_manipulation-->>slice_image: return processed image slice end slice_image-->>clip_image_preprocess: return std::vector<clip_image_u8_ptr> loop for each clip_image_u8_ptr clip_image_preprocess->>normalize: convert image normalize-->>clip_image_preprocess: return clip_image_f32 end clip_image_preprocess-->>User: return list of clip_image_f32Test
Test script confirms that this does not break anything: