Skip to content

[ACL'25 Main] Official Implementation of HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Notifications You must be signed in to change notification settings

Ghy0501/HiDe-LLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model (ACL 2025 Main)

🤗 Dataset (HuggingFace) 📑 Paper (arXiv:2503.12941)

This repo is the official implementation of ACL 2025 paper: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Haiyang Guo*, Fanhu Zeng*, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu

framework

News

Abstract

Instruction tuning is widely used to enhance a pre-trained Multimodal Large Language Model (MLLM) to understand and follow human instructions by training it on a curated set of task-specific dataset. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods.

Installation

The installation of our environment is the same as CoIN.

conda create -n hide python=3.10 -y
conda activate hide
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

To measure the metrics of caption tasks, please install the following three packages:

pip install nltk==3.9.1
pip install pycocotools==2.0.8
pip install pycocoevalcap==1.2

We recommend replacing the eval.py file under that path /envs/hide/lib/python3.10/site-packages/pycocoevalcap/ in your environment with the eval.py file that we have provided in the repository to avoid unwanted error reporting and time overhead.

Technical issues can be reported and addressed through the official GitHub issue trackers for both projects: CoIN and LLaVA.

UCIT Benchmark

Please download the images from the constituting dataset:

Image Source Download Path
ArxivQA images
ImageNet-R images
IconQA images
CLEVR-Math images
VizWiz images
Flickr30k images

After downloading all of them, organize the data as follows:

|-- datasets
    |-- ArxivQA
        |-- images/
    |-- CLEVR
        |-- images
            |-- train/
            |-- test/
            |-- val/
    |-- Flickr30k
        |-- train/
        |-- val/
    |-- IconQA
        |-- iconqa_data/
            |-- iconqa/
    |-- ImageNet-R
        |-- train/
        |-- test/
    |-- VizWiz
        |-- train/
        |-- test/
        |-- val/

Please download the instructions from our HuggingFace page, then, organize the instructions as follows:

|-- instructions
    |-- ArxivQA
        |-- test_3000.json
        |-- train_4w.json
    |-- CLEVR
        |-- test_3000.json
        |-- train_4w.json
    |-- Flickr30k
        |-- test_3000.json
        |-- train_brief_4w.json
        |-- val_coco_type_3000.json
    |-- IconQA
        |-- test_3000.json
        |-- train.json
    |-- ImageNet-R
        |-- test_3000.json
        |-- train.json
    |-- VizWiz
        |-- test_3000.json
        |-- train.json
        |-- val_coco_type_3000.json

Pre-trained Weights

Please download LLaVA and CLIP, and use the config.json provided in this repository replace the original config.json in LLaVA.

Training and Evaluation

Once the data and instructions organized and placed correctly, you can train the model by running ./scripts/CoIN/Train_UCIT/train_all.sh. After the training is completed, you can evaluate the performance by running ./scripts/CoIN/Eval_UCIT/Eval_all.sh. Be careful to modify the paths in all .sh files to your own actual paths.

Citation

@article{guo2025hide,
  title={Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model},
  author={Guo, Haiyang and Zeng, Fanhu and Xiang, Ziwei and Zhu, Fei and Wang, Da-Han and Zhang, Xu-Yao and Liu, Cheng-Lin},
  journal={arXiv preprint arXiv:2503.12941},
  year={2025}
}

Acknowledgememnt

This repository is built upon the LLaVA and CoIN projects. We sincerely thank the authors for their valuable contributions to the research community.

About

[ACL'25 Main] Official Implementation of HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •