Skip to content
/ IMG Public

Source code of our ACM MM'25 accepted paper "Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval".

Notifications You must be signed in to change notification settings

HuiGuanLab/IMG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

Repo for ACM MM'25 paper "Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval". This paper proposes solutions for the Video Moment Retrieval task from an audio-visual collaborative perspective.

You can find our pre-print paper from arxiv.

framework

Table of Contents

Set up the Environment

  • Ubuntu 20.04
  • CUDA 12.0
  • Python 3.7
  • torch 1.13.1
  • torchvision 0.7.0

Use Anaconda and easily build up the required environment by

cd IMG
conda env create -f env.yml

Data Preparation

Follow previous work ADPN, we use GloVe-840B-300d for text embeddings, I3D/CLIP+SF/InternVideo2 visual features and PANNs audio features for Charades-STA dataset, and I3D visual features and VGGish audio features for ActivityNet Captions dataset.

We have also prepared CLIP + SlowFast and InternVideo2 features, CLIP features are extracted by ourselves, SlowFast features is derived from here, while InternVideo2 features is derived from here.

We already prepare all data below.

Download here to get Charades-STA features and Activitynet-Caption's audio features and json files.

Download here to get Activitynet-Captions i3d features and GloVe embeddings, touch IMG/data/features, and ensure the following directory structure.

|--data
|  |--dataset
|     |--activitynet
|     |     |--train_qid.json
|     |     |--val_1_qid.json
|     |     |--val_2_qid.json
|     |--charades
|     |     |--charades_sta_test_qid.txt
|     |     |--charades_sta_train_qid.txt
|     |     |--charades.json
|     |     |--charades_audiomatter_qid.txt
|     |     |--charades_sta_train_tvr_format.jsonl
|     |     |--charades_sta_test_tvr_format.jsonl
|     |     |--charades_audiomatter_test_tvr_format.jsonl
|  |--features
|     |--activitynet
|     |     |--audio
|     |     |     |--VGGish.pickle
|     |     |--i3d_video
|     |     |     |--feature_shapes.json
|     |     |     |--v___c8enCfzqw.npy
|     |     |     |--...(*.npy)
|     |--charades
|     |     |--audio
|     |     |     |--0A8CF.npy
|     |     |     |--...(*.npy)
|     |     |--i3d_video
|     |     |     |--feature_shapes.json
|     |     |     |--0A8CF.npy
|     |     |     |--...(*.npy)
|     |     |--clip_features
|     |     |     |--visual_features
|     |     |     |     |--0A8CF.npy
|     |     |     |     |--...(*.npy)
|     |     |     |--slowfast_features
|     |     |     |     |--0A8CF.npz
|     |     |     |     |--...(*.npz)
|     |     |     |--text_features
|     |     |     |     |--qid_0.npy
|     |     |     |     |--...(*.npy)
|     |     |--iv2_features
|     |     |     |--visual_features_6b
|     |     |     |     |--0A8CF.pt
|     |     |     |     |--...(*.pt)
|     |     |     |--llama2_txt
|     |     |     |     |--qid0.pt
|     |     |     |     |--...(*.pt)

Training

python main.py --task <charades|activitynet|charadesAM> --mode train --gpu_idx <GPU INDEX>

Inference

python main.py --task <charades|activitynet|charadesAM> --mode test --gpu_idx <GPU INDEX>

Change the config model_name in main.py to the model_name of your checkpoint.

Acknowledgement

We follow the repo ADPN and VSLNet for the code-running framework to quickly implement our work. We appreciate these great jobs.

This work was supported by the Pioneer and Leading Goose R&D Program of Zhejiang (No. 2024C01110), National Natural Science Foundation of China (No. 62472385), Young Elite Scientists Sponsorship Program by China Association for Science and Technology (No. 2022QNRC001), Public Welfare Technology Research Project of Zhejiang Province (No. LGF21F020010), Fundamental Research Funds for the Provincial Universities of Zhejiang (No. FR2402ZD) and Zhejiang Provincial High-Level Talent Special Support Program.

Citation

If you find IMG useful for your project or research, welcome to 🌟 this repo.

About

Source code of our ACM MM'25 accepted paper "Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages