StreamingVLM enables real-time, stable understanding of effectively infinite video by keeping a compact KV cache and aligning training with streaming inference. It avoids quadratic cost and sliding-window pitfalls, runs up to 8 FPS on a single H100, and wins 66.18% vs GPT-4o mini on a new long-video benchmark. It also boosts general VQA without task-specific finetuning. You can grasp the gist by skimming this section first.
Go to streamingvlm.hanlab.ai to see more cases and try our model.
StreamingVLM_Realtime_Generation.mp4
./scripts/env_infer.sh
./scripts/env_sft.shYou can set up the environment by running the scripts above.
You can run inference by the command below.
conda activate streamingvlm-infer
python streaming_vlm/inference/inference.pyFirst, download mit-han-lab/Inf-Stream-Train to /path/to/your/Inf-Stream-Train.
Then, download chenjoya/Live-WhisperX-526K to /path/to/your/Inf-Stream-Train/Livecc_sft.
Preprocess the LiveCC dataset with the following command:
cd $DATASET_PATH/Livecc_sft
find . -type f -exec mv -t . {} +Download mit-han-lab/Inf-Stream-Eval to /path/to/your/Inf-Stream-Eval.
Finally, set environment paths:
export DATASET_PATH=/path/to/your/Inf-Stream-Train
export EVAL_DATASET_PATH=/path/to/your/Inf-Stream-EvalYou can prepare data by following the steps in order.
*You can kick off SFT by executing the scripts below.*conda activate streamingvlm-sft
./scripts/sft_stage_1.sh
./scripts/sft_stage_2.sh # High Quality Annealing Dataconda activate streamingvlm-infer
./scripts/eval_efficiency.sh You can benchmark efficiency by running the script above.
First, make the OVOBench data structure like:
data/ovobench
├── AutoEvalMetaData
├── COIN
├── cross_task
├── Ego4D
├── hirest
├── MovieNet
├── OpenEQA
├── ovo_bench_new.json
├── perception_test
├── star
├── thumos
├── youcook2
└── YouTube_Games
Then, prepare the OVOBench environment and run evaluation:
./scripts/env_ovo.sh
conda activate streamingvlm-ovo
./scripts/eval_OVOBench.shYou can start OVOBench eval by these commands.
We use VLMEvalKit to evaluate VQA tasks.
conda activate streamingvlm-infer
./scripts/eval_VQA.shYou can launch VQA evaluation with the script above.
conda activate streamingvlm-infer
./scripts/eval_Inf-Stream-Eval.shYou can run the in-house eval by calling this script.
conda activate streamingvlm-infer
export LIVESPORTS3K_PATH=/path/to/your/LiveSports-3K/videos
conda activate streamingvlm-infer
./scripts/eval_LiveSports3k-cc.shYou can evaluate LiveSports3k-cc with the path set above.
If you would like to change inference FPS, use the following command:
sed -i 's/^FPS = .*/FPS = float(os.environ.get("QWENVL_FPS", "2.0"))/' \
"$(python -c 'import inspect,qwen_vl_utils.vision_process as m; import os; print(os.path.abspath(inspect.getsourcefile(m)))')"You can tweak FPS by editing the line via the command above.
If you find StreamingVLM useful or relevant to your project and research, please kindly cite our paper:
@misc{xu2025streamingvlmrealtimeunderstandinginfinite,
title={StreamingVLM: Real-Time Understanding for Infinite Video Streams},
author={Ruyi Xu and Guangxuan Xiao and Yukang Chen and Liuning He and Kelly Peng and Yao Lu and Song Han},
year={2025},
eprint={2510.09608},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.09608},
}

