This project explores a novel framework that enhances fake news detection in short videos by simulating the creative process using large language models (LLMs).
The figure below provides a overview of the proposed framework.
Figure 1. The framework consists of two main modules: (1) a data synthesis module that constructs diverse short video fake news samples via LLM-driven simulation, and (2) an enhanced model training module that dynamically selects and integrates synthetic samples into the training process through uncertainty-based re-training.
Our data synthesis module follows a two-step pipeline:
-
Textual Material Library:
We use Qwen-Max to analyze captions from real-world news short videos. Emotionally biased or semantically weak content is filtered out to ensure the quality and informativeness of generated narratives. -
Visual Material Library:
Visual materials are extracted from existing videos using the TransNet-v2 shot segmentation model. Each resulting clip is further annotated via Qwen-VL-Max to enable semantic-aware indexing and retrieval.
Figure 2. An example of an archival entry for a visual material clip.
Figure 3. Distribution of visual material types across Chinese and English libraries. "Real-shot video" dominates both libraries, indicating its central role in news reporting. Cultural variation is observed: e.g., surveillance recordings are more prevalent in the Chinese dataset.
Figure 4. Distribution of visual material scores in terms of visual quality, newsworthiness, and visual impact. Chinese materials generally score higher in newsworthiness, while English materials show stronger visual impact, reflecting different content creation preferences.
We design multiple fabrication paths to mimic real-world fake news strategies, categorized into completion-based and manipulation-based modes.
Figure 5. Four fabrication types: misleading substitution, groundless fabrication, fact distortion, and selective editing. Each path simulates a distinct manipulation strategy by controlling the pairing and rewriting process.
Figure 6. Using a real video as seed, the pipeline generates four fake samples, each reflecting a different fabrication strategy.
The following table summarizes the number of generated samples using our LLM-based pipeline.
Fabrication Type | Chinese | English |
---|---|---|
TOVN → TOVM (Misleading Substitution) | 974 | 663 |
TNVO → TMVO (Groundless Fabrication) | 958 | 802 |
TOVO → TMVO (Fact Distortion) | 790 | 802 |
TOVO → TOVM (Selective Editing) | 556 | 325 |
Total | 3,278 | 2,592 |
Table 1. Counts of synthesized short fake news videos from the Chinese (FakeSV) and English (FakeTT) datasets.
🔍 Note: For specific synthesized data, you can refer to the
enhanced_v1_fakesv.json
andenhanced_v1_fakett.json
in the/data
directory for details.Below is an example from Chinese data:
{ "video_id": "3xgfgk5d4pn3j8k_TNVO_TMVO", "title": "2023年8月15日,阿富汗赫拉特省发生严重空袭事件", "ocr": "2023年8月15日,在阿富汗赫拉特省的一次空袭中,多个村庄遭受严重破坏。救援人员正在现场进行清理工作,并收集遇难者的遗体和遗物。当地居民自发组织了悼念活动,点燃蜡烛以表达对逝者的哀思。", "visual_materials": [ "3xgfgk5d4pn3j8k_S0_E207", "3xgfgk5d4pn3j8k_S208_E269", "3xgfgk5d4pn3j8k_S270_E273", "3xgfgk5d4pn3j8k_S274_E339" ], "label": "假", "source": "Generated_TNVO_TMVO" }🔧 Data Format Explanation
video_id
: Unique ID for the synthesized fake news video. It follows the format:
source_video_id + "_" + faking_type
, where the suffixTNVO_TMVO
encodes the fabrication path.
We use the following symbols to describe modality states:
T
= TextV
= VisualO
= OriginalM
= ManipulatedN
= NULL (i.e., missing)For example,
TNVO_TMVO
indicates:
- Input: original text and missing visual (
T_N
andV_O
)- Output: manipulated text and original visual (
T_M
andV_O
)- This implies that the sample is generated by fabricating a new textual narrative based on the original visual content.
visual_materials
: A list of visual clips used to construct the final video. Each clip is denoted as:<source_video_id>_S<start_frame>_E<end_frame>
For instance,
3xgfgk5d4pn3j8k_S0_E207
refers to a segment from video3xgfgk5d4pn3j8k
spanning frames 0 to 207.The synthesized video is created by sequentially concatenating the listed clips in
visual_materials
.
To effectively leverage augmented data for performance enhancement, we propose an active learning-based retraining framework. It incrementally selects and incorporates informative synthetic samples based on model uncertainty and feature similarity, and is compatible with various detection backbones.
We provide a modular and extensible implementation in this repository, with SVFEND used as a backbone example. The core components are illustrated below.
.
├── data/ # Human-labeled and synthetic dataset files
├── figs/ # Visual illustrations used in README
├── models/ # Detection model definitions (e.g., SVFEND)
├── utils/ # Training utilities
├── main.py # Entry script for training
├── AL_run.py # Script for active learning-based training
├── requirements.txt # Python environment dependencies
└── README.md