Svetlana Orlova, Tommie Kerssies, Brunó B. Englert, Gijs Dubbelman
Eindhoven University of Technology
Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity.
Video ViT-based encoder-only models set a new state of the art on both datasets, while being significantly more efficient than top-performing specialized methods. FPS measured using NVIDIA A100 MIG, 2 1 GPU. † From prior work. ‡ Optimistic estimates using publicly available components of the model. “A→B”: trained on A, tested on B; D2K: DADA-2000.
We provide pre-trained and fine-tuned models in MODEL_ZOO.md.
Please follow the instructions in INSTALL.md.
Please follow the instructions in DATASET.md for data preparation.
Please follow the instructions in TRAIN.md.
Instructions are in RUN.md.
Svetlana Orlova: [email protected], [email protected]
Our code is mainly based on the VideoMAE codebase.
With Video ViTs that have identical architecture, we only used their weights:
ViViT,
VideoMAE2,
SMILE,
SIGMA,
MME,
MGMAE.
We used fragments of original implementations of
MVD,
InternVideo2,
and UMT to integrate these models with our codebase.
The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: ViViT, InternVideo2, SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. VideoMAE2, SMILE, MGMAE, UMT, and BEiT are licensed under the MIT license. SIGMA is licensed under the BSD 3-Clause Clear license
If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:
@inproceedings{orlova2025simplifying,
title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}
@article{orlova2025simplifying,
title={Simplifying Traffic Anomaly Detection with Video Foundation Models},
author={Orlova, Svetlana and Kerssies, Tommie and Englert, Brun{\'o} B and Dubbelman, Gijs},
journal={arXiv preprint arXiv:2507.09338},
year={2025}
}

