Forensics-Bench

A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang¹^,*, Chenghui Lv⁵^,⁴^,* , Xian Li⁶^,⁴ , Shichao Dong⁷ , Huadong Li⁸ , Kelu Yao⁴ ,
Chao Li⁴ , Wenqi Shao³ , Ping Luo¹^,^2,†

¹The University of Hong Kong, ²HKU Shanghai Intelligent Computing Research Center,
³Shanghai AI Laboratory, ⁴Zhejiang Laboratory, Hangzhou, China,
⁵Hangzhou Institute for Advanced Study, ⁶Zhejiang University, ⁷Alibaba, Beijing, China, ⁸MEGVII Technology

*Equal contribution (primary contact: wj0529@connect.hku.hk)
†Corresponding Author

Introduction

We present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC.

Overview

An illustration of the pipeline for data collection of Forensics-Bench. First, from the designed 5 perspectives of Forensics-Bench, we searched the related public available dataset from the Internet. Then, we collated the retrieved dataset into a uniformed metadata format. Finally, we either manually transformed original data into handcrafted Questions&Answers (Q&A) or proceed the Q&A transformation with the aid of ChatGPT. Forensics-Bench supports evaluations over a diverse kinds of forgeries across various perspectives.

Comparisons with Existing Benchmarks

In Forensics-Bench, we cover 4 forgery modalities, 4 forgery tasks, 21 forgery types and 22 forgery models, comprehensively evaluating the perception, location and reasoning capabilities of LVLMs in the context of forgery detection. A detailed comparison with previous evaluation benchmarks for Large Vision Language Models in the context of forgery detection is provided in the table above.

Leaderboard

Quantitative results for 22 open-sourced LVLMs and 3 proprietary LVLMs across 5 perspectives of forgery detection are summarized. Accuracy is the metric. The overall score is calculated across all data in Forensics-Bench.

Open-Source Proprietary

Model	Overall	semantic	Modality	Task	Type	Model
LLaVA-NEXT-34B	66.7	63.8	69.7	41.0	63.3	72.7
LLaVA-v1.5-7B-XTuner	65.7	61.2	68.3	41.9	58.2	66.8
LLaVA-v1.5-13B-XTuner	65.2	62.9	68.7	37.9	61.3	71.8
InternVL-Chat-V1-2	62.2	58.8	67.9	43.4	60.1	69.5
LLaVA-NEXT-13B	58.0	64.0	66.7	32.0	62.3	71.2
GPT-4o	57.9	50.2	57.3	33.1	47.9	53.1
mPLUG-Owl2	57.8	58.5	68.4	40.5	58.1	70.0
LLaVA-v1.5-7B	54.9	61.6	68.7	37.1	64.0	70.8
LLaVA-v1.5-13B	53.8	52.7	64.2	34.1	55.6	63.7
Yi-VL-34B	52.6	47.2	53.6	39.7	41.2	51.6
CogVLM-Chat	50.0	44.1	49.5	32.2	45.4	52.0
Gemini-1.5-Pro	48.3	42.6	42.7	37.8	43.7	41.6
XComposer2	47.3	42.2	43.8	28.3	42.9	48.4
LLaVA-InternLM2-7B	45.0	40.8	52.2	30.5	42.6	50.3
VisualGLM-6B	43.9	38.9	39.1	30.3	35.1	39.2
LLaVA-NEXT-7B	42.9	49.0	53.1	32.1	55.7	63.7
LLaVA-InternLM-7B	42.3	37.7	39.4	30.2	39.9	47.5
ShareGPT4V-7B	41.7	44.6	46.9	32.3	51.8	58.5
InternVL-Chat-V1-5	40.5	39.9	33.6	28.7	41.7	47.6
DeepSeek-VL-7B	40.1	35.4	30.8	24.6	29.8	38.6
Yi-VL-6B	39.1	38.2	39.4	30.7	39.4	48.1
InstructBLIP-13B	37.3	33.1	42.2	27.1	28.4	33.7
QWen-VL-Chat	34.5	29.6	32.1	27.1	32.4	34.3
Claude3V-Sonnet	33.8	28.4	28.5	32.1	29.9	28.9
Monkey-Chat	27.2	18.6	18.1	20.6	19.2	21.2

Taskonomy Analysis

Analysis on forgery semantics. We illustrated the detailed performance of 25 LVLMs in the following figure from the perspective of forgery semantics. It can be seen that most LVLMs did not demonstrate significant bias towards certain content in terms of human subjects vs general subjects. This provide great starting points for the development of future all-round forgery detectors under the paradigm of LVLMs.

Analysis on forgery modalities. We showed the detailed performance of 25 LVLMs in the following figure from the perspective of forgery modalities. We find that top-performing LVLMs (such as LLaVA-NEXT-34B) achieved impressive binary classification performance on forgeries in the near-infrared (NIR) modality. Meanwhile, when the input content contains both RGB images and texts, these LVLMs struggled to performed well. It is still worth exploring to design robust forgery detectors excelled at different modalities.

Analysis on forgery tasks. The detailed performance of 25 LVLMs from the perspective of forgery tasks is shown in the following figure. We find that most LVLMs demonstrated relatively great performance in the forgery binary classification (BC) task, while having difficulty maintaining strong performance in the tasks of forgery spatial localization (segmentation masks/bounding boxes) (SLS/SLD) and forgery temporal localization (TL). Such results reveal that most LVLMs still required improvements over location and reasoning capabilities in different forgery detection tasks.

Analysis on forgery types. The detailed performance of 25 LVLMs from the perspective of forgery types is illustrated in the following figure. Firstly, we find that current LVLMs still found it challenging to perform well over a wide range of forgery types, such as face swap (multiple faces), copy-move (CM), removal (RM) and splicing (SPL). Second, we find that leading LVLMs like LLaVA series models already excelled in certain forgery types, such as face spoofing (SPF), image enhancement (IE), style translation (ST) and out-of-context (OOC), indicating their potential to grow into more generalized forgery detectors.

Analysis on forgery models. The detailed performance of 25 LVLMs from the perspective of forgery models is illustrated in the above figure. It is noticeable that leading LVLMs achieved excellent performance at forgeries created with spoofing methods, such as 3D masks (3D) and paper cut (PC). Besides, for forgeries synthesized by popular AI models, we find that current LVLMs performed better on forgeries output by diffusion models (DF) compared with those output by GANs, which may expose the limited discerning capabilities of LVLMs for forgeries output by different AI models. Moreover, we find that current LVLMs experienced challenges when recognizing forgeries generated by the combinations of multiple AI models, such as Encoder-Decoder&Graphics-based methods (ED&GR), Generative Adversarial Networks&Transformer (GAN&TR) and etc. Such forgeries may pose more significant challenges to LVLMs' forgery detection capabilities in the future.

BibTeX

@misc{wang2025forensicsbenchcomprehensiveforgerydetection, title={Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models}, author={Jin Wang and Chenghui Lv and Xian Li and Shichao Dong and Huadong Li and kelu Yao and Chao Li and Wenqi Shao and Ping Luo}, year={2025}, eprint={2503.15024}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.15024}, }

Forensics-Bench

A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

🔔News

Introduction

Forensics-Bench

Overview

Comparisons with Existing Benchmarks

Experiment Results

Leaderboard

Taskonomy Analysis

Case Study

BibTeX