Forensics-Bench

A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang1,*, Chenghui Lv5,4,* , Xian Li6,4 , Shichao Dong7 , Huadong Li8 , Kelu Yao4 ,
Chao Li4 , Wenqi Shao3 , Ping Luo1,2,†

1The University of Hong Kong, 2HKU Shanghai Intelligent Computing Research Center,
3Shanghai AI Laboratory, 4Zhejiang Laboratory, Hangzhou, China,
5Hangzhou Institute for Advanced Study, 6Zhejiang University, 7Alibaba, Beijing, China, 8MEGVII Technology

*Equal contribution (primary contact: wj0529@connect.hku.hk)
†Corresponding Author
MMT-bench

Visualization of Forensics-Bench. Our Forensics-Bench consists of 5 key perspectives covering 112 unique forgery detection types. Each perspective includes detailed categories such as forgery semantics, modalities, tasks, types, and models. We denote the number of samples for each category and illustrate examples of images and corresponding questions. Forensics-Bench enables comprehensive evaluations of LVLMs on diverse forgery detection types in the era of AIGC.

🔔News

🔥[2025-02-27]: Our paper has been accepted to CVPR 2025!

Introduction

We present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC.

Forensics-Bench

Overview

pipeline

An illustration of the pipeline for data collection of Forensics-Bench. First, from the designed 5 perspectives of Forensics-Bench, we searched the related public available dataset from the Internet. Then, we collated the retrieved dataset into a uniformed metadata format. Finally, we either manually transformed original data into handcrafted Questions&Answers (Q&A) or proceed the Q&A transformation with the aid of ChatGPT. Forensics-Bench supports evaluations over a diverse kinds of forgeries across various perspectives.

Comparisons with Existing Benchmarks

In Forensics-Bench, we cover 4 forgery modalities, 4 forgery tasks, 21 forgery types and 22 forgery models, comprehensively evaluating the perception, location and reasoning capabilities of LVLMs in the context of forgery detection. A detailed comparison with previous evaluation benchmarks for Large Vision Language Models in the context of forgery detection is provided in the table above.

MMT-bench

Experiment Results

Leaderboard

Quantitative results for 22 open-sourced LVLMs and 3 proprietary LVLMs across 5 perspectives of forgery detection are summarized. Accuracy is the metric. The overall score is calculated across all data in Forensics-Bench.

Open-Source Proprietary
Model Overall semantic Modality Task Type Model
LLaVA-NEXT-34B 66.7 63.8 69.7 41.0 63.3 72.7
LLaVA-v1.5-7B-XTuner 65.7 61.2 68.3 41.9 58.2 66.8
LLaVA-v1.5-13B-XTuner 65.2 62.9 68.7 37.9 61.3 71.8
InternVL-Chat-V1-2 62.2 58.8 67.9 43.4 60.1 69.5
LLaVA-NEXT-13B 58.0 64.0 66.7 32.0 62.3 71.2
GPT-4o 57.9 50.2 57.3 33.1 47.9 53.1
mPLUG-Owl2 57.8 58.5 68.4 40.5 58.1 70.0
LLaVA-v1.5-7B 54.9 61.6 68.7 37.1 64.0 70.8
LLaVA-v1.5-13B 53.8 52.7 64.2 34.1 55.6 63.7
Yi-VL-34B 52.6 47.2 53.6 39.7 41.2 51.6
CogVLM-Chat 50.0 44.1 49.5 32.2 45.4 52.0
Gemini-1.5-Pro 48.3 42.6 42.7 37.8 43.7 41.6
XComposer2 47.3 42.2 43.8 28.3 42.9 48.4
LLaVA-InternLM2-7B 45.0 40.8 52.2 30.5 42.6 50.3
VisualGLM-6B 43.9 38.9 39.1 30.3 35.1 39.2
LLaVA-NEXT-7B 42.9 49.0 53.1 32.1 55.7 63.7
LLaVA-InternLM-7B 42.3 37.7 39.4 30.2 39.9 47.5
ShareGPT4V-7B 41.7 44.6 46.9 32.3 51.8 58.5
InternVL-Chat-V1-5 40.5 39.9 33.6 28.7 41.7 47.6
DeepSeek-VL-7B 40.1 35.4 30.8 24.6 29.8 38.6
Yi-VL-6B 39.1 38.2 39.4 30.7 39.4 48.1
InstructBLIP-13B 37.3 33.1 42.2 27.1 28.4 33.7
QWen-VL-Chat 34.5 29.6 32.1 27.1 32.4 34.3
Claude3V-Sonnet 33.8 28.4 28.5 32.1 29.9 28.9
Monkey-Chat 27.2 18.6 18.1 20.6 19.2 21.2

Taskonomy Analysis

Analysis on forgery semantics. We illustrated the detailed performance of 25 LVLMs in the following figure from the perspective of forgery semantics. It can be seen that most LVLMs did not demonstrate significant bias towards certain content in terms of human subjects vs general subjects. This provide great starting points for the development of future all-round forgery detectors under the paradigm of LVLMs.

error distribution

Analysis on forgery modalities. We showed the detailed performance of 25 LVLMs in the following figure from the perspective of forgery modalities. We find that top-performing LVLMs (such as LLaVA-NEXT-34B) achieved impressive binary classification performance on forgeries in the near-infrared (NIR) modality. Meanwhile, when the input content contains both RGB images and texts, these LVLMs struggled to performed well. It is still worth exploring to design robust forgery detectors excelled at different modalities.

error distribution

Analysis on forgery tasks. The detailed performance of 25 LVLMs from the perspective of forgery tasks is shown in the following figure. We find that most LVLMs demonstrated relatively great performance in the forgery binary classification (BC) task, while having difficulty maintaining strong performance in the tasks of forgery spatial localization (segmentation masks/bounding boxes) (SLS/SLD) and forgery temporal localization (TL). Such results reveal that most LVLMs still required improvements over location and reasoning capabilities in different forgery detection tasks.

error distribution

Analysis on forgery types. The detailed performance of 25 LVLMs from the perspective of forgery types is illustrated in the following figure. Firstly, we find that current LVLMs still found it challenging to perform well over a wide range of forgery types, such as face swap (multiple faces), copy-move (CM), removal (RM) and splicing (SPL). Second, we find that leading LVLMs like LLaVA series models already excelled in certain forgery types, such as face spoofing (SPF), image enhancement (IE), style translation (ST) and out-of-context (OOC), indicating their potential to grow into more generalized forgery detectors.

error distribution
error distribution

Analysis on forgery models. The detailed performance of 25 LVLMs from the perspective of forgery models is illustrated in the above figure. It is noticeable that leading LVLMs achieved excellent performance at forgeries created with spoofing methods, such as 3D masks (3D) and paper cut (PC). Besides, for forgeries synthesized by popular AI models, we find that current LVLMs performed better on forgeries output by diffusion models (DF) compared with those output by GANs, which may expose the limited discerning capabilities of LVLMs for forgeries output by different AI models. Moreover, we find that current LVLMs experienced challenges when recognizing forgeries generated by the combinations of multiple AI models, such as Encoder-Decoder&Graphics-based methods (ED&GR), Generative Adversarial Networks&Transformer (GAN&TR) and etc. Such forgeries may pose more significant challenges to LVLMs' forgery detection capabilities in the future.

Case Study

Image 1

Please see more cases in our paper for further discussions.

BibTeX


     @misc{wang2025forensicsbenchcomprehensiveforgerydetection,
        title={Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models}, 
        author={Jin Wang and Chenghui Lv and Xian Li and Shichao Dong and Huadong Li and kelu Yao and Chao Li and Wenqi Shao and Ping Luo},
        year={2025},
        eprint={2503.15024},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.15024}, 
      }