CSVQA

🔥[2025-06-02]: Our paper is now available on arXiv and we welcome citations: CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

🔥[2025-05-30]: We developed a complete evaluation pipeline, and the implementation details are available on GitHub.

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their scientific reasoning capabilities remain insufficiently evaluated. Existing multimodal benchmarks mainly focus on general image comprehension or text-based reasoning, lacking authentic scientific contexts that require domain-specific knowledge integrated with visual evidence analysis. To address this gap, we introduce CSVQA, a diagnostic multimodal benchmark designed specifically to evaluate scientific reasoning through domain-grounded visual question answering.

Our benchmark comprises 1,378 carefully crafted question-answer pairs across diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to previous benchmarks, CSVQA emphasizes real-world scientific content and complex reasoning tasks.

Additionally, we propose a rigorous evaluation protocol to systematically assess whether model predictions are supported by valid intermediate reasoning steps, based on curated explanations.

The dataset is sourced from publicly available Chinese high school textbooks and examination papers across STEM disciplines. To ensure high-quality alignment, we use a four-phase quality control pipeline, improving efficiency over traditional methods.

The process begins by parsing source materials and applying OCR to extract textual and visual data. Then we apply an automated alignment pipeline which is powered by DeepSeekV3 to establish semantic correspondences between questions and answers.

Manual screening then addresses complex cases, such as multi-page layouts and mixed text-image formats. Finally, the benchmark undergoes three independent reviews: schema validation, integrity checks for completeness, and domain-specific audits with the help of annotators to ensure subject accuracy.

From an initial pool of approximately 100,000 raw entries, we filter out unsuitable question types like proof-based or diagram-drawing tasks, discard samples without associated images, and remove mismatched question-answer pairs as flagged by the LLM. A human-curated subset of high-quality multimodal items is retained for the final dataset.

First, its coverage of multiple STEM disciplines requires diverse domain knowledge and reasoning strategies. Second, the inclusion of 14 distinct visual modalities introduces significant variation in visual structure and complexity, testing a model’s ability to generalize across image types. Third, many questions are grounded in real-world scenarios and demand domain-specific knowledge, requiring models to go beyond pattern recognition and engage in context-aware reasoning.

An overview of the dataset’s composition is presented in the table below. CSVQA contains 1,378 expert-annotated questions with moderate average length, balancing language processing load and reasoning depth. Nearly 81% of items are paired with a detailed explanation, which is particularly valuable for analyzing logical missteps in model predictions. Furthermore, we incorporated a bilingual dataset generated after translation, allowing for a broader range of test scenarios.

Statistics	Number
Total Questions	1,378
Image Types	14
Easy: Medium: Hard	22.6% : 67.4% : 10.0%
Multiple-choice Questions	1,278
Open Questions	100
With an Explanation	81.1%
Image in the Question	1,341
Image in Option	37
Average Question Length	69.7
Average Option Length	12.1
Average Explanation Length	123.5

To better compare with other datasets, the length analysis is conducted in English.

Model	Overall	Biology	Chemistry	Math	Physics	Open	MC
Random Choice	5.2	5.1	6.2	4.5	5.7	0	5.7
Open-source VLM
Fuyu-8B	4.9	6.3	5.6	3.5	4.3	2.0	5.1
Deepseek-VL2	6.2	7.0	6.2	7.6	4.5	8.0	6.0
LLaVA1.5-13B	7.5	10.7	9.4	5.4	5.5	4.0	7.8
MonoInternVL	9.3	7.3	9.1	9.2	10.9	3.0	9.8
Idefics3-8b	10.1	11.7	15.2	7.0	7.1	4.0	10.6
Pixtral-12B	10.5	15.3	8.8	8.6	10.0	5.0	10.9
Phi-4	11.5	13.3	16.1	8.9	8.3	7.0	11.8
Gemma3-27B	22.9	26.0	23.5	27.0	17.1	23.0	22.9
Internvl2-5-78B	28.4	36.3	36.1	24.1	19.7	16.0	29.3
QVQ-72B	36.6	40.7	41.3	33.7	32.0	32.0	36.9
Internvl3-78B	37.4	46.0	41.1	36.5	28.9	30.0	38.0
Qwen2.5VL-72B	38.5	45.7	40.8	37.5	32.2	29.0	39.2
Closed-source VLM
GPT-4o	23.6	28.0	23.5	23.5	20.6	18.0	24.0
Claude3.7	36.6	41.7	38.1	37.1	31.3	32.0	36.9
Gemini2.0-flash	44.1	45.0	45.5	47.6	39.8	46.0	44.0
o1	49.6	46.2	45.1	59.0	49.1	41.3	50.2

CSVQA

Chinese STEM Visual Question Answering

🔔News

Introduction

Dataset Source

Key Features

CSVQA Benchmark

Leaderboard on CSVQA

BibTeX