🔥[2025-06-02]: Our paper is now available on arXiv and we welcome citations: CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs
🔥[2025-05-30]: We developed a complete evaluation pipeline, and the implementation details are available on GitHub.
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their scientific reasoning capabilities remain insufficiently evaluated. Existing multimodal benchmarks mainly focus on general image comprehension or text-based reasoning, lacking authentic scientific contexts that require domain-specific knowledge integrated with visual evidence analysis. To address this gap, we introduce CSVQA, a diagnostic multimodal benchmark designed specifically to evaluate scientific reasoning through domain-grounded visual question answering.
Our benchmark comprises 1,378 carefully crafted question-answer pairs across diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to previous benchmarks, CSVQA emphasizes real-world scientific content and complex reasoning tasks.
Additionally, we propose a rigorous evaluation protocol to systematically assess whether model predictions are supported by valid intermediate reasoning steps, based on curated explanations.
The dataset is sourced from publicly available Chinese high school textbooks and examination papers across STEM disciplines. To ensure high-quality alignment, we use a four-phase quality control pipeline, improving efficiency over traditional methods.
The process begins by parsing source materials and applying OCR to extract textual and visual data. Then we apply an automated alignment pipeline which is powered by DeepSeekV3 to establish semantic correspondences between questions and answers.
Manual screening then addresses complex cases, such as multi-page layouts and mixed text-image formats. Finally, the benchmark undergoes three independent reviews: schema validation, integrity checks for completeness, and domain-specific audits with the help of annotators to ensure subject accuracy.
From an initial pool of approximately 100,000 raw entries, we filter out unsuitable question types like proof-based or diagram-drawing tasks, discard samples without associated images, and remove mismatched question-answer pairs as flagged by the LLM. A human-curated subset of high-quality multimodal items is retained for the final dataset.
First, its coverage of multiple STEM disciplines requires diverse domain knowledge and reasoning strategies. Second, the inclusion of 14 distinct visual modalities introduces significant variation in visual structure and complexity, testing a model’s ability to generalize across image types. Third, many questions are grounded in real-world scenarios and demand domain-specific knowledge, requiring models to go beyond pattern recognition and engage in context-aware reasoning.
An overview of the dataset’s composition is presented in the table below. CSVQA contains 1,378 expert-annotated questions with moderate average length, balancing language processing load and reasoning depth. Nearly 81% of items are paired with a detailed explanation, which is particularly valuable for analyzing logical missteps in model predictions. Furthermore, we incorporated a bilingual dataset generated after translation, allowing for a broader range of test scenarios.
Statistics | Number |
---|---|
Total Questions | 1,378 |
Image Types | 14 |
Easy: Medium: Hard | 22.6% : 67.4% : 10.0% |
Multiple-choice Questions | 1,278 |
Open Questions | 100 |
With an Explanation | 81.1% |
Image in the Question | 1,341 |
Image in Option | 37 |
Average Question Length | 69.7 |
Average Option Length | 12.1 |
Average Explanation Length | 123.5 |
To better compare with other datasets, the length analysis is conducted in English.
Accuracy of different models on CSVQA. Top scores in each column are underlined and bolded; second-best scores are bolded.
Model | Overall | Biology | Chemistry | Math | Physics | Open | MC |
---|---|---|---|---|---|---|---|
Random Choice | 5.2 | 5.1 | 6.2 | 4.5 | 5.7 | 0 | 5.7 |
Open-source VLM | |||||||
Fuyu-8B | 4.9 | 6.3 | 5.6 | 3.5 | 4.3 | 2.0 | 5.1 |
Deepseek-VL2 | 6.2 | 7.0 | 6.2 | 7.6 | 4.5 | 8.0 | 6.0 |
LLaVA1.5-13B | 7.5 | 10.7 | 9.4 | 5.4 | 5.5 | 4.0 | 7.8 |
MonoInternVL | 9.3 | 7.3 | 9.1 | 9.2 | 10.9 | 3.0 | 9.8 |
Idefics3-8b | 10.1 | 11.7 | 15.2 | 7.0 | 7.1 | 4.0 | 10.6 |
Pixtral-12B | 10.5 | 15.3 | 8.8 | 8.6 | 10.0 | 5.0 | 10.9 |
Phi-4 | 11.5 | 13.3 | 16.1 | 8.9 | 8.3 | 7.0 | 11.8 |
Gemma3-27B | 22.9 | 26.0 | 23.5 | 27.0 | 17.1 | 23.0 | 22.9 |
Internvl2-5-78B | 28.4 | 36.3 | 36.1 | 24.1 | 19.7 | 16.0 | 29.3 |
QVQ-72B | 36.6 | 40.7 | 41.3 | 33.7 | 32.0 | 32.0 | 36.9 |
Internvl3-78B | 37.4 | 46.0 | 41.1 | 36.5 | 28.9 | 30.0 | 38.0 |
Qwen2.5VL-72B | 38.5 | 45.7 | 40.8 | 37.5 | 32.2 | 29.0 | 39.2 |
Closed-source VLM | |||||||
GPT-4o | 23.6 | 28.0 | 23.5 | 23.5 | 20.6 | 18.0 | 24.0 |
Claude3.7 | 36.6 | 41.7 | 38.1 | 37.1 | 31.3 | 32.0 | 36.9 |
Gemini2.0-flash | 44.1 | 45.0 | 45.5 | 47.6 | 39.8 | 46.0 | 44.0 |
o1 | 49.6 | 46.2 | 45.1 | 59.0 | 49.1 | 41.3 | 50.2 |
@misc{jian2025csvqachinesemultimodalbenchmark,
title={CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs},
author={Ai Jian and Weijie Qiu and Xiaokun Wang and Peiyu Wang and Yunzhuo Hao and Jiangbo Pei and Yichen Wei and Yi Peng and Xuchen Song},
year={2025},
eprint={2505.24120},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.24120},
}