CSVQA

Chinese STEM Visual Question Answering

Ai Jian*, Weijie Qiu*, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song†
* Equal contribution  † Corresponding author

csvqa.benchmark@gmail.com

If you have any questions, please contact us! 😆

🔔News

🔥[2025-06-02]: Our paper is now available on arXiv and we welcome citations: CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

🔥[2025-05-30]: We developed a complete evaluation pipeline, and the implementation details are available on GitHub.

Introduction

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their scientific reasoning capabilities remain insufficiently evaluated. Existing multimodal benchmarks mainly focus on general image comprehension or text-based reasoning, lacking authentic scientific contexts that require domain-specific knowledge integrated with visual evidence analysis. To address this gap, we introduce CSVQA, a diagnostic multimodal benchmark designed specifically to evaluate scientific reasoning through domain-grounded visual question answering.

Our benchmark comprises 1,378 carefully crafted question-answer pairs across diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to previous benchmarks, CSVQA emphasizes real-world scientific content and complex reasoning tasks.

Additionally, we propose a rigorous evaluation protocol to systematically assess whether model predictions are supported by valid intermediate reasoning steps, based on curated explanations.

CSVQA Challenges

Dataset Source

The dataset is sourced from publicly available Chinese high school textbooks and examination papers across STEM disciplines. To ensure high-quality alignment, we use a four-phase quality control pipeline, improving efficiency over traditional methods.

The process begins by parsing source materials and applying OCR to extract textual and visual data. Then we apply an automated alignment pipeline which is powered by DeepSeekV3 to establish semantic correspondences between questions and answers.

Manual screening then addresses complex cases, such as multi-page layouts and mixed text-image formats. Finally, the benchmark undergoes three independent reviews: schema validation, integrity checks for completeness, and domain-specific audits with the help of annotators to ensure subject accuracy.

From an initial pool of approximately 100,000 raw entries, we filter out unsuitable question types like proof-based or diagram-drawing tasks, discard samples without associated images, and remove mismatched question-answer pairs as flagged by the LLM. A human-curated subset of high-quality multimodal items is retained for the final dataset.

CSVQA Challenges

Key Features

First, its coverage of multiple STEM disciplines requires diverse domain knowledge and reasoning strategies. Second, the inclusion of 14 distinct visual modalities introduces significant variation in visual structure and complexity, testing a model’s ability to generalize across image types. Third, many questions are grounded in real-world scenarios and demand domain-specific knowledge, requiring models to go beyond pattern recognition and engage in context-aware reasoning.

An overview of the dataset’s composition is presented in the table below. CSVQA contains 1,378 expert-annotated questions with moderate average length, balancing language processing load and reasoning depth. Nearly 81% of items are paired with a detailed explanation, which is particularly valuable for analyzing logical missteps in model predictions. Furthermore, we incorporated a bilingual dataset generated after translation, allowing for a broader range of test scenarios.

Statistics Number
Total Questions1,378
Image Types14
Easy: Medium: Hard22.6% : 67.4% : 10.0%
Multiple-choice Questions1,278
Open Questions100
With an Explanation81.1%
Image in the Question1,341
Image in Option37
Average Question Length69.7
Average Option Length12.1
Average Explanation Length123.5

To better compare with other datasets, the length analysis is conducted in English.

CSVQA Benchmark

Leaderboard on CSVQA

Accuracy of different models on CSVQA. Top scores in each column are underlined and bolded; second-best scores are bolded.

Model Overall Biology Chemistry Math Physics Open MC
Random Choice5.25.16.24.55.705.7
Open-source VLM
Fuyu-8B4.96.35.63.54.32.05.1
Deepseek-VL26.27.06.27.64.58.06.0
LLaVA1.5-13B7.510.79.45.45.54.07.8
MonoInternVL9.37.39.19.210.93.09.8
Idefics3-8b10.111.715.27.07.14.010.6
Pixtral-12B10.515.38.88.610.05.010.9
Phi-411.513.316.18.98.37.011.8
Gemma3-27B22.926.023.527.017.123.022.9
Internvl2-5-78B28.436.336.124.119.716.029.3
QVQ-72B36.640.741.333.732.032.036.9
Internvl3-78B37.446.041.136.528.930.038.0
Qwen2.5VL-72B38.545.740.837.532.229.039.2
Closed-source VLM
GPT-4o23.628.023.523.520.618.024.0
Claude3.736.641.738.137.131.332.036.9
Gemini2.0-flash44.145.045.547.639.846.044.0
o149.646.245.159.049.141.350.2

BibTeX


    @misc{jian2025csvqachinesemultimodalbenchmark,
        title={CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs}, 
        author={Ai Jian and Weijie Qiu and Xiaokun Wang and Peiyu Wang and Yunzhuo Hao and Jiangbo Pei and Yichen Wei and Yi Peng and Xuchen Song},
        year={2025},
        eprint={2505.24120},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.24120}, 
      }