Can GPT-4V Score Drawn Models Automatically in NERIF?

Original title: NERIF: GPT-4V for Automatic Scoring of Drawn Models

Authors: Gyeong-Geon Lee, Xiaoming Zhai

This article explores using GPT-4V, a powerful image-processing model, for automatically scoring student-drawn scientific models. The NERIF method combines rubrics and instructional notes to prompt GPT-4V in scoring. The study used a dataset of 900 student-drawn models across six science tasks. GPT-4V’s scoring accuracy was compared to human expert scores, showing an average accuracy of 51%. It performed better in scoring ‘Beginning’ and ‘Developing’ models (64% and 62% accuracy respectively) than ‘Proficient’ models (26%). Qualitative analysis revealed how GPT-4V comprehends image inputs, capturing context, evaluations, and drawing characteristics. While there’s room for improvement, the method proved effective, and some mis-assigned scores were understandable to experts. The study indicates promise in employing GPT-4V for automatic scoring of student-drawn models, showcasing its potential in scientific assessment.

Original article: https://arxiv.org/abs/2311.12990