-
The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts
Authors:
Hatice Gurdil,
Hatice Ozlem Anadol,
Yesim Beril Soguksu
Abstract:
In this study, it was investigated whether AI evaluators assess the content validity of B1-level English reading comprehension test items in a manner similar to human evaluators. A 25-item multiple-choice test was developed, and these test items were evaluated by four human and four AI evaluators. No statistically significant difference was found between the scores given by human and AI evaluators…
▽ More
In this study, it was investigated whether AI evaluators assess the content validity of B1-level English reading comprehension test items in a manner similar to human evaluators. A 25-item multiple-choice test was developed, and these test items were evaluated by four human and four AI evaluators. No statistically significant difference was found between the scores given by human and AI evaluators, with similar evaluation trends observed. The Content Validity Ratio (CVR) and the Item Content Validity Index (I-CVI) were calculated and analyzed using the Wilcoxon Signed-Rank Test, with no statistically significant difference. The findings revealed that in some cases, AI evaluators could replace human evaluators. However, differences in specific items were thought to arise from varying interpretations of the evaluation criteria. Ensuring linguistic clarity and clearly defining criteria could contribute to more consistent evaluations. In this regard, the development of hybrid evaluation systems, in which AI technologies are used alongside human experts, is recommended.
△ Less
Submitted 3 February, 2025;
originally announced March 2025.
-
A Comprehensive Guide to Item Recovery Using the Multidimensional Graded Response Model in R
Authors:
Yesim Beril Soguksu,
Ayse Bilicioglu Gunes,
Hatice Gurdil
Abstract:
The purpose of this study is to provide a step-by-step demonstration of item recovery for the Multidimensional Graded Response Model (MGRM) in R. Within this scope, a sample simulation design was constructed where the test lengths were set to 20 and 40, the interdimensional correlations were varied as 0.3 and 0.7, and the sample size was fixed at 2000. Parameter estimates were derived from the gen…
▽ More
The purpose of this study is to provide a step-by-step demonstration of item recovery for the Multidimensional Graded Response Model (MGRM) in R. Within this scope, a sample simulation design was constructed where the test lengths were set to 20 and 40, the interdimensional correlations were varied as 0.3 and 0.7, and the sample size was fixed at 2000. Parameter estimates were derived from the generated datasets for the 3-dimensional GRM, and bias and Root Mean Square Error (RMSE) values were calculated and visualized. In line with the aim of the study, R codes for all these steps were presented along with detailed explanations, enabling researchers to replicate and adapt the procedures for their own analyses. This study is expected to contribute to the literature by serving as a practical guide for implementing item recovery in the MGRM. In addition, the methods presented, including data generation, parameter estimation, and result visualization, are anticipated to benefit researchers even if they are not directly engaged in item recovery.
△ Less
Submitted 24 December, 2024; v1 submitted 21 December, 2024;
originally announced December 2024.
-
Integration of Artificial Intelligence in Educational Measurement: Efficacy of ChatGPT in Data Generation within the Scope of Item Response Theory
Authors:
Hatice Gurdil,
Yesim Beril Soguksu,
Salih Salihoglu,
Fatma Coskun
Abstract:
The aim of this study is to investigate the effectiveness of ChatGPT 3.5 in developing algorithms for data generation within the framework of Item Response Theory (IRT) using the R programming language. In this context, validity examinations were conducted on data sets generated according to the Two-Parameter Logistic Model (2PLM) with algorithms written by ChatGPT 3.5 and researchers. These exami…
▽ More
The aim of this study is to investigate the effectiveness of ChatGPT 3.5 in developing algorithms for data generation within the framework of Item Response Theory (IRT) using the R programming language. In this context, validity examinations were conducted on data sets generated according to the Two-Parameter Logistic Model (2PLM) with algorithms written by ChatGPT 3.5 and researchers. These examinations considered whether the data sets met the IRT assumptions and the simulation conditions of the item parameters. As a result, it was determined that while ChatGPT 3.5 was quite successful in generating data that met the IRT assumptions, it was less effective in meeting the simulation conditions of the item parameters compared to the algorithm developed by the researchers. In this regard, ChatGPT 3.5 is recommended as a useful tool that researchers can use in developing data generation algorithms for IRT.
△ Less
Submitted 3 July, 2024; v1 submitted 28 January, 2024;
originally announced February 2024.