-
Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison
Authors:
Aymeric de Chillaz,
Anna Sotnikova,
Patrick Jermann,
Antoine Bosselut
Abstract:
Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks. In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges. To investigate these effects, we introduce a high-quality dataset of 201 university-level STEM questions, manually annotated with features…
▽ More
Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks. In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges. To investigate these effects, we introduce a high-quality dataset of 201 university-level STEM questions, manually annotated with features such as image type, role, problem complexity, and question format. Our study analyzes how these features affect generative AI performance compared to students. We evaluate four model families with five prompting strategies, comparing results to the average of 546 student responses per question. Although the best model correctly answers on average 58.5 % of the questions using majority vote aggregation, human participants consistently outperform AI on questions involving visual components. Interestingly, human performance remains stable across question features but varies by subject, whereas AI performance is susceptible to both subject matter and question features. Finally, we provide actionable insights for educators, demonstrating how question design can enhance academic integrity by leveraging features that challenge current AI systems without increasing the cognitive burden for students.
△ Less
Submitted 2 July, 2025;
originally announced July 2025.
-
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Authors:
Angelika Romanou,
Negar Foroutan,
Anna Sotnikova,
Zeming Chen,
Sree Harsha Nelaturu,
Shivalika Singh,
Rishabh Maheshwary,
Micol Altomare,
Mohamed A. Haggag,
Snegha A,
Alfonso Amayuelas,
Azril Hafizi Amirudin,
Viraat Aryabumi,
Danylo Boiko,
Michael Chang,
Jenny Chim,
Gal Cohen,
Aditya Kumar Dalmia,
Abraham Diress,
Sharad Duwal,
Daniil Dzenhaliou,
Daniel Fernando Erazo Florez,
Fabian Farestam,
Joseph Marvin Imperial,
Shayekh Bin Islam
, et al. (34 additional authors not shown)
Abstract:
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other th…
▽ More
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants
Authors:
Beatriz Borges,
Negar Foroutan,
Deniz Bayazit,
Anna Sotnikova,
Syrielle Montariol,
Tanya Nazaretzky,
Mohammadreza Banaei,
Alireza Sakhaeirad,
Philippe Servant,
Seyed Parsa Neshaei,
Jibril Frej,
Angelika Romanou,
Gail Weiss,
Sepideh Mamooler,
Zeming Chen,
Simin Fan,
Silin Gao,
Mete Ismayilzada,
Debjit Paul,
Alexandre Schöpfer,
Andrej Janchevski,
Anja Tiede,
Clarence Linden,
Emanuele Troiani,
Francesco Salvi
, et al. (65 additional authors not shown)
Abstract:
AI assistants are being increasingly used by students enrolled in higher education institutions. While these tools provide opportunities for improved teaching and education, they also pose significant challenges for assessment and learning outcomes. We conceptualize these challenges through the lens of vulnerability, the potential for university assessments and learning outcomes to be impacted by…
▽ More
AI assistants are being increasingly used by students enrolled in higher education institutions. While these tools provide opportunities for improved teaching and education, they also pose significant challenges for assessment and learning outcomes. We conceptualize these challenges through the lens of vulnerability, the potential for university assessments and learning outcomes to be impacted by student use of generative AI. We investigate the potential scale of this vulnerability by measuring the degree to which AI assistants can complete assessment questions in standard university-level STEM courses. Specifically, we compile a novel dataset of textual assessment questions from 50 courses at EPFL and evaluate whether two AI assistants, GPT-3.5 and GPT-4 can adequately answer these questions. We use eight prompting strategies to produce responses and find that GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions. When grouping courses in our dataset by degree program, these systems already pass non-project assessments of large numbers of core courses in various degree programs, posing risks to higher education accreditation that will be amplified as these models improve. Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
△ Less
Submitted 27 November, 2024; v1 submitted 7 August, 2024;
originally announced August 2024.
-
"Flex Tape Can't Fix That": Bias and Misinformation in Edited Language Models
Authors:
Karina Halevy,
Anna Sotnikova,
Badr AlKhamissi,
Syrielle Montariol,
Antoine Bosselut
Abstract:
Model editing has emerged as a cost-effective strategy to update knowledge stored in language models. However, model editing can have unintended consequences after edits are applied: information unrelated to the edits can also be changed, and other general behaviors of the model can be wrongly altered. In this work, we investigate how model editing methods unexpectedly amplify model biases post-ed…
▽ More
Model editing has emerged as a cost-effective strategy to update knowledge stored in language models. However, model editing can have unintended consequences after edits are applied: information unrelated to the edits can also be changed, and other general behaviors of the model can be wrongly altered. In this work, we investigate how model editing methods unexpectedly amplify model biases post-edit. We introduce a novel benchmark dataset, Seesaw-CF, for measuring bias-related harms of model editing and conduct the first in-depth investigation of how different weight-editing methods impact model bias. Specifically, we focus on biases with respect to demographic attributes such as race, geographic origin, and gender, as well as qualitative flaws in long-form texts generated by edited language models. We find that edited models exhibit, to various degrees, more biased behavior as they become less confident in attributes for Asian, African, and South American subjects. Furthermore, edited models amplify sexism and xenophobia in text generations while remaining seemingly coherent and logical. Finally, editing facts about place of birth, country of citizenship, or gender have particularly negative effects on the model's knowledge about unrelated features like field of work.
△ Less
Submitted 3 October, 2024; v1 submitted 29 February, 2024;
originally announced March 2024.
-
Multilingual large language models leak human stereotypes across language boundaries
Authors:
Yang Trista Cao,
Anna Sotnikova,
Jieyu Zhao,
Linda X. Zou,
Rachel Rudinger,
Hal Daume III
Abstract:
Multilingual large language models have gained prominence for their proficiency in processing and generating text across languages. Like their monolingual counterparts, multilingual models are likely to pick up on stereotypes and other social biases present in their training data. In this paper, we study a phenomenon we term stereotype leakage, which refers to how training a model multilingually m…
▽ More
Multilingual large language models have gained prominence for their proficiency in processing and generating text across languages. Like their monolingual counterparts, multilingual models are likely to pick up on stereotypes and other social biases present in their training data. In this paper, we study a phenomenon we term stereotype leakage, which refers to how training a model multilingually may lead to stereotypes expressed in one language showing up in the models' behaviour in another. We propose a measurement framework for stereotype leakage and investigate its effect across English, Russian, Chinese, and Hindi and with GPT-3.5, mT5, and mBERT. Our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. We find that of these models, GPT-3.5 exhibits the most stereotype leakage, and Hindi is the most susceptible to leakage effects. WARNING: This paper contains model outputs which could be offensive in nature.
△ Less
Submitted 19 November, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Analyzing and modeling network travel patterns during the Ukraine invasion using crowd-sourced pervasive traffic data
Authors:
S. Travis Waller,
Moeid Qurashi,
Anna Sotnikova,
Lavina Karva,
Sai Chand
Abstract:
In 2022, Ukraine is suffering an invasion which has resulted in acute impacts playing out over time and geography. This paper examines the impact of the ongoing disruption on traffic behavior using analytics as well as zonal-based network models. The methodology is a data-driven approach that utilizes obtained travel-time conditions within an evolutionary algorithm framework which infers origin-de…
▽ More
In 2022, Ukraine is suffering an invasion which has resulted in acute impacts playing out over time and geography. This paper examines the impact of the ongoing disruption on traffic behavior using analytics as well as zonal-based network models. The methodology is a data-driven approach that utilizes obtained travel-time conditions within an evolutionary algorithm framework which infers origin-destination demand values in an automated process based on traffic assignment. Because of the automation of the implementation, numerous daily models can be approximated for multiple cities. The novelty of this paper versus the previously published core methodology includes an analysis to ensure the obtained data is appropriate since some data sources were disabled due to the ongoing disruption. Further, novelty includes a direct linkage of the analysis to the timeline of disruptions to examine the interaction in a new way. Finally, specific network metrics are identified which are particularly suited for conceptualizing the impact of conflict disruptions on traffic network conditions. The ultimate aim is to establish processes, concepts and analysis to advance the broader activity of rapidly quantifying the traffic impacts of conflict scenarios.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Theory-Grounded Measurement of U.S. Social Stereotypes in English Language Models
Authors:
Yang Trista Cao,
Anna Sotnikova,
Hal Daumé III,
Rachel Rudinger,
Linda Zou
Abstract:
NLP models trained on text have been shown to reproduce human stereotypes, which can magnify harms to marginalized groups when systems are deployed at scale. We adapt the Agency-Belief-Communion (ABC) stereotype model of Koch et al. (2016) from social psychology as a framework for the systematic study and discovery of stereotypic group-trait associations in language models (LMs). We introduce the…
▽ More
NLP models trained on text have been shown to reproduce human stereotypes, which can magnify harms to marginalized groups when systems are deployed at scale. We adapt the Agency-Belief-Communion (ABC) stereotype model of Koch et al. (2016) from social psychology as a framework for the systematic study and discovery of stereotypic group-trait associations in language models (LMs). We introduce the sensitivity test (SeT) for measuring stereotypical associations from language models. To evaluate SeT and other measures using the ABC model, we collect group-trait judgments from U.S.-based subjects to compare with English LM stereotypes. Finally, we extend this framework to measure LM stereotyping of intersectional identities.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.