-
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants
Authors:
Beatriz Borges,
Negar Foroutan,
Deniz Bayazit,
Anna Sotnikova,
Syrielle Montariol,
Tanya Nazaretzky,
Mohammadreza Banaei,
Alireza Sakhaeirad,
Philippe Servant,
Seyed Parsa Neshaei,
Jibril Frej,
Angelika Romanou,
Gail Weiss,
Sepideh Mamooler,
Zeming Chen,
Simin Fan,
Silin Gao,
Mete Ismayilzada,
Debjit Paul,
Alexandre Schöpfer,
Andrej Janchevski,
Anja Tiede,
Clarence Linden,
Emanuele Troiani,
Francesco Salvi
, et al. (65 additional authors not shown)
Abstract:
AI assistants are being increasingly used by students enrolled in higher education institutions. While these tools provide opportunities for improved teaching and education, they also pose significant challenges for assessment and learning outcomes. We conceptualize these challenges through the lens of vulnerability, the potential for university assessments and learning outcomes to be impacted by…
▽ More
AI assistants are being increasingly used by students enrolled in higher education institutions. While these tools provide opportunities for improved teaching and education, they also pose significant challenges for assessment and learning outcomes. We conceptualize these challenges through the lens of vulnerability, the potential for university assessments and learning outcomes to be impacted by student use of generative AI. We investigate the potential scale of this vulnerability by measuring the degree to which AI assistants can complete assessment questions in standard university-level STEM courses. Specifically, we compile a novel dataset of textual assessment questions from 50 courses at EPFL and evaluate whether two AI assistants, GPT-3.5 and GPT-4 can adequately answer these questions. We use eight prompting strategies to produce responses and find that GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions. When grouping courses in our dataset by degree program, these systems already pass non-project assessments of large numbers of core courses in various degree programs, posing risks to higher education accreditation that will be amplified as these models improve. Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
△ Less
Submitted 27 November, 2024; v1 submitted 7 August, 2024;
originally announced August 2024.
-
Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation
Authors:
Bruno Laboissiere Camargos Borges,
Bruno Machado Pacheco,
Danilo Silva
Abstract:
Selective prediction augments a model with the option to abstain from providing unreliable predictions. The key ingredient is a confidence score function, which should be directly related to the conditional risk. In the case of binary semantic segmentation, existing score functions either ignore the particularities of the evaluation metric or demand additional held-out data for tuning. We propose…
▽ More
Selective prediction augments a model with the option to abstain from providing unreliable predictions. The key ingredient is a confidence score function, which should be directly related to the conditional risk. In the case of binary semantic segmentation, existing score functions either ignore the particularities of the evaluation metric or demand additional held-out data for tuning. We propose the Soft Dice Confidence (SDC), a simple, tuning-free confidence score function that directly aligns with the Dice coefficient metric. We prove that, under conditional independence, the SDC is near optimal: we establish upper and lower bounds on the ratio between the SDC and the ideal (intractable) confidence score function and show that these bounds are very close to 1. Experiments on six public medical-imaging benchmarks and on synthetic data corroborate our theoretical findings. In fact, SDC outperformed all prior confidence estimators from the literature in all of our experiments, including those that rely on additional data. These results position SDC as a reliable and efficient confidence estimator for selective prediction in semantic segmentation.
△ Less
Submitted 30 June, 2025; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Innovation and Word Usage Patterns in Machine Learning
Authors:
Vítor Bandeira Borges,
Daniel Oliveira Cajueiro
Abstract:
In this study, we delve into the dynamic landscape of machine learning research evolution. Initially, through the utilization of Latent Dirichlet Allocation, we discern pivotal themes and fundamental concepts that have emerged within the realm of machine learning. Subsequently, we undertake a comprehensive analysis to track the evolutionary trajectories of these identified themes. To quantify the…
▽ More
In this study, we delve into the dynamic landscape of machine learning research evolution. Initially, through the utilization of Latent Dirichlet Allocation, we discern pivotal themes and fundamental concepts that have emerged within the realm of machine learning. Subsequently, we undertake a comprehensive analysis to track the evolutionary trajectories of these identified themes. To quantify the novelty and divergence of research contributions, we employ the Kullback-Leibler Divergence metric. This statistical measure serves as a proxy for ``surprise'', indicating the extent of differentiation between the content of academic papers and the subsequent developments in research. By amalgamating these insights, we gain the ability to ascertain the pivotal roles played by prominent researchers and the significance of specific academic venues (periodicals and conferences) within the machine learning domain.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Let Me Teach You: Pedagogical Foundations of Feedback for Language Models
Authors:
Beatriz Borges,
Niket Tandon,
Tanja Käser,
Antoine Bosselut
Abstract:
Natural Language Feedback (NLF) is an increasingly popular mechanism for aligning Large Language Models (LLMs) to human preferences. Despite the diversity of the information it can convey, NLF methods are often hand-designed and arbitrary, with little systematic grounding. At the same time, research in learning sciences has long established several effective feedback models. In this opinion piece,…
▽ More
Natural Language Feedback (NLF) is an increasingly popular mechanism for aligning Large Language Models (LLMs) to human preferences. Despite the diversity of the information it can convey, NLF methods are often hand-designed and arbitrary, with little systematic grounding. At the same time, research in learning sciences has long established several effective feedback models. In this opinion piece, we compile ideas from pedagogy to introduce FELT, a feedback framework for LLMs that outlines various characteristics of the feedback space, and a feedback content taxonomy based on these variables, providing a general mapping of the feedback space. In addition to streamlining NLF designs, FELT also brings out new, unexplored directions for research in NLF. We make our taxonomy available to the community, providing guides and examples for mapping our categorizations to future research.
△ Less
Submitted 23 October, 2024; v1 submitted 1 July, 2023;
originally announced July 2023.
-
PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
Authors:
Silin Gao,
Beatriz Borges,
Soyoung Oh,
Deniz Bayazit,
Saya Kanno,
Hiromi Wakaki,
Yuki Mitsufuji,
Antoine Bosselut
Abstract:
Sustaining coherent and engaging narratives requires dialogue or storytelling agents to understand how the personas of speakers or listeners ground the narrative. Specifically, these agents must infer personas of their listeners to produce statements that cater to their interests. They must also learn to maintain consistent speaker personas for themselves throughout the narrative, so that their co…
▽ More
Sustaining coherent and engaging narratives requires dialogue or storytelling agents to understand how the personas of speakers or listeners ground the narrative. Specifically, these agents must infer personas of their listeners to produce statements that cater to their interests. They must also learn to maintain consistent speaker personas for themselves throughout the narrative, so that their counterparts feel involved in a realistic conversation or story.
However, personas are diverse and complex: they entail large quantities of rich interconnected world knowledge that is challenging to robustly represent in general narrative systems (e.g., a singer is good at singing, and may have attended conservatoire). In this work, we construct a new large-scale persona commonsense knowledge graph, PeaCoK, containing ~100K human-validated persona facts. Our knowledge graph schematizes five dimensions of persona knowledge identified in previous studies of human interactive behaviours, and distils facts in this schema from both existing commonsense knowledge graphs and large-scale pretrained language models. Our analysis indicates that PeaCoK contains rich and precise world persona inferences that help downstream systems generate more consistent and engaging narratives.
△ Less
Submitted 26 May, 2023; v1 submitted 3 May, 2023;
originally announced May 2023.
-
REFINER: Reasoning Feedback on Intermediate Representations
Authors:
Debjit Paul,
Mete Ismayilzada,
Maxime Peyrard,
Beatriz Borges,
Antoine Bosselut,
Robert West,
Boi Faltings
Abstract:
Language models (LMs) have recently shown remarkable performance on reasoning tasks by explicitly generating intermediate inferences, e.g., chain-of-thought prompting. However, these intermediate inference steps may be inappropriate deductions from the initial context and lead to incorrect final predictions. Here we introduce REFINER, a framework for finetuning LMs to explicitly generate intermedi…
▽ More
Language models (LMs) have recently shown remarkable performance on reasoning tasks by explicitly generating intermediate inferences, e.g., chain-of-thought prompting. However, these intermediate inference steps may be inappropriate deductions from the initial context and lead to incorrect final predictions. Here we introduce REFINER, a framework for finetuning LMs to explicitly generate intermediate reasoning steps while interacting with a critic model that provides automated feedback on the reasoning. Specifically, the critic provides structured feedback that the reasoning LM uses to iteratively improve its intermediate arguments. Empirical evaluations of REFINER on three diverse reasoning tasks show significant improvements over baseline LMs of comparable scale. Furthermore, when using GPT-3.5 or ChatGPT as the reasoner, the trained critic significantly improves reasoning without finetuning the reasoner. Finally, our critic model is trained without expensive human-in-the-loop data but can be substituted with humans at inference time.
△ Less
Submitted 4 February, 2024; v1 submitted 4 April, 2023;
originally announced April 2023.
-
Laughing Heads: Can Transformers Detect What Makes a Sentence Funny?
Authors:
Maxime Peyrard,
Beatriz Borges,
Kristina Gligorić,
Robert West
Abstract:
The automatic detection of humor poses a grand challenge for natural language processing. Transformer-based systems have recently achieved remarkable results on this task, but they usually (1)~were evaluated in setups where serious vs humorous texts came from entirely different sources, and (2)~focused on benchmarking performance without providing insights into how the models work. We make progres…
▽ More
The automatic detection of humor poses a grand challenge for natural language processing. Transformer-based systems have recently achieved remarkable results on this task, but they usually (1)~were evaluated in setups where serious vs humorous texts came from entirely different sources, and (2)~focused on benchmarking performance without providing insights into how the models work. We make progress in both respects by training and analyzing transformer-based humor recognition models on a recently introduced dataset consisting of minimal pairs of aligned sentences, one serious, the other humorous. We find that, although our aligned dataset is much harder than previous datasets, transformer-based models recognize the humorous sentence in an aligned pair with high accuracy (78%). In a careful error analysis, we characterize easy vs hard instances. Finally, by analyzing attention weights, we obtain important insights into the mechanisms by which transformers recognize humor. Most remarkably, we find clear evidence that one single attention head learns to recognize the words that make a test sentence humorous, even without access to this information at training time.
△ Less
Submitted 25 August, 2021; v1 submitted 19 May, 2021;
originally announced May 2021.
-
Leveraging the Self-Transition Probability of Ordinal Pattern Transition Graph for Transportation Mode Classification
Authors:
I. Cardoso-Pereira,
J. B. Borges,
P. H. Barros,
A. F. Loureiro,
O. A. Rosso,
H. S. Ramos
Abstract:
The analysis of GPS trajectories is a well-studied problem in Urban Computing and has been used to track people. Analyzing people mobility and identifying the transportation mode used by them is essential for cities that want to reduce traffic jams and travel time between their points, thus helping to improve the quality of life of citizens. The trajectory data of a moving object is represented by…
▽ More
The analysis of GPS trajectories is a well-studied problem in Urban Computing and has been used to track people. Analyzing people mobility and identifying the transportation mode used by them is essential for cities that want to reduce traffic jams and travel time between their points, thus helping to improve the quality of life of citizens. The trajectory data of a moving object is represented by a discrete collection of points through time, i.e., a time series. Regarding its interdisciplinary and broad scope of real-world applications, it is evident the need of extracting knowledge from time series data. Mining this type of data, however, faces several complexities due to its unique properties. Different representations of data may overcome this. In this work, we propose the use of a feature retained from the Ordinal Pattern Transition Graph, called the probability of self-transition for transportation mode classification. The proposed feature presents better accuracy results than Permutation Entropy and Statistical Complexity, even when these two are combined. This is the first work, to the best of our knowledge, that uses Information Theory quantifiers to transportation mode classification, showing that it is a feasible approach to this kind of problem.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.