-
Instruction Following by Boosting Attention of Large Language Models
Authors:
Vitoria Guardieiro,
Adam Stein,
Avishree Khare,
Eric Wong
Abstract:
Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering's effectiveness to be limit…
▽ More
Controlling the generation of large language models (LLMs) remains a central challenge to ensure their safe and reliable deployment. While prompt engineering and finetuning are common approaches, recent work has explored latent steering, a lightweight technique that alters LLM internal activations to guide generation. However, subsequent studies revealed latent steering's effectiveness to be limited, often underperforming simple instruction prompting. To address this limitation, we first establish a benchmark across diverse behaviors for standardized evaluation of steering techniques. Building on insights from this benchmark, we introduce Instruction Attention Boosting (InstABoost), a latent steering method that boosts the strength of instruction prompting by altering the model's attention during generation. InstABoost combines the strengths of existing approaches and is theoretically supported by prior work that suggests that in-context rule following in transformer-based models can be controlled by manipulating attention on instructions. Empirically, InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Benchmarking Misuse Mitigation Against Covert Adversaries
Authors:
Davis Brown,
Mahdi Sabbaghi,
Luze Sun,
Alexander Robey,
George J. Pappas,
Eric Wong,
Hamed Hassani
Abstract:
Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to {detect}. However, when combined, these fragments uplift misuse by helping the attacker complete hard and…
▽ More
Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to {detect}. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
△ Less
Submitted 6 June, 2025;
originally announced June 2025.
-
The Road to Generalizable Neuro-Symbolic Learning Should be Paved with Foundation Models
Authors:
Adam Stein,
Aaditya Naik,
Neelay Velingker,
Mayur Naik,
Eric Wong
Abstract:
Neuro-symbolic learning was proposed to address challenges with training neural networks for complex reasoning tasks with the added benefits of interpretability, reliability, and efficiency. Neuro-symbolic learning methods traditionally train neural models in conjunction with symbolic programs, but they face significant challenges that limit them to simplistic problems. On the other hand, purely-n…
▽ More
Neuro-symbolic learning was proposed to address challenges with training neural networks for complex reasoning tasks with the added benefits of interpretability, reliability, and efficiency. Neuro-symbolic learning methods traditionally train neural models in conjunction with symbolic programs, but they face significant challenges that limit them to simplistic problems. On the other hand, purely-neural foundation models now reach state-of-the-art performance through prompting rather than training, but they are often unreliable and lack interpretability. Supplementing foundation models with symbolic programs, which we call neuro-symbolic prompting, provides a way to use these models for complex reasoning tasks. Doing so raises the question: What role does specialized model training as part of neuro-symbolic learning have in the age of foundation models? To explore this question, we highlight three pitfalls of traditional neuro-symbolic learning with respect to the compute, data, and programs leading to generalization problems. This position paper argues that foundation models enable generalizable neuro-symbolic solutions, offering a path towards achieving the original goals of neuro-symbolic learning without the downsides of training from scratch.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Data Mining-Based Techniques for Software Fault Localization
Authors:
Peggy Cellier,
Mireille Ducassé,
Sébastien Ferré,
Olivier Ridoux,
W. Eric Wong
Abstract:
This chapter illustrates the basic concepts of fault localization using a data mining technique. It utilizes the Trityp program to illustrate the general method. Formal concept analysis and association rule are two well-known methods for symbolic data mining. In their original inception, they both consider data in the form of an object-attribute table. In their original inception, they both consid…
▽ More
This chapter illustrates the basic concepts of fault localization using a data mining technique. It utilizes the Trityp program to illustrate the general method. Formal concept analysis and association rule are two well-known methods for symbolic data mining. In their original inception, they both consider data in the form of an object-attribute table. In their original inception, they both consider data in the form of an object-attribute table. The chapter considers a debugging process in which a program is tested against different test cases. Two attributes, PASS and FAIL, represent the issue of the test case. The chapter extends the analysis of data mining for fault localization for the multiple fault situations. It addresses how data mining can be further applied to fault localization for GUI components. Unlike traditional software, GUI test cases are usually event sequences, and each individual event has a unique corresponding event handler.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
An Iterative Framework for Generative Backmapping of Coarse Grained Proteins
Authors:
Georgios Kementzidis,
Erin Wong,
John Nicholson,
Ruichen Xu,
Yuefan Deng
Abstract:
The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle…
▽ More
The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle the challenges associated with such large-scale biomolecules. Our method enables stepwise refinement from CG beads to full atomistic details. We outline the theory of iterative generative backmapping and demonstrate via numerical experiments the advantages of multistep schemes by applying them to proteins of vastly different structures with very coarse representations. This multistep approach not only improves the accuracy of reconstructions but also makes the training process more computationally efficient for proteins with ultra-CG representations.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
A Game-Theoretic Quantum Algorithm for Solving Magic Squares
Authors:
Sarah Chehade,
Andrea Delgado,
Elaine Wong
Abstract:
Variational quantum algorithms (VQAs) offer a promising near-term approach to finding optimal quantum strategies for playing non-local games. These games test quantum correlations beyond classical limits and enable entanglement verification. In this work, we present a variational framework for the Magic Square Game (MSG), a two-player non-local game with perfect quantum advantage. We construct a v…
▽ More
Variational quantum algorithms (VQAs) offer a promising near-term approach to finding optimal quantum strategies for playing non-local games. These games test quantum correlations beyond classical limits and enable entanglement verification. In this work, we present a variational framework for the Magic Square Game (MSG), a two-player non-local game with perfect quantum advantage. We construct a value Hamiltonian that encodes the game's parity and consistency constraints, then optimize parameterized quantum circuits to minimize this cost. Our approach builds on the stabilizer formalism, leverages commutation structure for circuit design, and is hardware-efficient. Compared to existing work, our contribution emphasizes algebraic structure and interpretability. We validate our method through numerical experiments and outline generalizations to larger games.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Probabilistic Stability Guarantees for Feature Attributions
Authors:
Helen Jin,
Anton Xue,
Weiqiu You,
Surbhi Goel,
Eric Wong
Abstract:
Stability guarantees have emerged as a principled way to evaluate feature attributions, but existing certification methods rely on heavily smoothed classifiers and often produce conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, sample-efficient stability certification algorithm (SCA) that yields non-trivial and interpretable gu…
▽ More
Stability guarantees have emerged as a principled way to evaluate feature attributions, but existing certification methods rely on heavily smoothed classifiers and often produce conservative guarantees. To address these limitations, we introduce soft stability and propose a simple, model-agnostic, sample-efficient stability certification algorithm (SCA) that yields non-trivial and interpretable guarantees for any attribution method. Moreover, we show that mild smoothing achieves a more favorable trade-off between accuracy and stability, avoiding the aggressive compromises made in prior certification methods. To explain this behavior, we use Boolean function analysis to derive a novel characterization of stability under smoothing. We evaluate SCA on vision and language tasks and demonstrate the effectiveness of soft stability in measuring the robustness of explanation methods.
△ Less
Submitted 17 May, 2025; v1 submitted 18 April, 2025;
originally announced April 2025.
-
CTSketch: Compositional Tensor Sketching for Scalable Neurosymbolic Learning
Authors:
Seewon Choi,
Alaia Solko-Breslin,
Rajeev Alur,
Eric Wong
Abstract:
Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using only end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neuro…
▽ More
Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using only end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neurosymbolic inference: decompose the symbolic program into sub-programs and summarize each sub-program with a sketched tensor. This strategy allows us to approximate the output distribution of the program with simple tensor operations over the input distributions and summaries. We provide theoretical insight into the maximum error of the approximation. Furthermore, we evaluate CTSketch on many benchmarks from the neurosymbolic literature, including some designed for evaluating scalability. Our results show that CTSketch pushes neurosymbolic learning to new scales that have previously been unattainable by obtaining high accuracy on tasks involving over one thousand inputs.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
NSF-SciFy: Mining the NSF Awards Database for Scientific Claims
Authors:
Delip Rao,
Weiqiu You,
Eric Wong,
Chris Callison-Burch
Abstract:
We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publica…
▽ More
We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in proposals. Using zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research.
△ Less
Submitted 15 March, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
Adaptively profiling models with task elicitation
Authors:
Davis Brown,
Prithvi Balehannina,
Helen Jin,
Shreya Havaldar,
Hamed Hassani,
Eric Wong
Abstract:
Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks -- an order of magnitude more than prior work -- where frontier models exhibit systematic…
▽ More
Language model evaluations often fail to characterize consequential failure modes, forcing experts to inspect outputs and build new benchmarks. We introduce task elicitation, a method that automatically builds new evaluations to profile model behavior. Task elicitation finds hundreds of natural-language tasks -- an order of magnitude more than prior work -- where frontier models exhibit systematic failures, in domains ranging from forecasting to online harassment. For example, we find that Sonnet 3.5 over-associates quantum computing and AGI and that o3-mini is prone to hallucination when fabrications are repeated in-context.
△ Less
Submitted 20 May, 2025; v1 submitted 3 March, 2025;
originally announced March 2025.
-
Building a Software Stack for Quantum-HPC Integration
Authors:
Amir Shehata,
Peter Groszkowski,
Thomas Naughton,
Murali Gopalakrishnan Meena,
Elaine Wong,
Daniel Claudino,
Rafael Ferreira da Silvaa,
Thomas Beck
Abstract:
This paper presents a comprehensive software stack architecture for integrating quantum computing (QC) capabilities with High-Performance Computing (HPC) environments. While quantum computers show promise as specialized accelerators for scientific computing, their effective integration with classical HPC systems presents significant technical challenges. We propose a hardware-agnostic software fra…
▽ More
This paper presents a comprehensive software stack architecture for integrating quantum computing (QC) capabilities with High-Performance Computing (HPC) environments. While quantum computers show promise as specialized accelerators for scientific computing, their effective integration with classical HPC systems presents significant technical challenges. We propose a hardware-agnostic software framework that supports both current noisy intermediate-scale quantum devices and future fault-tolerant quantum computers, while maintaining compatibility with existing HPC workflows. The architecture includes a quantum gateway interface, standardized APIs for resource management, and robust scheduling mechanisms to handle both simultaneous and interleaved quantum-classical workloads. Key innovations include: (1) a unified resource management system that efficiently coordinates quantum and classical resources, (2) a flexible quantum programming interface that abstracts hardware-specific details, (3) A Quantum Platform Manager API that simplifies the integration of various quantum hardware systems, and (4) a comprehensive tool chain for quantum circuit optimization and execution. We demonstrate our architecture through implementation of quantum-classical algorithms, including the variational quantum linear solver, showcasing the framework's ability to handle complex hybrid workflows while maximizing resource utilization. This work provides a foundational blueprint for integrating QC capabilities into existing HPC infrastructures, addressing critical challenges in resource management, job scheduling, and efficient data movement between classical and quantum resources.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Scalable Coordinated Learning for H2M/R Applications over Optical Access Networks (Invited)
Authors:
Sourav Mondal,
Elaine Wong
Abstract:
One of the primary research interests adhering to next-generation fiber-wireless access networks is human-to-machine/robot (H2M/R) collaborative communications facilitating Industry 5.0. This paper discusses scalable H2M/R communications across large geographical distances that also allow rapid onboarding of new machines/robots as $\sim72\%$ training time is saved through global-local coordinated…
▽ More
One of the primary research interests adhering to next-generation fiber-wireless access networks is human-to-machine/robot (H2M/R) collaborative communications facilitating Industry 5.0. This paper discusses scalable H2M/R communications across large geographical distances that also allow rapid onboarding of new machines/robots as $\sim72\%$ training time is saved through global-local coordinated learning.
△ Less
Submitted 27 February, 2025;
originally announced February 2025.
-
Where's the Bug? Attention Probing for Scalable Fault Localization
Authors:
Adam Stein,
Arthur Wayne,
Aaditya Naik,
Mayur Naik,
Eric Wong
Abstract:
Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks. While LLM-based program repair systems can propose bug fixes using only a user's bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs. Existing FL appro…
▽ More
Ensuring code correctness remains a challenging problem even as large language models (LLMs) become increasingly capable at code-related tasks. While LLM-based program repair systems can propose bug fixes using only a user's bug report, their effectiveness is fundamentally limited by their ability to perform fault localization (FL), a challenging problem for both humans and LLMs. Existing FL approaches rely on executable test cases, require training on costly and often noisy line-level annotations, or demand resource-intensive LLMs. In this paper, we present Bug Attention Probe (BAP), a method which learns state-of-the-art fault localization without any direct localization labels, outperforming traditional FL baselines and prompting of large-scale LLMs. We evaluate our approach across a variety of code settings, including real-world Java bugs from the standard Defects4J dataset as well as seven other datasets which span a diverse set of bug types and languages. Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4o. BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost.
△ Less
Submitted 19 February, 2025; v1 submitted 19 February, 2025;
originally announced February 2025.
-
The current cratering rate on the regular satellites of Jupiter, Saturn, and Uranus
Authors:
R. Brasser,
E. W. Wong,
S. C. Werner
Abstract:
We aim to compute the impact rates for objects with a diameter of 1 km onto the regular satellites of Jupiter, Saturn and Uranus using our latest dynamical simulations of the evolution of outer solar system coupled with the best estimates of the current population of objects beyond Neptune and their size-frequency distribution. We use the outcome of the last 3.5~Gyr of evolution of the outer solar…
▽ More
We aim to compute the impact rates for objects with a diameter of 1 km onto the regular satellites of Jupiter, Saturn and Uranus using our latest dynamical simulations of the evolution of outer solar system coupled with the best estimates of the current population of objects beyond Neptune and their size-frequency distribution. We use the outcome of the last 3.5~Gyr of evolution of the outer solar system from our database of simulations and combine this with observational constraints of the population beyond Neptune to compute the flux of objects entering the Centaur region, with uncertainties. The initial conditions resemble the current population rather than a near-circular, near-planar disc usually assumed just before the onset of giant planet migration. We obtain a better estimate of the impact probability of a Centaur with the satellites from enacting simulations of planetesimals flying past the satellites on hyperbolic orbits, which agree with literature precedents. We find that our impact rate of objects greater than 1 km in diameter with Jupiter is 0.0012/yr, which is a factor of 3--6 lower than previous estimates of 0.0044/yr from Nesvorny et al. (2023) and 0.0075/yr from Zahnle et al. (2003). On the other hand our impact probabilities with the satellites scaled to the giant planets are consistent with these earlier literature estimates, as is the leakage rate of objects from beyond Neptune into the Centaur region. However, our absolute impact probabilities with the giant planets are lower. We attribute this to our choice of initial conditions.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Existence of Periodic and Stationary Solutions to Distribution-Dependent SDEs
Authors:
Wei Sun,
Ethan Wong
Abstract:
We investigate the periodic and stationary solutions of distribution-dependent stochastic differential equations. While generally, the semigroups associated with the equations are nonlinear, we show that the methods of weak convergence and Lyapunov functions can be combined to give efficient criteria for the existence of periodic and stationary solutions. Concrete examples are presented to illustr…
▽ More
We investigate the periodic and stationary solutions of distribution-dependent stochastic differential equations. While generally, the semigroups associated with the equations are nonlinear, we show that the methods of weak convergence and Lyapunov functions can be combined to give efficient criteria for the existence of periodic and stationary solutions. Concrete examples are presented to illustrate the novel criteria.
△ Less
Submitted 15 January, 2025;
originally announced January 2025.
-
OpenAI o1 System Card
Authors:
OpenAI,
:,
Aaron Jaech,
Adam Kalai,
Adam Lerer,
Adam Richardson,
Ahmed El-Kishky,
Aiden Low,
Alec Helyar,
Aleksander Madry,
Alex Beutel,
Alex Carney,
Alex Iftimie,
Alex Karpenko,
Alex Tachard Passos,
Alexander Neitz,
Alexander Prokofiev,
Alexander Wei,
Allison Tam,
Ally Bennett,
Ananya Kumar,
Andre Saraiva,
Andrea Vallone,
Andrew Duberstein,
Andrew Kondrich
, et al. (238 additional authors not shown)
Abstract:
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-ar…
▽ More
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance
Authors:
Hallee E. Wong,
Jose Javier Gonzalez Ortiz,
John Guttag,
Adrian V. Dalca
Abstract:
Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of manually labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without re…
▽ More
Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of manually labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. Along with the image to segment, the model takes user interactions such as clicks, bounding boxes or scribbles as input, and predicts a segmentation. As the user segments more images, those images and segmentations become additional inputs to the model, providing context. As the context set of labeled images grows, the number of interactions required to segment each new image decreases. We demonstrate that MultiverSeg enables users to interactively segment new datasets efficiently, by amortizing the number of interactions per image to achieve an accurate segmentation. Compared to using a state-of-the-art interactive segmentation method, using MultiverSeg reduced the total number of scribble steps by 53% and clicks by 36% to achieve 90% Dice on sets of images from unseen tasks. We release code and model weights at https://multiverseg.csail.mit.edu
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment
Authors:
Firdavs Nasriddinov,
Rafal Kocielnik,
Arushi Gupta,
Cherine Yang,
Elyssa Wong,
Anima Anandkumar,
Andrew Hung
Abstract:
This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings, which is crucial for characterizing teaching tasks. In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition. However, a…
▽ More
This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings, which is crucial for characterizing teaching tasks. In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition. However, analyzing and quantifying this feedback is challenging due to its unstructured and specialized nature. Automated systems are essential to manage these complexities at scale, allowing for the creation of structured datasets that enhance feedback analysis and improve surgical education. Our framework integrates voice activity detection, speaker diarization, and automated speech recaognition, with a novel enhancement that 1) removes hallucinations (non-existent utterances generated during speech recognition fueled by noise in the operating room) and 2) separates speech from trainers and trainees using few-shot voice samples. These aspects are vital for reconstructing accurate surgical dialogues and understanding the roles of operating room participants. Using data from 33 real-world surgeries, we demonstrated the system's capability to reconstruct surgical teaching dialogues and detect feedback instances effectively (F1 score of 0.79+/-0.07). Moreover, our hallucination removal step improves feedback detection performance by ~14%. Evaluation on downstream clinically relevant tasks of predicting Behavioral Adjustment of trainees and classifying Technical feedback, showed performances comparable to manual annotations with F1 scores of 0.82+/0.03 and 0.81+/0.03 respectively. These results highlight the effectiveness of our framework in supporting clinically relevant tasks and improving over manual methods.
△ Less
Submitted 1 December, 2024;
originally announced December 2024.
-
Software Fault Localization Based on Multi-objective Feature Fusion and Deep Learning
Authors:
Xiaolei Hu,
Dongcheng Li,
W. Eric Wong,
Ya Zou
Abstract:
Software fault localization remains challenging due to limited feature diversity and low precision in traditional methods. This paper proposes a novel approach that integrates multi-objective optimization with deep learning models to improve both accuracy and efficiency in fault localization (FL). By framing feature selection as a multi-objective optimization problem (MOP), we extract and fuse thr…
▽ More
Software fault localization remains challenging due to limited feature diversity and low precision in traditional methods. This paper proposes a novel approach that integrates multi-objective optimization with deep learning models to improve both accuracy and efficiency in fault localization (FL). By framing feature selection as a multi-objective optimization problem (MOP), we extract and fuse three critical fault-related feature sets: spectrum-based, mutation-based, and text-based features, into a comprehensive feature fusion model. These features are then embedded within a deep learning architecture, comprising a multilayer perceptron (MLP) and gated recurrent network (GRN), which together enhance localization accuracy and generalizability. Experiments on the Defects4J benchmark dataset with 434 faults show that the proposed algorithm reduces processing time by 78.2% compared to single-objective methods. Additionally, our MLP and GRN models achieve a 94.2% improvement in localization accuracy compared to traditional FL methods, outperforming state-of-the-art deep learning-based FL method by 7.67%. Further validation using the PROMISE dataset demonstrates the generalizability of the proposed model, showing a 4.6% accuracy improvement in cross-project tests over state-of-the-art deep learning-based FL method.
△ Less
Submitted 25 November, 2024;
originally announced November 2024.
-
Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment
Authors:
Arushi Gupta,
Rafal Kocielnik,
Jiayun Wang,
Firdavs Nasriddinov,
Cherine Yang,
Elyssa Wong,
Anima Anandkumar,
Andrew Hung
Abstract:
During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess f…
▽ More
During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess feedback effectiveness is laborious and prone to biases, underscoring the need for an automated, scalable, and objective method. Creating such an automated system poses challenges, as it requires an understanding of both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness. Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes, and their combination achieves an AUROC of 0.70+/-0.02, improving prediction accuracy by up to 6.6%. Additionally, we introduce self-supervised fine-tuning as a strategy for enhancing surgical video representation learning, which is scalable and further enhances prediction performance. Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.
△ Less
Submitted 16 November, 2024;
originally announced November 2024.
-
Learning General-Purpose Biomedical Volume Representations using Randomized Synthesis
Authors:
Neel Dey,
Benjamin Billot,
Hallee E. Wong,
Clinton J. Wang,
Mengwei Ren,
P. Ellen Grant,
Adrian V. Dalca,
Polina Golland
Abstract:
Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes hi…
▽ More
Current volumetric biomedical foundation models struggle to generalize as public 3D datasets are small and do not cover the broad diversity of medical procedures, conditions, anatomical regions, and imaging protocols. We address this by creating a representation learning method that instead anticipates strong domain shifts at training time itself. We first propose a data engine that synthesizes highly variable training samples that would enable generalization to new biomedical contexts. To then train a single 3D network for any voxel-level task, we develop a contrastive learning method that pretrains the network to be stable against nuisance imaging variation simulated by the data engine, a key inductive bias for generalization. This network's features can be used as robust representations of input images for downstream tasks and its weights provide a strong, dataset-agnostic initialization for finetuning on new datasets. As a result, we set new standards across both multimodality registration and few-shot segmentation, a first for any 3D biomedical vision model, all without (pre-)training on any existing dataset of real images.
△ Less
Submitted 2 March, 2025; v1 submitted 4 November, 2024;
originally announced November 2024.
-
Many-Objective Search-Based Coverage-Guided Automatic Test Generation for Deep Neural Networks
Authors:
Dongcheng Li,
W. Eric Wong,
Hu Liu,
Man Zhao
Abstract:
To ensure the reliability of DNN systems and address the test generation problem for neural networks, this paper proposes a fuzzing test generation technique based on many-objective optimization algorithms. Traditional fuzz testing employs random search, leading to lower testing efficiency and tends to generate numerous invalid test cases. By utilizing many-objective optimization techniques, effec…
▽ More
To ensure the reliability of DNN systems and address the test generation problem for neural networks, this paper proposes a fuzzing test generation technique based on many-objective optimization algorithms. Traditional fuzz testing employs random search, leading to lower testing efficiency and tends to generate numerous invalid test cases. By utilizing many-objective optimization techniques, effective test cases can be generated. To achieve high test coverage, this paper proposes several improvement strategies. The frequency-based fuzz sampling strategy assigns priorities based on the frequency of selection of initial data, avoiding the repetitive selection of the same data and enhancing the quality of initial data better than random sampling strategies. To address the issue that global search may yield test not satisfying semantic constraints, a local search strategy based on the Monte Carlo tree search is proposed to enhance the algorithm's local search capabilities. Furthermore, we improve the diversity of the population and the algorithm's global search capability by updating SPEA2's external archive based on a decomposition-based archiving strategy. To validate the effectiveness of the proposed approach, experiments were conducted on several public datasets and various neural network models. The results reveal that, compared to random and clustering-based sampling, the frequency-based fuzz sampling strategy provides a greater improvement in coverage rate in the later stages of iterations. On complex networks like VGG16, the improved SPEA2 algorithm increased the coverage rate by about 12% across several coverage metrics, and by approximately 40% on LeNet series networks. The experimental results also indicates that the newly generated test cases not only exhibit higher coverage rates but also generate adversarial samples that reveal model errors.
△ Less
Submitted 1 November, 2024;
originally announced November 2024.
-
AR-Pro: Counterfactual Explanations for Anomaly Repair with Formal Properties
Authors:
Xiayan Ji,
Anton Xue,
Eric Wong,
Oleg Sokolsky,
Insup Lee
Abstract:
Anomaly detection is widely used for identifying critical errors and suspicious behaviors, but current methods lack interpretability. We leverage common properties of existing methods and recent advances in generative models to introduce counterfactual explanations for anomaly detection. Given an input, we generate its counterfactual as a diffusion-based repair that shows what a non-anomalous vers…
▽ More
Anomaly detection is widely used for identifying critical errors and suspicious behaviors, but current methods lack interpretability. We leverage common properties of existing methods and recent advances in generative models to introduce counterfactual explanations for anomaly detection. Given an input, we generate its counterfactual as a diffusion-based repair that shows what a non-anomalous version should have looked like. A key advantage of this approach is that it enables a domain-independent formal specification of explainability desiderata, offering a unified framework for generating and evaluating explanations. We demonstrate the effectiveness of our anomaly explainability framework, AR-Pro, on vision (MVTec, VisA) and time-series (SWaT, WADI, HAI) anomaly datasets. The code used for the experiments is accessible at: https://github.com/xjiae/arpro.
△ Less
Submitted 31 October, 2024;
originally announced October 2024.
-
GPT-4o System Card
Authors:
OpenAI,
:,
Aaron Hurst,
Adam Lerer,
Adam P. Goucher,
Adam Perelman,
Aditya Ramesh,
Aidan Clark,
AJ Ostrow,
Akila Welihinda,
Alan Hayes,
Alec Radford,
Aleksander Mądry,
Alex Baker-Whitcomb,
Alex Beutel,
Alex Borzunov,
Alex Carney,
Alex Chow,
Alex Kirillov,
Alex Nichol,
Alex Paino,
Alex Renzin,
Alex Tachard Passos,
Alexander Kirillov,
Alexi Christakis
, et al. (395 additional authors not shown)
Abstract:
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 mil…
▽ More
GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
Probabilistic Representation of Commutative Quantum Circuit Models
Authors:
Richard Yu,
Jorge Ramirez,
Elaine Wong
Abstract:
In commuting parametric quantum circuits, the Fourier series of the pairwise fidelity can be expressed as the characteristic function of random variables. Furthermore, expressiveness can be cast as the recurrence probability of a random walk on a lattice. This construction has been successfully applied to the group composed only of Pauli-Z rotations, and we generalize this probabilistic strategy t…
▽ More
In commuting parametric quantum circuits, the Fourier series of the pairwise fidelity can be expressed as the characteristic function of random variables. Furthermore, expressiveness can be cast as the recurrence probability of a random walk on a lattice. This construction has been successfully applied to the group composed only of Pauli-Z rotations, and we generalize this probabilistic strategy to any commuting set of Pauli operators. We utilize an efficient algorithm by van den Berg and Temme (2020) using the tableau representation of Pauli strings to yield a unitary from the Clifford group that, under conjugation, simultaneously diagonalizes our commuting set of Pauli rotations. Furthermore, we fully characterize the underlying distribution of the random walk using stabilizer states and their basis state representations. This would allow us to tractably compute the lattice volume and variance matrix used to express the frame potential. Together, this demonstrates a scalable strategy to calculate the expressiveness of parametric quantum models.
△ Less
Submitted 24 October, 2024;
originally announced October 2024.
-
A Hybrid Sampling and Multi-Objective Optimization Approach for Enhanced Software Defect Prediction
Authors:
Jie Zhang,
Dongcheng Li,
W. Eric Wong,
Shengrong Wang
Abstract:
Accurate early prediction of software defects is essential to maintain software quality and reduce maintenance costs. However, the field of software defect prediction (SDP) faces challenges such as class imbalances, high-dimensional feature spaces, and suboptimal prediction accuracy. To mitigate these challenges, this paper introduces a novel SDP framework that integrates hybrid sampling technique…
▽ More
Accurate early prediction of software defects is essential to maintain software quality and reduce maintenance costs. However, the field of software defect prediction (SDP) faces challenges such as class imbalances, high-dimensional feature spaces, and suboptimal prediction accuracy. To mitigate these challenges, this paper introduces a novel SDP framework that integrates hybrid sampling techniques, specifically Borderline SMOTE and Tomek Links, with a suite of multi-objective optimization algorithms, including NSGA-II, MOPSO, and MODE. The proposed model applies feature fusion through multi-objective optimization, enhancing both the generalization capability and stability of the predictions. Furthermore, the integration of parallel processing for these optimization algorithms significantly boosts the computational efficiency of the model. Comprehensive experiments conducted on datasets from NASA and PROMISE repositories demonstrate that the proposed hybrid sampling and multi-objective optimization approach improves data balance, eliminates redundant features, and enhances prediction accuracy. The experimental results also highlight the robustness of the feature fusion approach, confirming its superiority over existing state-of-the-art techniques in terms of predictive performance and applicability across diverse datasets.
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Dolphin: A Programmable Framework for Scalable Neurosymbolic Learning
Authors:
Aaditya Naik,
Jason Liu,
Claire Wang,
Amish Sethi,
Saikat Dutta,
Mayur Naik,
Eric Wong
Abstract:
Neurosymbolic learning enables the integration of symbolic reasoning with deep learning but faces significant challenges in scaling to complex symbolic programs, large datasets, or both. We introduce DOLPHIN, a framework that tackles these challenges by supporting neurosymbolic programs in Python, executing complex symbolic reasoning on the CPU while vectorizing probabilistic computations and grad…
▽ More
Neurosymbolic learning enables the integration of symbolic reasoning with deep learning but faces significant challenges in scaling to complex symbolic programs, large datasets, or both. We introduce DOLPHIN, a framework that tackles these challenges by supporting neurosymbolic programs in Python, executing complex symbolic reasoning on the CPU while vectorizing probabilistic computations and gradient propagation on the GPU. Across 13 benchmarks spanning tasks over text, image, and video data, with symbolic reasoning features like recursion and black-box functions, DOLPHIN converges to state-of-the-art accuracies on the more complex benchmarks while existing frameworks such as Scallop, ISED, and IndeCateR+ fail to converge within the time limit. On simpler benchmarks, DOLPHIN matches their performance, while achieving these results 1.71x to 62x faster than the baselines. Overall, DOLPHIN advances the scalability of neurosymbolic frameworks, achieving state-of-the-art efficiency and convergence on difficult benchmarks where existing frameworks struggle. The code is published at https://github.com/Dolphin-NeSy/Dolphin.
△ Less
Submitted 28 May, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Smart Contract Vulnerability Detection based on Static Analysis and Multi-Objective Search
Authors:
Dongcheng Li,
W. Eric Wong,
Xiaodan Wang,
Sean Pan,
Liang-Seng Koh
Abstract:
This paper introduces a method for detecting vulnerabilities in smart contracts using static analysis and a multi-objective optimization algorithm. We focus on four types of vulnerabilities: reentrancy, call stack overflow, integer overflow, and timestamp dependencies. Initially, smart contracts are compiled into an abstract syntax tree to analyze relationships between contracts and functions, inc…
▽ More
This paper introduces a method for detecting vulnerabilities in smart contracts using static analysis and a multi-objective optimization algorithm. We focus on four types of vulnerabilities: reentrancy, call stack overflow, integer overflow, and timestamp dependencies. Initially, smart contracts are compiled into an abstract syntax tree to analyze relationships between contracts and functions, including calls, inheritance, and data flow. These analyses are transformed into static evaluations and intermediate representations that reveal internal relations. Based on these representations, we examine contract's functions, variables, and data dependencies to detect the specified vulnerabilities. To enhance detection accuracy and coverage, we apply a multi-objective optimization algorithm to the static analysis process. This involves assigning initial numeric values to input data and monitoring changes in statement coverage and detection accuracy. Using coverage and accuracy as fitness values, we calculate Pareto front and crowding distance values to select the best individuals for the new parent population, iterating until optimization criteria are met. We validate our approach using an open-source dataset collected from Etherscan, containing 6,693 smart contracts. Experimental results show that our method outperforms state-of-the-art tools in terms of coverage, accuracy, efficiency, and effectiveness in detecting the targeted vulnerabilities.
△ Less
Submitted 30 September, 2024;
originally announced October 2024.
-
The FIX Benchmark: Extracting Features Interpretable to eXperts
Authors:
Helen Jin,
Shreya Havaldar,
Chaehyeon Kim,
Anton Xue,
Weiqiu You,
Helen Qu,
Marco Gatti,
Daniel A Hashimoto,
Bhuvnesh Jain,
Amin Madani,
Masao Sako,
Lyle Ungar,
Eric Wong
Abstract:
Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that…
▽ More
Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we propose FIXScore, a unified expert alignment measure applicable to diverse real-world settings across cosmology, psychology, and medicine domains in vision, language, and time series data modalities. With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.
△ Less
Submitted 23 December, 2024; v1 submitted 20 September, 2024;
originally announced September 2024.
-
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Authors:
Anton Xue,
Avishree Khare,
Rajeev Alur,
Surbhi Goel,
Eric Wong
Abstract:
We study how to subvert large language models (LLMs) from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can…
▽ More
We study how to subvert large language models (LLMs) from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can still mislead both theoretical constructions and models learned from data. Furthermore, we demonstrate that popular attack algorithms on LLMs find adversarial prompts and induce attention patterns that align with our theory. Our novel logic-based framework provides a foundation for studying LLMs in rule-based settings, enabling a formal analysis of tasks like logical reasoning and jailbreak attacks.
△ Less
Submitted 28 February, 2025; v1 submitted 21 June, 2024;
originally announced July 2024.
-
Towards Compositionality in Concept Learning
Authors:
Adam Stein,
Aaditya Naik,
Yinjun Wu,
Mayur Naik,
Eric Wong
Abstract:
Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositiona…
▽ More
Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositional. To automatically discover compositional concept representations, we identify two salient properties of such representations, and propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties. We evaluate CCE on five different datasets over image and text data. Our evaluation shows that CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks. Code and data are available at https://github.com/adaminsky/compositional_concepts .
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Avoiding Copyright Infringement via Large Language Model Unlearning
Authors:
Guangyao Dou,
Zheyuan Liu,
Qing Lyu,
Kaize Ding,
Eric Wong
Abstract:
Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In real-world scenarios, model owners need to continuously address copyright infringement as new requests for content removal emerge at different time points. This leads to the need for sequential…
▽ More
Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but also pose risks by learning and generating copyrighted material, leading to significant legal and ethical concerns. In real-world scenarios, model owners need to continuously address copyright infringement as new requests for content removal emerge at different time points. This leads to the need for sequential unlearning, where copyrighted content is removed sequentially as new requests arise. Despite its practical relevance, sequential unlearning in the context of copyright infringement has not been rigorously explored in existing literature. To address this gap, we propose Stable Sequential Unlearning (SSU), a novel framework designed to unlearn copyrighted content from LLMs over multiple time steps. Our approach works by identifying and removing specific weight updates in the model's parameters that correspond to copyrighted content. We improve unlearning efficacy by introducing random labeling loss and ensuring the model retains its general-purpose knowledge by adjusting targeted parameters. Experimental results show that SSU achieves an effective trade-off between unlearning efficacy and general-purpose language abilities, outperforming existing baselines.
△ Less
Submitted 10 February, 2025; v1 submitted 16 June, 2024;
originally announced June 2024.
-
Data-Efficient Learning with Neural Programs
Authors:
Alaia Solko-Breslin,
Seewon Choi,
Ziyang Li,
Neelay Velingker,
Rajeev Alur,
Mayur Naik,
Eric Wong
Abstract:
Many computational tasks can be naturally expressed as a composition of a DNN followed by a program written in a traditional programming language or an API call to an LLM. We call such composites "neural programs" and focus on the problem of learning the DNN parameters when the training data consist of end-to-end input-output labels for the composite. When the program is written in a differentiabl…
▽ More
Many computational tasks can be naturally expressed as a composition of a DNN followed by a program written in a traditional programming language or an API call to an LLM. We call such composites "neural programs" and focus on the problem of learning the DNN parameters when the training data consist of end-to-end input-output labels for the composite. When the program is written in a differentiable logic programming language, techniques from neurosymbolic learning are applicable, but in general, the learning for neural programs requires estimating the gradients of black-box components. We present an algorithm for learning neural programs, called ISED, that only relies on input-output samples of black-box components. For evaluation, we introduce new benchmarks that involve calls to modern LLMs such as GPT-4 and also consider benchmarks from the neurosymbolic learning literature. Our evaluation shows that for the latter benchmarks, ISED has comparable performance to state-of-the-art neurosymbolic frameworks. For the former, we use adaptations of prior work on gradient approximations of black-box components as a baseline, and show that ISED achieves comparable accuracy but in a more data- and sample-efficient manner.
△ Less
Submitted 31 October, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Rethinking Programming Paradigms in the QC-HPC Context
Authors:
Silvina Caino-Lores,
Daniel Claudino,
Eugene Dumitrescu,
Travis S. Humble,
Sonia Lopez Alarcon,
Elaine Wong
Abstract:
Programming for today's quantum computers is making significant strides toward modern workflows compatible with high performance computing (HPC), but fundamental challenges still remain in the integration of these vastly different technologies. Quantum computing (QC) programming languages share some common ground, as well as their emerging runtimes and algorithmic modalities. In this short paper,…
▽ More
Programming for today's quantum computers is making significant strides toward modern workflows compatible with high performance computing (HPC), but fundamental challenges still remain in the integration of these vastly different technologies. Quantum computing (QC) programming languages share some common ground, as well as their emerging runtimes and algorithmic modalities. In this short paper, we explore avenues of refinement for the quantum processing unit (QPU) in the context of many-tasks management, asynchronous or otherwise, in order to understand the value it can play in linking QC with HPC. Through examples, we illustrate how its potential for scientific discovery might be realized.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
DISCRET: Synthesizing Faithful Explanations For Treatment Effect Estimation
Authors:
Yinjun Wu,
Mayank Keoliya,
Kan Chen,
Neelay Velingker,
Ziyang Li,
Emily J Getzen,
Qi Long,
Mayur Naik,
Ravi B Parikh,
Eric Wong
Abstract:
Designing faithful yet accurate AI models is challenging, particularly in the field of individual treatment effect estimation (ITE). ITE prediction models deployed in critical settings such as healthcare should ideally be (i) accurate, and (ii) provide faithful explanations. However, current solutions are inadequate: state-of-the-art black-box models do not supply explanations, post-hoc explainers…
▽ More
Designing faithful yet accurate AI models is challenging, particularly in the field of individual treatment effect estimation (ITE). ITE prediction models deployed in critical settings such as healthcare should ideally be (i) accurate, and (ii) provide faithful explanations. However, current solutions are inadequate: state-of-the-art black-box models do not supply explanations, post-hoc explainers for black-box models lack faithfulness guarantees, and self-interpretable models greatly compromise accuracy. To address these issues, we propose DISCRET, a self-interpretable ITE framework that synthesizes faithful, rule-based explanations for each sample. A key insight behind DISCRET is that explanations can serve dually as database queries to identify similar subgroups of samples. We provide a novel RL algorithm to efficiently synthesize these explanations from a large search space. We evaluate DISCRET on diverse tasks involving tabular, image, and text data. DISCRET outperforms the best self-interpretable models and has accuracy comparable to the best black-box models while providing faithful explanations. DISCRET is available at https://github.com/wuyinjun-1993/DISCRET-ICML2024.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models
Authors:
Ethan Parker Wong,
Faten M'hiri
Abstract:
Inspired by the 'Bias Considerations in Bilingual Natural Language Processing' report by Statistics Canada, this study delves into potential biases in multilingual sentiment analysis between English and French. Given a 50-50 dataset of French and English, we aim to determine if there exists a language bias and explore how the incorporation of more diverse datasets in the future might affect the eq…
▽ More
Inspired by the 'Bias Considerations in Bilingual Natural Language Processing' report by Statistics Canada, this study delves into potential biases in multilingual sentiment analysis between English and French. Given a 50-50 dataset of French and English, we aim to determine if there exists a language bias and explore how the incorporation of more diverse datasets in the future might affect the equity of multilingual Natural Language Processing (NLP) systems. By employing Support Vector Machine (SVM) and Naive Bayes models on three balanced datasets, we reveal potential biases in multilingual sentiment classification. Utilizing Fairlearn, a tool for assessing bias in machine learning models, our findings indicate nuanced outcomes. With French data outperforming English across accuracy, recall, and F1 score metrics in both models, hinting at a language bias favoring French. However, Fairlearn's metrics suggest that the SVM approaches equitable levels with a demographic parity ratio of 0.963, 0.989, and 0.985 for the three separate datasets, indicating near-equitable treatment across languages. In contrast, Naive Bayes demonstrates greater disparities, evidenced by a demographic parity ratio of 0.813, 0.908, and 0.961. These findings reveal the importance of developing equitable multilingual NLP systems, particularly as we anticipate the inclusion of more datasets in various languages in the future.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Practice-informed Patterns for Organising Large Groups in Distributed Mixed Reality Collaboration
Authors:
Emily Wong,
Juan Sánchez Esquivel,
Jens Emil Grønbæk,
Germán Leiva,
Eduardo Velloso
Abstract:
Collaborating across dissimilar, distributed spaces presents numerous challenges for computer-aided spatial communication. Mixed reality (MR) can blend selected surfaces, allowing collaborators to work in blended f-formations (facing formations), even when their workstations are physically misaligned. Since collaboration often involves more than just participant pairs, this research examines how w…
▽ More
Collaborating across dissimilar, distributed spaces presents numerous challenges for computer-aided spatial communication. Mixed reality (MR) can blend selected surfaces, allowing collaborators to work in blended f-formations (facing formations), even when their workstations are physically misaligned. Since collaboration often involves more than just participant pairs, this research examines how we might scale MR experiences for large-group collaboration. To do so, this study recruited collaboration designers (CDs) to evaluate and reimagine MR for large-scale collaboration. These CDs were engaged in a four-part user study that involved a technology probe, a semi-structured interview, a speculative low-fidelity prototyping activity and a validation session. The outcomes of this paper contribute (1) a set of collaboration design principles to inspire future computer-supported collaborative work, (2) eight collaboration patterns for blended f-formations and collaboration at scale and (3) theoretical implications for f-formations and space-place relationships. As a result, this work creates a blueprint for scaling collaboration across distributed spaces.
△ Less
Submitted 9 May, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Expressiveness of Commutative Quantum Circuits: A Probabilistic Approach
Authors:
Jorge M. Ramirez,
Elaine Wong,
Caio Alves,
Sarah Chehade,
Ryan Bennink
Abstract:
This study investigates the frame potential and expressiveness of commutative quantum circuits. Based on the Fourier series representation of these circuits, we express quantum expectation and pairwise fidelity as characteristic functions of random variables, and expressiveness as the recurrence probability of a random walk on a lattice. A central outcome of our work includes formulas to approxima…
▽ More
This study investigates the frame potential and expressiveness of commutative quantum circuits. Based on the Fourier series representation of these circuits, we express quantum expectation and pairwise fidelity as characteristic functions of random variables, and expressiveness as the recurrence probability of a random walk on a lattice. A central outcome of our work includes formulas to approximate the frame potential and expressiveness for any commutative quantum circuit, underpinned by convergence theorems in probability theory. We identify the lattice volume of the random walk as means to approximate expressiveness based on circuit architecture. In the specific case of commutative circuits involving Pauli-$Z$ rotations, we provide theoretical results relating expressiveness and circuit structure. Our probabilistic representation also provide means for bounding and approximately calculating the frame potential of a circuit through sampling methods.
△ Less
Submitted 2 December, 2024; v1 submitted 30 April, 2024;
originally announced April 2024.
-
A Cross-Platform Execution Engine for the Quantum Intermediate Representation
Authors:
Elaine Wong,
Vicente Leyton Ortega,
Daniel Claudino,
Seth Johnson,
Sharmin Afrose,
Meenambika Gowrishankar,
Anthony M. Cabrera,
Travis S. Humble
Abstract:
Hybrid languages like the Quantum Intermediate Representation (QIR) are essential for programming systems that mix quantum and conventional computing models, while execution of these programs is often deferred to a system-specific implementation. Here, we describe and demonstrate the QIR Execution Engine (QIR-EE) for parsing, interpreting, and executing QIR across multiple hardware platforms. QIR-…
▽ More
Hybrid languages like the Quantum Intermediate Representation (QIR) are essential for programming systems that mix quantum and conventional computing models, while execution of these programs is often deferred to a system-specific implementation. Here, we describe and demonstrate the QIR Execution Engine (QIR-EE) for parsing, interpreting, and executing QIR across multiple hardware platforms. QIR-EE uses LLVM to execute hybrid instructions specifying quantum programs and, by design, presents extension points that support customized runtime and hardware environments. We demonstrate an implementation that uses the XACC quantum hardware-accelerator library to dispatch prototypical quantum programs on different commercial quantum platforms and numerical simulators, and we validate execution of QIR-EE on the IonQ Harmony and Quantinuum H1-1 hardware. Our results highlight the efficiency of hybrid executable architectures for handling mixed instructions, managing mixed data, and integrating with quantum computing frameworks to realize cross-platform execution.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Authors:
Patrick Chao,
Edoardo Debenedetti,
Alexander Robey,
Maksym Andriushchenko,
Francesco Croce,
Vikash Sehwag,
Edgar Dobriban,
Nicolas Flammarion,
George J. Pappas,
Florian Tramer,
Hamed Hassani,
Eric Wong
Abstract:
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and suc…
▽ More
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.
△ Less
Submitted 31 October, 2024; v1 submitted 27 March, 2024;
originally announced April 2024.
-
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Authors:
Jiabao Ji,
Bairu Hou,
Alexander Robey,
George J. Pappas,
Hamed Hassani,
Yang Zhang,
Eric Wong,
Shiyu Chang
Abstract:
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance.…
▽ More
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.
△ Less
Submitted 28 February, 2024; v1 submitted 25 February, 2024;
originally announced February 2024.
-
Tyche: Stochastic In-Context Learning for Medical Image Segmentation
Authors:
Marianne Rakic,
Hallee E. Wong,
Jose Javier Gonzalez Ortiz,
Beth Cimini,
John Guttag,
Adrian V. Dalca
Abstract:
Existing learning-based solutions to medical image segmentation have two important shortcomings. First, for most new segmentation task, a new model has to be trained or fine-tuned. This requires extensive resources and machine learning expertise, and is therefore often infeasible for medical researchers and clinicians. Second, most existing segmentation methods produce a single deterministic segme…
▽ More
Existing learning-based solutions to medical image segmentation have two important shortcomings. First, for most new segmentation task, a new model has to be trained or fine-tuned. This requires extensive resources and machine learning expertise, and is therefore often infeasible for medical researchers and clinicians. Second, most existing segmentation methods produce a single deterministic segmentation mask for a given image. In practice however, there is often considerable uncertainty about what constitutes the correct segmentation, and different expert annotators will often segment the same image differently. We tackle both of these problems with Tyche, a model that uses a context set to generate stochastic predictions for previously unseen tasks without the need to retrain. Tyche differs from other in-context segmentation methods in two important ways. (1) We introduce a novel convolution block architecture that enables interactions among predictions. (2) We introduce in-context test-time augmentation, a new mechanism to provide prediction stochasticity. When combined with appropriate model design and loss functions, Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without the need to retrain.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image
Authors:
Hallee E. Wong,
Marianne Rakic,
John Guttag,
Adrian V. Dalca
Abstract:
Biomedical image segmentation is a crucial part of both scientific research and clinical care. With enough labelled data, deep learning models can be trained to accurately automate specific biomedical image segmentation tasks. However, manually segmenting images to create training data is highly labor intensive and requires domain expertise. We present \emph{ScribblePrompt}, a flexible neural netw…
▽ More
Biomedical image segmentation is a crucial part of both scientific research and clinical care. With enough labelled data, deep learning models can be trained to accurately automate specific biomedical image segmentation tasks. However, manually segmenting images to create training data is highly labor intensive and requires domain expertise. We present \emph{ScribblePrompt}, a flexible neural network based interactive segmentation tool for biomedical imaging that enables human annotators to segment previously unseen structures using scribbles, clicks, and bounding boxes. Through rigorous quantitative experiments, we demonstrate that given comparable amounts of interaction, ScribblePrompt produces more accurate segmentations than previous methods on datasets unseen during training. In a user study with domain experts, ScribblePrompt reduced annotation time by 28% while improving Dice by 15% compared to the next best method. ScribblePrompt's success rests on a set of careful design decisions. These include a training strategy that incorporates both a highly diverse set of images and tasks, novel algorithms for simulated user interactions and labels, and a network that enables fast inference. We showcase ScribblePrompt in an interactive demo, provide code, and release a dataset of scribble annotations at https://scribbleprompt.csail.mit.edu
△ Less
Submitted 16 July, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Initialization Matters for Adversarial Transfer Learning
Authors:
Andong Hua,
Jindong Gu,
Zhiyu Xue,
Nicholas Carlini,
Eric Wong,
Yao Qin
Abstract:
With the prevalence of the Pretraining-Finetuning paradigm in transfer learning, the robustness of downstream tasks has become a critical concern. In this work, we delve into adversarial robustness in transfer learning and reveal the critical role of initialization, including both the pretrained model and the linear head. First, we discover the necessity of an adversarially robust pretrained model…
▽ More
With the prevalence of the Pretraining-Finetuning paradigm in transfer learning, the robustness of downstream tasks has become a critical concern. In this work, we delve into adversarial robustness in transfer learning and reveal the critical role of initialization, including both the pretrained model and the linear head. First, we discover the necessity of an adversarially robust pretrained model. Specifically, we reveal that with a standard pretrained model, Parameter-Efficient Finetuning (PEFT) methods either fail to be adversarially robust or continue to exhibit significantly degraded adversarial robustness on downstream tasks, even with adversarial training during finetuning. Leveraging a robust pretrained model, surprisingly, we observe that a simple linear probing can outperform full finetuning and other PEFT methods with random initialization on certain datasets. We further identify that linear probing excels in preserving robustness from the robust pretraining. Based on this, we propose Robust Linear Initialization (RoLI) for adversarial finetuning, which initializes the linear head with the weights obtained by adversarial linear probing to maximally inherit the robustness from pretraining. Across five different image classification datasets, we demonstrate the effectiveness of RoLI and achieve new state-of-the-art results. Our code is available at \url{https://github.com/DongXzz/RoLI}.
△ Less
Submitted 30 March, 2024; v1 submitted 9 December, 2023;
originally announced December 2023.
-
Deep Multimodal Fusion for Surgical Feedback Classification
Authors:
Rafal Kocielnik,
Elyssa Y. Wong,
Timothy N. Chu,
Lydia Lin,
De-An Huang,
Jiayun Wang,
Anima Anandkumar,
Andrew J. Hung
Abstract:
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In th…
▽ More
Quantification of real-time informal feedback delivered by an experienced surgeon to a trainee during surgery is important for skill improvements in surgical training. Such feedback in the live operating room is inherently multimodal, consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to anatomic elements). In this work, we leverage a clinically-validated five-category classification of surgical feedback: "Anatomic", "Technical", "Procedural", "Praise" and "Visual Aid". We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale. Our automated classification of surgical feedback achieves AUCs ranging from 71.5 to 77.6 with the fusion improving performance by 3.1%. We also show that high-quality manual transcriptions of feedback audio from experts improve AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged training strategy, with first pre-training each modality separately and then training them jointly, is more effective than training different modalities altogether. We also present intuitive findings on the importance of modalities for different feedback categories. This work offers an important first look at the feasibility of automated classification of real-world live surgical feedback based on text, audio, and video modalities.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups
Authors:
Weiqiu You,
Helen Qu,
Marco Gatti,
Bhuvnesh Jain,
Eric Wong
Abstract:
Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a fram…
▽ More
Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery. Code is available at https://github.com/BrachioLab/sop.
△ Less
Submitted 14 February, 2025; v1 submitted 24 October, 2023;
originally announced October 2023.
-
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
Authors:
Chongyu Fan,
Jiancheng Liu,
Yihua Zhang,
Eric Wong,
Dennis Wei,
Sijia Liu
Abstract:
With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today's AI models. However, existing MU methods focusing on data and/or weight perspectives often suffer limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of 'weight saliency' for MU, drawing parall…
▽ More
With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today's AI models. However, existing MU methods focusing on data and/or weight perspectives often suffer limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of 'weight saliency' for MU, drawing parallels with input saliency in model explanation. This innovation directs MU's attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with 'exact' unlearning (model retraining from scratch after removing the forgetting data points). To the best of our knowledge, SalUn is the first principled MU approach that can effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation tasks. As highlighted below, For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not. Codes are available at https://github.com/OPTML-Group/Unlearn-Saliency. (WARNING: This paper contains model outputs that may be offensive in nature.)
△ Less
Submitted 4 April, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Jailbreaking Black Box Large Language Models in Twenty Queries
Authors:
Patrick Chao,
Alexander Robey,
Edgar Dobriban,
Hamed Hassani,
George J. Pappas,
Eric Wong
Abstract:
There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt…
▽ More
There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.
△ Less
Submitted 18 July, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Comparing Styles across Languages: A Cross-Cultural Exploration of Politeness
Authors:
Shreya Havaldar,
Matthew Pressimone,
Eric Wong,
Lyle Ungar
Abstract:
Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into…
▽ More
Understanding how styles differ across languages is advantageous for training both humans and computers to generate culturally appropriate text. We introduce an explanation framework to extract stylistic differences from multilingual LMs and compare styles across languages. Our framework (1) generates comprehensive style lexica in any language and (2) consolidates feature importances from LMs into comparable lexical categories. We apply this framework to compare politeness, creating the first holistic multilingual politeness dataset and exploring how politeness varies across four languages. Our approach enables an effective evaluation of how distinct linguistic categories contribute to stylistic variations and provides interpretable insights into how people communicate differently around the world.
△ Less
Submitted 26 March, 2025; v1 submitted 10 October, 2023;
originally announced October 2023.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Authors:
Alexander Robey,
Eric Wong,
Hamed Hassani,
George J. Pappas
Abstract:
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarial…
▽ More
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.
△ Less
Submitted 11 June, 2024; v1 submitted 5 October, 2023;
originally announced October 2023.