-
SplInterp: Improving our Understanding and Training of Sparse Autoencoders
Authors:
Jeremy Budd,
Javier Ideami,
Benjamin Macdowall Rynne,
Keith Duggar,
Randall Balestriero
Abstract:
Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline…
▽ More
Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``$k$-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``$k$-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: https://github.com/splInterp2025/splInterp
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o
Authors:
Rundong Liu,
Andre Frade,
Amal Vaidya,
Maxime Labonne,
Marcus Kaiser,
Bismayan Chakrabarti,
Jonathan Budd,
Sean Moran
Abstract:
This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative s…
▽ More
This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
Graph Laplacian-based Bayesian Multi-fidelity Modeling
Authors:
Orazio Pinti,
Jeremy M. Budd,
Franca Hoffmann,
Assad A. Oberai
Abstract:
We present a novel probabilistic approach for generating multi-fidelity data while accounting for errors inherent in both low- and high-fidelity data. In this approach a graph Laplacian constructed from the low-fidelity data is used to define a multivariate Gaussian prior density for the coordinates of the true data points. In addition, few high-fidelity data points are used to construct a conjuga…
▽ More
We present a novel probabilistic approach for generating multi-fidelity data while accounting for errors inherent in both low- and high-fidelity data. In this approach a graph Laplacian constructed from the low-fidelity data is used to define a multivariate Gaussian prior density for the coordinates of the true data points. In addition, few high-fidelity data points are used to construct a conjugate likelihood term. Thereafter, Bayes rule is applied to derive an explicit expression for the posterior density which is also multivariate Gaussian. The maximum \textit{a posteriori} (MAP) estimate of this density is selected to be the optimal multi-fidelity estimate. It is shown that the MAP estimate and the covariance of the posterior density can be determined through the solution of linear systems of equations. Thereafter, two methods, one based on spectral truncation and another based on a low-rank approximation, are developed to solve these equations efficiently. The multi-fidelity approach is tested on a variety of problems in solid and fluid mechanics with data that represents vectors of quantities of interest and discretized spatial fields in one and two dimensions. The results demonstrate that by utilizing a small fraction of high-fidelity data, the multi-fidelity approach can significantly improve the accuracy of a large collection of low-fidelity data points.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Weakly Convex Regularisers for Inverse Problems: Convergence of Critical Points and Primal-Dual Optimisation
Authors:
Zakhar Shumaylov,
Jeremy Budd,
Subhadip Mukherjee,
Carola-Bibiane Schönlieb
Abstract:
Variational regularisation is the primary method for solving inverse problems, and recently there has been considerable work leveraging deeply learned regularisation for enhanced performance. However, few results exist addressing the convergence of such regularisation, particularly within the context of critical points as opposed to global minimisers. In this paper, we present a generalised formul…
▽ More
Variational regularisation is the primary method for solving inverse problems, and recently there has been considerable work leveraging deeply learned regularisation for enhanced performance. However, few results exist addressing the convergence of such regularisation, particularly within the context of critical points as opposed to global minimisers. In this paper, we present a generalised formulation of convergent regularisation in terms of critical points, and show that this is achieved by a class of weakly convex regularisers. We prove convergence of the primal-dual hybrid gradient method for the associated variational problem, and, given a Kurdyka-Lojasiewicz condition, an $\mathcal{O}(\log{k}/k)$ ergodic convergence rate. Finally, applying this theory to learned regularisation, we prove universal approximation for input weakly convex neural networks (IWCNN), and show empirically that IWCNNs can lead to improved performance of learned adversarial regularisers for computed tomography (CT) reconstruction.
△ Less
Submitted 15 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Provably Convergent Data-Driven Convex-Nonconvex Regularization
Authors:
Zakhar Shumaylov,
Jeremy Budd,
Subhadip Mukherjee,
Carola-Bibiane Schönlieb
Abstract:
An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural networ…
▽ More
An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural network (IWCNN) construction to adapt the method of learned adversarial regularization to the CNC framework. Empirically we show that our method overcomes numerical issues of previous adversarial methods.
△ Less
Submitted 2 November, 2023; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Statistical Design and Analysis for Robust Machine Learning: A Case Study from COVID-19
Authors:
Davide Pigoli,
Kieran Baker,
Jobie Budd,
Lorraine Butler,
Harry Coppock,
Sabrina Egglestone,
Steven G. Gilmour,
Chris Holmes,
David Hurley,
Radka Jersakova,
Ivan Kiskin,
Vasiliki Koutra,
Jonathon Mellor,
George Nicholson,
Joe Packham,
Selina Patel,
Richard Payne,
Stephen J. Roberts,
Björn W. Schuller,
Ana Tendero-Cañadas,
Tracey Thornley,
Alexander Titcomb
Abstract:
Since early in the coronavirus disease 2019 (COVID-19) pandemic, there has been interest in using artificial intelligence methods to predict COVID-19 infection status based on vocal audio signals, for example cough recordings. However, existing studies have limitations in terms of data collection and of the assessment of the performances of the proposed predictive models. This paper rigorously ass…
▽ More
Since early in the coronavirus disease 2019 (COVID-19) pandemic, there has been interest in using artificial intelligence methods to predict COVID-19 infection status based on vocal audio signals, for example cough recordings. However, existing studies have limitations in terms of data collection and of the assessment of the performances of the proposed predictive models. This paper rigorously assesses state-of-the-art machine learning techniques used to predict COVID-19 infection status based on vocal audio signals, using a dataset collected by the UK Health Security Agency. This dataset includes acoustic recordings and extensive study participant meta-data. We provide guidelines on testing the performance of methods to classify COVID-19 infection status based on acoustic features and we discuss how these can be extended more generally to the development and assessment of predictive methods based on public health datasets.
△ Less
Submitted 27 February, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers
Authors:
Harry Coppock,
George Nicholson,
Ivan Kiskin,
Vasiliki Koutra,
Kieran Baker,
Jobie Budd,
Richard Payne,
Emma Karoune,
David Hurley,
Alexander Titcomb,
Sabrina Egglestone,
Ana Tendero Cañadas,
Lorraine Butler,
Radka Jersakova,
Jonathon Mellor,
Selina Patel,
Tracey Thornley,
Peter Diggle,
Sylvia Richardson,
Josef Packham,
Björn W. Schuller,
Davide Pigoli,
Steven Gilmour,
Stephen Roberts,
Chris Holmes
Abstract:
Recent work has reported that AI classifiers trained on audio recordings can accurately predict severe acute respiratory syndrome coronavirus 2 (SARSCoV2) infection status. Here, we undertake a large scale study of audio-based deep learning classifiers, as part of the UK governments pandemic response. We collect and analyse a dataset of audio recordings from 67,842 individuals with linked metadata…
▽ More
Recent work has reported that AI classifiers trained on audio recordings can accurately predict severe acute respiratory syndrome coronavirus 2 (SARSCoV2) infection status. Here, we undertake a large scale study of audio-based deep learning classifiers, as part of the UK governments pandemic response. We collect and analyse a dataset of audio recordings from 67,842 individuals with linked metadata, including reverse transcription polymerase chain reaction (PCR) test outcomes, of whom 23,514 tested positive for SARS CoV 2. Subjects were recruited via the UK governments National Health Service Test-and-Trace programme and the REal-time Assessment of Community Transmission (REACT) randomised surveillance survey. In an unadjusted analysis of our dataset AI classifiers predict SARS-CoV-2 infection status with high accuracy (Receiver Operating Characteristic Area Under the Curve (ROCAUC) 0.846 [0.838, 0.854]) consistent with the findings of previous studies. However, after matching on measured confounders, such as age, gender, and self reported symptoms, our classifiers performance is much weaker (ROC-AUC 0.619 [0.594, 0.644]). Upon quantifying the utility of audio based classifiers in practical settings, we find them to be outperformed by simple predictive scores based on user reported symptoms.
△ Less
Submitted 2 March, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
A large-scale and PCR-referenced vocal audio dataset for COVID-19
Authors:
Jobie Budd,
Kieran Baker,
Emma Karoune,
Harry Coppock,
Selina Patel,
Ana Tendero Cañadas,
Alexander Titcomb,
Richard Payne,
David Hurley,
Sabrina Egglestone,
Lorraine Butler,
Jonathon Mellor,
George Nicholson,
Ivan Kiskin,
Vasiliki Koutra,
Radka Jersakova,
Rachel A. McKendry,
Peter Diggle,
Sylvia Richardson,
Björn W. Schuller,
Steven Gilmour,
Davide Pigoli,
Stephen Roberts,
Josef Packham,
Tracey Thornley
, et al. (1 additional authors not shown)
Abstract:
The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmi…
▽ More
The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were collected in the 'Speak up to help beat coronavirus' digital survey alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,794 of 72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms were reported by 45.62% of participants. This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.
△ Less
Submitted 3 November, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Joint reconstruction-segmentation on graphs
Authors:
Jeremy Budd,
Yves van Gennip,
Jonas Latz,
Simone Parisotto,
Carola-Bibiane Schönlieb
Abstract:
Practical image segmentation tasks concern images which must be reconstructed from noisy, distorted, and/or incomplete observations. A recent approach for solving such tasks is to perform this reconstruction jointly with the segmentation, using each to guide the other. However, this work has so far employed relatively simple segmentation methods, such as the Chan--Vese algorithm. In this paper, we…
▽ More
Practical image segmentation tasks concern images which must be reconstructed from noisy, distorted, and/or incomplete observations. A recent approach for solving such tasks is to perform this reconstruction jointly with the segmentation, using each to guide the other. However, this work has so far employed relatively simple segmentation methods, such as the Chan--Vese algorithm. In this paper, we present a method for joint reconstruction-segmentation using graph-based segmentation methods, which have been seeing increasing recent interest. Complications arise due to the large size of the matrices involved, and we show how these complications can be managed. We then analyse the convergence properties of our scheme. Finally, we apply this scheme to distorted versions of ``two cows'' images familiar from previous graph-based segmentation literature, first to a highly noised version and second to a blurred version, achieving highly accurate segmentations in both cases. We compare these results to those obtained by sequential reconstruction-segmentation approaches, finding that our method competes with, or even outperforms, those approaches in terms of reconstruction and segmentation accuracy.
△ Less
Submitted 20 January, 2023; v1 submitted 11 August, 2022;
originally announced August 2022.
-
Go local: The key to controlling the COVID-19 pandemic in the post lockdown era
Authors:
Isabel Bennett,
Jobie Budd,
Erin M. Manning,
Ed Manley,
Mengdie Zhuang,
Ingemar J. Cox,
Michael Short,
Anne M. Johnson,
Deenan Pillay,
Rachel A. McKendry
Abstract:
The UK government announced its first wave of lockdown easing on 10 May 2020, two months after the non-pharmaceutical measures to reduce the spread of COVID-19 were first introduced on 23 March 2020. Analysis of reported case rate data from Public Health England and aggregated and anonymised crowd level mobility data shows variability across local authorities in the UK. A locality-based approach t…
▽ More
The UK government announced its first wave of lockdown easing on 10 May 2020, two months after the non-pharmaceutical measures to reduce the spread of COVID-19 were first introduced on 23 March 2020. Analysis of reported case rate data from Public Health England and aggregated and anonymised crowd level mobility data shows variability across local authorities in the UK. A locality-based approach to lockdown easing is needed, enabling local public health and associated health and social care services to rapidly respond to emerging hotspots of infection. National level data will hide an increasing heterogeneity of COVID-19 infections and mobility, and new ways of real-time data presentation to the public are required. Data sources (including mobile) allow for faster visualisation than more traditional data sources, and are part of a wider trend towards near real-time analysis of outbreaks needed for timely, targeted local public health interventions. Real time data visualisation may give early warnings of unusual levels of activity which warrant further investigation by local public health authorities.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.