-
Symbolic Regression with Multimodal Large Language Models and Kolmogorov Arnold Networks
Authors:
Thomas R. Harvey,
Fabian Ruehle,
Kit Fraser-Taliente,
James Halverson
Abstract:
We present a novel approach to symbolic regression using vision-capable large language models (LLMs) and the ideas behind Google DeepMind's Funsearch. The LLM is given a plot of a univariate function and tasked with proposing an ansatz for that function. The free parameters of the ansatz are fitted using standard numerical optimisers, and a collection of such ansätze make up the population of a…
▽ More
We present a novel approach to symbolic regression using vision-capable large language models (LLMs) and the ideas behind Google DeepMind's Funsearch. The LLM is given a plot of a univariate function and tasked with proposing an ansatz for that function. The free parameters of the ansatz are fitted using standard numerical optimisers, and a collection of such ansätze make up the population of a genetic algorithm. Unlike other symbolic regression techniques, our method does not require the specification of a set of functions to be used in regression, but with appropriate prompt engineering, we can arbitrarily condition the generative step. By using Kolmogorov Arnold Networks (KANs), we demonstrate that ``univariate is all you need'' for symbolic regression, and extend this method to multivariate functions by learning the univariate function on each edge of a trained KAN. The combined expression is then simplified by further processing with a language model.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Analyzing the Impact of Low-Rank Adaptation for Cross-Domain Few-Shot Object Detection in Aerial Images
Authors:
Hicham Talaoubrid,
Anissa Mokraoui,
Ismail Ben Ayed,
Axel Prouvost,
Sonimith Hang,
Monit Korn,
Rémi Harvey
Abstract:
This paper investigates the application of Low-Rank Adaptation (LoRA) to small models for cross-domain few-shot object detection in aerial images. Originally designed for large-scale models, LoRA helps mitigate overfitting, making it a promising approach for resource-constrained settings. We integrate LoRA into DiffusionDet, and evaluate its performance on the DOTA and DIOR datasets. Our results s…
▽ More
This paper investigates the application of Low-Rank Adaptation (LoRA) to small models for cross-domain few-shot object detection in aerial images. Originally designed for large-scale models, LoRA helps mitigate overfitting, making it a promising approach for resource-constrained settings. We integrate LoRA into DiffusionDet, and evaluate its performance on the DOTA and DIOR datasets. Our results show that LoRA applied after an initial fine-tuning slightly improves performance in low-shot settings (e.g., 1-shot and 5-shot), while full fine-tuning remains more effective in higher-shot configurations. These findings highlight LoRA's potential for efficient adaptation in aerial object detection, encouraging further research into parameter-efficient fine-tuning strategies for few-shot learning. Our code is available here: https://github.com/HichTala/LoRA-DiffusionDet.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
Generative Modeling for Mathematical Discovery
Authors:
Jordan S. Ellenberg,
Cristofero S. Fraser-Taliente,
Thomas R. Harvey,
Karan Srivastava,
Andrew V. Sutherland
Abstract:
We present a new implementation of the LLM-driven genetic algorithm {\it funsearch}, whose aim is to generate examples of interest to mathematicians and which has already had some success in problems in extremal combinatorics. Our implementation is designed to be useful in practice for working mathematicians; it does not require expertise in machine learning or access to high-performance computing…
▽ More
We present a new implementation of the LLM-driven genetic algorithm {\it funsearch}, whose aim is to generate examples of interest to mathematicians and which has already had some success in problems in extremal combinatorics. Our implementation is designed to be useful in practice for working mathematicians; it does not require expertise in machine learning or access to high-performance computing resources. Applying {\it funsearch} to a new problem involves modifying a small segment of Python code and selecting a large language model (LLM) from one of many third-party providers. We benchmarked our implementation on three different problems, obtaining metrics that may inform applications of {\it funsearch} to new problems. Our results demonstrate that {\it funsearch} successfully learns in a variety of combinatorial and number-theoretic settings, and in some contexts learns principles that generalize beyond the problem originally trained on.
△ Less
Submitted 16 March, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Decoding Nature with Nature's Tools: Heterotic Line Bundle Models of Particle Physics with Genetic Algorithms and Quantum Annealing
Authors:
Steve Abel,
Andrei Constantin,
Thomas R. Harvey,
Andre Lukas,
Luca A. Nutricati
Abstract:
The string theory landscape may include a multitude of ultraviolet embeddings of the Standard Model, but identifying these has proven difficult due to the enormous number of available string compactifications. Genetic Algorithms (GAs) represent a powerful class of discrete optimisation techniques that can efficiently deal with the immensity of the string landscape, especially when enhanced with in…
▽ More
The string theory landscape may include a multitude of ultraviolet embeddings of the Standard Model, but identifying these has proven difficult due to the enormous number of available string compactifications. Genetic Algorithms (GAs) represent a powerful class of discrete optimisation techniques that can efficiently deal with the immensity of the string landscape, especially when enhanced with input from quantum annealers. In this letter we focus on geometric compactifications of the $E_8\times E_8$ heterotic string theory compactified on smooth Calabi-Yau threefolds with Abelian bundles. We make use of analytic formulae for bundle-valued cohomology to impose the entire range of spectrum requirements, something that has not been possible so far. For manifolds with a relatively low number of Kahler parameters we compare the GA search results with results from previous systematic scans, showing that GAs can find nearly all the viable solutions while visiting only a tiny fraction of the solution space. Moreover, we carry out GA searches on manifolds with a larger numbers of Kahler parameters where systematic searches are not feasible.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
SMART: Self-Morphing Adaptive Replanning Tree
Authors:
Zongyuan Shen,
James P. Wilson,
Shalabh Gupta,
Ryan Harvey
Abstract:
The paper presents an algorithm, called Self-Morphing Adaptive Replanning Tree (SMART), that facilitates fast replanning in dynamic environments. SMART performs risk based tree-pruning if the current path is obstructed by nearby moving obstacle(s), resulting in multiple disjoint subtrees. Then, for speedy recovery, it exploits these subtrees and performs informed tree-repair at hot-spots that lie…
▽ More
The paper presents an algorithm, called Self-Morphing Adaptive Replanning Tree (SMART), that facilitates fast replanning in dynamic environments. SMART performs risk based tree-pruning if the current path is obstructed by nearby moving obstacle(s), resulting in multiple disjoint subtrees. Then, for speedy recovery, it exploits these subtrees and performs informed tree-repair at hot-spots that lie at the intersection of subtrees to find a new path. The performance of SMART is comparatively evaluated with eight existing algorithms through extensive simulations. Two scenarios are considered with: 1) dynamic obstacles and 2) both static and dynamic obstacles. The results show that SMART yields significant improvements in replanning time, success rate and travel time. Finally, the performance of SMART is validated by a real laboratory experiment.
△ Less
Submitted 21 September, 2023; v1 submitted 10 May, 2023;
originally announced May 2023.
-
SMARRT: Self-Repairing Motion-Reactive Anytime RRT for Dynamic Environments
Authors:
Zongyuan Shen,
James Wilson,
Ryan Harvey,
Shalabh Gupta
Abstract:
This paper addresses the fast replanning problem in dynamic environments with moving obstacles. Since for randomly moving obstacles the future states are unpredictable, the proposed method, called SMARRT, reacts to obstacle motions and revises the path in real-time based on the current interfering obstacle state (i.e., position and velocity). SMARRT is fast and efficient and performs collision che…
▽ More
This paper addresses the fast replanning problem in dynamic environments with moving obstacles. Since for randomly moving obstacles the future states are unpredictable, the proposed method, called SMARRT, reacts to obstacle motions and revises the path in real-time based on the current interfering obstacle state (i.e., position and velocity). SMARRT is fast and efficient and performs collision checking only on the partial path segment close to the robot within a feasibility checking horizon. If the path is infeasible, then tree parts associated with the path inside the horizon are pruned while maintaining the maximal tree structure of already-explored regions. Then, a multi-resolution utility map is created to capture the environmental information used to compute the replanning utility for each cell on the multi-scale tiling. A hierarchical searching method is applied on the map to find the sampling cell efficiently. Finally, uniform samples are drawn within the sampling cell for fast replanning. The SMARRT method is validated via simulation runs, and the results are evaluated in comparison to four existing methods. The SMARRT method yields significant improvements in travel time, replanning time, and success rate compared against the existing methods.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
Heterotic String Model Building with Monad Bundles and Reinforcement Learning
Authors:
Andrei Constantin,
Thomas R. Harvey,
Andre Lukas
Abstract:
We use reinforcement learning as a means of constructing string compactifications with prescribed properties. Specifically, we study heterotic SO(10) GUT models on Calabi-Yau three-folds with monad bundles, in search of phenomenologically promising examples. Due to the vast number of bundles and the sparseness of viable choices, methods based on systematic scanning are not suitable for this class…
▽ More
We use reinforcement learning as a means of constructing string compactifications with prescribed properties. Specifically, we study heterotic SO(10) GUT models on Calabi-Yau three-folds with monad bundles, in search of phenomenologically promising examples. Due to the vast number of bundles and the sparseness of viable choices, methods based on systematic scanning are not suitable for this class of models. By focusing on two specific manifolds with Picard numbers two and three, we show that reinforcement learning can be used successfully to explore monad bundles. Training can be accomplished with minimal computing resources and leads to highly efficient policy networks. They produce phenomenologically promising states for nearly 100% of episodes and within a small number of steps. In this way, hundreds of new candidate standard models are found.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
CPPNet: A Coverage Path Planning Network
Authors:
Zongyuan Shen,
Palash Agrawal,
James P. Wilson,
Ryan Harvey,
Shalabh Gupta
Abstract:
This paper presents a deep-learning based CPP algorithm, called Coverage Path Planning Network (CPPNet). CPPNet is built using a convolutional neural network (CNN) whose input is a graph-based representation of the occupancy grid map while its output is an edge probability heat graph, where the value of each edge is the probability of belonging to the optimal TSP tour. Finally, a greedy search is…
▽ More
This paper presents a deep-learning based CPP algorithm, called Coverage Path Planning Network (CPPNet). CPPNet is built using a convolutional neural network (CNN) whose input is a graph-based representation of the occupancy grid map while its output is an edge probability heat graph, where the value of each edge is the probability of belonging to the optimal TSP tour. Finally, a greedy search is used to select the final optimized tour. CPPNet is trained and comparatively evaluated against the TSP tour. It is shown that CPPNet provides near-optimal solutions while requiring significantly less computational time, thus enabling real-time coverage path planning in partially unknown and dynamic environments.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
MRRT: Multiple Rapidly-Exploring Random Trees for Fast Online Replanning in Dynamic Environments
Authors:
Zongyuan Shen,
James P. Wilson,
Ryan Harvey,
Shalabh Gupta
Abstract:
This paper presents a novel algorithm, called MRRT, which uses multiple rapidly-exploring random trees for fast online replanning of autonomous vehicles in dynamic environments with moving obstacles. The proposed algorithm is built upon the RRT algorithm with a multi-tree structure. At the beginning, the RRT algorithm is applied to find the initial solution based on partial knowledge of the enviro…
▽ More
This paper presents a novel algorithm, called MRRT, which uses multiple rapidly-exploring random trees for fast online replanning of autonomous vehicles in dynamic environments with moving obstacles. The proposed algorithm is built upon the RRT algorithm with a multi-tree structure. At the beginning, the RRT algorithm is applied to find the initial solution based on partial knowledge of the environment. Then, the robot starts to execute this path. At each iteration, the new obstacle configurations are collected by the robot's sensor and used to replan the path. This new information can come from unknown static obstacles (e.g., seafloor layout) as well as moving obstacles. Then, to accommodate the environmental changes, two procedures are adopted: 1) edge pruning, and 2) tree regrowing. Specifically, the edge pruning procedure checks the collision status through the tree and only removes the invalid edges while maintaining the tree structure of already-explored regions. Due to removal of invalid edges, the tree could be broken into multiple disjoint trees. As such, the RRT algorithm is applied to regrow the trees. Specifically, a sample is created randomly and joined to all the disjoint trees in its local neighborhood by connecting to the nearest nodes. Finally, a new solution is found for the robot. The advantages of the proposed MRRT algorithm are as follows: i) retains the maximal tree structure by only pruning the edges which collide with the obstacles, ii) guarantees probabilistic completeness, and iii) is computational efficient for fast replanning since all disjoint trees are maintained for future connections and expanded simultaneously.
△ Less
Submitted 22 April, 2021;
originally announced April 2021.
-
Detecting Electric Devices in 3D Images of Bags
Authors:
Anthony Bagnall,
Paul Southam,
James Large,
Richard Harvey
Abstract:
The aviation and transport security industries face the challenge of screening high volumes of baggage for threats and contraband in the minimum time possible. Automation and semi-automation of this procedure offers the potential to increase security by detecting more threats and improve the customer experience by speeding up the process. Traditional 2D x-ray images are often extremely difficult t…
▽ More
The aviation and transport security industries face the challenge of screening high volumes of baggage for threats and contraband in the minimum time possible. Automation and semi-automation of this procedure offers the potential to increase security by detecting more threats and improve the customer experience by speeding up the process. Traditional 2D x-ray images are often extremely difficult to examine due to the fact that they are tightly packed and contain a wide variety of cluttered and occluded objects. Because of these limitations, major airports are introducing 3D x-ray Computed Tomography (CT) baggage scanning. We investigate whether we can automate the process of detecting electric devices in these 3D images of luggage. Detecting electrical devices is of particular concern as they can be used to conceal explosives. Given the massive volume of luggage that needs to be screened for this threat, the best way to automate the detection is to first filter whether a bag contains an electric device or not, and if it does, to identify the number of devices and their location. We present an algorithm, Unpack, Predict, eXtract, Repack (UXPR), which involves unpacking through segmenting the data at a range of scales using an algorithm known as the Sieve, predicting whether a segment is electrical or not based on the histogram of voxel intensities, then repacking the bag by ensembling the segments and predictions to identify the devices in bags. Through a range of experiments using data provided by ALERT (Awareness and Localization of Explosives-Related Threats) we show that this system can find a high proportion of devices with unsupervised segmentation if a similar device has been seen before, and shows promising results for detecting devices not seen at all based on the properties of its constituent parts.
△ Less
Submitted 25 April, 2020;
originally announced May 2020.
-
Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals
Authors:
Helen L Bear,
Richard Harvey
Abstract:
Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly…
▽ More
Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly dependent upon the speaker. So here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We test these phoneme to viseme maps to examine how similarly speakers talk visually and we use signed rank tests to measure the distance between individuals. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.
△ Less
Submitted 8 May, 2018;
originally announced May 2018.
-
Phoneme-to-viseme mappings: the good, the bad, and the ugly
Authors:
Helen L Bear,
Richard Harvey
Abstract:
Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is "a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguit…
▽ More
Visemes are the visual equivalent of phonemes. Although not precisely defined, a working definition of a viseme is "a set of phonemes which have identical appearance on the lips". Therefore a phoneme falls into one viseme class but a viseme may represent many phonemes: a many to one mapping. This mapping introduces ambiguity between phonemes when using viseme classifiers. Not only is this ambiguity damaging to the performance of audio-visual classifiers operating on real expressive speech, there is also considerable choice between possible mappings. In this paper we explore the issue of this choice of viseme-to-phoneme map. We show that there is definite difference in performance between viseme-to-phoneme mappings and explore why some maps appear to work better than others. We also devise a new algorithm for constructing phoneme-to-viseme mappings from labeled speech data. These new visemes, `Bear' visemes, are shown to perform better than previously known units.
△ Less
Submitted 8 May, 2018;
originally announced May 2018.
-
Comparing phonemes and visemes with DNN-based lipreading
Authors:
Kwanchiva Thangthai,
Helen L Bear,
Richard Harvey
Abstract:
There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report t…
▽ More
There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report the accuracy of our system at both word and unit levels. The evaluation task is large vocabulary continuous speech using the TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs and our visual speech decoder is a Weighted Finite-State Transducer (WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The phoneme lipreading system word accuracy outperforms the viseme based system word accuracy. However, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words.
△ Less
Submitted 8 May, 2018;
originally announced May 2018.
-
Decoding visemes: improving machine lipreading
Authors:
Helen L. Bear,
Richard Harvey
Abstract:
To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses p…
▽ More
To undertake machine lip-reading, we try to recognise speech from a visual signal. Current work often uses viseme classification supported by language models with varying degrees of success. A few recent works suggest phoneme classification, in the right circumstances, can outperform viseme classification. In this work we present a novel two-pass method of training phoneme classifiers which uses previously trained visemes in the first pass. With our new training algorithm, we show classification performance which significantly improves on previous lip-reading results.
△ Less
Submitted 3 October, 2017;
originally announced October 2017.
-
Finding phonemes: improving machine lip-reading
Authors:
Helen L. Bear,
Richard W. Harvey,
Yuxuan Lan
Abstract:
In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45. Viseme classes are based upon the mapping of articulated pho…
▽ More
In machine lip-reading there is continued debate and research around the correct classes to be used for recognition. In this paper we use a structured approach for devising speaker-dependent viseme classes, which enables the creation of a set of phoneme-to-viseme maps where each has a different quantity of visemes ranging from two to 45. Viseme classes are based upon the mapping of articulated phonemes, which have been confused during phoneme recognition, into viseme groups. Using these maps, with the LiLIR dataset, we show the effect of changing the viseme map size in speaker-dependent machine lip-reading, measured by word recognition correctness and so demonstrate that word recognition with phoneme classifiers is not just possible, but often better than word recognition with viseme classifiers. Furthermore, there are intermediate units between visemes and phonemes which are better still.
△ Less
Submitted 3 October, 2017;
originally announced October 2017.
-
Speaker-independent machine lip-reading with speaker-dependent viseme classifiers
Authors:
Helen L. Bear,
Stephen J. Cox,
Richard W. Harvey
Abstract:
In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, spea…
▽ More
In machine lip-reading, which is identification of speech from visual-only information, there is evidence to show that visual speech is highly dependent upon the speaker [1]. Here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We use these maps to examine how similarly speakers talk visually. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.
△ Less
Submitted 3 October, 2017;
originally announced October 2017.
-
Which phoneme-to-viseme maps best improve visual-only computer lip-reading?
Authors:
Helen L. Bear,
Richard W. Harvey,
Barry-John Theobald,
Yuxuan Lan
Abstract:
A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings…
▽ More
A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings and consider if any are stable across talkers. We show a method for devising maps based on phoneme confusions from an automated lip-reading system, and we present new mappings that show improvements for individual talkers.
△ Less
Submitted 3 October, 2017;
originally announced October 2017.
-
Some observations on computer lip-reading: moving from the dream to the reality
Authors:
Helen L. Bear,
Gari Owen,
Richard Harvey,
Barry-John Theobald
Abstract:
In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution,…
▽ More
In the quest for greater computer lip-reading performance there are a number of tacit assumptions which are either present in the datasets (high resolution for example) or in the methods (recognition of spoken visual units called visemes for example). Here we review these and other assumptions and show the surprising result that computer lip-reading is not heavily constrained by video resolution, pose, lighting and other practical factors. However, the working assumption that visemes, which are the visual equivalent of phonemes, are the best unit for recognition does need further examination. We conclude that visemes, which were defined over a century ago, are unlikely to be optimal for a modern computer lip-reading system.
△ Less
Submitted 3 October, 2017;
originally announced October 2017.
-
Resolution limits on visual speech recognition
Authors:
Helen L. Bear,
Richard Harvey,
Barry-John Theobald,
Yuxuan Lan
Abstract:
Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test…
▽ More
Visual-only speech recognition is dependent upon a number of factors that can be difficult to control, such as: lighting; identity; motion; emotion and expression. But some factors, such as video resolution are controllable, so it is surprising that there is not yet a systematic study of the effect of resolution on lip-reading. Here we use a new data set, the Rosetta Raven data, to train and test recognizers so we can measure the affect of video resolution on recognition accuracy. We conclude that, contrary to common practice, resolution need not be that great for automatic lip-reading. However it is highly unlikely that automatic lip-reading can work reliably when the distance between the bottom of the lower lip and the top of the upper lip is less than four pixels at rest.
△ Less
Submitted 3 October, 2017;
originally announced October 2017.