-
Magistral
Authors:
Mistral-AI,
:,
Abhinav Rastogi,
Albert Q. Jiang,
Andy Lo,
Gabrielle Berrada,
Guillaume Lample,
Jason Rute,
Joep Barmentlo,
Karmesh Yadav,
Kartik Khandelwal,
Khyathi Raghavi Chandu,
Léonard Blier,
Lucile Saulnier,
Matthieu Dinot,
Maxime Darrin,
Neha Gupta,
Roman Soletskyi,
Sagar Vaze,
Teven Le Scao,
Yihan Wang,
Adam Yang,
Alexander H. Liu,
Alexandre Sablayrolles,
Amélie Héliou
, et al. (76 additional authors not shown)
Abstract:
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a s…
▽ More
We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform
Authors:
Jay Roberts,
Kyle Mylonakis,
Sidhartha Roy,
Kaan Kale
Abstract:
The high cost of ownership of AI compute infrastructure and challenges of robust serving of large language models (LLMs) has led to a surge in managed Model-as-a-service deployments. Even when enterprises choose on-premises deployments, the compute infrastructure is typically shared across many teams in order to maximize the return on investment. In both scenarios the deployed models operate only…
▽ More
The high cost of ownership of AI compute infrastructure and challenges of robust serving of large language models (LLMs) has led to a surge in managed Model-as-a-service deployments. Even when enterprises choose on-premises deployments, the compute infrastructure is typically shared across many teams in order to maximize the return on investment. In both scenarios the deployed models operate only on plaintext data, and so enterprise data owners must allow their data to appear in plaintext on a shared or multi-tenant compute infrastructure. This results in data owners with private or sensitive data being hesitant or restricted in what data they use with these types of deployments. In this work we introduce the Stained Glass Transform, a learned, stochastic, and sequence dependent transformation of the word embeddings of an LLM which information theoretically provides privacy to the input of the LLM while preserving the utility of model. We theoretically connect a particular class of Stained Glass Transforms to the theory of mutual information of Gaussian Mixture Models. We then calculate a-postiori privacy estimates, based on mutual information, and verify the privacy and utility of instances of transformed embeddings through token level metrics of privacy and standard LLM performance benchmarks.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
BeamClean: Language Aware Embedding Reconstruction
Authors:
Kaan Kale,
Kyle Mylonakis,
Jay Roberts,
Sidhartha Roy
Abstract:
In this work, we consider an inversion attack on the obfuscated input embeddings sent to a language model on a server, where the adversary has no access to the language model or the obfuscation mechanism and sees only the obfuscated embeddings along with the model's embedding table. We propose BeamClean, an inversion attack that jointly estimates the noise parameters and decodes token sequences by…
▽ More
In this work, we consider an inversion attack on the obfuscated input embeddings sent to a language model on a server, where the adversary has no access to the language model or the obfuscation mechanism and sees only the obfuscated embeddings along with the model's embedding table. We propose BeamClean, an inversion attack that jointly estimates the noise parameters and decodes token sequences by integrating a language-model prior. Against Laplacian and Gaussian obfuscation mechanisms, BeamClean always surpasses naive distance-based attacks. This work highlights the necessity for and robustness of more advanced learned, input-dependent methods.
△ Less
Submitted 19 May, 2025;
originally announced May 2025.
-
Beyond Accuracy: EcoL2 Metric for Sustainable Neural PDE Solvers
Authors:
Taniya Kapoor,
Abhishek Chandra,
Anastasios Stamou,
Stephen J Roberts
Abstract:
Real-world systems, from aerospace to railway engineering, are modeled with partial differential equations (PDEs) describing the physics of the system. Estimating robust solutions for such problems is essential. Deep learning-based architectures, such as neural PDE solvers, have recently gained traction as a reliable solution method. The current state of development of these approaches, however, p…
▽ More
Real-world systems, from aerospace to railway engineering, are modeled with partial differential equations (PDEs) describing the physics of the system. Estimating robust solutions for such problems is essential. Deep learning-based architectures, such as neural PDE solvers, have recently gained traction as a reliable solution method. The current state of development of these approaches, however, primarily focuses on improving accuracy. The environmental impact of excessive computation, leading to increased carbon emissions, has largely been overlooked. This paper introduces a carbon emission measure for a range of PDE solvers. Our proposed metric, EcoL2, balances model accuracy with emissions across data collection, model training, and deployment. Experiments across both physics-informed machine learning and operator learning architectures demonstrate that the proposed metric presents a holistic assessment of model performance and emission cost. As such solvers grow in scale and deployment, EcoL2 represents a step toward building performant scientific machine learning systems with lower long-term environmental impact.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering
Authors:
Udita Patel,
Rutu Mulkar,
Jay Roberts,
Cibi Chakravarthy Senthilkumar,
Sujay Gandhi,
Xiaofei Zheng,
Naumaan Nayyar,
Parul Kalra,
Rafael Castrillo
Abstract:
We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, mon…
▽ More
We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference responses.We also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.
△ Less
Submitted 3 June, 2025; v1 submitted 16 May, 2025;
originally announced May 2025.
-
Dargana: fine-tuning EarthPT for dynamic tree canopy mapping from space
Authors:
Michael J. Smith,
Luke Fleming,
James E. Geach,
Ryan J. Roberts,
Freddie Kalaitzis,
James Banister
Abstract:
We present Dargana, a fine-tuned variant of the EarthPT time-series foundation model that achieves specialisation using <3% of its pre-training data volume and 5% of its pre-training compute. Dargana is fine-tuned to generate regularly updated classification of tree canopy cover at 10m resolution, distinguishing conifer and broadleaved tree types. Using Cornwall, UK, as a test case, the model achi…
▽ More
We present Dargana, a fine-tuned variant of the EarthPT time-series foundation model that achieves specialisation using <3% of its pre-training data volume and 5% of its pre-training compute. Dargana is fine-tuned to generate regularly updated classification of tree canopy cover at 10m resolution, distinguishing conifer and broadleaved tree types. Using Cornwall, UK, as a test case, the model achieves a pixel-level ROC-AUC of 0.98 and a PR-AUC of 0.83 on unseen satellite imagery. Dargana can identify fine structures like hedgerows and coppice below the training sample limit, and can track temporal changes to canopy cover such as new woodland establishment. Our results demonstrate how pre-trained Large Observation Models like EarthPT can be specialised for granular, dynamic land cover monitoring from space, providing a valuable, scalable tool for natural capital management and conservation.
△ Less
Submitted 24 April, 2025;
originally announced April 2025.
-
Analysis of Distracted Pedestrians Crossing Behavior: An Immersive Virtual Reality Application
Authors:
Methusela Sulle,
Judith Mwakalonge,
Gurcan Comert,
Saidi Siuhi,
Nana Kankam Gyimah,
Jaylen Roberts,
Denis Ruganuza
Abstract:
Pedestrian safety is a critical public health priority, with pedestrian fatalities accounting for 18% of all U.S. traffic deaths in 2022. The rising prevalence of distracted walking, exacerbated by mobile device use, poses significant risks at signalized intersections. This study utilized an immersive virtual reality (VR) environment to simulate real-world traffic scenarios and assess pedestrian b…
▽ More
Pedestrian safety is a critical public health priority, with pedestrian fatalities accounting for 18% of all U.S. traffic deaths in 2022. The rising prevalence of distracted walking, exacerbated by mobile device use, poses significant risks at signalized intersections. This study utilized an immersive virtual reality (VR) environment to simulate real-world traffic scenarios and assess pedestrian behavior under three conditions: undistracted crossing, crossing while using a mobile device, and crossing with Light-emitting diode (LED) safety interventions. Analysis using ANOVA models identified speed and mobile-focused eye-tracking as significant predictors of crossing duration, revealing how distractions impair situational awareness and response times. While LED measures reduced delays, their limited effectiveness highlights the need for integrated strategies addressing both behavioral and physical factors. This study showcases VRs potential to analyze complex pedestrian behaviors, offering actionable insights for urban planners and policymakers aiming to enhance pedestrian safety.
△ Less
Submitted 14 February, 2025;
originally announced March 2025.
-
Basic Category Usage in Vision Language Models
Authors:
Hunter Sawyer,
Jesse Roberts,
Kyle Moore
Abstract:
The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic level categorization in two recently released, open…
▽ More
The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic level categorization consistent with human behavior. Moreover, the models' preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well established expert basic level shift, further suggesting that VLMs acquire cognitive categorization behaviors from the human data on which they are trained.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Investigating Human-Aligned Large Language Model Uncertainty
Authors:
Kyle Moore,
Jesse Roberts,
Daryl Watson,
Pamela Wisniewski
Abstract:
Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We…
▽ More
Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources
Authors:
Hannah-Beth Clark,
Margaux Dowland,
Laura Benton,
Reka Budai,
Ibrahim Kaan Keskin,
Emma Searle,
Matthew Gregory,
Mark Hodierne,
William Gayne,
John Roberts
Abstract:
As a publicly funded body in the UK, Oak National Academy is in a unique position to innovate within this field as we have a comprehensive curriculum of approximately 13,000 open education resources (OER) for all National Curriculum subjects, designed and quality-assured by expert, human teachers. This has provided the corpus of content needed for building a high-quality AI-powered lesson planning…
▽ More
As a publicly funded body in the UK, Oak National Academy is in a unique position to innovate within this field as we have a comprehensive curriculum of approximately 13,000 open education resources (OER) for all National Curriculum subjects, designed and quality-assured by expert, human teachers. This has provided the corpus of content needed for building a high-quality AI-powered lesson planning tool, Aila, that is free to use and, therefore, accessible to all teachers across the country. Furthermore, using our evidence-informed curriculum principles, we have codified and exemplified each component of lesson design. To assess the quality of lessons produced by Aila at scale, we have developed an AI-powered auto-evaluation agent,facilitating informed improvements to enhance output quality. Through comparisons between human and auto-evaluations, we have begun to refine this agent further to increase its accuracy, measured by its alignment with an expert human evaluator. In this paper we present this iterative evaluation process through an illustrative case study focused on one quality benchmark - the level of challenge within multiple-choice quizzes. We also explore the contribution that this may make to similar projects and the wider sector.
△ Less
Submitted 23 January, 2025;
originally announced February 2025.
-
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
Authors:
Jonathan Roberts,
Mohammad Reza Taesiri,
Ansh Sharma,
Akash Gupta,
Samuel Roberts,
Ioana Croitoru,
Simion-Vlad Bogolin,
Jialu Tang,
Florian Langer,
Vyas Raina,
Vatsal Raina,
Hanyi Xiong,
Vishaal Udandarao,
Jingyi Lu,
Shiyang Chen,
Sam Purkis,
Tianshuo Yan,
Wenye Lin,
Gyungin Shin,
Qiaochu Yang,
Anh Totti Nguyen,
David I. Atkinson,
Aaditya Baranwal,
Alexandru Coca,
Mikah Dang
, et al. (9 additional authors not shown)
Abstract:
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for l…
▽ More
Large Multimodal Models (LMMs) exhibit major shortfalls when interpreting images and, by some measures, have poorer spatial cognition than small children or animals. Despite this, they attain high scores on many popular visual benchmarks, with headroom rapidly eroded by an ongoing surge of model progress. To address this, there is a pressing need for difficult benchmarks that remain relevant for longer. We take this idea to its limit by introducing ZeroBench-a lightweight visual reasoning benchmark that is entirely impossible for contemporary frontier LMMs. Our benchmark consists of 100 manually curated questions and 334 less difficult subquestions. We evaluate 20 LMMs on ZeroBench, all of which score 0.0%, and rigorously analyse the errors. To encourage progress in visual understanding, we publicly release ZeroBench.
△ Less
Submitted 6 March, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
RL + Transformer = A General-Purpose Problem Solver
Authors:
Micah Rentschler,
Jesse Roberts
Abstract:
What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problems that it has never encountered before - an emergent ability called In-Context R…
▽ More
What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problems that it has never encountered before - an emergent ability called In-Context Reinforcement Learning (ICRL). This powerful meta-learner not only excels in solving unseen in-distribution environments with remarkable sample efficiency, but also shows strong performance in out-of-distribution environments. In addition, we show that it exhibits robustness to the quality of its training data, seamlessly stitches together behaviors from its context, and adapts to non-stationary environments. These behaviors demonstrate that an RL-trained transformer can iteratively improve upon its own solutions, making it an excellent general-purpose problem solver.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
Leveraging Large Language Models and Machine Learning for Smart Contract Vulnerability Detection
Authors:
S M Mostaq Hossain,
Amani Altarawneh,
Jesse Roberts
Abstract:
As blockchain technology and smart contracts become widely adopted, securing them throughout every stage of the transaction process is essential. The concern of improved security for smart contracts is to find and detect vulnerabilities using classical Machine Learning (ML) models and fine-tuned Large Language Models (LLM). The robustness of such work rests on a labeled smart contract dataset that…
▽ More
As blockchain technology and smart contracts become widely adopted, securing them throughout every stage of the transaction process is essential. The concern of improved security for smart contracts is to find and detect vulnerabilities using classical Machine Learning (ML) models and fine-tuned Large Language Models (LLM). The robustness of such work rests on a labeled smart contract dataset that includes annotated vulnerabilities on which several LLMs alongside various traditional machine learning algorithms such as DistilBERT model is trained and tested. We train and test machine learning algorithms to classify smart contract codes according to vulnerability types in order to compare model performance. Having fine-tuned the LLMs specifically for smart contract code classification should help in getting better results when detecting several types of well-known vulnerabilities, such as Reentrancy, Integer Overflow, Timestamp Dependency and Dangerous Delegatecall. From our initial experimental results, it can be seen that our fine-tuned LLM surpasses the accuracy of any other model by achieving an accuracy of over 90%, and this advances the existing vulnerability detection benchmarks. Such performance provides a great deal of evidence for LLMs ability to describe the subtle patterns in the code that traditional ML models could miss. Thus, we compared each of the ML and LLM models to give a good overview of each models strengths, from which we can choose the most effective one for real-world applications in smart contract security. Our research combines machine learning and large language models to provide a rich and interpretable framework for detecting different smart contract vulnerabilities, which lays a foundation for a more secure blockchain ecosystem.
△ Less
Submitted 4 January, 2025;
originally announced January 2025.
-
Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities
Authors:
Lawrence Wang,
Stephen J. Roberts
Abstract:
Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating ou…
▽ More
Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is 'stable' and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a consequence of network depth, and we prove that for any network with depth > 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions. Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold. We find these lead to excellent generalization performance on modern benchmark datasets.
△ Less
Submitted 23 December, 2024;
originally announced December 2024.
-
GAMEBoT: Transparent Assessment of LLM Reasoning in Games
Authors:
Wenye Lin,
Jonathan Roberts,
Yunhan Yang,
Samuel Albanie,
Zongqing Lu,
Kai Han
Abstract:
Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these…
▽ More
Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these challenges, we introduce GAMEBoT, a gaming arena designed for rigorous and transparent assessment of LLM reasoning capabilities. GAMEBoT decomposes complex reasoning in games into predefined modular subproblems. This decomposition allows us to design a suite of Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs in addressing these subproblems before action selection. Furthermore, we develop a suite of rule-based algorithms to generate ground truth for these subproblems, enabling rigorous validation of the LLMs' intermediate reasoning steps. This approach facilitates evaluation of both the quality of final actions and the accuracy of the underlying reasoning process. GAMEBoT also naturally alleviates the risk of data contamination through dynamic games and head-to-head LLM competitions. We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics. Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts. Project page: https://visual-ai.github.io/gamebot
△ Less
Submitted 1 June, 2025; v1 submitted 18 December, 2024;
originally announced December 2024.
-
Towards Fairness in AI for Melanoma Detection: Systemic Review and Recommendations
Authors:
Laura N Montoya,
Jennafer Shae Roberts,
Belen Sanchez Hidalgo
Abstract:
Early and accurate melanoma detection is crucial for improving patient outcomes. Recent advancements in artificial intelligence AI have shown promise in this area, but the technologys effectiveness across diverse skin tones remains a critical challenge. This study conducts a systematic review and preliminary analysis of AI based melanoma detection research published between 2013 and 2024, focusing…
▽ More
Early and accurate melanoma detection is crucial for improving patient outcomes. Recent advancements in artificial intelligence AI have shown promise in this area, but the technologys effectiveness across diverse skin tones remains a critical challenge. This study conducts a systematic review and preliminary analysis of AI based melanoma detection research published between 2013 and 2024, focusing on deep learning methodologies, datasets, and skin tone representation. Our findings indicate that while AI can enhance melanoma detection, there is a significant bias towards lighter skin tones. To address this, we propose including skin hue in addition to skin tone as represented by the LOreal Color Chart Map for a more comprehensive skin tone assessment technique. This research highlights the need for diverse datasets and robust evaluation metrics to develop AI models that are equitable and effective for all patients. By adopting best practices outlined in a PRISMA Equity framework tailored for healthcare and melanoma detection, we can work towards reducing disparities in melanoma outcomes.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Authors:
Jonathan Roberts,
Kai Han,
Samuel Albanie
Abstract:
As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has trad…
▽ More
As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.
△ Less
Submitted 23 April, 2025; v1 submitted 7 November, 2024;
originally announced November 2024.
-
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Authors:
Josselin Somerville Roberts,
Tony Lee,
Chi Heem Wong,
Michihiro Yasunaga,
Yifan Mai,
Percy Liang
Abstract:
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.…
▽ More
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
VHELM: A Holistic Evaluation of Vision Language Models
Authors:
Tony Lee,
Haoqin Tu,
Chi Heem Wong,
Wenhao Zheng,
Yiyang Zhou,
Yifan Mai,
Josselin Somerville Roberts,
Michihiro Yasunaga,
Huaxiu Yao,
Cihang Xie,
Percy Liang
Abstract:
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs…
▽ More
Current benchmarks for assessing vision-language models (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision Language Models (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.
△ Less
Submitted 24 October, 2024; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Design Contradictions: Help or Hindrance?
Authors:
Aron E. Owen,
Jonathan C. Roberts
Abstract:
The need for innovative ideas in data visualisation drives us to explore new creative approaches. Combining two or more creative words, particularly those that contradict each other, can positively impact the creative process, sparking novel ideas and designs. As we move towards AI-driven design, an open question arises: do these design contradictions work positively with AI tools? Currently, the…
▽ More
The need for innovative ideas in data visualisation drives us to explore new creative approaches. Combining two or more creative words, particularly those that contradict each other, can positively impact the creative process, sparking novel ideas and designs. As we move towards AI-driven design, an open question arises: do these design contradictions work positively with AI tools? Currently, the answer is no. AI systems, like large language models (LLMs), rely on algorithms that engender similarity, whereas creativity often requires divergence and novelty. This poster initiates a conversation on how to drive AI systems to be more creative and generate new ideas. This research invites us to reconsider traditional design methods and explore new approaches in an AI-driven world. Can we apply the same techniques used in traditional design, like the double diamond model, or do we need new methods for design engineering? How can we quickly design visualisations and craft new ideas with generative AI? This paper seeks to start this critical conversation and offers practical insights into the potential of AI in driving creativity in data visualisation.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
Towards Metrics for Evaluating Creativity in Visualisation Design
Authors:
Aron E Owen,
Jonathan C Roberts
Abstract:
Creativity in visualisation design is essential for designers and data scientists who need to present data in innovative ways. It is often achieved through sketching or drafting low-fidelity prototypes. However, judging this innovation is often difficult. A creative visualisation test would offer a structured approach to enhancing visual thinking and design skills, which are vital across many fiel…
▽ More
Creativity in visualisation design is essential for designers and data scientists who need to present data in innovative ways. It is often achieved through sketching or drafting low-fidelity prototypes. However, judging this innovation is often difficult. A creative visualisation test would offer a structured approach to enhancing visual thinking and design skills, which are vital across many fields. Such a test can facilitate objective evaluation, skill identification, benchmarking, fostering innovation, and improving learning outcomes. In developing such a test, we propose focusing on four criteria: Quantity, Correctness, Novelty, and Feasibility. These criteria integrate into a test that is easy to administer. We name it the Rowen Test of Creativity in Visualisation Design; We introduce the test, scoring system and results from using eight visualisation experts.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis
Authors:
Ray Umphrey,
Jesse Roberts,
Lindsey Roberts
Abstract:
This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, Koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM's ability to generate novel intertextual observati…
▽ More
This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, Koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM's ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.
△ Less
Submitted 29 September, 2024; v1 submitted 3 September, 2024;
originally announced September 2024.
-
Towards a Generative AI Design Dialogue
Authors:
Aron E. Owen,
Jonathan C. Roberts
Abstract:
Traditional visualisation designers often start with sketches before implementation. With generative AI, these sketches can be turned into AI-generated visualisations using specific prompts. However, guiding AI to create compelling visuals can be challenging. We propose a new design process where designers verbalise their thoughts during work, later converting these narratives into AI prompts. Thi…
▽ More
Traditional visualisation designers often start with sketches before implementation. With generative AI, these sketches can be turned into AI-generated visualisations using specific prompts. However, guiding AI to create compelling visuals can be challenging. We propose a new design process where designers verbalise their thoughts during work, later converting these narratives into AI prompts. This approach helps AI generate accurate visuals and assists designers in refining their concepts, enhancing the overall design process. Blending human creativity with AI capabilities enables rapid iteration, leading to higher quality and more innovative visualisations, making design more accessible and efficient.
△ Less
Submitted 19 August, 2024;
originally announced September 2024.
-
Fostering Creative Visualisation Skills Through Data-Art Exhibitions
Authors:
Jonathan C. Roberts
Abstract:
Data-art exhibitions offer a unique and real-world setting to foster creative visualisation skills among students. They serve as real-world platform for students to display their work, bridging the gap between classroom learning and professional practice. Students must develop a technical solution, grasp the context, and produce work that is appropriate for public presentation. This scenario helps…
▽ More
Data-art exhibitions offer a unique and real-world setting to foster creative visualisation skills among students. They serve as real-world platform for students to display their work, bridging the gap between classroom learning and professional practice. Students must develop a technical solution, grasp the context, and produce work that is appropriate for public presentation. This scenario helps to encourage innovative thinking, engagement with the topic, and helps to enhance technical proficiency. We present our implementation of a data-art exhibition within a computing curriculum, for third-year degree-level students. Students create art-based visualisations from selected datasets and present their work in a public exhibition. We have used this initiative over the course of two academic years with different cohorts, and reflect on its impact on student learning and creativity.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Authors:
Jonathan Roberts,
Kai Han,
Samuel Albanie
Abstract:
Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the…
▽ More
Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.
△ Less
Submitted 29 August, 2024; v1 submitted 21 August, 2024;
originally announced August 2024.
-
Visual Storytelling: A Methodological Approach to Designing and Implementing a Visualisation Poster
Authors:
Rhiannon Owen,
Jonathan Roberts
Abstract:
We present a design study of developing a visualisation poster. Posters can be difficult to create, and the story on a poster is not always clear. Using a case-study approach we propose three important aspects: the poster should have a clear focus (especially a hero visualisation), envisioning its use helps to drive the important aspects, and third the essence (its fundamental concept and guiding…
▽ More
We present a design study of developing a visualisation poster. Posters can be difficult to create, and the story on a poster is not always clear. Using a case-study approach we propose three important aspects: the poster should have a clear focus (especially a hero visualisation), envisioning its use helps to drive the important aspects, and third the essence (its fundamental concept and guiding idea) must be clear. We will use case studies that have focused on the use of the Five Design-Sheet method (FdS) as a way to sketch and plan a visualisation, before successfully implementing and creating the visual poster. The case studies serve as a practical illustration of the workflow, offering a means to explain the three key processes involved: (1) comprehending the data, (2) employing a design study with the FdS (Five Design-Sheet), (3) crafting, evaluating and refining the visualisation.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning
Authors:
Kyle Moore,
Jesse Roberts,
Thao Pham,
Douglas Fisher
Abstract:
Language models are known to absorb biases from their training data, leading to predictions driven by statistical regularities rather than semantic relevance. We investigate the impact of these biases on answer choice preferences in the Massive Multi-Task Language Understanding (MMLU) task. Our findings reveal that differences in learned regularities across answer options are predictive of model p…
▽ More
Language models are known to absorb biases from their training data, leading to predictions driven by statistical regularities rather than semantic relevance. We investigate the impact of these biases on answer choice preferences in the Massive Multi-Task Language Understanding (MMLU) task. Our findings reveal that differences in learned regularities across answer options are predictive of model preferences and mirror human test-taking strategies. To address this issue, we introduce two novel methods: Counterfactual Prompting with Chain of Thought (CoT) and Counterfactual Prompting with Agnostically Primed CoT (APriCoT). We demonstrate that while Counterfactual Prompting with CoT alone is insufficient to mitigate bias, our novel Primed Counterfactual Prompting with CoT approach effectively reduces the influence of base-rate probabilities while improving overall accuracy. Our results suggest that mitigating bias requires a "System-2" like process and that CoT reasoning is susceptible to confirmation bias under some prompting methodologies. Our contributions offer practical solutions for developing more robust and fair language models.
△ Less
Submitted 5 September, 2024; v1 submitted 16 August, 2024;
originally announced August 2024.
-
Creating Data Art: Authentic Learning and Visualisation Exhibition
Authors:
Jonathan C. Roberts
Abstract:
We present an authentic learning task designed for computing students, centred on the creation of data-art visualisations from chosen datasets for a public exhibition. This exhibition was showcased in the cinema foyer for two weeks in June, providing a real-world platform for students to display their work. Over the course of two years, we implemented this active learning task with two different c…
▽ More
We present an authentic learning task designed for computing students, centred on the creation of data-art visualisations from chosen datasets for a public exhibition. This exhibition was showcased in the cinema foyer for two weeks in June, providing a real-world platform for students to display their work. Over the course of two years, we implemented this active learning task with two different cohorts of students. In this paper, we share our experiences and insights from these activities, highlighting the impact on student engagement and learning outcomes. We also provide a detailed description of the seven individual tasks that learners must perform: topic and data selection and analysis, research and art inspiration, design conceptualisation, proposed solution, visualisation creation, exhibition curation, and reflection. By integrating these tasks, students not only develop technical skills but also gain practical experience in presenting their work to a public audience, bridging the gap between academic learning and professional practice.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
Engaging Data-Art: Conducting a Public Hands-On Workshop
Authors:
Jonathan C. Roberts
Abstract:
Data-art blends visualisation, data science, and artistic expression. It allows people to transform information and data into exciting and interesting visual narratives. Hosting a public data-art hands-on workshop enables participants to engage with data and learn fundamental visualisation techniques. However, being a public event, it presents a range of challenges. We outline our approach to orga…
▽ More
Data-art blends visualisation, data science, and artistic expression. It allows people to transform information and data into exciting and interesting visual narratives. Hosting a public data-art hands-on workshop enables participants to engage with data and learn fundamental visualisation techniques. However, being a public event, it presents a range of challenges. We outline our approach to organising and conducting a public workshop, that caters to a wide age range, from children to adults. We divide the tutorial into three sections, focusing on data, sketching skills and visualisation. We place emphasis on public engagement, and ensure that participants have fun while learning new skills.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Path-based Design Model for Constructing and Exploring Alternative Visualisations
Authors:
James Jackson,
Panagiotis D. Ritsos,
Peter W. S. Butcher,
Jonathan C. Roberts
Abstract:
We present a path-based design model and system for designing and creating visualisations. Our model represents a systematic approach to constructing visual representations of data or concepts following a predefined sequence of steps. The initial step involves outlining the overall appearance of the visualisation by creating a skeleton structure, referred to as a flowpath. Subsequently, we specify…
▽ More
We present a path-based design model and system for designing and creating visualisations. Our model represents a systematic approach to constructing visual representations of data or concepts following a predefined sequence of steps. The initial step involves outlining the overall appearance of the visualisation by creating a skeleton structure, referred to as a flowpath. Subsequently, we specify objects, visual marks, properties, and appearance, storing them in a gene. Lastly, we map data onto the flowpath, ensuring suitable morphisms. Alternative designs are created by exchanging values in the gene. For example, designs that share similar traits, are created by making small incremental changes to the gene. Our design methodology fosters the generation of diverse creative concepts, space-filling visualisations, and traditional formats like bar charts, circular plots and pie charts. Through our implementation we showcase the model in action. As an example application, we integrate the output visualisations onto a smartwatch and visualisation dashboards. In this article we (1) introduce, define and explain the path model and discuss possibilities for its use, (2) present our implementation, results, and evaluation, and (3) demonstrate and evaluate an application of its use on a mobile watch.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
Supporting the Digital Autonomy of Elders Through LLM Assistance
Authors:
Jesse Roberts,
Lindsey Roberts,
Alice Reed
Abstract:
The internet offers tremendous access to services, social connections, and needed products. However, to those without sufficient experience, engaging with businesses and friends across the internet can be daunting due to the ever present danger of scammers and thieves, to say nothing of the myriad of potential computer viruses. Like a forest rich with both edible and poisonous plants, those famili…
▽ More
The internet offers tremendous access to services, social connections, and needed products. However, to those without sufficient experience, engaging with businesses and friends across the internet can be daunting due to the ever present danger of scammers and thieves, to say nothing of the myriad of potential computer viruses. Like a forest rich with both edible and poisonous plants, those familiar with the norms inhabit it safely with ease while newcomers need a guide. However, reliance on a human digital guide can be taxing and often impractical. We propose and pilot a simple but unexplored idea: could an LLM provide the necessary support to help the elderly who are separated by the digital divide safely achieve digital autonomy?
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Large Language Model Recall Uncertainty is Modulated by the Fan Effect
Authors:
Jesse Roberts,
Kyle Moore,
Thao Pham,
Oseremhen Ewaleifoh,
Doug Fisher
Abstract:
This paper evaluates whether large language models (LLMs) exhibit cognitive fan effects, similar to those discovered by Anderson in humans, after being pre-trained on human textual data. We conduct two sets of in-context recall experiments designed to elicit fan effects. Consistent with human results, we find that LLM recall uncertainty, measured via token probability, is influenced by the fan eff…
▽ More
This paper evaluates whether large language models (LLMs) exhibit cognitive fan effects, similar to those discovered by Anderson in humans, after being pre-trained on human textual data. We conduct two sets of in-context recall experiments designed to elicit fan effects. Consistent with human results, we find that LLM recall uncertainty, measured via token probability, is influenced by the fan effect. Our results show that removing uncertainty disrupts the observed effect. The experiments suggest the fan effect is consistent whether the fan value is induced in-context or in the pre-training data. Finally, these findings provide in-silico evidence that fan effects and typicality are expressions of the same phenomena.
△ Less
Submitted 29 September, 2024; v1 submitted 8 July, 2024;
originally announced July 2024.
-
Construct accurate multi-continuum micromorphic homogenisations in multi-D space-time with computer algebra
Authors:
A. J. Roberts
Abstract:
Homogenisation empowers the efficient macroscale system level prediction of physical scenarios with intricate microscale structures. Here we develop an innovative powerful, rigorous and flexible framework for asymptotic homogenisation of dynamics at the \emph{finite} scale separation of real physics, with proven results underpinned by modern dynamical systems theory. The novel systematic approach…
▽ More
Homogenisation empowers the efficient macroscale system level prediction of physical scenarios with intricate microscale structures. Here we develop an innovative powerful, rigorous and flexible framework for asymptotic homogenisation of dynamics at the \emph{finite} scale separation of real physics, with proven results underpinned by modern dynamical systems theory. The novel systematic approach removes most of the usual assumptions, whether implicit or explicit, of other methodologies. By no longer assuming averages the methodology constructs so-called multi-continuum or micromorphic homogenisations systematically informed by the microscale physics. The developed framework and approach enables a user to straightforwardly choose and create such homogenisations with clear physical and theoretical support, and of highly controllable accuracy and fidelity.
△ Less
Submitted 6 April, 2025; v1 submitted 3 July, 2024;
originally announced July 2024.
-
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
Authors:
Kyle Moore,
Jesse Roberts,
Thao Pham,
Oseremhen Ewaleifoh,
Doug Fisher
Abstract:
Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to hav…
▽ More
Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.
△ Less
Submitted 30 September, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
AstroPT: Scaling Large Observation Models for Astronomy
Authors:
Michael J. Smith,
Ryan J. Roberts,
Eirini Angeloudi,
Marc Huertas-Company
Abstract:
This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million $512 \times 512$ pixel $grz$-band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find t…
▽ More
This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million $512 \times 512$ pixel $grz$-band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find that AstroPT follows a similar saturating log-log scaling law to textual models. We also find that the models' performances on downstream tasks as measured by linear probing improves with model size up to the model parameter saturation point. We believe that collaborative community development paves the best route towards realising an open source `Large Observation Model' -- a model trained on data taken from the observational sciences at the scale seen in natural language processing. To this end, we release the source code, weights, and dataset for AstroPT under the MIT license, and invite potential collaborators to join us in collectively building and researching these models.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation
Authors:
Jonathan Roberts,
Kai Han,
Neil Houlsby,
Samuel Albanie
Abstract:
Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we pres…
▽ More
Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark consisting of 2000 questions split between two tasks across 8 categories. The questions are curated from arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control. We evaluate 28 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.
△ Less
Submitted 5 December, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
Do Large Language Models Learn Human-Like Strategic Preferences?
Authors:
Jesse Roberts,
Kyle Moore,
Doug Fisher
Abstract:
In this paper, we evaluate whether LLMs learn to make human-like preference judgements in strategic scenarios as compared with known empirical results. Solar and Mistral are shown to exhibit stable value-based preference consistent with humans and exhibit human-like preference for cooperation in the prisoner's dilemma (including stake-size effect) and traveler's dilemma (including penalty-size eff…
▽ More
In this paper, we evaluate whether LLMs learn to make human-like preference judgements in strategic scenarios as compared with known empirical results. Solar and Mistral are shown to exhibit stable value-based preference consistent with humans and exhibit human-like preference for cooperation in the prisoner's dilemma (including stake-size effect) and traveler's dilemma (including penalty-size effect). We establish a relationship between model size, value-based preference, and superficiality. Finally, results here show that models tending to be less brittle have relied on sliding window attention suggesting a potential link. Additionally, we contribute a novel method for constructing preference relations from arbitrary LLMs and support for a hypothesis regarding human behavior in the traveler's dilemma.
△ Less
Submitted 2 October, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Feature-Action Design Patterns for Storytelling Visualizations with Time Series Data
Authors:
Saiful Khan,
Scott Jones,
Benjamin Bach,
Jaehoon Cha,
Min Chen,
Julie Meikle,
Jonathan C Roberts,
Jeyan Thiyagalingam,
Jo Wood,
Panagiotis D. Ritsos
Abstract:
We present a method to create storytelling visualization with time series data. Many personal decisions nowadays rely on access to dynamic data regularly, as we have seen during the COVID-19 pandemic. It is thus desirable to construct storytelling visualization for dynamic data that is selected by an individual for a specific context. Because of the need to tell data-dependent stories, predefined…
▽ More
We present a method to create storytelling visualization with time series data. Many personal decisions nowadays rely on access to dynamic data regularly, as we have seen during the COVID-19 pandemic. It is thus desirable to construct storytelling visualization for dynamic data that is selected by an individual for a specific context. Because of the need to tell data-dependent stories, predefined storyboards based on known data cannot accommodate dynamic data easily nor scale up to many different individuals and contexts. Motivated initially by the need to communicate time series data during the COVID-19 pandemic, we developed a novel computer-assisted method for meta-authoring of stories, which enables the design of storyboards that include feature-action patterns in anticipation of potential features that may appear in dynamically arrived or selected data. In addition to meta-storyboards involving COVID-19 data, we also present storyboards for telling stories about progress in a machine learning workflow. Our approach is complementary to traditional methods for authoring storytelling visualization, and provides an efficient means to construct data-dependent storyboards for different data-streams of similar contexts.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Authors:
Jonathan Roberts,
Timo Lüddecke,
Rehan Sheikh,
Kai Han,
Samuel Albanie
Abstract:
Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities o…
▽ More
Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored, despite potential wide-ranging benefits to navigation, environmental research, urban development, and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains, particularly focusing on the frontier model GPT-4V, and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks, testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel, including instances where they outperform humans, but also where they falter, providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models, our benchmark will be publicly released.
△ Less
Submitted 16 January, 2024; v1 submitted 24 November, 2023;
originally announced November 2023.
-
Rock Climbing Route Generation and Grading as Computational Creativity
Authors:
Jesse Roberts
Abstract:
In this paper, we bridge work in rock climbing route generation and grading into the computational creativity community. We provide the necessary background to situate that literature and demonstrate the domain's intellectual merit in the computational creativity community. We provide a guiding set of desiderata for future work in this area. We propose an approach to computational route grading. F…
▽ More
In this paper, we bridge work in rock climbing route generation and grading into the computational creativity community. We provide the necessary background to situate that literature and demonstrate the domain's intellectual merit in the computational creativity community. We provide a guiding set of desiderata for future work in this area. We propose an approach to computational route grading. Finally, we identify important gaps in the literature and consider how they may be filled. This paper thus also serves as a pilot study, planting a flag for our ongoing research in this domain.
△ Less
Submitted 3 November, 2023;
originally announced November 2023.
-
Few-Shot Learning Patterns in Financial Time-Series for Trend-Following Strategies
Authors:
Kieran Wood,
Samuel Kessler,
Stephen J. Roberts,
Stefan Zohren
Abstract:
Forecasting models for systematic trading strategies do not adapt quickly when financial market conditions rapidly change, as was seen in the advent of the COVID-19 pandemic in 2020, causing many forecasting models to take loss-making positions. To deal with such situations, we propose a novel time-series trend-following forecaster that can quickly adapt to new market conditions, referred to as re…
▽ More
Forecasting models for systematic trading strategies do not adapt quickly when financial market conditions rapidly change, as was seen in the advent of the COVID-19 pandemic in 2020, causing many forecasting models to take loss-making positions. To deal with such situations, we propose a novel time-series trend-following forecaster that can quickly adapt to new market conditions, referred to as regimes. We leverage recent developments from the deep learning community and use few-shot learning. We propose the Cross Attentive Time-Series Trend Network -- X-Trend -- which takes positions attending over a context set of financial time-series regimes. X-Trend transfers trends from similar patterns in the context set to make forecasts, then subsequently takes positions for a new distinct target regime. By quickly adapting to new financial regimes, X-Trend increases Sharpe ratio by 18.9% over a neural forecaster and 10-fold over a conventional Time-series Momentum strategy during the turbulent market period from 2018 to 2023. Our strategy recovers twice as quickly from the COVID-19 drawdown compared to the neural-forecaster. X-Trend can also take zero-shot positions on novel unseen financial assets obtaining a 5-fold Sharpe ratio increase versus a neural time-series trend forecaster over the same period. Furthermore, the cross-attention mechanism allows us to interpret the relationship between forecasts and patterns in the context set.
△ Less
Submitted 28 March, 2024; v1 submitted 16 October, 2023;
originally announced October 2023.
-
In Consideration of Indigenous Data Sovereignty: Data Mining as a Colonial Practice
Authors:
Jennafer Shae Roberts,
Laura N Montoya
Abstract:
Data mining reproduces colonialism, and Indigenous voices are being left out of the development of technology that relies on data, such as artificial intelligence. This research stresses the need for the inclusion of Indigenous Data Sovereignty and centers on the importance of Indigenous rights over their own data. Inclusion is necessary in order to integrate Indigenous knowledge into the design,…
▽ More
Data mining reproduces colonialism, and Indigenous voices are being left out of the development of technology that relies on data, such as artificial intelligence. This research stresses the need for the inclusion of Indigenous Data Sovereignty and centers on the importance of Indigenous rights over their own data. Inclusion is necessary in order to integrate Indigenous knowledge into the design, development, and implementation of data-reliant technology. To support this hypothesis and address the problem, the CARE Principles for Indigenous Data Governance (Collective Benefit, Authority to Control, Responsibility, and Ethics) are applied. We cover how the colonial practices of data mining do not align with Indigenous convictions. The included case studies highlight connections to Indigenous rights in relation to the protection of data and environmental ecosystems, thus establishing how data governance can serve both the people and the Earth. By applying the CARE Principles to the issues that arise from data mining and neocolonialism, our goal is to provide a framework that can be used in technological development. The theory is that this could reflect outwards to promote data sovereignty generally and create new relationships between people and data that are ethical as opposed to driven by speed and profit.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Projected Task-Specific Layers for Multi-Task Reinforcement Learning
Authors:
Josselin Somerville Roberts,
Julia Di
Abstract:
Multi-task reinforcement learning could enable robots to scale across a wide variety of manipulation tasks in homes and workplaces. However, generalizing from one task to another and mitigating negative task interference still remains a challenge. Addressing this challenge by successfully sharing information across tasks will depend on how well the structure underlying the tasks is captured. In th…
▽ More
Multi-task reinforcement learning could enable robots to scale across a wide variety of manipulation tasks in homes and workplaces. However, generalizing from one task to another and mitigating negative task interference still remains a challenge. Addressing this challenge by successfully sharing information across tasks will depend on how well the structure underlying the tasks is captured. In this work, we introduce our new architecture, Projected Task-Specific Layers (PTSL), that leverages a common policy with dense task-specific corrections through task-specific layers to better express shared and variable task information. We then show that our model outperforms the state of the art on the MT10 and MT50 benchmarks of Meta-World consisting of 10 and 50 goal-conditioned tasks for a Sawyer arm.
△ Less
Submitted 6 March, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
A Skeleton-based Approach For Rock Crack Detection Towards A Climbing Robot Application
Authors:
Josselin Somerville Roberts,
Paul-Emile Giacomelli,
Yoni Gozlan,
Julia Di
Abstract:
Conventional wheeled robots are unable to traverse scientifically interesting, but dangerous, cave environments. Multi-limbed climbing robot designs, such as ReachBot, are able to grasp irregular surface features and execute climbing motions to overcome obstacles, given suitable grasp locations. To support grasp site identification, we present a method for detecting rock cracks and edges, the SKel…
▽ More
Conventional wheeled robots are unable to traverse scientifically interesting, but dangerous, cave environments. Multi-limbed climbing robot designs, such as ReachBot, are able to grasp irregular surface features and execute climbing motions to overcome obstacles, given suitable grasp locations. To support grasp site identification, we present a method for detecting rock cracks and edges, the SKeleton Intersection Loss (SKIL). SKIL is a loss designed for thin object segmentation that leverages the skeleton of the label. A dataset of rock face images was collected, manually annotated, and augmented with generated data. A new group of metrics, LineAcc, has been proposed for thin object segmentation such that the impact of the object width on the score is minimized. In addition, the metric is less sensitive to translation which can often lead to a score of zero when computing classical metrics such as Dice on thin objects. Our fine-tuned models outperform previous methods on similar thin object segmentation tasks such as blood vessel segmentation and show promise for integration onto a robotic system.
△ Less
Submitted 6 November, 2023; v1 submitted 10 September, 2023;
originally announced September 2023.
-
Efficient computational homogenisation of 2D beams of heterogeneous elasticity using the patch scheme
Authors:
Thien Tran-Duc,
J. E. Bunder,
A. J. Roberts
Abstract:
Modern 'smart' materials have complex heterogeneous microscale structure, often with unknown macroscale closure but one we need to realise for large scale engineering and science. The multiscale Equation-Free Patch Scheme empowers us to non-intrusively, efficiently, and accurately predict the large scale, system level, solutions through computations on only small sparse patches of the given detail…
▽ More
Modern 'smart' materials have complex heterogeneous microscale structure, often with unknown macroscale closure but one we need to realise for large scale engineering and science. The multiscale Equation-Free Patch Scheme empowers us to non-intrusively, efficiently, and accurately predict the large scale, system level, solutions through computations on only small sparse patches of the given detailed microscale system. Here the microscale system is that of a 2D beam of heterogeneous elasticity, with either fixed fixed, fixed-free, or periodic boundary conditions. We demonstrate that the described multiscale Patch Scheme simply, efficiently, and stably predicts the beam's macroscale, with a controllable accuracy, at finite scale separation. Dynamical systems theory supports the scheme. This article points the way for others to use this systematic non-intrusive approach, via a developing toolbox of functions, to model and compute accurately macroscale system-levels of general complex physical and engineering systems.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
Using Artificial Populations to Study Psychological Phenomena in Neural Models
Authors:
Jesse Roberts,
Kyle Moore,
Drew Wilenzick,
Doug Fisher
Abstract:
The recent proliferation of research into transformer based natural language processing has led to a number of studies which attempt to detect the presence of human-like cognitive behavior in the models. We contend that, as is true of human psychology, the investigation of cognitive behavior in language models must be conducted in an appropriate population of an appropriate size for the results to…
▽ More
The recent proliferation of research into transformer based natural language processing has led to a number of studies which attempt to detect the presence of human-like cognitive behavior in the models. We contend that, as is true of human psychology, the investigation of cognitive behavior in language models must be conducted in an appropriate population of an appropriate size for the results to be meaningful. We leverage work in uncertainty estimation in a novel approach to efficiently construct experimental populations. The resultant tool, PopulationLM, has been made open source. We provide theoretical grounding in the uncertainty estimation literature and motivation from current cognitive work regarding language models. We discuss the methodological lessons from other scientific communities and attempt to demonstrate their application to two artificial population studies. Through population based experimentation we find that language models exhibit behavior consistent with typicality effects among categories highly represented in training. However, we find that language models don't tend to exhibit structural priming effects. Generally, our results show that single models tend to over estimate the presence of cognitive behaviors in neural models.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
Challenges and Opportunities in Data Visualization Education: A Call to Action
Authors:
Benjamin Bach,
Mandy Keck,
Fateme Rajabiyazdi,
Tatiana Losev,
Isabel Meirelles,
Jason Dykes,
Robert S. Laramee,
Mashael AlKadi,
Christina Stoiber,
Samuel Huron,
Charles Perin,
Luiz Morais,
Wolfgang Aigner,
Doris Kosminsky,
Magdalena Boucher,
Søren Knudsen,
Areti Manataki,
Jan Aerts,
Uta Hinrichs,
Jonathan C. Roberts,
Sheelagh Carpendale
Abstract:
This paper is a call to action for research and discussion on data visualization education. As visualization evolves and spreads through our professional and personal lives, we need to understand how to support and empower a broad and diverse community of learners in visualization. Data Visualization is a diverse and dynamic discipline that combines knowledge from different fields, is tailored to…
▽ More
This paper is a call to action for research and discussion on data visualization education. As visualization evolves and spreads through our professional and personal lives, we need to understand how to support and empower a broad and diverse community of learners in visualization. Data Visualization is a diverse and dynamic discipline that combines knowledge from different fields, is tailored to suit diverse audiences and contexts, and frequently incorporates tacit knowledge. This complex nature leads to a series of interrelated challenges for data visualization education. Driven by a lack of consolidated knowledge, overview, and orientation for visualization education, the 21 authors of this paper-educators and researchers in data visualization-identify and describe 19 challenges informed by our collective practical experience. We organize these challenges around seven themes People, Goals & Assessment, Environment, Motivation, Methods, Materials, and Change. Across these themes, we formulate 43 research questions to address these challenges. As part of our call to action, we then conclude with 5 cross-cutting opportunities and respective action items: embrace DIVERSITY+INCLUSION, build COMMUNITIES, conduct RESEARCH, act AGILE, and relish RESPONSIBILITY. We aim to inspire researchers, educators and learners to drive visualization education forward and discuss why, how, who and where we educate, as we learn to use visualization to address challenges across many scales and many domains in a rapidly changing world: viseducationchallenges.github.io.
△ Less
Submitted 15 August, 2023;
originally announced August 2023.
-
DLSIA: Deep Learning for Scientific Image Analysis
Authors:
Eric J Roberts,
Tanny Chavez,
Alexander Hexemer,
Petrus H. Zwart
Abstract:
We introduce DLSIA (Deep Learning for Scientific Image Analysis), a Python-based machine learning library that empowers scientists and researchers across diverse scientific domains with a range of customizable convolutional neural network (CNN) architectures for a wide variety of tasks in image analysis to be used in downstream data processing, or for experiment-in-the-loop computing scenarios. DL…
▽ More
We introduce DLSIA (Deep Learning for Scientific Image Analysis), a Python-based machine learning library that empowers scientists and researchers across diverse scientific domains with a range of customizable convolutional neural network (CNN) architectures for a wide variety of tasks in image analysis to be used in downstream data processing, or for experiment-in-the-loop computing scenarios. DLSIA features easy-to-use architectures such as autoencoders, tunable U-Nets, and parameter-lean mixed-scale dense networks (MSDNets). Additionally, we introduce sparse mixed-scale networks (SMSNets), generated using random graphs and sparse connections. As experimental data continues to grow in scale and complexity, DLSIA provides accessible CNN construction and abstracts CNN complexities, allowing scientists to tailor their machine learning approaches, accelerate discoveries, foster interdisciplinary collaboration, and advance research in scientific image analysis.
△ Less
Submitted 26 August, 2023; v1 submitted 2 August, 2023;
originally announced August 2023.
-
The Glamorisation of Unpaid Labour: AI and its Influencers
Authors:
Nana Mgbechikwere Nwachukwu,
Jennafer Shae Roberts,
Laura N Montoya
Abstract:
To harness the true potential of Artificial Intelligence (AI) for societal betterment, we need to move away from prioritising corporate interests which exploit Global South workers in the digital age. The unpaid labour and societal harms which are generated by Digital Value Networks (DVNs) disproportionately affect workers in Africa, Latin America, and India and need to be regulated. In this resea…
▽ More
To harness the true potential of Artificial Intelligence (AI) for societal betterment, we need to move away from prioritising corporate interests which exploit Global South workers in the digital age. The unpaid labour and societal harms which are generated by Digital Value Networks (DVNs) disproportionately affect workers in Africa, Latin America, and India and need to be regulated. In this research, we discuss unethical practices to automate Human Intelligence Tasks (HITs) through gig work platforms and the capitalisation of data collection utilising influencers in social media. These are important areas of study in worker and user data practices, where ethical AI could be impactful. We provide suggestions for a path forward focused on responsible AI development.
△ Less
Submitted 15 September, 2023; v1 submitted 31 July, 2023;
originally announced August 2023.