-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Inspo: Writing with Crowds Alongside AI
Authors:
Chieh-Yang Huang,
Sanjana Gautam,
Shannon McClellan Brooks,
Ya-Fang Lin,
Tiffany Knearem,
Ting-Hao 'Kenneth' Huang
Abstract:
The use of artificial intelligence (AI) to support creative writing has bloomed in recent years. However, it is less well understood how AI compares to on-demand human support. We explored how writers interact with both AI and crowd worker writing assistants in creative writing. We replicated the interface of the prior crowd-writing system, Heteroglossia, and developed Inspo, a text editor allowin…
▽ More
The use of artificial intelligence (AI) to support creative writing has bloomed in recent years. However, it is less well understood how AI compares to on-demand human support. We explored how writers interact with both AI and crowd worker writing assistants in creative writing. We replicated the interface of the prior crowd-writing system, Heteroglossia, and developed Inspo, a text editor allowing users to request suggestions from AI models and crowd workers. In a one-week deployment study involving eight creative writers, we examined how often participants selected crowd workers when fluent AI text generators were also available. Findings showed a consistent decline in crowd worker usage, with participants favoring AI due to its faster responses and more consistent quality. We conclude with suggestions for future systems, recommending designs that account for the unique strengths and weaknesses of human versus AI assistants, strategies to address automation bias, and sociocultural views of writing.
△ Less
Submitted 19 October, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
Latent Diffusion Model for Medical Image Standardization and Enhancement
Authors:
Md Selim,
Jie Zhang,
Faraneh Fathi,
Michael A. Brooks,
Ge Wang,
Guoqiang Yu,
Jin Chen
Abstract:
Computed tomography (CT) serves as an effective tool for lung cancer screening, diagnosis, treatment, and prognosis, providing a rich source of features to quantify temporal and spatial tumor changes. Nonetheless, the diversity of CT scanners and customized acquisition protocols can introduce significant inconsistencies in texture features, even when assessing the same patient. This variability po…
▽ More
Computed tomography (CT) serves as an effective tool for lung cancer screening, diagnosis, treatment, and prognosis, providing a rich source of features to quantify temporal and spatial tumor changes. Nonetheless, the diversity of CT scanners and customized acquisition protocols can introduce significant inconsistencies in texture features, even when assessing the same patient. This variability poses a fundamental challenge for subsequent research that relies on consistent image features. Existing CT image standardization models predominantly utilize GAN-based supervised or semi-supervised learning, but their performance remains limited. We present DiffusionCT, an innovative score-based DDPM model that operates in the latent space to transform disparate non-standard distributions into a standardized form. The architecture comprises a U-Net-based encoder-decoder, augmented by a DDPM model integrated at the bottleneck position. First, the encoder-decoder is trained independently, without embedding DDPM, to capture the latent representation of the input data. Second, the latent DDPM model is trained while keeping the encoder-decoder parameters fixed. Finally, the decoder uses the transformed latent representation to generate a standardized CT image, providing a more consistent basis for downstream analysis. Empirical tests on patient CT images indicate notable improvements in image standardization using DiffusionCT. Additionally, the model significantly reduces image noise in SPAD images, further validating the effectiveness of DiffusionCT for advanced imaging tasks.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
DiffusionCT: Latent Diffusion Model for CT Image Standardization
Authors:
Md Selim,
Jie Zhang,
Michael A. Brooks,
Ge Wang,
Jin Chen
Abstract:
Computed tomography (CT) is one of the modalities for effective lung cancer screening, diagnosis, treatment, and prognosis. The features extracted from CT images are now used to quantify spatial and temporal variations in tumors. However, CT images obtained from various scanners with customized acquisition protocols may introduce considerable variations in texture features, even for the same patie…
▽ More
Computed tomography (CT) is one of the modalities for effective lung cancer screening, diagnosis, treatment, and prognosis. The features extracted from CT images are now used to quantify spatial and temporal variations in tumors. However, CT images obtained from various scanners with customized acquisition protocols may introduce considerable variations in texture features, even for the same patient. This presents a fundamental challenge to downstream studies that require consistent and reliable feature analysis. Existing CT image harmonization models rely on GAN-based supervised or semi-supervised learning, with limited performance. This work addresses the issue of CT image harmonization using a new diffusion-based model, named DiffusionCT, to standardize CT images acquired from different vendors and protocols. DiffusionCT operates in the latent space by mapping a latent non-standard distribution into a standard one. DiffusionCT incorporates an Unet-based encoder-decoder, augmented by a diffusion model integrated into the bottleneck part. The model is designed in two training phases. The encoder-decoder is first trained, without embedding the diffusion model, to learn the latent representation of the input data. The latent diffusion model is then trained in the next training phase while fixing the encoder-decoder. Finally, the decoder synthesizes a standardized image with the transformed latent representation. The experimental results demonstrate a significant improvement in the performance of the standardization task using DiffusionCT.
△ Less
Submitted 25 March, 2023; v1 submitted 20 January, 2023;
originally announced January 2023.
-
AI-assisted Optimization of the ECCE Tracking System at the Electron Ion Collider
Authors:
C. Fanelli,
Z. Papandreou,
K. Suresh,
J. K. Adkins,
Y. Akiba,
A. Albataineh,
M. Amaryan,
I. C. Arsene,
C. Ayerbe Gayoso,
J. Bae,
X. Bai,
M. D. Baker,
M. Bashkanov,
R. Bellwied,
F. Benmokhtar,
V. Berdnikov,
J. C. Bernauer,
F. Bock,
W. Boeglin,
M. Borysova,
E. Brash,
P. Brindza,
W. J. Briscoe,
M. Brooks,
S. Bueltmann
, et al. (258 additional authors not shown)
Abstract:
The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to…
▽ More
The Electron-Ion Collider (EIC) is a cutting-edge accelerator facility that will study the nature of the "glue" that binds the building blocks of the visible matter in the universe. The proposed experiment will be realized at Brookhaven National Laboratory in approximately 10 years from now, with detector design and R&D currently ongoing. Notably, EIC is one of the first large-scale facilities to leverage Artificial Intelligence (AI) already starting from the design and R&D phases. The EIC Comprehensive Chromodynamics Experiment (ECCE) is a consortium that proposed a detector design based on a 1.5T solenoid. The EIC detector proposal review concluded that the ECCE design will serve as the reference design for an EIC detector. Herein we describe a comprehensive optimization of the ECCE tracker using AI. The work required a complex parametrization of the simulated detector system. Our approach dealt with an optimization problem in a multidimensional design space driven by multiple objectives that encode the detector performance, while satisfying several mechanical constraints. We describe our strategy and show results obtained for the ECCE tracking system. The AI-assisted design is agnostic to the simulation framework and can be extended to other sub-detectors or to a system of sub-detectors to further optimize the performance of the EIC detector.
△ Less
Submitted 19 May, 2022; v1 submitted 18 May, 2022;
originally announced May 2022.
-
Networked Restless Multi-Armed Bandits for Mobile Interventions
Authors:
Han-Ching Ou,
Christoph Siebenbrunner,
Jackson Killian,
Meredith B Brooks,
David Kempe,
Yevgeniy Vorobeychik,
Milind Tambe
Abstract:
Motivated by a broad class of mobile intervention problems, we propose and study restless multi-armed bandits (RMABs) with network effects. In our model, arms are partially recharging and connected through a graph, so that pulling one arm also improves the state of neighboring arms, significantly extending the previously studied setting of fully recharging bandits with no network effects. In mobil…
▽ More
Motivated by a broad class of mobile intervention problems, we propose and study restless multi-armed bandits (RMABs) with network effects. In our model, arms are partially recharging and connected through a graph, so that pulling one arm also improves the state of neighboring arms, significantly extending the previously studied setting of fully recharging bandits with no network effects. In mobile interventions, network effects may arise due to regular population movements (such as commuting between home and work). We show that network effects in RMABs induce strong reward coupling that is not accounted for by existing solution methods. We propose a new solution approach for networked RMABs, exploiting concavity properties which arise under natural assumptions on the structure of intervention effects. We provide sufficient conditions for optimality of our approach in idealized settings and demonstrate that it empirically outperforms state-of-the art baselines in three mobile intervention domains using real-world graphs.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Retinal Vessel Segmentation Using A New Topological Method
Authors:
Martin Brooks
Abstract:
A novel topological segmentation of retinal images represents blood vessels as connected regions in the continuous image plane, having shape-related analytic and geometric properties. This paper presents topological segmentation results from the DRIVE retinal image database.
A novel topological segmentation of retinal images represents blood vessels as connected regions in the continuous image plane, having shape-related analytic and geometric properties. This paper presents topological segmentation results from the DRIVE retinal image database.
△ Less
Submitted 3 August, 2016;
originally announced August 2016.
-
Persistence Lenses: Segmentation, Simplification, Vectorization, Scale Space and Fractal Analysis of Images
Authors:
Martin Brooks
Abstract:
A persistence lens is a hierarchy of disjoint upper and lower level sets of a continuous luminance image's Reeb graph. The boundary components of a persistence lens's interior components are Jordan curves that serve as a hierarchical segmentation of the image, and may be rendered as vector graphics. A persistence lens determines a varilet basis for the luminance image, in which image simplificatio…
▽ More
A persistence lens is a hierarchy of disjoint upper and lower level sets of a continuous luminance image's Reeb graph. The boundary components of a persistence lens's interior components are Jordan curves that serve as a hierarchical segmentation of the image, and may be rendered as vector graphics. A persistence lens determines a varilet basis for the luminance image, in which image simplification is a realized by subspace projection. Image scale space, and image fractal analysis, result from applying a scale measure to each basis function.
△ Less
Submitted 21 June, 2016; v1 submitted 25 April, 2016;
originally announced April 2016.
-
Varilets: Additive Decomposition, Topological Total Variation, and Filtering of Scalar Fields
Authors:
Martin Brooks
Abstract:
Continuous interpolation of real-valued data is characterized by piecewise monotone functions on a compact metric space. Topological total variation of piecewise monotone function f:X->R is a homeomorphism-invariant generalization of 1D total variation. A varilet basis is a collection of piecewise monotone functions { $g_i$ |i = 1...n}, called varilets, such that every linear combination…
▽ More
Continuous interpolation of real-valued data is characterized by piecewise monotone functions on a compact metric space. Topological total variation of piecewise monotone function f:X->R is a homeomorphism-invariant generalization of 1D total variation. A varilet basis is a collection of piecewise monotone functions { $g_i$ |i = 1...n}, called varilets, such that every linear combination $\sum a_ig_i$ has topological total variation $\sum |a_i|$. A varilet transform for $f$ is a varilet basis for which $f =\sum α_ig_i$. Filtered versions of $f$ result from altering the coefficients $α_i$.
△ Less
Submitted 25 April, 2016; v1 submitted 16 March, 2015;
originally announced March 2015.
-
An Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation
Authors:
Daniel Lemire,
Martin Brooks,
Yuhong Yan
Abstract:
Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting a sequence in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem using the l_inf norm, and we present an optimal linear time algorithm based on novel formalism. Moreover, given a precomputation in time O(n l…
▽ More
Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting a sequence in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem using the l_inf norm, and we present an optimal linear time algorithm based on novel formalism. Moreover, given a precomputation in time O(n log n) consisting of a labeling of all extrema, we compute any optimal segmentation in constant time. We compare experimentally its performance to two piecewise linear segmentation heuristics (top-down and bottom-up). We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.
△ Less
Submitted 7 September, 2007;
originally announced September 2007.
-
An Optimal Linear Time Algorithm for Quasi-Monotonic Segmentation
Authors:
Daniel Lemire,
Martin Brooks,
Yuhong Yan
Abstract:
Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regress…
▽ More
Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonic as possible and to alternate signs. We propose a quality metric for this problem, present an optimal linear time algorithm based on novel formalism, and compare experimentally its performance to a linear time top-down regression algorithm. We show that our algorithm is faster and more accurate. Applications include pattern recognition and qualitative modeling.
△ Less
Submitted 23 February, 2007;
originally announced February 2007.
-
Distributed Control System for the Test Interferometer of the ALMA Project
Authors:
M. Pokorny,
M. Brooks,
B. Glendenning,
G. Harris,
R. Heald,
F. Stauffer,
J. Pisano
Abstract:
The control system (TICS) for the test interferometer being built to support the development of the Atacama Large Millimeter Array (ALMA)will itself be a prototype for the final ALMA array, providing a test for the distributed control system under development. TICS will be based on the ALMA Common Software (ACS) (developed at the European Southern Observatory), which provides CORBA-based service…
▽ More
The control system (TICS) for the test interferometer being built to support the development of the Atacama Large Millimeter Array (ALMA)will itself be a prototype for the final ALMA array, providing a test for the distributed control system under development. TICS will be based on the ALMA Common Software (ACS) (developed at the European Southern Observatory), which provides CORBA-based services and a device management framework for the control software.
Simple device controllers will run on single board computers, one of which (known as an LCU) is located at each antenna; whereas complex, compound device controllers may run on centrally located computers. In either circumstance, client programs may obtain direct CORBA references to the devices and their properties. Monitor and control requests are sent to devices or properties, which then process and forward the commands to the appropriate hardware devices as required. Timing requirements are met by tagging commands with (future) timestamps synchronized to a timing pulse, which is regulated by a central reference generator, and is distributed to all hardware devices in the array. Monitoring is provided through a publish/subscribe CORBA-based service.
△ Less
Submitted 4 December, 2001; v1 submitted 8 November, 2001;
originally announced November 2001.