-
Humanity's Last Exam
Authors:
Long Phan,
Alice Gatti,
Ziwen Han,
Nathaniel Li,
Josephina Hu,
Hugh Zhang,
Chen Bo Calvin Zhang,
Mohamed Shaaban,
John Ling,
Sean Shi,
Michael Choi,
Anish Agrawal,
Arnav Chopra,
Adam Khoja,
Ryan Kim,
Richard Ren,
Jason Hausenloy,
Oliver Zhang,
Mantas Mazeika,
Dmitry Dodonov,
Tung Nguyen,
Jaeho Lee,
Daron Anderson,
Mikhail Doroshenko,
Alun Cennyth Stokes
, et al. (1084 additional authors not shown)
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of…
▽ More
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
△ Less
Submitted 19 April, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models
Authors:
Dominik Huber,
Martin Schreiber,
Martin Schulz,
Howard Pritchard,
Daniel Holmes
Abstract:
With Dynamic Resource Management (DRM) the resources assigned to a job can be changed dynamically during its execution. From the system's perspective, DRM opens a new level of flexibility in resource allocation and job scheduling and therefore has the potential to improve system efficiency metrics such as the utilization rate, job throughput, energy efficiency, and responsiveness. From the applica…
▽ More
With Dynamic Resource Management (DRM) the resources assigned to a job can be changed dynamically during its execution. From the system's perspective, DRM opens a new level of flexibility in resource allocation and job scheduling and therefore has the potential to improve system efficiency metrics such as the utilization rate, job throughput, energy efficiency, and responsiveness. From the application perspective, users can tailor the resources they request to their needs offering potential optimizations in queuing time or charged costs. Despite these obvious advantages and many attempts over the last decade to establish DRM in HPC, it remains a concept discussed in academia rather than being successfully deployed on production systems. This stems from the fact that support for DRM requires changes in all the layers of the HPC system software stack including applications, programming models, process managers, and resource management software, as well as an extensive and holistic co-design process to establish new techniques and policies for scheduling and resource optimization. In this work, we therefore start with the assumption that resources are accessible by processes executed either on them (e.g., on CPU) or controlling them (e.g., GPU-offloading). Then, the overall DRM problem can be decomposed into dynamic process management (DPM) and dynamic resource mapping or allocation (DRA). The former determines which processes (or which change in processes) must be managed and the latter identifies the resources where they will be executed. The interfaces for such \mbox{DPM/DPA} in these layers need to be standardized, which requires a careful design to be interoperable while providing high flexibility. Based on a survey of existing approaches we propose design principles, that form the basis of a holistic approach to DMR in HPC and provide a prototype implementation using MPI.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
Digital Twins and the Future of their Use Enabling Shift Left and Shift Right Cybersecurity Operations
Authors:
Ahmad Mohsin,
Helge Janicke,
Surya Nepal,
David Holmes
Abstract:
Digital Twins (DTs), optimize operations and monitor performance in Smart Critical Systems (SCS) domains like smart grids and manufacturing. DT-based cybersecurity solutions are in their infancy, lacking a unified strategy to overcome challenges spanning next three to five decades. These challenges include reliable data accessibility from Cyber-Physical Systems (CPS), operating in unpredictable en…
▽ More
Digital Twins (DTs), optimize operations and monitor performance in Smart Critical Systems (SCS) domains like smart grids and manufacturing. DT-based cybersecurity solutions are in their infancy, lacking a unified strategy to overcome challenges spanning next three to five decades. These challenges include reliable data accessibility from Cyber-Physical Systems (CPS), operating in unpredictable environments. Reliable data sources are pivotal for intelligent cybersecurity operations aided with underlying modeling capabilities across the SCS lifecycle, necessitating a DT. To address these challenges, we propose Security Digital Twins (SDTs) collecting realtime data from CPS, requiring the Shift Left and Shift Right (SLSR) design paradigm for SDT to implement both design time and runtime cybersecurity operations. Incorporating virtual CPS components (VC) in Cloud/Edge, data fusion to SDT models is enabled with high reliability, providing threat insights and enhancing cyber resilience. VC-enabled SDT ensures accurate data feeds for security monitoring for both design and runtime. This design paradigm shift propagates innovative SDT modeling and analytics for securing future critical systems. This vision paper outlines intelligent SDT design through innovative techniques, exploring hybrid intelligence with data-driven and rule-based semantic SDT models. Various operational use cases are discussed for securing smart critical systems through underlying modeling and analytics capabilities.
△ Less
Submitted 24 September, 2023;
originally announced September 2023.
-
Points of non-linearity of functions generated by random neural networks
Authors:
David Holmes
Abstract:
We consider functions from the real numbers to the real numbers, output by a neural network with 1 hidden activation layer, arbitrary width, and ReLU activation function. We assume that the parameters of the neural network are chosen uniformly at random with respect to various probability distributions, and compute the expected distribution of the points of non-linearity. We use these results to e…
▽ More
We consider functions from the real numbers to the real numbers, output by a neural network with 1 hidden activation layer, arbitrary width, and ReLU activation function. We assume that the parameters of the neural network are chosen uniformly at random with respect to various probability distributions, and compute the expected distribution of the points of non-linearity. We use these results to explain why the network may be biased towards outputting functions with simpler geometry, and why certain functions with low information-theoretic complexity are nonetheless hard for a neural network to approximate.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
Symbol-Level Synchronisation Channel Modelling With Real-World Application: From Davey-Mackay, Fritchman to Markov
Authors:
Shamin Achari,
Daniel Glenn Holmes,
Ling Cheng
Abstract:
Errors in realistic channels contain not only substitution errors, but synchronisation errors as well. Moreover, these errors are rarely statistically independent in nature. By extending on the idea of the Fritchman channel model, a novel error category-based methodology in determining channel characteristics is described for memory channels which contain insertion, deletion, and substitution erro…
▽ More
Errors in realistic channels contain not only substitution errors, but synchronisation errors as well. Moreover, these errors are rarely statistically independent in nature. By extending on the idea of the Fritchman channel model, a novel error category-based methodology in determining channel characteristics is described for memory channels which contain insertion, deletion, and substitution errors. The practicality of such a methodology is reinforced by making use of real communication data from a visible light communication system. Simulation results show that the error-free and error runs using this new method of defining the channel clearly deviates from the Davey-Mackay synchronisation model which is memoryless in nature. This further emphasises the inherent memory in these synchronisation channels which we are now able to characterise. Additionally, a new method to determine the parameters of a synchronisation memory channel using the Levenshtein distance metric is detailed. This method of channel modelling allows for more realistic communication models to be simulated and can easily extend to other areas of research such as DNA barcoding in the medical domain.
△ Less
Submitted 19 February, 2021;
originally announced February 2021.
-
Trip Recovery in Lower-Limb Prostheses using Reachable Sets of Predicted Human Motion
Authors:
Shannon M. Danforth,
Patrick D. Holmes,
Ram Vasudevan
Abstract:
People with lower-limb loss, the majority of which use passive prostheses, exhibit a high incidence of falls each year. Powered lower-limb prostheses have the potential to reduce fall rates by actively helping the user recover from a stumble, but the unpredictability of the human response makes it difficult to design controllers that ensure a successful recovery. This paper presents a method calle…
▽ More
People with lower-limb loss, the majority of which use passive prostheses, exhibit a high incidence of falls each year. Powered lower-limb prostheses have the potential to reduce fall rates by actively helping the user recover from a stumble, but the unpredictability of the human response makes it difficult to design controllers that ensure a successful recovery. This paper presents a method called TRIP-RTD (Trip Recovery in Prostheses via Reachability-based Trajectory Design) for online trajectory planning in a knee prosthesis during and after a stumble that can accommodate a set of possible predictions of human behavior. Using this predicted set of human behavior, the proposed method computes a parameterized reachable set of trajectories for the human-prosthesis system. To ensure safety at run-time, TRIP-RTD selects a trajectory for the prosthesis that guarantees that all possible states of the human-prosthesis system at touchdown arrive in the basin of attraction of the nominal behavior of the system. In simulated stumble experiments where a nominal phase-based controller was unable to help the system recover, TRIP-RTD produced trajectories in under 101 ms that led to successful recoveries for all feasible solutions found.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Extending the Message Passing Interface (MPI) with User-Level Schedules
Authors:
Derek Schafer,
Sheikh Ghafoor,
Daniel Holmes,
Martin Ruefenacht,
Anthony Skjellum
Abstract:
Composability is one of seven reasons for the long-standing and continuing success of MPI. Extending MPI by composing its operations with user-level operations provides useful integration with the progress engine and completion notification methods of MPI. However, the existing extensibility mechanism in MPI (generalized requests) is not widely utilized and has significant drawbacks.
MPI can be…
▽ More
Composability is one of seven reasons for the long-standing and continuing success of MPI. Extending MPI by composing its operations with user-level operations provides useful integration with the progress engine and completion notification methods of MPI. However, the existing extensibility mechanism in MPI (generalized requests) is not widely utilized and has significant drawbacks.
MPI can be generalized via scheduled communication primitives, for example, by utilizing implementation techniques from existing MPI-3 nonblocking collectives and from forthcoming MPI-4 persistent and partitioned APIs. Non-trivial schedules are used internally in some MPI libraries; but, they are not accessible to end-users.
Message-based communication patterns can be built as libraries on top of MPI. Such libraries can have comparable implementation maturity and potentially higher performance than MPI library code, but do not require intimate knowledge of the MPI implementation. Libraries can provide performance-portable interfaces that cross MPI implementation boundaries. The ability to compose additional user-defined operations using the same progress engine benefits all kinds of general purpose HPC libraries.
We propose a definition for MPI schedules: a user-level programming model suitable for creating persistent collective communication composed with new application-specific sequences of user-defined operations managed by MPI and fully integrated with MPI progress and completion notification. The API proposed offers a path to standardization for extensible communication schedules involving user-defined operations. Our approach has the potential to introduce event-driven programming into MPI (beyond the tools interface), although connecting schedules with events comprises future work.
Early performance results described here are promising and indicate strong overlap potential.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
Automated Camera-Based Estimation of Rehabilitation Criteria Following ACL Reconstruction
Authors:
Choong Hee Kim,
Shannon M. Danforth,
Patrick D. Holmes,
Daphna Raz,
Darlene Yao,
Asheesh Bedi,
Ram Vasudevan
Abstract:
Anterior cruciate ligament (ACL) reconstruction necessitates months of rehabilitation, during which a clinician evaluates whether a patient is ready to return to sports or occupation. Due to their time- and cost-intensive nature, these screenings to assess progress are unavailable to many. This paper introduces an automated, markerless, camera-based method for estimating rehabilitation criteria fo…
▽ More
Anterior cruciate ligament (ACL) reconstruction necessitates months of rehabilitation, during which a clinician evaluates whether a patient is ready to return to sports or occupation. Due to their time- and cost-intensive nature, these screenings to assess progress are unavailable to many. This paper introduces an automated, markerless, camera-based method for estimating rehabilitation criteria following ACL reconstruction. To evaluate the performance of this novel technique, data were collected weekly from 12 subjects as they used a leg press over the course of a 12-week rehabilitation period. The proposed camera-based method for estimating displacement and force was compared to encoder and force plate measurements. The leg press displacement and force values were estimated with 89.7% and 85.3% accuracy, respectively. These values were then used to calculate lower-limb symmetry and to track patient progress over time.
△ Less
Submitted 25 October, 2018;
originally announced October 2018.