-
Evaluating the Performance and Robustness of LLMs in Materials Science Q&A and Property Predictions
Authors:
Hongchen Wang,
Kangming Li,
Scott Ramsay,
Yao Fehlis,
Edward Kim,
Jason Hattrick-Simpers
Abstract:
Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial c…
▽ More
Large Language Models (LLMs) have the potential to revolutionize scientific research, yet their robustness and reliability in domain-specific applications remain insufficiently explored. In this study, we evaluate the performance and robustness of LLMs for materials science, focusing on domain-specific question answering and materials property prediction across diverse real-world and adversarial conditions. Three distinct datasets are used in this study: 1) a set of multiple-choice questions from undergraduate-level materials science courses, 2) a dataset including various steel compositions and yield strengths, and 3) a band gap dataset, containing textual descriptions of material crystal structures and band gap values. The performance of LLMs is assessed using various prompting strategies, including zero-shot chain-of-thought, expert prompting, and few-shot in-context learning. The robustness of these models is tested against various forms of 'noise', ranging from realistic disturbances to intentionally adversarial manipulations, to evaluate their resilience and reliability under real-world conditions. Additionally, the study showcases unique phenomena of LLMs during predictive tasks, such as mode collapse behavior when the proximity of prompt examples is altered and performance recovery from train/test mismatch. The findings aim to provide informed skepticism for the broad use of LLMs in materials science and to inspire advancements that enhance their robustness and reliability for practical applications.
△ Less
Submitted 11 March, 2025; v1 submitted 22 September, 2024;
originally announced September 2024.
-
Ill-Typed Programs Don't Evaluate
Authors:
Steven Ramsay,
Charlie Walpole
Abstract:
We introduce two-sided type systems, which are sequent calculi for typing formulas. Two-sided type systems allow for hypothetical reasoning over the typing of compound program expressions, and the refutation of typing formulas. By incorporating a type of all values, these type systems support more refined notions of well-typing and ill-typing, guaranteeing both that well-typed programs don't go wr…
▽ More
We introduce two-sided type systems, which are sequent calculi for typing formulas. Two-sided type systems allow for hypothetical reasoning over the typing of compound program expressions, and the refutation of typing formulas. By incorporating a type of all values, these type systems support more refined notions of well-typing and ill-typing, guaranteeing both that well-typed programs don't go wrong and that ill-typed programs don't evaluate - that is, reach a value. This makes two-sided type systems suitable for incorrectness reasoning in higher-order program verification, which we illustrate through an application to precise data-flow typing in a language with constructors and pattern matching. Finally, we investigate the internalisation of the meta-level negation in the system as a complement operator on types. This motivates an alternative semantics for the typing judgement, which guarantees that ill-typed programs don't evaluate, but in which well-typed programs may yet go wrong.
△ Less
Submitted 20 October, 2023; v1 submitted 13 July, 2023;
originally announced July 2023.
-
Effect Handlers for Programmable Inference
Authors:
Minh Nguyen,
Roly Perera,
Meng Wang,
Steven Ramsay
Abstract:
Inference algorithms for probabilistic programming are complex imperative programs with many moving parts. Efficient inference often requires customising an algorithm to a particular probabilistic model or problem, sometimes called inference programming. Most inference frameworks are implemented in languages that lack a disciplined approach to side effects, which can result in monolithic implement…
▽ More
Inference algorithms for probabilistic programming are complex imperative programs with many moving parts. Efficient inference often requires customising an algorithm to a particular probabilistic model or problem, sometimes called inference programming. Most inference frameworks are implemented in languages that lack a disciplined approach to side effects, which can result in monolithic implementations where the structure of the algorithms is obscured and inference programming is hard. Functional programming with typed effects offers a more structured and modular foundation for programmable inference, with monad transformers being the primary structuring mechanism explored to date.
This paper presents an alternative approach to inference programming based on algebraic effects. Using effect signatures to specify the key operations of the algorithms, and effect handlers to modularly interpret those operations for specific variants, we develop two abstract algorithms, or inference patterns, representing two important classes of inference: Metropolis-Hastings and particle filtering. We show how our approach reveals the algorithms' high-level structure, and makes it easy to tailor and recombine their parts into new variants. We implement the two inference patterns as a Haskell library, and discuss the pros and cons of algebraic effects vis-a-vis monad transformers as a structuring mechanism for modular imperative algorithm design.
△ Less
Submitted 23 December, 2024; v1 submitted 2 March, 2023;
originally announced March 2023.
-
Higher-Order MSL Horn Constraints
Authors:
Jerome Jochems,
Eddie Jones,
Steven Ramsay
Abstract:
The monadic shallow linear (MSL) class is a decidable fragment of first-order Horn clauses that was discovered and rediscovered around the turn of the century, with applications in static analysis and verification. We propose a new class of higher-order Horn constraints which extend MSL to higher-order logic and develop a resolution-based decision procedure. Higher-order MSL Horn constraints can q…
▽ More
The monadic shallow linear (MSL) class is a decidable fragment of first-order Horn clauses that was discovered and rediscovered around the turn of the century, with applications in static analysis and verification. We propose a new class of higher-order Horn constraints which extend MSL to higher-order logic and develop a resolution-based decision procedure. Higher-order MSL Horn constraints can quite naturally capture the complex patterns of call and return that are possible in higher-order programs, which make them well suited to higher-order program verification. In fact, we show that the higher-order MSL satisfiability problem and the HORS model checking problem are interreducible, so that higher-order MSL can be seen as a constraint-based approach to higher-order model checking. Finally, we describe an implementation of our decision procedure and its application to verified socket programming.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
CycleQ: An Efficient Basis for Cyclic Equational Reasoning
Authors:
Eddie Jones,
C-. H. Luke Ong,
Steven Ramsay
Abstract:
We propose a new cyclic proof system for automated, equational reasoning about the behaviour of pure functional programs. The key to the system is the way in which cyclic proof and equational reasoning are mediated by the use of contextual substitution as a cut rule. We show that our system, although simple, already subsumes several of the approaches to implicit induction variously known as "induc…
▽ More
We propose a new cyclic proof system for automated, equational reasoning about the behaviour of pure functional programs. The key to the system is the way in which cyclic proof and equational reasoning are mediated by the use of contextual substitution as a cut rule. We show that our system, although simple, already subsumes several of the approaches to implicit induction variously known as "inductionless induction", "rewriting induction", and "proof by consistency". By restricting the form of the traces, we show that global correctness in our system can be verified incrementally, taking advantage of the well-known size-change principle, which leads to an efficient implementation of proof search. Our CycleQ tool, accessible as a GHC plugin, shows promising results on a number of standard benchmarks.
△ Less
Submitted 14 June, 2022; v1 submitted 24 November, 2021;
originally announced November 2021.
-
Initial Limit Datalog: a New Extensible Class of Decidable Constrained Horn Clauses
Authors:
Toby Cathcart Burn,
Luke Ong,
Steven Ramsay,
Dominik Wagner
Abstract:
We present initial limit Datalog, a new extensible class of constrained Horn clauses for which the satisfiability problem is decidable. The class may be viewed as a generalisation to higher-order logic (with a simple restriction on types) of the first-order language limit Datalog$_Z$ (a fragment of Datalog modulo linear integer arithmetic), but can be instantiated with any suitable background theo…
▽ More
We present initial limit Datalog, a new extensible class of constrained Horn clauses for which the satisfiability problem is decidable. The class may be viewed as a generalisation to higher-order logic (with a simple restriction on types) of the first-order language limit Datalog$_Z$ (a fragment of Datalog modulo linear integer arithmetic), but can be instantiated with any suitable background theory. For example, the fragment is decidable over any countable well-quasi-order with a decidable first-order theory, such as natural number vectors under componentwise linear arithmetic, and words of a bounded, context-free language ordered by the subword relation. Formulas of initial limit Datalog have the property that, under some assumptions on the background theory, their satisfiability can be witnessed by a new kind of term model which we call entwined structures. Whilst the set of all models is typically uncountable, the set of all entwined structures is recursively enumerable, and model checking is decidable.
△ Less
Submitted 29 April, 2021;
originally announced April 2021.
-
Verifying Liveness Properties of ML Programs
Authors:
M. M. Lester,
R. P. Neatherway,
C. -H. L. Ong,
S. J. Ramsay
Abstract:
Higher-order recursion schemes are a higher-order analogue of Boolean Programs; they form a natural class of abstractions for functional programs. We present a new, efficient algorithm for checking CTL properties of the trees generated by higher-order recursion schemes, which is an extension of Kobayashi's intersection type-based model checking technique. We show that an implementation of this alg…
▽ More
Higher-order recursion schemes are a higher-order analogue of Boolean Programs; they form a natural class of abstractions for functional programs. We present a new, efficient algorithm for checking CTL properties of the trees generated by higher-order recursion schemes, which is an extension of Kobayashi's intersection type-based model checking technique. We show that an implementation of this algorithm, THORS, performs well on a number of small examples and we demonstrate how it can be used to verify liveness properties of OCaml programs. Example properties include statements such as "all opened sockets are eventually closed" and "the lock is held until the file is closed".
△ Less
Submitted 24 December, 2020;
originally announced December 2020.
-
Intensional Datatype Refinement
Authors:
Eddie Jones,
Steven Ramsay
Abstract:
The pattern-match safety problem is to verify that a given functional program will never crash due to non-exhaustive patterns in its function definitions. We present a refinement type system that can be used to solve this problem. The system extends ML-style type systems with algebraic datatypes by a limited form of structural subtyping and environment-level intersection. We describe a fully autom…
▽ More
The pattern-match safety problem is to verify that a given functional program will never crash due to non-exhaustive patterns in its function definitions. We present a refinement type system that can be used to solve this problem. The system extends ML-style type systems with algebraic datatypes by a limited form of structural subtyping and environment-level intersection. We describe a fully automatic, sound and complete type inference procedure for this system which, under reasonable assumptions, is worst-case linear-time in the program size. Compositionality is essential to obtaining this complexity guarantee. A prototype implementation for Haskell is able to analyse a selection of packages from the Hackage database in a few hundred milliseconds.
△ Less
Submitted 25 November, 2020; v1 submitted 4 August, 2020;
originally announced August 2020.
-
Bisimilarity in fresh-register automata
Authors:
Andrzej S. Murawski,
Steven J. Ramsay,
Nikos Tzevelekos
Abstract:
Register automata are a basic model of computation over infinite alphabets. Fresh-register automata extend register automata with the capability to generate fresh symbols in order to model computational scenarios involving name creation. This paper investigates the complexity of the bisimilarity problem for classes of register and fresh-register automata. We examine all main disciplines that have…
▽ More
Register automata are a basic model of computation over infinite alphabets. Fresh-register automata extend register automata with the capability to generate fresh symbols in order to model computational scenarios involving name creation. This paper investigates the complexity of the bisimilarity problem for classes of register and fresh-register automata. We examine all main disciplines that have appeared in the literature: general register assignments; assignments where duplicate register values are disallowed; and assignments without duplicates in which registers cannot be empty. In the general case, we show that the problem is EXPTIME-complete. However, the absence of duplicate values in registers enables us to identify inherent symmetries inside the associated bisimulation relations, which can be used to establish a polynomial bound on the depth of Attacker-winning strategies. Furthermore, they enable a highly succinct representation of the corresponding bisimulations. By exploiting results from group theory and computational group theory, we can then show solvability in PSPACE and NP respectively for the latter two register disciplines. In each case, we find that freshness does not affect the complexity class of the problem. The results allow us to close a complexity gap for language equivalence of deterministic register automata. We show that deterministic language inequivalence for the no-duplicates fragment is NP-complete, which disproves an old conjecture of Sakamoto. Finally, we discover that, unlike in the finite-alphabet case, the addition of pushdown store makes bisimilarity undecidable, even in the case of visibly pushdown storage.
△ Less
Submitted 5 February, 2025; v1 submitted 13 May, 2020;
originally announced May 2020.
-
Defunctionalization of Higher-Order Constrained Horn Clauses
Authors:
Long Pham,
Steven J. Ramsay,
C. -H. Luke Ong
Abstract:
Building on the successes of satisfiability modulo theories (SMT), Bjørner et al. initiated a research programme advocating Horn constraints as a suitable basis for automatic program verification. The notion of first-order constrained Horn clauses has recently been extended to higher-order logic by Cathcart Burn et al. To exploit the remarkable efficiency of SMT solving, a natural approach to solv…
▽ More
Building on the successes of satisfiability modulo theories (SMT), Bjørner et al. initiated a research programme advocating Horn constraints as a suitable basis for automatic program verification. The notion of first-order constrained Horn clauses has recently been extended to higher-order logic by Cathcart Burn et al. To exploit the remarkable efficiency of SMT solving, a natural approach to solve systems of higher-order Horn constraints is to reduce them to systems of first-order Horn constraints. This paper presents a defunctionalization algorithm to achieve the reduction.
Given a well-sorted higher-order constrained Horn clause (HoCHC) problem instance, the defunctionalization algorithm constructs a first-order well-sorted constrained Horn clause problem. In addition to well-sortedness of the algorithm's output, we prove that if an input HoCHC is solvable, then the result of its defunctionalization is solvable. The converse also holds, which we prove using a recent result on the continuous semantics of HoCHC. To our knowledge, this defunctionalization algorithm is the first sound and complete reduction from systems of higher-order Horn constraints to systems of first-order Horn constraints.
We have constructed DefMono, a prototype implementation of the defunctionalization algorithm. It first defunctionalizes an input HoCHC problem and then feeds the result into a backend SMT solver. We have evaluated the performance of DefMono empirically by comparison with two other higher-order verification tools.
△ Less
Submitted 22 February, 2019; v1 submitted 8 October, 2018;
originally announced October 2018.
-
Higher-Order Constrained Horn Clauses and Refinement Types
Authors:
Toby Cathcart Burn,
C. -H. Luke Ong,
Steven J. Ramsay
Abstract:
Motivated by applications in automated verification of higher-order functional programs, we develop a notion of constrained Horn clauses in higher-order logic and a decision problem concerning their satisfiability. We show that, although satisfiable systems of higher-order clauses do not generally have least models, there is a notion of canonical model obtained through a reduction to a problem con…
▽ More
Motivated by applications in automated verification of higher-order functional programs, we develop a notion of constrained Horn clauses in higher-order logic and a decision problem concerning their satisfiability. We show that, although satisfiable systems of higher-order clauses do not generally have least models, there is a notion of canonical model obtained through a reduction to a problem concerning a kind of monotone logic program. Following work in higher-order program verification, we develop a refinement type system in order to reason about and automate the search for models. This provides a sound but incomplete method for solving the decision problem. Finally, we show that an extension of the decision problem in which refinement types are used directly as guards on existential quantifiers can be reduced to the original problem. This result can be used to show that properties of higher-order functions that are definable using refinement types are also expressible using higher-order constrained Horn clauses.
△ Less
Submitted 1 August, 2017; v1 submitted 17 May, 2017;
originally announced May 2017.