-
Reasoning Models are Test Exploiters: Rethinking Multiple-Choice
Authors:
Narun Raman,
Taylor Lundy,
Kevin Leyton-Brown
Abstract:
When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grad…
▽ More
When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of $15$ different question-answering benchmarks (e.g., MMLU, HLE) and $25$ different LLMs (including small models such as Qwen 7B and relatively large models such as Llama 70B). For each model-benchmark pair, we considered $5$ ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models, and offer practical guidelines for designing more robust, bias-resistant benchmarks that better reflect LLMs' genuine reasoning capabilities.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
NFTs as a Data-Rich Test Bed: Conspicuous Consumption and its Determinants
Authors:
Taylor Lundy,
Narun Raman,
Scott Duke Kominers,
Kevin Leyton-Brown
Abstract:
Conspicuous consumption occurs when a consumer derives value from a good based on its social meaning as a signal of wealth, taste, and/or community affiliation. Common conspicuous goods include designer footwear, country club memberships, and artwork; conspicuous goods also exist in the digital sphere, with non-fungible tokens (NFTs) as a prominent example. The NFT market merits deeper study for t…
▽ More
Conspicuous consumption occurs when a consumer derives value from a good based on its social meaning as a signal of wealth, taste, and/or community affiliation. Common conspicuous goods include designer footwear, country club memberships, and artwork; conspicuous goods also exist in the digital sphere, with non-fungible tokens (NFTs) as a prominent example. The NFT market merits deeper study for two key reasons: first, it is poorly understood relative to its economic scale; and second, it is unusually amenable to analysis because NFT transactions are publicly available on the blockchain, making them useful as a test bed for conspicuous consumption dynamics. This paper introduces a model that incorporates two previously identified elements of conspicuous consumption: the \emph{bandwagon effect} (goods increase in value as they become more popular) and the \emph{snob effect} (goods increase in value as they become rarer). Our model resolves the apparent tension between these two effects, exhibiting net complementarity between others' and one's own conspicuous consumption. We also introduce a novel dataset combining NFT transactions with embeddings of the corresponding NFT images computed using an off-the-shelf vision transformer architecture. We use our dataset to validate the model, showing that the bandwagon effect raises an NFT collection's value as more consumers join, while the snob effect drives consumers to seek rarer NFTs within a given collection.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models
Authors:
Narun Raman,
Taylor Lundy,
Thiago Amin,
Jesse Perla,
Kevin Leyton-Brown
Abstract:
How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address…
▽ More
How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting handwritten templates to target new domains and perspectives. Because it offers an automated way of generating fresh questions, auto-STEER mitigates the risk that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that it will serve as a useful tool both for evaluating and fine-tuning models for years to come. We demonstrate the usefulness of our benchmark via a case study on $27$ LLMs, ranging from small open-source models to the current state of the art. We examined each model's ability to solve microeconomic problems across our whole taxonomy and present the results across a range of prompting strategies and scoring metrics.
△ Less
Submitted 18 February, 2025; v1 submitted 18 February, 2025;
originally announced February 2025.
-
Multidimensional Bayesian Utility Maximization: Tight Approximations to Welfare
Authors:
Kira Goldner,
Taylor Lundy
Abstract:
We initiate the study of multidimensional Bayesian utility maximization, focusing on the unit-demand setting where values are i.i.d. across both items and buyers. The seminal result of Hartline and Roughgarden '08 studies simple, information-robust mechanisms that maximize utility for $n$ i.i.d. agents and $m$ identical items via an approximation to social welfare as an upper bound, and they prove…
▽ More
We initiate the study of multidimensional Bayesian utility maximization, focusing on the unit-demand setting where values are i.i.d. across both items and buyers. The seminal result of Hartline and Roughgarden '08 studies simple, information-robust mechanisms that maximize utility for $n$ i.i.d. agents and $m$ identical items via an approximation to social welfare as an upper bound, and they prove this gap between optimal utility and social welfare is $Θ(1+\log{n/m})$ in this setting. We extend these results to the multidimensional setting. To do so, we develop simple, prior-independent, approximately-optimal mechanisms, targeting the simplest benchmark of optimal welfare. We give a $(1- 1/e)$-approximation when there are more items than buyers, and a $Θ(\log{n/m})$-approximation when there are more buyers than items, and we prove that this bound is tight in both $n$ and $m$ by reducing the i.i.d. unit-demand setting to the identical items setting. Finally, we include an extensive discussion section on why Bayesian utility maximization is a promising research direction. In particular, we characterize complexities in this setting that defy our intuition from the welfare and revenue literature, and motivate why coming up with a better benchmark than welfare is a hard problem itself.
△ Less
Submitted 15 February, 2025; v1 submitted 19 February, 2024;
originally announced February 2024.
-
STEER: Assessing the Economic Rationality of Large Language Models
Authors:
Narun Raman,
Taylor Lundy,
Samuel Amouyal,
Yoav Levine,
Kevin Leyton-Brown,
Moshe Tennenholtz
Abstract:
There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing suc…
▽ More
There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
△ Less
Submitted 28 May, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Pay to (Not) Play: Monetizing Impatience in Mobile Games
Authors:
Taylor Lundy,
Narun Raman,
Hu Fu,
Kevin Leyton-Brown
Abstract:
Mobile gaming is a rapidly growing and incredibly profitable sector; having grown seven-fold over the past 10 years, it now grosses over $100 billion annually. This growth was due in large part to a shift in monetization strategies: rather than charging players an upfront cost ("pay-to-play"), games often request optional microtransactions throughout gameplay ("free-to-play"). We focus on a common…
▽ More
Mobile gaming is a rapidly growing and incredibly profitable sector; having grown seven-fold over the past 10 years, it now grosses over $100 billion annually. This growth was due in large part to a shift in monetization strategies: rather than charging players an upfront cost ("pay-to-play"), games often request optional microtransactions throughout gameplay ("free-to-play"). We focus on a common scenario in which games include wait times -- gating either items or game progression -- that players can pay to skip. Game designers typically say that they optimize for player happiness rather than revenue; however, prices for skips are typically set at levels that few players are willing to pay, leading to low purchase rates. Under a traditional analysis, it would seem that game designers fail at their stated goal if few players buy what they are selling. We argue that an alternate model can better explain this dynamic: players value tasks more highly as they are perceived to be more difficult. While skips can increase players' utilities by providing instant gratification, pricing skips too cheaply can lower players' utilities by decreasing the perceived amount of work needed to complete a task. We show that high revenue, high player utility, and low purchase rates can all coexist under this model, particularly under a realistic distribution of players having few buyers but a few big-spending "whales." We also investigate how a game designer should optimize prices under our model.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
UNSAT Solver Synthesis via Monte Carlo Forest Search
Authors:
Chris Cameron,
Jason Hartford,
Taylor Lundy,
Tuan Truong,
Alan Milligan,
Rex Chen,
Kevin Leyton-Brown
Abstract:
We introduce Monte Carlo Forest Search (MCFS), a class of reinforcement learning (RL) algorithms for learning policies in {tree MDPs}, for which policy execution involves traversing an exponential-sized tree. Examples of such problems include proving unsatisfiability of a SAT formula; counting the number of solutions of a satisfiable SAT formula; and finding the optimal solution to a mixed-integer…
▽ More
We introduce Monte Carlo Forest Search (MCFS), a class of reinforcement learning (RL) algorithms for learning policies in {tree MDPs}, for which policy execution involves traversing an exponential-sized tree. Examples of such problems include proving unsatisfiability of a SAT formula; counting the number of solutions of a satisfiable SAT formula; and finding the optimal solution to a mixed-integer program. MCFS algorithms can be seen as extensions of Monte Carlo Tree Search (MCTS) to cases where, rather than finding a good path (solution) within a tree, the problem is to find a small tree within a forest of candidate trees. We instantiate and evaluate our ideas in an algorithm that we dub Knuth Synthesis, an MCFS algorithm that learns DPLL branching policies for solving the Boolean satisfiability (SAT) problem, with the objective of achieving good average-case performance on a given distribution of unsatisfiable problem instances. Knuth Synthesis is the first RL approach to avoid the prohibitive costs of policy evaluations in an exponentially-sized tree, leveraging two key ideas: first, we estimate tree size by randomly sampling paths and measuring their lengths, drawing on an unbiased approximation due to Knuth (1975); second, we query a strong solver at a user-defined depth rather than learning a policy across the whole tree, to focus our policy search on early decisions that offer the greatest potential for reducing tree size. We matched or exceeded the performance of a strong baseline on three well-known SAT distributions, facing problems that were two orders of magnitude more challenging than those addressed in previous RL studies.
△ Less
Submitted 12 July, 2024; v1 submitted 22 November, 2022;
originally announced November 2022.
-
The Perils of Learning Before Optimizing
Authors:
Chris Cameron,
Jason Hartford,
Taylor Lundy,
Kevin Leyton-Brown
Abstract:
Formulating real-world optimization problems often begins with making predictions from historical data (e.g., an optimizer that aims to recommend fast routes relies upon travel-time predictions). Typically, learning the prediction model used to generate the optimization problem and solving that problem are performed in two separate stages. Recent work has showed how such prediction models can be l…
▽ More
Formulating real-world optimization problems often begins with making predictions from historical data (e.g., an optimizer that aims to recommend fast routes relies upon travel-time predictions). Typically, learning the prediction model used to generate the optimization problem and solving that problem are performed in two separate stages. Recent work has showed how such prediction models can be learned end-to-end by differentiating through the optimization task. Such methods often yield empirical improvements, which are typically attributed to end-to-end making better error tradeoffs than the standard loss function used in a two-stage solution. We refine this explanation and more precisely characterize when end-to-end can improve performance. When prediction targets are stochastic, a two-stage solution must make an a priori choice about which statistics of the target distribution to model-we consider expectations over prediction targets-while an end-to-end solution can make this choice adaptively. We show that the performance gap between a two-stage and end-to-end approach is closely related to the price of correlation concept in stochastic optimization and show the implications of some existing POC results for the predict-then-optimize problem. We then consider a novel and particularly practical setting, where multiple prediction targets are combined to obtain each of the objective function's coefficients. We give explicit constructions where (1) two-stage performs unboundedly worse than end-to-end; and (2) two-stage is optimal. We use simulations to experimentally quantify performance gaps and identify a wide range of real-world applications from the literature whose objective functions rely on multiple prediction targets, suggesting that end-to-end learning could yield significant improvements.
△ Less
Submitted 16 December, 2021; v1 submitted 18 June, 2021;
originally announced June 2021.
-
Smarter Parking: Using AI to Identify Parking Inefficiencies in Vancouver
Authors:
Devon Graham,
Satish Kumar Sarraf,
Taylor Lundy,
Ali MohammadMehr,
Sara Uppal,
Tae Yoon Lee,
Hedayat Zarkoob,
Scott Duke Kominers,
Kevin Leyton-Brown
Abstract:
On-street parking is convenient, but has many disadvantages: on-street spots come at the expense of other road uses such as traffic lanes, transit lanes, bike lanes, or parklets; drivers looking for parking contribute substantially to traffic congestion and hence to greenhouse gas emissions; safety is reduced both due to the fact that drivers looking for spots are more distracted than other road u…
▽ More
On-street parking is convenient, but has many disadvantages: on-street spots come at the expense of other road uses such as traffic lanes, transit lanes, bike lanes, or parklets; drivers looking for parking contribute substantially to traffic congestion and hence to greenhouse gas emissions; safety is reduced both due to the fact that drivers looking for spots are more distracted than other road users and that people exiting parked cars pose a risk to cyclists. These social costs may not be worth paying when off-street parking lots are nearby and have surplus capacity. To see where this might be true in downtown Vancouver, we used artificial intelligence techniques to estimate the amount of time it would take drivers to both park on and off street for destinations throughout the city. For on-street parking, we developed (1) a deep-learning model of block-by-block parking availability based on data from parking meters and audits and (2) a computational simulation of drivers searching for an on-street spot. For off-street parking, we developed a computational simulation of the time it would take drivers drive from their original destination to the nearest city-owned off-street lot and then to queue for a spot based on traffic and lot occupancy data. Finally, in both cases we also computed the time it would take the driver to walk from their parking spot to their original destination. We compared these time estimates for destinations in each block of Vancouver's downtown core and each hour of the day. We found many areas where off street would actually save drivers time over searching the streets for a spot, and many more where the time cost for parking off street was small. The identification of such areas provides an opportunity for the city to repurpose valuable curbside space for community-friendly uses more in line with its transportation goals.
△ Less
Submitted 21 March, 2020;
originally announced March 2020.
-
Limitations of Incentive Compatibility on Discrete Type Spaces
Authors:
Taylor Lundy,
Hu Fu
Abstract:
In the design of incentive compatible mechanisms, a common approach is to enforce incentive compatibility as constraints in programs that optimize over feasible mechanisms. Such constraints are often imposed on sparsified representations of the type spaces, such as their discretizations or samples, in order for the program to be manageable. In this work, we explore limitations of this approach, by…
▽ More
In the design of incentive compatible mechanisms, a common approach is to enforce incentive compatibility as constraints in programs that optimize over feasible mechanisms. Such constraints are often imposed on sparsified representations of the type spaces, such as their discretizations or samples, in order for the program to be manageable. In this work, we explore limitations of this approach, by studying whether all dominant strategy incentive compatible mechanisms on a set $T$ of discrete types can be extended to the convex hull of $T$.
Dobzinski, Fu and Kleinberg (2015) answered the question affirmatively for all settings where types are single dimensional. It is not difficult to show that the same holds when the set of feasible outcomes is downward closed. In this work we show that the question has a negative answer for certain non-downward-closed settings with multi-dimensional types. This result should call for caution in the use of the said approach to enforcing incentive compatibility beyond single-dimensional preferences and downward closed feasible outcomes.
△ Less
Submitted 23 November, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.