Skip to main content

Showing 1–1 of 1 results for author: Dane, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.00612  [pdf, ps, other

    cs.AI

    Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

    Authors: D. Sculley, Will Cukierski, Phil Culliton, Sohier Dane, Maggie Demkin, Ryan Holbrook, Addison Howard, Paul Mooney, Walter Reade, Megan Risdal, Nate Keating

    Abstract: In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well def… ▽ More

    Submitted 28 May, 2025; v1 submitted 1 May, 2025; originally announced May 2025.