-
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Authors:
Yu Fan,
Jingwei Ni,
Jakob Merane,
Etienne Salimbeni,
Yang Tian,
Yoan Hermstrüwer,
Yinya Huang,
Mubashara Akhtar,
Florian Geering,
Oliver Dreyer,
Daniel Brunner,
Markus Leippold,
Mrinmaya Sachan,
Alexander Stremitzer,
Christoph Engel,
Elliott Ash,
Joel Niklaus
Abstract:
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,…
▽ More
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/
△ Less
Submitted 29 May, 2025; v1 submitted 19 May, 2025;
originally announced May 2025.
-
Lawma: The Power of Specialization for Legal Annotation
Authors:
Ricardo Dominguez-Olmedo,
Vedant Nanda,
Rediet Abebe,
Stefan Bechtold,
Christoph Engel,
Jens Frankenreiter,
Krishna Gummadi,
Moritz Hardt,
Michael Livermore
Abstract:
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, ou…
▽ More
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal annotation remains limited. To bridge this gap, we introduce CaselawQA, a benchmark comprising 260 legal annotation tasks, nearly all new to the machine learning community. We demonstrate that commercial models, such as GPT-4.5 and Claude 3.7 Sonnet, achieve non-trivial yet highly variable accuracy, generally falling short of the performance required for legal work. We then demonstrate that small, lightly fine-tuned models outperform commercial models. A few hundred to a thousand labeled examples are usually enough to achieve higher accuracy. Our work points to a viable alternative to the predominant practice of prompting commercial models. For concrete legal annotation tasks with some available labeled data, researchers are likely better off using a fine-tuned open-source model.
△ Less
Submitted 23 April, 2025; v1 submitted 23 July, 2024;
originally announced July 2024.
-
Standardizing Paediatric Clinical Data: The Development of the conect4children (c4c) Cross Cutting Paediatric Data Dictionary
Authors:
Anando Sen,
Victoria Hedley,
John Owen,
Ronald Cornet,
Dipak Kalra,
Corinna Engel,
Avril Palmeri,
Joanne Lee,
Jean-Christophe Roze,
Joseph F Standing,
Adilia Warris,
Claudia Pansieri,
Rebecca Leary,
Mark Turner,
Volker Straub
Abstract:
Standardization of data items collected in paediatric clinical trials is an important but challenging issue. The Clinical Data Interchange Standards Consortium (CDISC) data standards are well understood by the pharmaceutical industry but lack the implementation of some paediatric specific concepts. When a paediatric concept is absent within CDISC standards, companies and research institutions take…
▽ More
Standardization of data items collected in paediatric clinical trials is an important but challenging issue. The Clinical Data Interchange Standards Consortium (CDISC) data standards are well understood by the pharmaceutical industry but lack the implementation of some paediatric specific concepts. When a paediatric concept is absent within CDISC standards, companies and research institutions take multiple approaches in the collection of paediatric data, leading to different implementations of standards and potentially limited utility for reuse. To overcome these challenges, the conect4children consortium has developed a cross-cutting paediatric data dictionary (CCPDD). The dictionary was built over three phases - scoping (including a survey sent out to ten industrial and 34 academic partners to gauge interest), creation of a longlist and consensus building for the final set of terms. The dictionary was finalized during a workshop with attendees from academia, hospitals, industry and CDISC. The attendees held detailed discussions on each data item and participated in the final vote on the inclusion of the item in the CCPDD. Nine industrial and 34 academic partners responded to the survey, which showed overall interest in the development of the CCPDD. Following the final vote on 27 data items, three were rejected, six were deferred to the next version and a final opinion was sought from CDISC. The first version of the CCPDD with 25 data items was released in August 2019. The continued use of the dictionary has the potential to ensure the collection of standardized data that is interoperable and can later be pooled and reused for other applications. The dictionary is already being used for case report form creation in three clinical trials. The CCPDD will also serve as one of the inputs to the Paediatric User Guide, which is being developed by CDISC.
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
AI training resources for GLAM: a snapshot
Authors:
Andrew Darby,
Catherine Nicole Coleman,
Claudia Engel,
Daniel van Strien,
Mike Trizna,
Zachary W. Painter
Abstract:
We take a snapshot of current resources available for teaching and learning AI with a focus on the Galleries, Libraries, Archives and Museums (GLAM) community. The review was carried out in 2021 and 2022. The review provides an overview of material we identified as being relevant, offers a description of this material and makes recommendations for future work in this area.
We take a snapshot of current resources available for teaching and learning AI with a focus on the Galleries, Libraries, Archives and Museums (GLAM) community. The review was carried out in 2021 and 2022. The review provides an overview of material we identified as being relevant, offers a description of this material and makes recommendations for future work in this area.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.