Skip to main content

Showing 1–12 of 12 results for author: Bommarito, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.07854  [pdf, other

    cs.CL cs.AI

    The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

    Authors: Michael J Bommarito II, Jillian Bommarito, Daniel Martin Katz

    Abstract: Practically all large language models have been pre-trained on data that is subject to global uncertainty related to copyright infringement and breach of contract. This creates potential risk for users and developers due to this uncertain legal status. The KL3M Data Project directly confronts this critical issue by introducing the largest comprehensive training data pipeline that minimizes risks r… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: 27 pages, 7 figures, 9 table

  2. arXiv:2504.04131  [pdf, other

    cs.CL

    Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

    Authors: Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito

    Abstract: We present NUPunkt and CharBoundary, two sentence boundary detection libraries optimized for high-precision, high-throughput processing of legal text in large-scale applications such as due diligence, e-discovery, and legal research. These libraries address the critical challenges posed by legal documents containing specialized citations, abbreviations, and complex sentence structures that confoun… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

    Comments: 12 pages, 5 figures, 6 tables

  3. arXiv:2503.17247  [pdf, other

    cs.CL cs.AI

    KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

    Authors: Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito

    Abstract: We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased t… ▽ More

    Submitted 21 March, 2025; originally announced March 2025.

    Comments: 12 pages, 7 tables, 3 figures; Source code available at https://github.com/alea-institute/kl3m-tokenizer-paper

  4. arXiv:2501.08365  [pdf

    cs.CY cs.AI cs.CL cs.LG

    Towards Best Practices for Open Datasets for LLM Training

    Authors: Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Guilherme Penedo, Maarten Van Segbroeck, Jennifer Wang, Leandro von Werra, Mitchell Baker, Julie Belião, Kasia Chmielinski, Marzieh Fadaee, Lisa Gutermuth, Hynek Kydlíček, Greg Leppert, EM Lewis-Jong , et al. (14 additional authors not shown)

    Abstract: Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to… ▽ More

    Submitted 14 January, 2025; originally announced January 2025.

  5. arXiv:2302.12039  [pdf, other

    cs.CL cs.AI

    Natural Language Processing in the Legal Domain

    Authors: Daniel Martin Katz, Dirk Hartung, Lauritz Gerlach, Abhik Jana, Michael J. Bommarito II

    Abstract: In this paper, we summarize the current state of the field of NLP & Law with a specific focus on recent technical and substantive developments. To support our analysis, we construct and analyze a nearly complete corpus of more than six hundred NLP & Law related papers published over the past decade. Our analysis highlights several major trends. Namely, we document an increasing number of papers wr… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: 13 pages, 7 figures, 2 tables, online source and data

  6. arXiv:2301.04408  [pdf, other

    cs.CL cs.AI cs.CY

    GPT as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities

    Authors: Jillian Bommarito, Michael Bommarito, Daniel Martin Katz, Jessica Katz

    Abstract: The global economy is increasingly dependent on knowledge workers to meet the needs of public and private organizations. While there is no single definition of knowledge work, organizations and industry groups still attempt to measure individuals' capability to engage in it. The most comprehensive assessment of capability readiness for professional knowledge workers is the Uniform CPA Examination… ▽ More

    Submitted 11 January, 2023; originally announced January 2023.

    Comments: Source code and data available in online SI at https://github.com/mjbommar/gpt-as-knowledge-worker

  7. arXiv:2102.09904  [pdf, other

    cs.MS cs.CY cs.SE physics.soc-ph

    An Empirical Analysis of the R Package Ecosystem

    Authors: Ethan Bommarito, Michael J Bommarito II

    Abstract: In this research, we present a comprehensive, longitudinal empirical summary of the R package ecosystem, including not just CRAN, but also Bioconductor and GitHub. We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades, providing comprehensive counts and trends for common metrics across packages, releases, authors, licenses, and other important metadata. We… ▽ More

    Submitted 19 February, 2021; originally announced February 2021.

    Comments: 20 pages, 3 figures, 23 tables

  8. arXiv:1806.04973  [pdf, other

    cs.CL cs.DB

    OpenEDGAR: Open Source Software for SEC EDGAR Analysis

    Authors: Michael J Bommarito II, Daniel Martin Katz, Eric M Detterman

    Abstract: OpenEDGAR is an open source Python framework designed to rapidly construct research databases based on the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system operated by the US Securities and Exchange Commission (SEC). OpenEDGAR is built on the Django application framework, supports distributed compute across one or more servers, and includes functionality to (i) retrieve and parse… ▽ More

    Submitted 13 June, 2018; originally announced June 2018.

    Comments: 12 pages, 3 figures, 2 tables

    ACM Class: I.2.7; F.2.2; H.3.1; H.3.3; I.7

  9. arXiv:1806.03688  [pdf, other

    cs.CL cs.IR stat.ML

    LexNLP: Natural language processing and information extraction for legal and regulatory texts

    Authors: Michael J Bommarito II, Daniel Martin Katz, Eric M Detterman

    Abstract: LexNLP is an open source Python package focused on natural language processing and machine learning for legal and regulatory text. The package includes functionality to (i) segment documents, (ii) identify key text such as titles and section headings, (iii) extract over eighteen types of structured information like distances and dates, (iv) extract named entities such as companies and geopolitical… ▽ More

    Submitted 10 June, 2018; originally announced June 2018.

    Comments: 9 pages, 0 figures; see also https://github.com/LexPredict/lexpredict-lexnlp

    ACM Class: I.2.7; F.2.2; H.3.1; H.3.3; I.7

  10. arXiv:1712.03846  [pdf, other

    physics.soc-ph cs.SI

    Crowdsourcing accurately and robustly predicts Supreme Court decisions

    Authors: Daniel Martin Katz, Michael James Bommarito II, Josh Blackman

    Abstract: Scholars have increasingly investigated "crowdsourcing" as an alternative to expert-based judgment or purely data-driven approaches to predicting the future. Under certain conditions, scholars have found that crowdsourcing can outperform these other approaches. However, despite interest in the topic and a series of successful use cases, relatively few studies have applied empirical model thinking… ▽ More

    Submitted 11 December, 2017; originally announced December 2017.

    Comments: 11 pages, 5 figures, 4 tables; preprint for public feedback

  11. arXiv:1407.6333  [pdf, ps, other

    physics.soc-ph cs.SI

    Predicting the Behavior of the Supreme Court of the United States: A General Approach

    Authors: Daniel Martin Katz, Michael J Bommarito II, Josh Blackman

    Abstract: Building upon developments in theoretical and applied machine learning, as well as the efforts of various scholars including Guimera and Sales-Pardo (2011), Ruger et al. (2004), and Martin et al. (2004), we construct a model designed to predict the voting behavior of the Supreme Court of the United States. Using the extremely randomized tree method first proposed in Geurts, et al. (2006), a method… ▽ More

    Submitted 23 July, 2014; originally announced July 2014.

    Comments: 17 pages, 6 figures; source available at https://github.com/mjbommar/scotus-predict

  12. arXiv:1003.4146  [pdf, other

    cs.IR cs.CY cs.DL physics.soc-ph

    A Mathematical Approach to the Study of the United States Code

    Authors: Michael J. Bommarito II, Daniel Martin Katz

    Abstract: The United States Code (Code) is a document containing over 22 million words that represents a large and important source of Federal statutory law. Scholars and policy advocates often discuss the direction and magnitude of changes in various aspects of the Code. However, few have mathematically formalized the notions behind these discussions or directly measured the resulting representations. This… ▽ More

    Submitted 22 March, 2010; originally announced March 2010.

    Comments: 5 pages, 6 figures, 2 tables.