Skip to main content

Showing 1–5 of 5 results for author: Jonsson, L

Searching in archive stat. Search in all archives.
.
  1. arXiv:2309.12269  [pdf, other

    cs.CL cs.CY stat.AP

    The Cambridge Law Corpus: A Dataset for Legal AI Research

    Authors: Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek

    Abstract: We introduce the Cambridge Law Corpus (CLC), a dataset for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases,… ▽ More

    Submitted 1 January, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Journal ref: Advances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023

  2. arXiv:1906.02416  [pdf, other

    stat.ML cs.CL cs.IR cs.LG

    Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models

    Authors: Alexander Terenin, Måns Magnusson, Leif Jonsson

    Abstract: To scale non-parametric extensions of probabilistic topic models such as Latent Dirichlet allocation to larger data sets, practitioners rely increasingly on parallel and distributed systems. In this work, we study data-parallel training for the hierarchical Dirichlet process (HDP) topic model. Based upon a representation of certain conditional distributions within an HDP, we propose a doubly spars… ▽ More

    Submitted 6 October, 2020; v1 submitted 6 June, 2019; originally announced June 2019.

    Journal ref: Conference on Empirical Methods in Natural Language Processing, 2020

  3. Pólya Urn Latent Dirichlet Allocation: a doubly sparse massively parallel sampler

    Authors: Alexander Terenin, Måns Magnusson, Leif Jonsson, David Draper

    Abstract: Latent Dirichlet Allocation (LDA) is a topic model widely used in natural language processing and machine learning. Most approaches to training the model rely on iterative algorithms, which makes it difficult to run LDA on big corpora that are best analyzed in parallel and distributed computational environments. Indeed, current approaches to parallel inference either don't converge to the correct… ▽ More

    Submitted 22 October, 2020; v1 submitted 11 April, 2017; originally announced April 2017.

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence 41(7):1709-1719, 2019

  4. arXiv:1602.00260  [pdf, other

    stat.ML

    DOLDA - a regularized supervised topic model for high-dimensional multi-class regression

    Authors: Måns Magnusson, Leif Jonsson, Mattias Villani

    Abstract: Generating user interpretable multi-class predictions in data rich environments with many classes and explanatory covariates is a daunting task. We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle both many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant (… ▽ More

    Submitted 20 October, 2016; v1 submitted 31 January, 2016; originally announced February 2016.

  5. arXiv:1506.03784  [pdf, other

    stat.ML stat.ME

    Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models

    Authors: Måns Magnusson, Leif Jonsson, Mattias Villani, David Broman

    Abstract: Topic models, and more specifically the class of Latent Dirichlet Allocation (LDA), are widely used for probabilistic modeling of text. MCMC sampling from the posterior distribution is typically performed using a collapsed Gibbs sampler. We propose a parallel sparse partially collapsed Gibbs sampler and compare its speed and efficiency to state-of-the-art samplers for topic models on five well-kno… ▽ More

    Submitted 15 August, 2017; v1 submitted 11 June, 2015; originally announced June 2015.

    Comments: Accepted for publication in Journal of Computational and Graphical Statistics