Skip to main content

Showing 1–5 of 5 results for author: Dahl, C M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.13604  [pdf, other

    cs.CL econ.EM

    Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE

    Authors: Christian Møller Dahl, Torben Johansen, Christian Vedel

    Abstract: This paper introduces a new tool, OccCANINE, to automatically transform occupational descriptions into the HISCO classification system. The manual work involved in processing and classifying occupational descriptions is error-prone, tedious, and time-consuming. We finetune a preexisting language model (CANINE) to do this automatically, thereby performing in seconds and minutes what previously took… ▽ More

    Submitted 2 April, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: All code and guides on how to use OccCANINE is available on GitHub https://github.com/christianvedels/OccCANINE

    ACM Class: I.2.7; I.7.0

  2. arXiv:2210.00503  [pdf, other

    cs.CV

    DARE: A large-scale handwritten date recognition system

    Authors: Christian M. Dahl, Torben S. D. Johansen, Emil N. Sørensen, Christian E. Westermann, Simon F. Wittrock

    Abstract: Handwritten text recognition for historical documents is an important task but it remains difficult due to a lack of sufficient training data in combination with a large variability of writing styles and degradation of historical documents. While recurrent neural network architectures are commonly used for handwritten text recognition, they are often computationally expensive to train and the bene… ▽ More

    Submitted 2 October, 2022; originally announced October 2022.

  3. arXiv:2102.03239  [pdf, other

    cs.CV econ.EM stat.ML

    Applications of Machine Learning in Document Digitisation

    Authors: Christian M. Dahl, Torben S. D. Johansen, Emil N. Sørensen, Christian E. Westermann, Simon F. Wittrock

    Abstract: Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that 'large and detailed' usually implies 'costly and difficult', especially when the data medium is paper… ▽ More

    Submitted 5 February, 2021; originally announced February 2021.

  4. arXiv:2102.00208  [pdf, other

    cs.LG econ.EM stat.ME stat.ML

    Time Series (re)sampling using Generative Adversarial Networks

    Authors: Christian M. Dahl, Emil N. Sørensen

    Abstract: We propose a novel bootstrap procedure for dependent data based on Generative Adversarial networks (GANs). We show that the dynamics of common stationary time series processes can be learned by GANs and demonstrate that GANs trained on a single sample path can be used to generate additional samples from the process. We find that temporal convolutional neural networks provide a suitable design for… ▽ More

    Submitted 30 January, 2021; originally announced February 2021.

  5. arXiv:2101.10862  [pdf, other

    cs.CV econ.EM

    HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition

    Authors: Christian M. Dahl, Torben Johansen, Emil N. Sørensen, Simon Wittrock

    Abstract: Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Probably the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are… ▽ More

    Submitted 10 March, 2022; v1 submitted 22 January, 2021; originally announced January 2021.