Skip to main content

Showing 1–12 of 12 results for author: Reul, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2201.07661  [pdf, other

    cs.CV

    Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning

    Authors: Christian Reul, Stefan Tomasek, Florian Langhanki, Uwe Springmann

    Abstract: This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training a new model on a few pages of transcribed text (ground truth). To… ▽ More

    Submitted 19 January, 2022; originally announced January 2022.

  2. arXiv:2106.07881  [pdf

    cs.CV

    Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning

    Authors: Christian Reul, Christoph Wick, Maximilian Nöth, Andreas Büttner, Maximilian Wehner, Uwe Springmann

    Abstract: In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: submitted to HIP'21

  3. OCR4all -- An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

    Authors: Christian Reul, Dennis Christ, Alexander Hartelt, Nico Balbach, Maximilian Wehner, Uwe Springmann, Christoph Wick, Christine Grundig, Andreas Büttner, Frank Puppe

    Abstract: Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processin… ▽ More

    Submitted 9 September, 2019; originally announced September 2019.

    Comments: submitted to MDPI - Applied Sciences

    Journal ref: https://www.mdpi.com/2076-3417/9/22/4853/htm

  4. arXiv:1810.03436  [pdf

    cs.CV

    State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

    Authors: Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe

    Abstract: In this paper we evaluate Optical Character Recognition (OCR) of 19th century Fraktur scripts without book-specific training using mixed models, i.e. models trained to recognize a variety of fonts and typesets from previously unseen sources. We describe the training process leading to strong mixed OCR models and compare them to freely available models of the popular open source engines OCRopus and… ▽ More

    Submitted 8 October, 2018; originally announced October 2018.

    Comments: Submitted to DHd 2019 (https://dhd2019.org/) which demands a... creative... submission format. Consequently, some captions might look weird and some links aren't clickable. Extended version with more technical details and some fixes to follow

  5. arXiv:1809.05501  [pdf

    cs.CL

    Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

    Authors: Uwe Springmann, Christian Reul, Stefanie Dipper, Johannes Baiter

    Abstract: In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available un… ▽ More

    Submitted 14 September, 2018; originally announced September 2018.

    Comments: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on Automatic Text and Layout Recognition

  6. arXiv:1807.02004  [pdf

    cs.CV

    Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

    Authors: Christoph Wick, Christian Reul, Frank Puppe

    Abstract: Optical Character Recognition (OCR) on contemporary and historical data is still in the focus of many researchers. Especially historical prints require book specific trained OCR models to achieve applicable results (Springmann and Lüdeling, 2016, Reul et al., 2017a). To reduce the human effort for manually annotating ground truth (GT) various techniques such as voting and pretraining have shown to… ▽ More

    Submitted 6 August, 2018; v1 submitted 5 July, 2018; originally announced July 2018.

    Comments: 11 pages, 3 figures

    Journal ref: Digital Humanities Quarterly 14 (2), 2020

  7. arXiv:1802.10038  [pdf, other

    cs.CV

    Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning

    Authors: Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe

    Abstract: We combine three methods which significantly improve the OCR accuracy of OCR models trained on early printed books: (1) The pretraining method utilizes the information stored in already existing models trained on a variety of typesets (mixed models) instead of starting the training from scratch. (2) Performing cross fold training on a single set of ground truth data (line images and their transcri… ▽ More

    Submitted 28 February, 2018; v1 submitted 27 February, 2018; originally announced February 2018.

    Comments: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on Automatic Text and Layout Recognition

  8. arXiv:1802.10033  [pdf, other

    cs.CV cs.DL

    Improving OCR Accuracy on Early Printed Books using Deep Convolutional Networks

    Authors: Christoph Wick, Christian Reul, Frank Puppe

    Abstract: This paper proposes a combination of a convolutional and a LSTM network to improve the accuracy of OCR on early printed books. While the standard model of line based OCR uses a single LSTM layer, we utilize a CNN- and Pooling-Layer combination in advance of an LSTM layer. Due to the higher amount of trainable parameters the performance of the network relies on a high amount of training examples to… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: 16 pages, 4 figures, 8 tables, submitted to JLCL Volume 33 (2018), Issue 1

  9. arXiv:1712.05586  [pdf

    cs.CV

    Transfer Learning for OCRopus Model Training on Early Printed Books

    Authors: Christian Reul, Christoph Wick, Uwe Springmann, Frank Puppe

    Abstract: A method is presented that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books when only small amounts of diplomatic transcriptions are available. This is achieved by building from already existing models during training instead of starting from scratch. To overcome the discrepancies between the set of characters of the pretraine… ▽ More

    Submitted 21 December, 2017; v1 submitted 15 December, 2017; originally announced December 2017.

  10. Improving OCR Accuracy on Early Printed Books by utilizing Cross Fold Training and Voting

    Authors: Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe

    Abstract: In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR… ▽ More

    Submitted 27 November, 2017; originally announced November 2017.

  11. arXiv:1701.07396  [pdf

    cs.CV cs.AI

    LAREX - A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books

    Authors: Christian Reul, Uwe Springmann, Frank Puppe

    Abstract: A semi-automatic open-source tool for layout analysis on early printed books is presented. LAREX uses a rule based connected components approach which is very fast, easily comprehensible for the user and allows an intuitive manual correction if necessary. The PageXML format is used to support integration into existing OCR workflows. Evaluations showed that LAREX provides an efficient and flexible… ▽ More

    Submitted 20 January, 2017; originally announced January 2017.

  12. arXiv:1701.07395  [pdf

    cs.CV

    Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

    Authors: Christian Reul, Marco Dittrich, Martin Gruner

    Abstract: This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text recognition is illustrated in detail using the example of 'Der Heiligen Leben', printed in Nuremberg in 1488. For each step the required time expenditur… ▽ More

    Submitted 20 January, 2017; originally announced January 2017.