Search | arXiv e-print repository

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Authors: Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A. Said Gurbuz, Michele Dolfi, Miquel Farré, Peter W. J. Staar

Abstract: We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipeline… ▽ More We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon. △ Less

Submitted 14 March, 2025; originally announced March 2025.

Comments: 24 pages, 10 figures

arXiv:2502.09927 [pdf, other]

Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

Authors: Granite Vision Team, Leonid Karlinsky, Assaf Arbelle, Abraham Daniels, Ahmed Nassar, Amit Alfassi, Bo Wu, Eli Schwartz, Dhiraj Joshi, Jovana Kondic, Nimrod Shabtay, Pengyuan Li, Roei Herzig, Shafiq Abedin, Shaked Perek, Sivan Harary, Udi Barzelay, Adi Raz Goldfarb, Aude Oliva, Ben Wieles, Bishwaranjan Bhattacharjee, Brandon Huang, Christoph Auer, Dan Gutfreund, David Beymer , et al. (38 additional authors not shown)

Abstract: We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as gener… ▽ More We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights. △ Less

Submitted 14 February, 2025; originally announced February 2025.

arXiv:2501.17887 [pdf, other]

Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion

Authors: Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, Peter W. J. Staar

Abstract: We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in… ▽ More We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. Docling is released as a Python package and can be used as a Python API or as a CLI tool. Docling's modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. Docling has been already integrated in other popular open-source frameworks (e.g., LangChain, LlamaIndex, spaCy), making it a natural fit for the processing of documents and the development of high-end applications. The open-source community has fully engaged in using, promoting, and developing for Docling, which gathered 10k stars on GitHub in less than a month and was reported as the No. 1 trending repository in GitHub worldwide in November 2024. △ Less

Submitted 27 January, 2025; originally announced January 2025.

Comments: Accepted to AAAI 25: Workshop on Open-Source AI for Mainstream Use

arXiv:2411.19710 [pdf, other]

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Authors: Rafael Teixeira de Lima, Shubham Gupta, Cesar Berrospi, Lokesh Mishra, Michele Dolfi, Peter Staar, Panagiotis Vagenas

Abstract: Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to… ▽ More Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development. △ Less

Submitted 29 November, 2024; originally announced November 2024.

Comments: to be published in the 31st International Conference on Computational Linguistics (COLING 2025)

arXiv:2409.18164 [pdf]

Data-Prep-Kit: getting your data ready for LLM application development

Authors: David Wood, Boris Lublinsky, Alexy Roytman, Shivdeep Singh, Constantin Adam, Abdulhamid Adebayo, Sungeun An, Yuan Chi Chang, Xuan-Hong Dang, Nirmit Desai, Michele Dolfi, Hajar Emami-Gohari, Revital Eres, Takuya Goto, Dhiraj Joshi, Yan Koyfman, Mohammad Nassar, Hima Patel, Paramesvaran Selvam, Yousaf Shah, Saptha Surendran, Daiki Tsuzuku, Petros Zerfos, Shahrokh Daijavad

Abstract: Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortles… ▽ More Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG). △ Less

Submitted 12 November, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

Comments: 10 pages, 7 figures

arXiv:2408.09869 [pdf, other]

Docling Technical Report

Authors: Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Nikolaos Livathinos, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Fabian Lindlbauer, Kasper Dinkla, Lokesh Mishra, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, Peter W. J. Staar

Abstract: This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addit… ▽ More This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models. △ Less

Submitted 9 December, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

Comments: Docling v1 report

arXiv:2406.19102 [pdf, other]

doi 10.18653/v1/2024.climatenlp-1.15

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

Authors: Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, Peter Staar

Abstract: Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as… ▽ More Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

Journal ref: Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024), pages 193-214, Bangkok, Thailand. Association for Computational Linguistics

arXiv:2405.10725 [pdf, other]

INDUS: Effective and Efficient Language Models for Scientific Applications

Authors: Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Nishan Pantha, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kaylin Bugbee, Mike Little, Elizabeth Fancher, Irina Gerasimov, Armin Mehrabian, Lauren Sanders, Sylvain Costes, Sergi Blanco-Cuaresma, Kelly Lockhart, Thomas Allen, Felix Grezes, Megan Ansdell, Alberto Accomazzi, Yousef El-Kurdi , et al. (11 additional authors not shown)

Abstract: Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, phys… ▽ More Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, CLIMATE-CHANGE NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain-specific (SCIBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings -- as a retrieval model for large-scale vector search applications and in automatic content tagging systems. △ Less

Submitted 30 October, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

Comments: EMNLP 2024 (Industry Track)

arXiv:2311.18481 [pdf, other]

doi 10.1609/aaai.v38i21.30574

ESG Accountability Made Easy: DocQA at Your Service

Authors: Lokesh Mishra, Cesar Berrospi, Kasper Dinkla, Diego Antognini, Francesco Fusco, Benedikt Bothur, Maksym Lysak, Nikolaos Livathinos, Ahmed Nassar, Panagiotis Vagenas, Lucas Morin, Christoph Auer, Michele Dolfi, Peter Staar

Abstract: We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via… ▽ More We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: Accepted at the Demonstration Track of the 38th Annual AAAI Conference on Artificial Intelligence (AAAI 24)

Journal ref: AAAI 2024, 38, 23814-23816

arXiv:2305.14962 [pdf, other]

doi 10.1007/978-3-031-41679-8_27

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Authors: Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Livathinos, Peter Staar

Abstract: Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions t… ▽ More Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions to document layout understanding. In this report, we present the results of our \textit{ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents}, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. To raise the bar over previous competitions, we engineered a hard competition dataset and proposed the recent DocLayNet dataset for training. We recorded 45 team registrations and received official submissions from 21 teams. In the presented solutions, we recognize interesting combinations of recent computer vision models, data augmentation strategies and ensemble methods to achieve remarkable accuracy in the task we posed. A clear trend towards adoption of vision-transformer based methods is evident. The results demonstrate substantial progress towards achieving robust and highly generalizing methods for document layout understanding. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: ICDAR 2023, 10 pages, 4 figures

arXiv:2209.03648 [pdf, other]

FETA: Towards Specializing Foundation Models for Expert Task Applications

Authors: Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky

Abstract: Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail… ▽ More Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training. This underlines the necessity to explicitly evaluate and finetune FMs on such expert tasks, arguably ones that appear the most in practical real-world applications. In this paper, we propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation, via learning to match their graphical illustrations to corresponding language descriptions. Our FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. FETA is equipped with a procedure for completely automatic annotation extraction (code would be released upon acceptance), allowing easy extension of FETA to more documentation types and application domains in the future. Our automatic annotation leads to an automated performance metric shown to be consistent with metrics computed on human-curated annotations (also released). We provide multiple baselines and analysis of popular FMs on FETA leading to several interesting findings that we believe would be very valuable to the FM community, paving the way towards real-world application of FMs for practical expert tasks currently 'overlooked' by standard benchmarks focusing on common objects. △ Less

Submitted 19 December, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

arXiv:2206.01062 [pdf, other]

doi 10.1145/3534678.3539043

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Authors: Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, Peter W J Staar

Abstract: Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since t… ▽ More Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis. △ Less

Submitted 2 June, 2022; originally announced June 2022.

Comments: 9 pages, 6 figures, 5 tables. Accepted paper at SIGKDD 2022 conference

arXiv:2206.00785 [pdf, other]

doi 10.1109/CLOUD55607.2022.00060

Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness

Authors: Christoph Auer, Michele Dolfi, André Carvalho, Cesar Berrospi Ramis, Peter W. J. Staar

Abstract: Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as… ▽ More Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes. △ Less

Submitted 1 June, 2022; originally announced June 2022.

Comments: 11 pages, 7 figures, to be published in IEEE CLOUD 2022

ACM Class: I.7.5; I.2.1; C.1.4; C.4

arXiv:2102.09395 [pdf, other]

Robust PDF Document Conversion Using Recurrent Neural Networks

Authors: Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar

Abstract: The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretatio… ▽ More The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document structure resolution. Second, it can take into account the text flow across pages more naturally compared to visual methods because it can concatenate the printing commands of sequential pages. Last, our proposed method needs less memory and it is computationally less expensive than visual methods. This allows us to deploy such models in production environments at a much lower cost. Through extensive architectural search in combination with advanced feature engineering, we were able to implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels. The best model we achieved is currently served in production environments on our Corpus Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This model enhances the capabilities of CCS significantly, as it eliminates the need for human annotated label ground-truth for every unseen document layout. This proved particularly useful when applied to a huge corpus of PDF articles related to COVID-19. △ Less

Submitted 18 February, 2021; originally announced February 2021.

Comments: 9 pages, 2 tables, 4 figures, uses aaai21.sty. Accepted at the "Thirty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-21)". Received the "IAAI-21 Innovative Application Award"

ACM Class: I.7.5; I.5.1; I.5.2; I.5.4; I.5.5; I.2.1

arXiv:1907.08400 [pdf, other]

An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries

Authors: Matteo Manica, Christoph Auer, Valery Weber, Federico Zipoli, Michele Dolfi, Peter Staar, Teodoro Laino, Costas Bekas, Akihiro Fujita, Hiroki Toda, Shuichi Hirose, Yasumitsu Orii

Abstract: Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document… ▽ More Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document ingestion system that integrates data from databases and publications (in PDF format) in a biochemistry knowledge graph (BCKG). The BCKG is a comprehensive source of knowledge that can be queried to retrieve known biochemical facts and to generate novel insights. After describing the knowledge ingestion framework, we showcase an application of our system in the field of carbohydrate enzymes. The BCKG represents a way to scale knowledge ingestion and automatically exploit prior knowledge to accelerate discovery in biochemical sciences. △ Less

Submitted 19 July, 2019; originally announced July 2019.

Comments: 4 pages, 1 figure, Workshop on Applied Data Science for Healthcare at KDD, Anchorage, AK, 2019

arXiv:1903.12184 [pdf, other]

doi 10.1103/PhysRevB.100.075138

Understanding repulsively mediated superconductivity of correlated electrons via massively parallel DMRG

Authors: Adrian Kantian, Michele Dolfi, Matthias Troyer, Thierry Giamarchi

Abstract: The so-called minimal models of unconventional superconductivity are lattice models of interacting electrons derived from materials in which electron pairing arises from purely repulsive interactions. Showing unambiguously that a minimal model actually can have a superconducting ground state remains a challenge at nonperturbative interactions. We make a significant step in this direction by comput… ▽ More The so-called minimal models of unconventional superconductivity are lattice models of interacting electrons derived from materials in which electron pairing arises from purely repulsive interactions. Showing unambiguously that a minimal model actually can have a superconducting ground state remains a challenge at nonperturbative interactions. We make a significant step in this direction by computing ground states of the 2D \mbox{U-V} Hubbard model - the minimal model of the quasi-1D superconductors - by parallelized DMRG, which allows for systematic control of any bias and that is sign-problem-free. Using distributed-memory supercomputers and leveraging the advantages of the \mbox{U-V} model, we can treat unprecedented sizes of 2D strips and extrapolate their spin gap both to zero approximation error and the thermodynamic limit. Our results for the spin gap are shown to be compatible with a spin excitation spectrum that is either fully gapped or has zeros only in discrete points, and conversely that a Fermi liquid or magnetically ordered ground state is incompatible with them. Coupled with the enhancement to short-range correlations that we find exclusively in the $d_{xy}$ pairing-channel, this allows us to build an indirect case for the ground state of this model having superconducting order in the full 2D limit, and ruling out the other main possible phases, magnetic orders and Fermi liquids. △ Less

Submitted 6 August, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

Comments: Final published version: 18 pages including appendix, 11 figures

Journal ref: Phys. Rev. B 100, 075138 (2019)

arXiv:1806.02284 [pdf, other]

doi 10.1145/3219819.3219834

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

Authors: Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas

Abstract: Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. comple… ▽ More Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather ground-truth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99\% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements. △ Less

Submitted 24 May, 2018; originally announced June 2018.

Comments: Accepted paper at KDD 2018 conference

arXiv:1805.09687 [pdf, other]

doi 10.13140/RG.2.2.10858.82888

Corpus Conversion Service: A machine learning platform to ingest documents at scale [Poster abstract]

Authors: Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas

Abstract: Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make their content discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables)… ▽ More Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make their content discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. We present a platform to ingest documents at scale which is powered by Machine Learning techniques and allows the user to train custom models on document collections. We show precision/recall results greater than 97% with regard to conversion to structured formats, as well as scaling evidence for each of the microservices constituting the platform. △ Less

Submitted 15 May, 2018; originally announced May 2018.

Comments: Accepted in SysML 2018 (www.sysml.cc)

arXiv:1607.06352 [pdf, ps, other]

doi 10.1103/PhysRevA.94.063404

Density redistribution effects in fermionic optical lattices

Authors: Medha Soni, Michele Dolfi, Matthias Troyer

Abstract: We simulate a one dimensional fermionic optical lattice to analyse heating due to non-adiabatic lattice loading. Our simulations reveal that, similar to the bosonic case, density redistribution effects are the major cause of heating in harmonic traps. We suggest protocols to modulate the local density distribution during the process of lattice loading, in order to reduce the excess energy. Our num… ▽ More We simulate a one dimensional fermionic optical lattice to analyse heating due to non-adiabatic lattice loading. Our simulations reveal that, similar to the bosonic case, density redistribution effects are the major cause of heating in harmonic traps. We suggest protocols to modulate the local density distribution during the process of lattice loading, in order to reduce the excess energy. Our numerical results confirm that linear interpolation of the trapping potential and/or the interaction strength is an efficient method of doing so, bearing practical applications relevant to experiments. △ Less

Submitted 21 July, 2016; originally announced July 2016.

Comments: 10 pages, 16 pages

Journal ref: Phys. Rev. A 94, 063404 (2016)

arXiv:1510.02026 [pdf, other]

doi 10.1063/1.4939000

An Efficient Matrix Product Operator Representation of the Quantum-Chemical Hamiltonian

Authors: Sebastian Keller, Michele Dolfi, Matthias Troyer, Markus Reiher

Abstract: We describe how to efficiently construct the quantum chemical Hamiltonian operator in matrix product form. We present its implementation as a density matrix renormalization group (DMRG) algorithm for quantum chemical applications in a purely matrix product based framework. Existing implementations of DMRG for quantum chemistry are based on the traditional formulation of the method, which was devel… ▽ More We describe how to efficiently construct the quantum chemical Hamiltonian operator in matrix product form. We present its implementation as a density matrix renormalization group (DMRG) algorithm for quantum chemical applications in a purely matrix product based framework. Existing implementations of DMRG for quantum chemistry are based on the traditional formulation of the method, which was developed from a viewpoint of Hilbert space decimation and attained a higher performance compared to straightforward implementations of matrix product based DMRG. The latter variationally optimizes a class of ansatz states known as matrix product states (MPS), where operators are correspondingly represented as matrix product operators (MPO). The MPO construction scheme presented here eliminates the previous performance disadvantages while retaining the additional flexibility provided by a matrix product approach; for example, the specification of expectation values becomes an input parameter. In this way, MPOs for different symmetries - abelian and non-abelian - and different relativistic and non-relativistic models may be solved by an otherwise unmodified program. △ Less

Submitted 7 October, 2015; originally announced October 2015.

Comments: 11 pages, 7 figures

Journal ref: J. Chem. Phys. 143, 244118 (2015)

arXiv:1509.04709 [pdf, other]

doi 10.1103/PhysRevB.92.195139

Pair Correlations in Doped Hubbard Ladders

Authors: Michele Dolfi, Bela Bauer, Sebastian Keller, Matthias Troyer

Abstract: Hubbard ladders are an important stepping stone to the physics of the two-dimensional Hubbard model. While many of their properties are accessible to numerical and analytical techniques, the question of whether weakly hole-doped Hubbard ladders are dominated by superconducting or charge-density-wave correlations has so far eluded a definitive answer. In particular, previous numerical simulations o… ▽ More Hubbard ladders are an important stepping stone to the physics of the two-dimensional Hubbard model. While many of their properties are accessible to numerical and analytical techniques, the question of whether weakly hole-doped Hubbard ladders are dominated by superconducting or charge-density-wave correlations has so far eluded a definitive answer. In particular, previous numerical simulations of Hubbard ladders have seen a much faster decay of superconducting correlations than expected based on analytical arguments. We revisit this question using a state-of-the-art implementation of the density matrix renormalization group algorithm that allows us to simulate larger system sizes with higher accuracy than before. Performing careful extrapolations of the results, we obtain improved estimates for the Luttinger liquid parameter and the correlation functions at long distances. Our results confirm that, as suggested by analytical considerations, superconducting correlations become dominant in the limit of very small doping. △ Less

Submitted 20 November, 2015; v1 submitted 15 September, 2015; originally announced September 2015.

Comments: 10 pages, 12 figures, data analysis included

Journal ref: Phys. Rev. B 92, 195139 (2015)

arXiv:1410.5829 [pdf, other]

doi 10.1103/PhysRevA.91.033407

Minimizing nonadiabaticities in optical-lattice loading

Authors: Michele Dolfi, Adrian Kantian, Bela Bauer, Matthias Troyer

Abstract: In the quest to reach lower temperatures of ultra-cold gases in optical lattice experiments, non-adiabaticites during lattice loading are one of the limiting factors that prevent the same low temperatures to be reached as in experiments without lattice. Simulating the loading of a bosonic quantum gas into a one-dimensional optical lattice with and without a trap, we find that the redistribution of… ▽ More In the quest to reach lower temperatures of ultra-cold gases in optical lattice experiments, non-adiabaticites during lattice loading are one of the limiting factors that prevent the same low temperatures to be reached as in experiments without lattice. Simulating the loading of a bosonic quantum gas into a one-dimensional optical lattice with and without a trap, we find that the redistribution of atomic density inside a global confining potential is by far the dominant source of heating. Based on these results we propose to adjust the trapping potential during loading to minimize changes to the density distribution. Our simulations confirm that a very simple linear interpolation of the trapping potential during loading already significantly decreases the heating of a quantum gas and we discuss how loading protocols minimizing density redistributions can be designed. △ Less

Submitted 23 March, 2015; v1 submitted 21 October, 2014; originally announced October 2014.

Comments: 6 pages, 7 figures

Journal ref: Phys. Rev. A 91, 033407 (2015)

arXiv:1407.0872 [pdf, other]

doi 10.1016/j.cpc.2014.08.019

Matrix Product State applications for the ALPS project

Authors: Michele Dolfi, Bela Bauer, Sebastian Keller, Alexandr Kosenkov, Timothée Ewart, Adrian Kantian, Thierry Giamarchi, Matthias Troyer

Abstract: The density-matrix renormalization group method has become a standard computational approach to the low-energy physics as well as dynamics of low-dimensional quantum systems. In this paper, we present a new set of applications, available as part of the ALPS package, that provide an efficient and flexible implementation of these methods based on a matrix-product state (MPS) representation. Our appl… ▽ More The density-matrix renormalization group method has become a standard computational approach to the low-energy physics as well as dynamics of low-dimensional quantum systems. In this paper, we present a new set of applications, available as part of the ALPS package, that provide an efficient and flexible implementation of these methods based on a matrix-product state (MPS) representation. Our applications implement, within the same framework, algorithms to variationally find the ground state and low-lying excited states as well as simulate the time evolution of arbitrary one-dimensional and two-dimensional models. Implementing the conservation of quantum numbers for generic Abelian symmetries, we achieve performance competitive with the best codes in the community. Example results are provided for (i) a model of itinerant fermions in one dimension and (ii) a model of quantum magnetism. △ Less

Submitted 14 October, 2014; v1 submitted 3 July, 2014; originally announced July 2014.

Comments: 11+5 pages, 8 figures, 2 examples

Journal ref: Comput. Phys. Commun. 185, 3430 (2014)

arXiv:1404.1259 [pdf, other]

doi 10.1088/1742-5468/2014/06/P06012

Hybridization expansion Monte Carlo simulation of multi-orbital quantum impurity problems: matrix product formalism and improved Monte Carlo sampling

Authors: Hiroshi Shinaoka, Michele Dolfi, Matthias Troyer, Philipp Werner

Abstract: We explore two complementary modifications of the hybridization-expansion continuous-time Monte Carlo method, aiming at large multi-orbital quantum impurity problems. One idea is to compute the imaginary-time propagation using a matrix product states representation. We show that bond dimensions considerably smaller than the dimension of the Hilbert space are sufficient to obtain accurate results,… ▽ More We explore two complementary modifications of the hybridization-expansion continuous-time Monte Carlo method, aiming at large multi-orbital quantum impurity problems. One idea is to compute the imaginary-time propagation using a matrix product states representation. We show that bond dimensions considerably smaller than the dimension of the Hilbert space are sufficient to obtain accurate results, and that this approach scales polynomially, rather than exponentially with the number of orbitals. Based on scaling analyses, we conclude that a matrix product state implementation will outperform the exact-diagonalization based method for quantum impurity problems with more than 12 orbitals. The second idea is an improved Monte Carlo sampling scheme which is applicable to all variants of the hybridization expansion method. We show that this so-called sliding window sampling scheme speeds up the simulation by at least an order of magnitude for a broad range of model parameters, with the largest improvements at low temperature. △ Less

Submitted 30 June, 2014; v1 submitted 4 April, 2014; originally announced April 2014.

Comments: 24 pages, 8 figures

Journal ref: J. Stat. Mech., P0601 (2014)

arXiv:1401.3017 [pdf, other]

doi 10.1038/ncomms6137

Chiral spin liquid and emergent anyons in a Kagome lattice Mott insulator

Authors: B. Bauer, L. Cincio, B. P. Keller, M. Dolfi, G. Vidal, S. Trebst, A. W. W. Ludwig

Abstract: Topological phases in frustrated quantum spin systems have fascinated researchers for decades. One of the earliest proposals for such a phase was the chiral spin liquid put forward by Kalmeyer and Laughlin in 1987 as the bosonic analogue of the fractional quantum Hall effect. Elusive for many years, recent times have finally seen a number of models that realize this phase. However, these models ar… ▽ More Topological phases in frustrated quantum spin systems have fascinated researchers for decades. One of the earliest proposals for such a phase was the chiral spin liquid put forward by Kalmeyer and Laughlin in 1987 as the bosonic analogue of the fractional quantum Hall effect. Elusive for many years, recent times have finally seen a number of models that realize this phase. However, these models are somewhat artificial and unlikely to be found in realistic materials. Here, we take an important step towards the goal of finding a chiral spin liquid in nature by examining a physically motivated model for a Mott insulator on the Kagome lattice with broken time-reversal symmetry. We first provide a theoretical justification for the emergent chiral spin liquid phase in terms of a network model perspective. We then present an unambiguous numerical identification and characterization of the universal topological properties of the phase, including ground state degeneracy, edge physics, and anyonic bulk excitations, by using a variety of powerful numerical probes, including the entanglement spectrum and modular transformations. △ Less

Submitted 13 January, 2014; originally announced January 2014.

Comments: 9 pages, 9 figures; partially supersedes arXiv:1303.6963

Journal ref: Nature Communications 5, 5137 (2014)

arXiv:1401.2000 [pdf, other]

A model project for reproducible papers: critical temperature for the Ising model on a square lattice

Authors: M. Dolfi, J. Gukelberger, A. Hehn, J. Imriška, K. Pakrouski, T. F. Rønnow, M. Troyer, I. Zintchenko, F. Chirigati, J. Freire, D. Shasha

Abstract: In this paper we present a simple, yet typical simulation in statistical physics, consisting of large scale Monte Carlo simulations followed by an involved statistical analysis of the results. The purpose is to provide an example publication to explore tools for writing reproducible papers. The simulation estimates the critical temperature where the Ising model on the square lattice becomes magnet… ▽ More In this paper we present a simple, yet typical simulation in statistical physics, consisting of large scale Monte Carlo simulations followed by an involved statistical analysis of the results. The purpose is to provide an example publication to explore tools for writing reproducible papers. The simulation estimates the critical temperature where the Ising model on the square lattice becomes magnetic to be Tc /J = 2.26934(6) using a finite size scaling analysis of the crossing points of Binder cumulants. We provide a virtual machine which can be used to reproduce all figures and results. △ Less

Submitted 9 January, 2014; originally announced January 2014.

Comments: Authors are listed in alphabetical order by institution and name. 5 pages, 4 figures

arXiv:1303.6963 [pdf, other]

Gapped and gapless spin liquid phases on the Kagome lattice from chiral three-spin interactions

Authors: Bela Bauer, Brendan P. Keller, Michele Dolfi, Simon Trebst, Andreas W. W. Ludwig

Abstract: We argue that a relatively simple model containing only SU(2)-invariant chiral three-spin interactions on a Kagome lattice of S=1/2 spins can give rise to both a gapped and a gapless quantum spin liquid. Our arguments are rooted in a formulation in terms of network models of edge states and are backed up by a careful numerical analysis. For a uniform choice of chirality on the lattice, we realize… ▽ More We argue that a relatively simple model containing only SU(2)-invariant chiral three-spin interactions on a Kagome lattice of S=1/2 spins can give rise to both a gapped and a gapless quantum spin liquid. Our arguments are rooted in a formulation in terms of network models of edge states and are backed up by a careful numerical analysis. For a uniform choice of chirality on the lattice, we realize the Kalmeyer-Laughlin state, i.e. a gapped spin liquid which is identified as the nu=1/2 bosonic Laughlin state. For staggered chiralities, a gapless spin liquid emerges which exhibits gapless spin excitations along lines in momentum space, a feature that we probe by studying quasi-two-dimensional systems of finite width. We thus provide a single, appealingly simple spin model (i) for what is probably the simplest realization of the Kalmeyer-Laughlin state to date, as well as (ii) for a non-Fermi liquid state with lines of gapless SU(2) spin excitations. △ Less

Submitted 7 September, 2014; v1 submitted 27 March, 2013; originally announced March 2013.

Comments: 5+5 pages, 6+6 figures. Manuscript partially superseded by arXiv:1401.3017

arXiv:1203.6363 [pdf, other]

doi 10.1103/PhysRevLett.109.020604

Multigrid Algorithms for Tensor Network States

Authors: M. Dolfi, B. Bauer, M. Troyer, Z. Ristivojevic

Abstract: The widely used density matrix renormalization group (DRMG) method often fails to converge in systems with multiple length scales, such as lattice discretizations of continuum models and dilute or weakly doped lattice models. The local optimization employed by DMRG to optimize the wave function is ineffective in updating large-scale features. Here we present a multigrid algorithm that solves these… ▽ More The widely used density matrix renormalization group (DRMG) method often fails to converge in systems with multiple length scales, such as lattice discretizations of continuum models and dilute or weakly doped lattice models. The local optimization employed by DMRG to optimize the wave function is ineffective in updating large-scale features. Here we present a multigrid algorithm that solves these convergence problems by optimizing the wave function at different spatial resolutions. We demonstrate its effectiveness by simulating bosons in continuous space, and study non-adiabaticity when ramping up the amplitude of an optical lattice. The algorithm can be generalized to tensor network methods, and be combined with the contractor renormalization group (CORE) method to study dilute and weakly doped lattice models. △ Less

Submitted 12 June, 2012; v1 submitted 28 March, 2012; originally announced March 2012.

Comments: 5 pages, 7 figures. Accepted for publication in PRL

Journal ref: Phys. Rev. Lett. 109, 020604 (2012)

arXiv:1103.0740 [pdf, ps, other]

Kinetics of double stranded DNA overstretching revealed by 0.5-2 pN force steps

Authors: Pasquale Bianco, Lorenzo Bongini, Luca Melli, Mario Dolfi, Vincenzo Lombardi

Abstract: A detailed description of the conformational plasticity of double stranded DNA (ds) is a necessary framework for understanding protein-DNA interactions. Until now, however structure and kinetics of the transition from the basic conformation of ds-DNA (B state) to the 1.7 times longer and partially unwound conformation (S state) have not been defined. The force-extension relation of the ds-DNA of l… ▽ More A detailed description of the conformational plasticity of double stranded DNA (ds) is a necessary framework for understanding protein-DNA interactions. Until now, however structure and kinetics of the transition from the basic conformation of ds-DNA (B state) to the 1.7 times longer and partially unwound conformation (S state) have not been defined. The force-extension relation of the ds-DNA of lambda-phage is measured here with unprecedented resolution using a dual laser optical tweezers that can impose millisecond force steps of 0.5-2 pN (25 C). This approach reveals the kinetics of the transition between intermediate states of ds-DNA and uncovers the load-dependence of the rate constant of the unitary reaction step. DNA overstretching transition results essentially a two-state reaction composed of 5.85 nm steps, indicating cooperativity of ~25 base pairs. This mechanism increases the free energy for the unitary reaction to ~94 kBT, accounting for the stability of the basic conformation of DNA, and explains the absence of hysteresis in the force-extension relation at equilibrium. The novel description of the kinetics and energetics of the B-S transition of ds-DNA improves our understanding the biological role of the S state in the interplay between mechanics and enzymology of the DNA-protein machinery. △ Less

Submitted 3 March, 2011; originally announced March 2011.

Comments: 13 pages, 10 figures

Showing 1–29 of 29 results for author: Dolfi, M