-
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Authors:
Ahmed Nassar,
Andres Marafioti,
Matteo Omenetti,
Maksym Lysak,
Nikolaos Livathinos,
Christoph Auer,
Lucas Morin,
Rafael Teixeira de Lima,
Yusik Kim,
A. Said Gurbuz,
Michele Dolfi,
Miquel Farré,
Peter W. J. Staar
Abstract:
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipeline…
▽ More
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Authors:
Granite Vision Team,
Leonid Karlinsky,
Assaf Arbelle,
Abraham Daniels,
Ahmed Nassar,
Amit Alfassi,
Bo Wu,
Eli Schwartz,
Dhiraj Joshi,
Jovana Kondic,
Nimrod Shabtay,
Pengyuan Li,
Roei Herzig,
Shafiq Abedin,
Shaked Perek,
Sivan Harary,
Udi Barzelay,
Adi Raz Goldfarb,
Aude Oliva,
Ben Wieles,
Bishwaranjan Bhattacharjee,
Brandon Huang,
Christoph Auer,
Dan Gutfreund,
David Beymer
, et al. (38 additional authors not shown)
Abstract:
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as gener…
▽ More
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion
Authors:
Nikolaos Livathinos,
Christoph Auer,
Maksym Lysak,
Ahmed Nassar,
Michele Dolfi,
Panos Vagenas,
Cesar Berrospi Ramis,
Matteo Omenetti,
Kasper Dinkla,
Yusik Kim,
Shubham Gupta,
Rafael Teixeira de Lima,
Valery Weber,
Lucas Morin,
Ingmar Meijer,
Viktor Kuropiatnyk,
Peter W. J. Staar
Abstract:
We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in…
▽ More
We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. Docling is released as a Python package and can be used as a Python API or as a CLI tool. Docling's modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. Docling has been already integrated in other popular open-source frameworks (e.g., LangChain, LlamaIndex, spaCy), making it a natural fit for the processing of documents and the development of high-end applications. The open-source community has fully engaged in using, promoting, and developing for Docling, which gathered 10k stars on GitHub in less than a month and was reported as the No. 1 trending repository in GitHub worldwide in November 2024.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
Authors:
Rafael Teixeira de Lima,
Shubham Gupta,
Cesar Berrospi,
Lokesh Mishra,
Michele Dolfi,
Peter Staar,
Panagiotis Vagenas
Abstract:
Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to…
▽ More
Retrieval Augmented Generation (RAG) systems are a widespread application of Large Language Models (LLMs) in the industry. While many tools exist empowering developers to build their own systems, measuring their performance locally, with datasets reflective of the system's use cases, is a technological challenge. Solutions to this problem range from non-specific and cheap (most public datasets) to specific and costly (generating data from local documents). In this paper, we show that using public question and answer (Q&A) datasets to assess retrieval performance can lead to non-optimal systems design, and that common tools for RAG dataset generation can lead to unbalanced data. We propose solutions to these issues based on the characterization of RAG datasets through labels and through label-targeted data generation. Finally, we show that fine-tuned small LLMs can efficiently generate Q&A datasets. We believe that these observations are invaluable to the know-your-data step of RAG systems development.
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
Data-Prep-Kit: getting your data ready for LLM application development
Authors:
David Wood,
Boris Lublinsky,
Alexy Roytman,
Shivdeep Singh,
Constantin Adam,
Abdulhamid Adebayo,
Sungeun An,
Yuan Chi Chang,
Xuan-Hong Dang,
Nirmit Desai,
Michele Dolfi,
Hajar Emami-Gohari,
Revital Eres,
Takuya Goto,
Dhiraj Joshi,
Yan Koyfman,
Mohammad Nassar,
Hima Patel,
Paramesvaran Selvam,
Yousaf Shah,
Saptha Surendran,
Daiki Tsuzuku,
Petros Zerfos,
Shahrokh Daijavad
Abstract:
Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortles…
▽ More
Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).
△ Less
Submitted 12 November, 2024; v1 submitted 26 September, 2024;
originally announced September 2024.
-
Docling Technical Report
Authors:
Christoph Auer,
Maksym Lysak,
Ahmed Nassar,
Michele Dolfi,
Nikolaos Livathinos,
Panos Vagenas,
Cesar Berrospi Ramis,
Matteo Omenetti,
Fabian Lindlbauer,
Kasper Dinkla,
Lokesh Mishra,
Yusik Kim,
Shubham Gupta,
Rafael Teixeira de Lima,
Valery Weber,
Lucas Morin,
Ingmar Meijer,
Viktor Kuropiatnyk,
Peter W. J. Staar
Abstract:
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addit…
▽ More
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.
△ Less
Submitted 9 December, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs
Authors:
Lokesh Mishra,
Sohayl Dhibi,
Yusik Kim,
Cesar Berrospi Ramis,
Shubham Gupta,
Michele Dolfi,
Peter Staar
Abstract:
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as…
▽ More
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
INDUS: Effective and Efficient Language Models for Scientific Applications
Authors:
Bishwaranjan Bhattacharjee,
Aashka Trivedi,
Masayasu Muraoka,
Muthukumaran Ramasubramanian,
Takuma Udagawa,
Iksha Gurung,
Nishan Pantha,
Rong Zhang,
Bharath Dandala,
Rahul Ramachandran,
Manil Maskey,
Kaylin Bugbee,
Mike Little,
Elizabeth Fancher,
Irina Gerasimov,
Armin Mehrabian,
Lauren Sanders,
Sylvain Costes,
Sergi Blanco-Cuaresma,
Kelly Lockhart,
Thomas Allen,
Felix Grezes,
Megan Ansdell,
Alberto Accomazzi,
Yousef El-Kurdi
, et al. (11 additional authors not shown)
Abstract:
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, phys…
▽ More
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, CLIMATE-CHANGE NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain-specific (SCIBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings -- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.
△ Less
Submitted 30 October, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
ESG Accountability Made Easy: DocQA at Your Service
Authors:
Lokesh Mishra,
Cesar Berrospi,
Kasper Dinkla,
Diego Antognini,
Francesco Fusco,
Benedikt Bothur,
Maksym Lysak,
Nikolaos Livathinos,
Ahmed Nassar,
Panagiotis Vagenas,
Lucas Morin,
Christoph Auer,
Michele Dolfi,
Peter Staar
Abstract:
We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via…
▽ More
We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents
Authors:
Christoph Auer,
Ahmed Nassar,
Maksym Lysak,
Michele Dolfi,
Nikolaos Livathinos,
Peter Staar
Abstract:
Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions t…
▽ More
Transforming documents into machine-processable representations is a challenging task due to their complex structures and variability in formats. Recovering the layout structure and content from PDF files or scanned material has remained a key problem for decades. ICDAR has a long tradition in hosting competitions to benchmark the state-of-the-art and encourage the development of novel solutions to document layout understanding. In this report, we present the results of our \textit{ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents}, which posed the challenge to accurately segment the page layout in a broad range of document styles and domains, including corporate reports, technical literature and patents. To raise the bar over previous competitions, we engineered a hard competition dataset and proposed the recent DocLayNet dataset for training. We recorded 45 team registrations and received official submissions from 21 teams. In the presented solutions, we recognize interesting combinations of recent computer vision models, data augmentation strategies and ensemble methods to achieve remarkable accuracy in the task we posed. A clear trend towards adoption of vision-transformer based methods is evident. The results demonstrate substantial progress towards achieving robust and highly generalizing methods for document layout understanding.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
FETA: Towards Specializing Foundation Models for Expert Task Applications
Authors:
Amit Alfassy,
Assaf Arbelle,
Oshri Halimi,
Sivan Harary,
Roei Herzig,
Eli Schwartz,
Rameswar Panda,
Michele Dolfi,
Christoph Auer,
Kate Saenko,
PeterW. J. Staar,
Rogerio Feris,
Leonid Karlinsky
Abstract:
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail…
▽ More
Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training. This underlines the necessity to explicitly evaluate and finetune FMs on such expert tasks, arguably ones that appear the most in practical real-world applications. In this paper, we propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation, via learning to match their graphical illustrations to corresponding language descriptions. Our FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. FETA is equipped with a procedure for completely automatic annotation extraction (code would be released upon acceptance), allowing easy extension of FETA to more documentation types and application domains in the future. Our automatic annotation leads to an automated performance metric shown to be consistent with metrics computed on human-curated annotations (also released). We provide multiple baselines and analysis of popular FMs on FETA leading to several interesting findings that we believe would be very valuable to the FM community, paving the way towards real-world application of FMs for practical expert tasks currently 'overlooked' by standard benchmarks focusing on common objects.
△ Less
Submitted 19 December, 2022; v1 submitted 8 September, 2022;
originally announced September 2022.
-
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
Authors:
Birgit Pfitzmann,
Christoph Auer,
Michele Dolfi,
Ahmed S Nassar,
Peter W J Staar
Abstract:
Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since t…
▽ More
Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
Delivering Document Conversion as a Cloud Service with High Throughput and Responsiveness
Authors:
Christoph Auer,
Michele Dolfi,
André Carvalho,
Cesar Berrospi Ramis,
Peter W. J. Staar
Abstract:
Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as…
▽ More
Document understanding is a key business process in the data-driven economy since documents are central to knowledge discovery and business insights. Converting documents into a machine-processable format is a particular challenge here due to their huge variability in formats and complex structure. Accordingly, many algorithms and machine-learning methods emerged to solve particular tasks such as Optical Character Recognition (OCR), layout analysis, table-structure recovery, figure understanding, etc. We observe the adoption of such methods in document understanding solutions offered by all major cloud providers. Yet, publications outlining how such services are designed and optimized to scale in the cloud are scarce. In this paper, we focus on the case of document conversion to illustrate the particular challenges of scaling a complex data processing pipeline with a strong reliance on machine-learning methods on cloud infrastructure. Our key objective is to achieve high scalability and responsiveness for different workload profiles in a well-defined resource budget. We outline the requirements, design, and implementation choices of our document conversion service and reflect on the challenges we faced. Evidence for the scaling behavior and resource efficiency is provided for two alternative workload distribution strategies and deployment configurations. Our best-performing method achieves sustained throughput of over one million PDF pages per hour on 3072 CPU cores across 192 nodes.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.
-
Robust PDF Document Conversion Using Recurrent Neural Networks
Authors:
Nikolaos Livathinos,
Cesar Berrospi,
Maksym Lysak,
Viktor Kuropiatnyk,
Ahmed Nassar,
Andre Carvalho,
Michele Dolfi,
Christoph Auer,
Kasper Dinkla,
Peter Staar
Abstract:
The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretatio…
▽ More
The number of published PDF documents has increased exponentially in recent decades. There is a growing need to make their rich content discoverable to information retrieval tools. In this paper, we present a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. We demonstrate how a sequence of PDF printing commands can be used as input into a neural network and how the network can learn to classify each printing command according to its structural function in the page. This approach has three advantages: First, it can distinguish among more fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual methods), which results in a more accurate and detailed document structure resolution. Second, it can take into account the text flow across pages more naturally compared to visual methods because it can concatenate the printing commands of sequential pages. Last, our proposed method needs less memory and it is computationally less expensive than visual methods. This allows us to deploy such models in production environments at a much lower cost. Through extensive architectural search in combination with advanced feature engineering, we were able to implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels. The best model we achieved is currently served in production environments on our Corpus Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This model enhances the capabilities of CCS significantly, as it eliminates the need for human annotated label ground-truth for every unseen document layout. This proved particularly useful when applied to a huge corpus of PDF articles related to COVID-19.
△ Less
Submitted 18 February, 2021;
originally announced February 2021.
-
An Information Extraction and Knowledge Graph Platform for Accelerating Biochemical Discoveries
Authors:
Matteo Manica,
Christoph Auer,
Valery Weber,
Federico Zipoli,
Michele Dolfi,
Peter Staar,
Teodoro Laino,
Costas Bekas,
Akihiro Fujita,
Hiroki Toda,
Shuichi Hirose,
Yasumitsu Orii
Abstract:
Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document…
▽ More
Information extraction and data mining in biochemical literature is a daunting task that demands resource-intensive computation and appropriate means to scale knowledge ingestion. Being able to leverage this immense source of technical information helps to drastically reduce costs and time to solution in multiple application fields from food safety to pharmaceutics. We present a scalable document ingestion system that integrates data from databases and publications (in PDF format) in a biochemistry knowledge graph (BCKG). The BCKG is a comprehensive source of knowledge that can be queried to retrieve known biochemical facts and to generate novel insights. After describing the knowledge ingestion framework, we showcase an application of our system in the field of carbohydrate enzymes. The BCKG represents a way to scale knowledge ingestion and automatically exploit prior knowledge to accelerate discovery in biochemical sciences.
△ Less
Submitted 19 July, 2019;
originally announced July 2019.
-
Understanding repulsively mediated superconductivity of correlated electrons via massively parallel DMRG
Authors:
Adrian Kantian,
Michele Dolfi,
Matthias Troyer,
Thierry Giamarchi
Abstract:
The so-called minimal models of unconventional superconductivity are lattice models of interacting electrons derived from materials in which electron pairing arises from purely repulsive interactions. Showing unambiguously that a minimal model actually can have a superconducting ground state remains a challenge at nonperturbative interactions. We make a significant step in this direction by comput…
▽ More
The so-called minimal models of unconventional superconductivity are lattice models of interacting electrons derived from materials in which electron pairing arises from purely repulsive interactions. Showing unambiguously that a minimal model actually can have a superconducting ground state remains a challenge at nonperturbative interactions. We make a significant step in this direction by computing ground states of the 2D \mbox{U-V} Hubbard model - the minimal model of the quasi-1D superconductors - by parallelized DMRG, which allows for systematic control of any bias and that is sign-problem-free. Using distributed-memory supercomputers and leveraging the advantages of the \mbox{U-V} model, we can treat unprecedented sizes of 2D strips and extrapolate their spin gap both to zero approximation error and the thermodynamic limit. Our results for the spin gap are shown to be compatible with a spin excitation spectrum that is either fully gapped or has zeros only in discrete points, and conversely that a Fermi liquid or magnetically ordered ground state is incompatible with them. Coupled with the enhancement to short-range correlations that we find exclusively in the $d_{xy}$ pairing-channel, this allows us to build an indirect case for the ground state of this model having superconducting order in the full 2D limit, and ruling out the other main possible phases, magnetic orders and Fermi liquids.
△ Less
Submitted 6 August, 2019; v1 submitted 28 March, 2019;
originally announced March 2019.
-
Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale
Authors:
Peter W J Staar,
Michele Dolfi,
Christoph Auer,
Costas Bekas
Abstract:
Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. comple…
▽ More
Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather ground-truth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99\% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.
△ Less
Submitted 24 May, 2018;
originally announced June 2018.
-
Corpus Conversion Service: A machine learning platform to ingest documents at scale [Poster abstract]
Authors:
Peter W J Staar,
Michele Dolfi,
Christoph Auer,
Costas Bekas
Abstract:
Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make their content discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables)…
▽ More
Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make their content discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. We present a platform to ingest documents at scale which is powered by Machine Learning techniques and allows the user to train custom models on document collections. We show precision/recall results greater than 97% with regard to conversion to structured formats, as well as scaling evidence for each of the microservices constituting the platform.
△ Less
Submitted 15 May, 2018;
originally announced May 2018.
-
Density redistribution effects in fermionic optical lattices
Authors:
Medha Soni,
Michele Dolfi,
Matthias Troyer
Abstract:
We simulate a one dimensional fermionic optical lattice to analyse heating due to non-adiabatic lattice loading. Our simulations reveal that, similar to the bosonic case, density redistribution effects are the major cause of heating in harmonic traps. We suggest protocols to modulate the local density distribution during the process of lattice loading, in order to reduce the excess energy. Our num…
▽ More
We simulate a one dimensional fermionic optical lattice to analyse heating due to non-adiabatic lattice loading. Our simulations reveal that, similar to the bosonic case, density redistribution effects are the major cause of heating in harmonic traps. We suggest protocols to modulate the local density distribution during the process of lattice loading, in order to reduce the excess energy. Our numerical results confirm that linear interpolation of the trapping potential and/or the interaction strength is an efficient method of doing so, bearing practical applications relevant to experiments.
△ Less
Submitted 21 July, 2016;
originally announced July 2016.
-
An Efficient Matrix Product Operator Representation of the Quantum-Chemical Hamiltonian
Authors:
Sebastian Keller,
Michele Dolfi,
Matthias Troyer,
Markus Reiher
Abstract:
We describe how to efficiently construct the quantum chemical Hamiltonian operator in matrix product form. We present its implementation as a density matrix renormalization group (DMRG) algorithm for quantum chemical applications in a purely matrix product based framework. Existing implementations of DMRG for quantum chemistry are based on the traditional formulation of the method, which was devel…
▽ More
We describe how to efficiently construct the quantum chemical Hamiltonian operator in matrix product form. We present its implementation as a density matrix renormalization group (DMRG) algorithm for quantum chemical applications in a purely matrix product based framework. Existing implementations of DMRG for quantum chemistry are based on the traditional formulation of the method, which was developed from a viewpoint of Hilbert space decimation and attained a higher performance compared to straightforward implementations of matrix product based DMRG. The latter variationally optimizes a class of ansatz states known as matrix product states (MPS), where operators are correspondingly represented as matrix product operators (MPO). The MPO construction scheme presented here eliminates the previous performance disadvantages while retaining the additional flexibility provided by a matrix product approach; for example, the specification of expectation values becomes an input parameter. In this way, MPOs for different symmetries - abelian and non-abelian - and different relativistic and non-relativistic models may be solved by an otherwise unmodified program.
△ Less
Submitted 7 October, 2015;
originally announced October 2015.
-
Pair Correlations in Doped Hubbard Ladders
Authors:
Michele Dolfi,
Bela Bauer,
Sebastian Keller,
Matthias Troyer
Abstract:
Hubbard ladders are an important stepping stone to the physics of the two-dimensional Hubbard model. While many of their properties are accessible to numerical and analytical techniques, the question of whether weakly hole-doped Hubbard ladders are dominated by superconducting or charge-density-wave correlations has so far eluded a definitive answer. In particular, previous numerical simulations o…
▽ More
Hubbard ladders are an important stepping stone to the physics of the two-dimensional Hubbard model. While many of their properties are accessible to numerical and analytical techniques, the question of whether weakly hole-doped Hubbard ladders are dominated by superconducting or charge-density-wave correlations has so far eluded a definitive answer. In particular, previous numerical simulations of Hubbard ladders have seen a much faster decay of superconducting correlations than expected based on analytical arguments. We revisit this question using a state-of-the-art implementation of the density matrix renormalization group algorithm that allows us to simulate larger system sizes with higher accuracy than before. Performing careful extrapolations of the results, we obtain improved estimates for the Luttinger liquid parameter and the correlation functions at long distances. Our results confirm that, as suggested by analytical considerations, superconducting correlations become dominant in the limit of very small doping.
△ Less
Submitted 20 November, 2015; v1 submitted 15 September, 2015;
originally announced September 2015.
-
Minimizing nonadiabaticities in optical-lattice loading
Authors:
Michele Dolfi,
Adrian Kantian,
Bela Bauer,
Matthias Troyer
Abstract:
In the quest to reach lower temperatures of ultra-cold gases in optical lattice experiments, non-adiabaticites during lattice loading are one of the limiting factors that prevent the same low temperatures to be reached as in experiments without lattice. Simulating the loading of a bosonic quantum gas into a one-dimensional optical lattice with and without a trap, we find that the redistribution of…
▽ More
In the quest to reach lower temperatures of ultra-cold gases in optical lattice experiments, non-adiabaticites during lattice loading are one of the limiting factors that prevent the same low temperatures to be reached as in experiments without lattice. Simulating the loading of a bosonic quantum gas into a one-dimensional optical lattice with and without a trap, we find that the redistribution of atomic density inside a global confining potential is by far the dominant source of heating. Based on these results we propose to adjust the trapping potential during loading to minimize changes to the density distribution. Our simulations confirm that a very simple linear interpolation of the trapping potential during loading already significantly decreases the heating of a quantum gas and we discuss how loading protocols minimizing density redistributions can be designed.
△ Less
Submitted 23 March, 2015; v1 submitted 21 October, 2014;
originally announced October 2014.
-
Matrix Product State applications for the ALPS project
Authors:
Michele Dolfi,
Bela Bauer,
Sebastian Keller,
Alexandr Kosenkov,
Timothée Ewart,
Adrian Kantian,
Thierry Giamarchi,
Matthias Troyer
Abstract:
The density-matrix renormalization group method has become a standard computational approach to the low-energy physics as well as dynamics of low-dimensional quantum systems. In this paper, we present a new set of applications, available as part of the ALPS package, that provide an efficient and flexible implementation of these methods based on a matrix-product state (MPS) representation. Our appl…
▽ More
The density-matrix renormalization group method has become a standard computational approach to the low-energy physics as well as dynamics of low-dimensional quantum systems. In this paper, we present a new set of applications, available as part of the ALPS package, that provide an efficient and flexible implementation of these methods based on a matrix-product state (MPS) representation. Our applications implement, within the same framework, algorithms to variationally find the ground state and low-lying excited states as well as simulate the time evolution of arbitrary one-dimensional and two-dimensional models. Implementing the conservation of quantum numbers for generic Abelian symmetries, we achieve performance competitive with the best codes in the community. Example results are provided for (i) a model of itinerant fermions in one dimension and (ii) a model of quantum magnetism.
△ Less
Submitted 14 October, 2014; v1 submitted 3 July, 2014;
originally announced July 2014.
-
Hybridization expansion Monte Carlo simulation of multi-orbital quantum impurity problems: matrix product formalism and improved Monte Carlo sampling
Authors:
Hiroshi Shinaoka,
Michele Dolfi,
Matthias Troyer,
Philipp Werner
Abstract:
We explore two complementary modifications of the hybridization-expansion continuous-time Monte Carlo method, aiming at large multi-orbital quantum impurity problems. One idea is to compute the imaginary-time propagation using a matrix product states representation. We show that bond dimensions considerably smaller than the dimension of the Hilbert space are sufficient to obtain accurate results,…
▽ More
We explore two complementary modifications of the hybridization-expansion continuous-time Monte Carlo method, aiming at large multi-orbital quantum impurity problems. One idea is to compute the imaginary-time propagation using a matrix product states representation. We show that bond dimensions considerably smaller than the dimension of the Hilbert space are sufficient to obtain accurate results, and that this approach scales polynomially, rather than exponentially with the number of orbitals. Based on scaling analyses, we conclude that a matrix product state implementation will outperform the exact-diagonalization based method for quantum impurity problems with more than 12 orbitals. The second idea is an improved Monte Carlo sampling scheme which is applicable to all variants of the hybridization expansion method. We show that this so-called sliding window sampling scheme speeds up the simulation by at least an order of magnitude for a broad range of model parameters, with the largest improvements at low temperature.
△ Less
Submitted 30 June, 2014; v1 submitted 4 April, 2014;
originally announced April 2014.
-
Chiral spin liquid and emergent anyons in a Kagome lattice Mott insulator
Authors:
B. Bauer,
L. Cincio,
B. P. Keller,
M. Dolfi,
G. Vidal,
S. Trebst,
A. W. W. Ludwig
Abstract:
Topological phases in frustrated quantum spin systems have fascinated researchers for decades. One of the earliest proposals for such a phase was the chiral spin liquid put forward by Kalmeyer and Laughlin in 1987 as the bosonic analogue of the fractional quantum Hall effect. Elusive for many years, recent times have finally seen a number of models that realize this phase. However, these models ar…
▽ More
Topological phases in frustrated quantum spin systems have fascinated researchers for decades. One of the earliest proposals for such a phase was the chiral spin liquid put forward by Kalmeyer and Laughlin in 1987 as the bosonic analogue of the fractional quantum Hall effect. Elusive for many years, recent times have finally seen a number of models that realize this phase. However, these models are somewhat artificial and unlikely to be found in realistic materials. Here, we take an important step towards the goal of finding a chiral spin liquid in nature by examining a physically motivated model for a Mott insulator on the Kagome lattice with broken time-reversal symmetry. We first provide a theoretical justification for the emergent chiral spin liquid phase in terms of a network model perspective. We then present an unambiguous numerical identification and characterization of the universal topological properties of the phase, including ground state degeneracy, edge physics, and anyonic bulk excitations, by using a variety of powerful numerical probes, including the entanglement spectrum and modular transformations.
△ Less
Submitted 13 January, 2014;
originally announced January 2014.
-
A model project for reproducible papers: critical temperature for the Ising model on a square lattice
Authors:
M. Dolfi,
J. Gukelberger,
A. Hehn,
J. Imriška,
K. Pakrouski,
T. F. Rønnow,
M. Troyer,
I. Zintchenko,
F. Chirigati,
J. Freire,
D. Shasha
Abstract:
In this paper we present a simple, yet typical simulation in statistical physics, consisting of large scale Monte Carlo simulations followed by an involved statistical analysis of the results. The purpose is to provide an example publication to explore tools for writing reproducible papers. The simulation estimates the critical temperature where the Ising model on the square lattice becomes magnet…
▽ More
In this paper we present a simple, yet typical simulation in statistical physics, consisting of large scale Monte Carlo simulations followed by an involved statistical analysis of the results. The purpose is to provide an example publication to explore tools for writing reproducible papers. The simulation estimates the critical temperature where the Ising model on the square lattice becomes magnetic to be Tc /J = 2.26934(6) using a finite size scaling analysis of the crossing points of Binder cumulants. We provide a virtual machine which can be used to reproduce all figures and results.
△ Less
Submitted 9 January, 2014;
originally announced January 2014.
-
Gapped and gapless spin liquid phases on the Kagome lattice from chiral three-spin interactions
Authors:
Bela Bauer,
Brendan P. Keller,
Michele Dolfi,
Simon Trebst,
Andreas W. W. Ludwig
Abstract:
We argue that a relatively simple model containing only SU(2)-invariant chiral three-spin interactions on a Kagome lattice of S=1/2 spins can give rise to both a gapped and a gapless quantum spin liquid. Our arguments are rooted in a formulation in terms of network models of edge states and are backed up by a careful numerical analysis. For a uniform choice of chirality on the lattice, we realize…
▽ More
We argue that a relatively simple model containing only SU(2)-invariant chiral three-spin interactions on a Kagome lattice of S=1/2 spins can give rise to both a gapped and a gapless quantum spin liquid. Our arguments are rooted in a formulation in terms of network models of edge states and are backed up by a careful numerical analysis. For a uniform choice of chirality on the lattice, we realize the Kalmeyer-Laughlin state, i.e. a gapped spin liquid which is identified as the nu=1/2 bosonic Laughlin state. For staggered chiralities, a gapless spin liquid emerges which exhibits gapless spin excitations along lines in momentum space, a feature that we probe by studying quasi-two-dimensional systems of finite width. We thus provide a single, appealingly simple spin model (i) for what is probably the simplest realization of the Kalmeyer-Laughlin state to date, as well as (ii) for a non-Fermi liquid state with lines of gapless SU(2) spin excitations.
△ Less
Submitted 7 September, 2014; v1 submitted 27 March, 2013;
originally announced March 2013.
-
Multigrid Algorithms for Tensor Network States
Authors:
M. Dolfi,
B. Bauer,
M. Troyer,
Z. Ristivojevic
Abstract:
The widely used density matrix renormalization group (DRMG) method often fails to converge in systems with multiple length scales, such as lattice discretizations of continuum models and dilute or weakly doped lattice models. The local optimization employed by DMRG to optimize the wave function is ineffective in updating large-scale features. Here we present a multigrid algorithm that solves these…
▽ More
The widely used density matrix renormalization group (DRMG) method often fails to converge in systems with multiple length scales, such as lattice discretizations of continuum models and dilute or weakly doped lattice models. The local optimization employed by DMRG to optimize the wave function is ineffective in updating large-scale features. Here we present a multigrid algorithm that solves these convergence problems by optimizing the wave function at different spatial resolutions. We demonstrate its effectiveness by simulating bosons in continuous space, and study non-adiabaticity when ramping up the amplitude of an optical lattice. The algorithm can be generalized to tensor network methods, and be combined with the contractor renormalization group (CORE) method to study dilute and weakly doped lattice models.
△ Less
Submitted 12 June, 2012; v1 submitted 28 March, 2012;
originally announced March 2012.
-
Kinetics of double stranded DNA overstretching revealed by 0.5-2 pN force steps
Authors:
Pasquale Bianco,
Lorenzo Bongini,
Luca Melli,
Mario Dolfi,
Vincenzo Lombardi
Abstract:
A detailed description of the conformational plasticity of double stranded DNA (ds) is a necessary framework for understanding protein-DNA interactions. Until now, however structure and kinetics of the transition from the basic conformation of ds-DNA (B state) to the 1.7 times longer and partially unwound conformation (S state) have not been defined. The force-extension relation of the ds-DNA of l…
▽ More
A detailed description of the conformational plasticity of double stranded DNA (ds) is a necessary framework for understanding protein-DNA interactions. Until now, however structure and kinetics of the transition from the basic conformation of ds-DNA (B state) to the 1.7 times longer and partially unwound conformation (S state) have not been defined. The force-extension relation of the ds-DNA of lambda-phage is measured here with unprecedented resolution using a dual laser optical tweezers that can impose millisecond force steps of 0.5-2 pN (25 C). This approach reveals the kinetics of the transition between intermediate states of ds-DNA and uncovers the load-dependence of the rate constant of the unitary reaction step. DNA overstretching transition results essentially a two-state reaction composed of 5.85 nm steps, indicating cooperativity of ~25 base pairs. This mechanism increases the free energy for the unitary reaction to ~94 kBT, accounting for the stability of the basic conformation of DNA, and explains the absence of hysteresis in the force-extension relation at equilibrium. The novel description of the kinetics and energetics of the B-S transition of ds-DNA improves our understanding the biological role of the S state in the interplay between mechanics and enzymology of the DNA-protein machinery.
△ Less
Submitted 3 March, 2011;
originally announced March 2011.