-
SubGrapher: Visual Fingerprinting of Chemical Structures
Authors:
Lucas Morin,
Gerhard Ingmar Meijer,
Valéry Weber,
Luc Van Gool,
Peter W. J. Staar
Abstract:
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerpr…
▽ More
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code will be made publicly available.
△ Less
Submitted 28 April, 2025;
originally announced April 2025.
-
BOGausS: Better Optimized Gaussian Splatting
Authors:
Stéphane Pateux,
Matthieu Gendrin,
Luce Morin,
Théo Ladune,
Xiaoran Jiang
Abstract:
3D Gaussian Splatting (3DGS) proposes an efficient solution for novel view synthesis. Its framework provides fast and high-fidelity rendering. Although less complex than other solutions such as Neural Radiance Fields (NeRF), there are still some challenges building smaller models without sacrificing quality. In this study, we perform a careful analysis of 3DGS training process and propose a new op…
▽ More
3D Gaussian Splatting (3DGS) proposes an efficient solution for novel view synthesis. Its framework provides fast and high-fidelity rendering. Although less complex than other solutions such as Neural Radiance Fields (NeRF), there are still some challenges building smaller models without sacrificing quality. In this study, we perform a careful analysis of 3DGS training process and propose a new optimization methodology. Our Better Optimized Gaussian Splatting (BOGausS) solution is able to generate models up to ten times lighter than the original 3DGS with no quality degradation, thus significantly boosting the performance of Gaussian Splatting compared to the state of the art.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures
Authors:
Lucas Morin,
Valéry Weber,
Ahmed Nassar,
Gerhard Ingmar Meijer,
Luc Van Gool,
Yawei Li,
Peter Staar
Abstract:
The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures f…
▽ More
The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Authors:
Ahmed Nassar,
Andres Marafioti,
Matteo Omenetti,
Maksym Lysak,
Nikolaos Livathinos,
Christoph Auer,
Lucas Morin,
Rafael Teixeira de Lima,
Yusik Kim,
A. Said Gurbuz,
Michele Dolfi,
Miquel Farré,
Peter W. J. Staar
Abstract:
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipeline…
▽ More
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion
Authors:
Nikolaos Livathinos,
Christoph Auer,
Maksym Lysak,
Ahmed Nassar,
Michele Dolfi,
Panos Vagenas,
Cesar Berrospi Ramis,
Matteo Omenetti,
Kasper Dinkla,
Yusik Kim,
Shubham Gupta,
Rafael Teixeira de Lima,
Valery Weber,
Lucas Morin,
Ingmar Meijer,
Viktor Kuropiatnyk,
Peter W. J. Staar
Abstract:
We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in…
▽ More
We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. Docling is released as a Python package and can be used as a Python API or as a CLI tool. Docling's modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. Docling has been already integrated in other popular open-source frameworks (e.g., LangChain, LlamaIndex, spaCy), making it a natural fit for the processing of documents and the development of high-end applications. The open-source community has fully engaged in using, promoting, and developing for Docling, which gathered 10k stars on GitHub in less than a month and was reported as the No. 1 trending repository in GitHub worldwide in November 2024.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Docling Technical Report
Authors:
Christoph Auer,
Maksym Lysak,
Ahmed Nassar,
Michele Dolfi,
Nikolaos Livathinos,
Panos Vagenas,
Cesar Berrospi Ramis,
Matteo Omenetti,
Fabian Lindlbauer,
Kasper Dinkla,
Lokesh Mishra,
Yusik Kim,
Shubham Gupta,
Rafael Teixeira de Lima,
Valery Weber,
Lucas Morin,
Ingmar Meijer,
Viktor Kuropiatnyk,
Peter W. J. Staar
Abstract:
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addit…
▽ More
This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.
△ Less
Submitted 9 December, 2024; v1 submitted 19 August, 2024;
originally announced August 2024.
-
NERV++: An Enhanced Implicit Neural Video Representation
Authors:
Ahmed Ghorbel,
Wassim Hamidouche,
Luce Morin
Abstract:
Neural fields, also known as implicit neural representations (INRs), have shown a remarkable capability of representing, generating, and manipulating various data types, allowing for continuous data reconstruction at a low memory footprint. Though promising, INRs applied to video compression still need to improve their rate-distortion performance by a large margin, and require a huge number of par…
▽ More
Neural fields, also known as implicit neural representations (INRs), have shown a remarkable capability of representing, generating, and manipulating various data types, allowing for continuous data reconstruction at a low memory footprint. Though promising, INRs applied to video compression still need to improve their rate-distortion performance by a large margin, and require a huge number of parameters and long training iterations to capture high-frequency details, limiting their wider applicability. Resolving this problem remains a quite challenging task, which would make INRs more accessible in compression tasks. We take a step towards resolving these shortcomings by introducing neural representations for videos NeRV++, an enhanced implicit neural video representation, as more straightforward yet effective enhancement over the original NeRV decoder architecture, featuring separable conv2d residual blocks (SCRBs) that sandwiches the upsampling block (UB), and a bilinear interpolation skip layer for improved feature representation. NeRV++ allows videos to be directly represented as a function approximated by a neural network, and significantly enhance the representation capacity beyond current INR-based video codecs. We evaluate our method on UVG, MCL JVC, and Bunny datasets, achieving competitive results for video compression with INRs. This achievement narrows the gap to autoencoder-based video coding, marking a significant stride in INR-based video compression research.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
ESG Accountability Made Easy: DocQA at Your Service
Authors:
Lokesh Mishra,
Cesar Berrospi,
Kasper Dinkla,
Diego Antognini,
Francesco Fusco,
Benedikt Bothur,
Maksym Lysak,
Nikolaos Livathinos,
Ahmed Nassar,
Panagiotis Vagenas,
Lucas Morin,
Christoph Auer,
Michele Dolfi,
Peter Staar
Abstract:
We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via…
▽ More
We present Deep Search DocQA. This application enables information extraction from documents via a question-answering conversational assistant. The system integrates several technologies from different AI disciplines consisting of document conversion to machine-readable format (via computer vision), finding relevant data (via natural language processing), and formulating an eloquent response (via large language models). Users can explore over 10,000 Environmental, Social, and Governance (ESG) disclosure reports from over 2000 corporations. The Deep Search platform can be accessed at: https://ds4sd.github.io.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
MolGrapher: Graph-based Visual Recognition of Chemical Structures
Authors:
Lucas Morin,
Martin Danelljan,
Maria Isabel Agea,
Ahmed Nassar,
Valery Weber,
Ingmar Meijer,
Peter Staar,
Fisher Yu
Abstract:
The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diver…
▽ More
The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
ConvNeXt-ChARM: ConvNeXt-based Transform for Efficient Neural Image Compression
Authors:
Ahmed Ghorbel,
Wassim Hamidouche,
Luce Morin
Abstract:
Over the last few years, neural image compression has gained wide attention from research and industry, yielding promising end-to-end deep neural codecs outperforming their conventional counterparts in rate-distortion performance. Despite significant advancement, current methods, including attention-based transform coding, still need to be improved in reducing the coding rate while preserving the…
▽ More
Over the last few years, neural image compression has gained wide attention from research and industry, yielding promising end-to-end deep neural codecs outperforming their conventional counterparts in rate-distortion performance. Despite significant advancement, current methods, including attention-based transform coding, still need to be improved in reducing the coding rate while preserving the reconstruction fidelity, especially in non-homogeneous textured image areas. Those models also require more parameters and a higher decoding time. To tackle the above challenges, we propose ConvNeXt-ChARM, an efficient ConvNeXt-based transform coding framework, paired with a compute-efficient channel-wise auto-regressive prior to capturing both global and local contexts from the hyper and quantized latent representations. The proposed architecture can be optimized end-to-end to fully exploit the context information and extract compact latent representation while reconstructing higher-quality images. Experimental results on four widely-used datasets showed that ConvNeXt-ChARM brings consistent and significant BD-rate (PSNR) reductions estimated on average to 5.24% and 1.22% over the versatile video coding (VVC) reference encoder (VTM-18.0) and the state-of-the-art learned image compression method SwinT-ChARM, respectively. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the next generation ConvNet, namely ConvNeXt, and Swin Transformer.
△ Less
Submitted 12 July, 2023;
originally announced July 2023.
-
AICT: An Adaptive Image Compression Transformer
Authors:
Ahmed Ghorbel,
Wassim Hamidouche,
Luce Morin
Abstract:
Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Current methods that still rely on ConvNet-based entropy coding are limited in…
▽ More
Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Current methods that still rely on ConvNet-based entropy coding are limited in long-range modeling dependencies due to their local connectivity and an increasing number of architectural biases and priors. On the contrary, the proposed ICT can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre/post-processor to accurately extract more compact latent representation while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed adaptive image compression transformer (AICT) framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM.
△ Less
Submitted 12 July, 2023;
originally announced July 2023.
-
Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression
Authors:
Ahmed Ghorbel,
Wassim Hamidouche,
Luce Morin
Abstract:
Recently, the performance of neural image compression (NIC) has steadily improved thanks to the last line of study, reaching or outperforming state-of-the-art conventional codecs. Despite significant progress, current NIC methods still rely on ConvNet-based entropy coding, limited in modeling long-range dependencies due to their local connectivity and the increasing number of architectural biases…
▽ More
Recently, the performance of neural image compression (NIC) has steadily improved thanks to the last line of study, reaching or outperforming state-of-the-art conventional codecs. Despite significant progress, current NIC methods still rely on ConvNet-based entropy coding, limited in modeling long-range dependencies due to their local connectivity and the increasing number of architectural biases and priors, resulting in complex underperforming models with high decoding latency. Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Through the proposed ICT, we can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre-/post-processor to accurately extract more compact latent codes while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the adaptive image compression transformer (AICT) and the neural codec SwinT-ChARM.
△ Less
Submitted 22 January, 2024; v1 submitted 5 July, 2023;
originally announced July 2023.
-
Quality Assessment of DIBR-synthesized views: An Overview
Authors:
Shishun Tian,
Lu Zhang,
Wenbin Zou,
Xia Li,
Ting Su,
Luce Morin,
Olivier Deforges
Abstract:
The Depth-Image-Based-Rendering (DIBR) is one of the main fundamental technique to generate new views in 3D video applications, such as Multi-View Videos (MVV), Free-Viewpoint Videos (FVV) and Virtual Reality (VR). However, the quality assessment of DIBR-synthesized views is quite different from the traditional 2D images/videos. In recent years, several efforts have been made towards this topic, b…
▽ More
The Depth-Image-Based-Rendering (DIBR) is one of the main fundamental technique to generate new views in 3D video applications, such as Multi-View Videos (MVV), Free-Viewpoint Videos (FVV) and Virtual Reality (VR). However, the quality assessment of DIBR-synthesized views is quite different from the traditional 2D images/videos. In recent years, several efforts have been made towards this topic, but there {is a lack of} detailed survey in {the} literature. In this paper, we provide a comprehensive survey on various current approaches for DIBR-synthesized views. The current accessible datasets of DIBR-synthesized views are firstly reviewed{, followed} by a summary analysis of the representative state-of-the-art objective metrics. Then, the performances of different objective metrics are evaluated and discussed on all available datasets. Finally, we discuss the potential challenges and suggest possible directions for future research.
△ Less
Submitted 27 April, 2021; v1 submitted 16 November, 2019;
originally announced November 2019.
-
Numerical simulation of model problems in plasticity based on field dislocation mechanics
Authors:
Léo Morin,
Renald Brenner,
Pierre Suquet
Abstract:
The aim of this paper is to investigate the numerical implementation of the Field Dislocation Mechanics (FDM) theory for the simulation of dislocation-mediated plasticity. First, the mesoscale FDM theory of Acharya and Roy (2006) is recalled which permits to express the set of equations under the form of a static problem, corresponding to the determination of the local stress field for a given dis…
▽ More
The aim of this paper is to investigate the numerical implementation of the Field Dislocation Mechanics (FDM) theory for the simulation of dislocation-mediated plasticity. First, the mesoscale FDM theory of Acharya and Roy (2006) is recalled which permits to express the set of equations under the form of a static problem, corresponding to the determination of the local stress field for a given dislocation density distribution, complemented by an evolution problem, corresponding to the transport of the dislocation density. The static problem is solved using FFT-based techniques (Brenner et al., 2014). The main contribution of the present study is an efficient numerical scheme based on high resolution Godunov-type solvers to solve the evolution problem. Model problems of dislocation-mediated plasticity are finally considered in a simplified layer case. First, uncoupled problems with uniform velocity are considered, which permits to reproduce annihilation of dislocations and expansion of dislocation loops. Then, the FDM theory is applied to several problems of dislocation microstructures subjected to a mechanical loading.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.