Search | arXiv e-print repository

An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact

Abstract: The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. Thi… ▽ More The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining, Oil & Gas, and Utilities sectors, the pipeline demonstrates significant improvements over traditional zero-shot prompting approaches, particularly in extraction accuracy and validation coverage. This work advances NLP-driven automation for regulatory compliance, CSR (Corporate Social Responsibility), and ESG, with broad sectoral applicability. △ Less

Submitted 5 May, 2025; originally announced May 2025.

Comments: Accepted to ACL ClimateNLP 2025

arXiv:2502.15429 [pdf, other]

Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations

Authors: Lihu Chen, Shuojie Fu, Gabriel Freedman, Cemre Zor, Guy Martin, James Kinross, Uddhav Vaghela, Ovidiu Serban, Francesca Toni

Abstract: A significant and growing number of published scientific articles is found to involve fraudulent practices, posing a serious threat to the credibility and safety of research in fields such as medicine. We propose Pub-Guard-LLM, the first large language model-based system tailored to fraud detection of biomedical scientific articles. We provide three application modes for deploying Pub-Guard-LLM: v… ▽ More A significant and growing number of published scientific articles is found to involve fraudulent practices, posing a serious threat to the credibility and safety of research in fields such as medicine. We propose Pub-Guard-LLM, the first large language model-based system tailored to fraud detection of biomedical scientific articles. We provide three application modes for deploying Pub-Guard-LLM: vanilla reasoning, retrieval-augmented generation, and multi-agent debate. Each mode allows for textual explanations of predictions. To assess the performance of our system, we introduce an open-source benchmark, PubMed Retraction, comprising over 11K real-world biomedical articles, including metadata and retraction labels. We show that, across all modes, Pub-Guard-LLM consistently surpasses the performance of various baselines and provides more reliable explanations, namely explanations which are deemed more relevant and coherent than those generated by the baselines when evaluated by multiple assessment methods. By enhancing both detection performance and explainability in scientific fraud detection, Pub-Guard-LLM contributes to safeguarding research integrity with a novel, effective, open-source tool. △ Less

Submitted 8 April, 2025; v1 submitted 21 February, 2025; originally announced February 2025.

Comments: long paper under review

arXiv:2307.01157 [pdf, other]

A novel approach for predicting epidemiological forecasting parameters based on real-time signals and Data Assimilation

Authors: Romain Molinas, César Quilodrán Casas, Rossella Arcucci, Ovidiu Şerban

Abstract: This paper proposes a novel approach to predict epidemiological parameters by integrating new real-time signals from various sources of information, such as novel social media-based population density maps and Air Quality data. We implement an ensemble of Convolutional Neural Networks (CNN) models using various data sources and fusion methodology to build robust predictions and simulate several dy… ▽ More This paper proposes a novel approach to predict epidemiological parameters by integrating new real-time signals from various sources of information, such as novel social media-based population density maps and Air Quality data. We implement an ensemble of Convolutional Neural Networks (CNN) models using various data sources and fusion methodology to build robust predictions and simulate several dynamic parameters that could improve the decision-making process for policymakers. Additionally, we used data assimilation to estimate the state of our system from fused CNN predictions. The combination of meteorological signals and social media-based population density maps improved the performance and flexibility of our prediction of the COVID-19 outbreak in London. While the proposed approach outperforms standard models, such as compartmental models traditionally used in disease forecasting (SEIR), generating robust and consistent predictions allows us to increase the stability of our model while increasing its accuracy. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2305.15895 [pdf, other]

Collective Knowledge Graph Completion with Mutual Knowledge Distillation

Authors: Weihang Zhang, Ovidiu Serban, Jiahao Sun, Yi-ke Guo

Abstract: Knowledge graph completion (KGC), the task of predicting missing information based on the existing relational data inside a knowledge graph (KG), has drawn significant attention in recent years. However, the predictive power of KGC methods is often limited by the completeness of the existing knowledge graphs from different sources and languages. In monolingual and multilingual settings, KGs are po… ▽ More Knowledge graph completion (KGC), the task of predicting missing information based on the existing relational data inside a knowledge graph (KG), has drawn significant attention in recent years. However, the predictive power of KGC methods is often limited by the completeness of the existing knowledge graphs from different sources and languages. In monolingual and multilingual settings, KGs are potentially complementary to each other. In this paper, we study the problem of multi-KG completion, where we focus on maximizing the collective knowledge from different KGs to alleviate the incompleteness of individual KGs. Specifically, we propose a novel method called CKGC-CKD that uses relation-aware graph convolutional network encoder models on both individual KGs and a large fused KG in which seed alignments between KGs are regarded as edges for message propagation. An additional mutual knowledge distillation mechanism is also employed to maximize the knowledge transfer between the models of "global" fused KG and the "local" individual KGs. Experimental results on multilingual datasets have shown that our method outperforms all state-of-the-art models in the KGC task. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: Accepted at ENLSP-II workshop at NeurIPS 2022

arXiv:2110.03687 [pdf, other]

Protecting Retail Investors from Order Book Spoofing using a GRU-based Detection Model

Authors: Jean-Noël Tuccella, Philip Nadler, Ovidiu Şerban

Abstract: Market manipulation is tackled through regulation in traditional markets because of its detrimental effect on market efficiency and many participating financial actors. The recent increase of private retail investors due to new low-fee platforms and new asset classes such as decentralised digital currencies has increased the number of vulnerable actors due to lack of institutional sophistication a… ▽ More Market manipulation is tackled through regulation in traditional markets because of its detrimental effect on market efficiency and many participating financial actors. The recent increase of private retail investors due to new low-fee platforms and new asset classes such as decentralised digital currencies has increased the number of vulnerable actors due to lack of institutional sophistication and strong regulation. This paper proposes a method to detect illicit activity and inform investors on spoofing attempts, a well-known market manipulation technique. Our framework is based on a highly extendable Gated Recurrent Unit (GRU) model and allows the inclusion of market variables that can explain spoofing and potentially other illicit activities. The model is tested on granular order book data, in one of the most unregulated markets prone to spoofing with a large number of non-institutional traders. The results show that the model is performing well in an early detection context, allowing the identification of spoofing attempts soon enough to allow investors to react. This is the first step to a fully comprehensive model that will protect investors in various unregulated trading environments and regulators to identify illicit activity. △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:1609.04656 [pdf]

doi 10.1007/978-3-319-45982-0_5

Collective Awareness Platforms and Digital Social Innovation Mediating Consensus Seeking in Problem Situations

Authors: Atta Badii, Franco Bagnoli, Balint Balazs, Tommaso Castellani, Davide D'Orazio, Fernando Ferri, Patrizia Grifoni, Giovanna Pacini, Ovidiu Serban, Adriana Valente

Abstract: In this paper we show the results of our studies carried out in the framework of the European Project SciCafe2.0 in the area of Participatory Engagement models. We present a methodological approach built on participative engagements models and holistic framework for problem situation clarification and solution impacts assessment. Several online platforms for social engagement have been analysed to… ▽ More In this paper we show the results of our studies carried out in the framework of the European Project SciCafe2.0 in the area of Participatory Engagement models. We present a methodological approach built on participative engagements models and holistic framework for problem situation clarification and solution impacts assessment. Several online platforms for social engagement have been analysed to extract the main patterns of participative engagement. We present our own experiments through the SciCafe2.0 Platform and our insights from requirements elicitation. △ Less

Submitted 15 September, 2016; originally announced September 2016.

Journal ref: INSCI 2016, LNCS 9934, pp. 55-65, 2016

arXiv:1603.07534 [pdf, other]

Web Data Knowledge Extraction

Authors: Juan M. Tirado, Ovidiu Serban, Qiang Guo, Eiko Yoneki

Abstract: A constantly growing amount of information is available through the web. Unfortunately, extracting useful content from this massive amount of data still remains an open issue. The lack of standard data models and structures forces developers to create adhoc solutions from the scratch. The figure of the expert is still needed in many situations where developers do not have the correct background kn… ▽ More A constantly growing amount of information is available through the web. Unfortunately, extracting useful content from this massive amount of data still remains an open issue. The lack of standard data models and structures forces developers to create adhoc solutions from the scratch. The figure of the expert is still needed in many situations where developers do not have the correct background knowledge. This forces developers to spend time acquiring the needed background from the expert. In other directions, there are promising solutions employing machine learning techniques. However, increasing accuracy requires an increase in system complexity that cannot be endured in many projects. In this work, we approach the web knowledge extraction problem using an expertcentric methodology. This methodology defines a set of configurable, extendible and independent components that permit the reutilisation of large pieces of code among projects. Our methodology differs from similar solutions in its expert-driven design. This design, makes it possible for subject-matter expert to drive the knowledge extraction for a given set of documents. Additionally, we propose the utilization of machine assisted solutions that guide the expert during this process. To demonstrate the capabilities of our methodology, we present a real use case scenario in which public procurement data is extracted from the web-based repositories of several public institutions across Europe. We provide insightful details about the challenges we had to deal with in this use case and additional discussions about how to apply our methodology. △ Less

Submitted 24 March, 2016; originally announced March 2016.

MSC Class: 68U04 ACM Class: I.7; H.2; H.2.4; D.2.11; D.2.12; C.1.1; H.2.8; H.3; H.3.1; H.5.4

Showing 1–7 of 7 results for author: Şerban, O