-
Think Before You Attribute: Improving the Performance of LLMs Attribution Systems
Authors:
João Eduardo Batista,
Emil Vatai,
Mohamed Wahib
Abstract:
Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability…
▽ More
Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability and accountability are non-negotiable. To be reliable, attribution systems need high accuracy and retrieve data with short lengths, i.e., attribute to a sentence within a document rather than a whole document. We propose a sentence-level pre-attribution step for Retrieve-Augmented Generation (RAG) systems that classify sentences into three categories: not attributable, attributable to a single quote, and attributable to multiple quotes. By separating sentences before attribution, a proper attribution method can be selected for the type of sentence, or the attribution can be skipped altogether. Our results indicate that classifiers are well-suited for this task. In this work, we propose a pre-attribution step to reduce the computational complexity of attribution, provide a clean version of the HAGRID dataset, and provide an end-to-end attribution system that works out of the box.
△ Less
Submitted 18 May, 2025;
originally announced May 2025.
-
Embedding Domain-Specific Knowledge from LLMs into the Feature Engineering Pipeline
Authors:
João Eduardo Batista
Abstract:
Feature engineering is mandatory in the machine learning pipeline to obtain robust models. While evolutionary computation is well-known for its great results both in feature selection and feature construction, its methods are computationally expensive due to the large number of evaluations required to induce the final model. Part of the reason why these algorithms require a large number of evaluat…
▽ More
Feature engineering is mandatory in the machine learning pipeline to obtain robust models. While evolutionary computation is well-known for its great results both in feature selection and feature construction, its methods are computationally expensive due to the large number of evaluations required to induce the final model. Part of the reason why these algorithms require a large number of evaluations is their lack of domain-specific knowledge, resulting in a lot of random guessing during evolution. In this work, we propose using Large Language Models (LLMs) as an initial feature construction step to add knowledge to the dataset. By doing so, our results show that the evolution can converge faster, saving us computational resources. The proposed approach only provides the names of the features in the dataset and the target objective to the LLM, making it usable even when working with datasets containing private data. While consistent improvements to test performance were only observed for one-third of the datasets (CSS, PM, and IM10), possibly due to problems being easily explored by LLMs, this approach only decreased the model performance in 1/77 test cases. Additionally, this work introduces the M6GP feature engineering algorithm to symbolic regression, showing it can improve the results of the random forest regressor and produce competitive results with its predecessor, M3GP.
△ Less
Submitted 27 March, 2025;
originally announced March 2025.
-
Tadashi: Enabling AI-Based Automated Code Generation With Guaranteed Correctness
Authors:
Emil Vatai,
Aleksandr Drozd,
Ivan R. Ivanov,
Joao E. Batista,
Yinghao Ren,
Mohamed Wahib
Abstract:
Frameworks and domain-specific languages for auto-generating code have traditionally depended on human experts to implement rigorous methods ensuring the legality of code transformations. Recently, machine learning (ML) has gained traction for generating code optimized for specific hardware targets. However, ML approaches-particularly black-box neural networks-offer no guarantees on the correctnes…
▽ More
Frameworks and domain-specific languages for auto-generating code have traditionally depended on human experts to implement rigorous methods ensuring the legality of code transformations. Recently, machine learning (ML) has gained traction for generating code optimized for specific hardware targets. However, ML approaches-particularly black-box neural networks-offer no guarantees on the correctness or legality of the transformations they produce. To address this gap, we introduce Tadashi, an end-to-end system that leverages the polyhedral model to support researchers in curating datasets critical for ML-based code generation. Tadashi provides an end-to-end system capable of applying, verifying, and evaluating candidate transformations on polyhedral schedules with both reliability and practicality. We formally prove that Tadashi guarantees the legality of generated transformations, demonstrate its low runtime overhead, and showcase its broad applicability. Tadashi available at https://github.com/vatai/tadashi/.
△ Less
Submitted 2 June, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
RakutenAI-7B: Extending Large Language Models for Japanese
Authors:
Rakuten Group,
Aaron Levine,
Connie Huang,
Chenguang Wang,
Eduardo Batista,
Ewa Szymanska,
Hongyi Ding,
Hou Wei Chou,
Jean-François Pessiot,
Johanes Effendi,
Justin Chiu,
Kai Torben Ohlhus,
Karan Chopra,
Keiji Shinzato,
Koji Murakami,
Lee Xiong,
Lei Chen,
Maki Kubota,
Maksim Tkachenko,
Miroku Lee,
Naoki Takahashi,
Prathyusha Jwalapuram,
Ryutaro Tatsushima,
Saurabh Jain,
Sunil Kumar Yadav
, et al. (5 additional authors not shown)
Abstract:
We introduce RakutenAI-7B, a suite of Japanese-oriented large language models that achieve the best performance on the Japanese LM Harness benchmarks among the open 7B models. Along with the foundation model, we release instruction- and chat-tuned models, RakutenAI-7B-instruct and RakutenAI-7B-chat respectively, under the Apache 2.0 license.
We introduce RakutenAI-7B, a suite of Japanese-oriented large language models that achieve the best performance on the Japanese LM Harness benchmarks among the open 7B models. Along with the foundation model, we release instruction- and chat-tuned models, RakutenAI-7B-instruct and RakutenAI-7B-chat respectively, under the Apache 2.0 license.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
FlexCast: genuine overlay-based atomic multicast
Authors:
Eliã Batista,
Paulo Coelho,
Eduardo Alchieri,
Fernando Dotti,
Fernando Pedone
Abstract:
Atomic multicast is a communication abstraction where messages are propagated to groups of processes with reliability and order guarantees. Atomic multicast is at the core of strongly consistent storage and transactional systems. This paper presents FlexCast, the first genuine overlay-based atomic multicast protocol. Genuineness captures the essence of atomic multicast in that only the sender of a…
▽ More
Atomic multicast is a communication abstraction where messages are propagated to groups of processes with reliability and order guarantees. Atomic multicast is at the core of strongly consistent storage and transactional systems. This paper presents FlexCast, the first genuine overlay-based atomic multicast protocol. Genuineness captures the essence of atomic multicast in that only the sender of a message and the message's destinations coordinate to order the message, leading to efficient protocols. Overlay-based protocols restrict how process groups can communicate. Limiting communication leads to simpler protocols and reduces the amount of information each process must keep about the rest of the system. FlexCast implements genuine atomic multicast using a complete DAG overlay. We experimentally evaluate FlexCast in a geographically distributed environment using gTPC-C, a variation of the TPC-C benchmark that takes into account geographical distribution and locality. We show that, by exploiting genuineness and workload locality, FlexCast outperforms well-established atomic multicast protocols without the inherent communication overhead of state-of-the-art non-genuine multicast protocols.
△ Less
Submitted 28 September, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
Contributions to Context-Aware Smart Healthcare: A Security and Privacy Perspective
Authors:
Edgar Batista
Abstract:
The management of health data, from their gathering to their analysis, arises a number of challenging issues due to their highly confidential nature. In particular, this dissertation contributes to several security and privacy challenges within the smart health paradigm. More concretely, we firstly develop some contributions to context-aware environments enabling smart health scenarios. We present…
▽ More
The management of health data, from their gathering to their analysis, arises a number of challenging issues due to their highly confidential nature. In particular, this dissertation contributes to several security and privacy challenges within the smart health paradigm. More concretely, we firstly develop some contributions to context-aware environments enabling smart health scenarios. We present an extensive analysis on the security aspects of the underlying sensors and networks deployed in such environments, a novel user-centred privacy framework for analysing ubiquitous computing systems, and a complete analysis on the security and privacy challenges that need to be faced to implement cognitive cities properly. Second, we contribute to process mining, a popular analytical field that helps analyse business processes within organisations. Despite its popularity within the healthcare industry, we address two major issues: the high complexity of healthcare processes and the scarce research on privacy aspects. Regarding the first issue, we present a novel process discovery algorithm with a built-in heuristic that simplifies complex processes and, regarding the second, we propose two novel privacy-preserving process mining methods, which achieve a remarkable trade-off between accuracy and privacy. Last but not least, we present some smart health applications, namely a context-aware recommender system for routes, a platform supporting early mobilization programmes in hospital settings, and a health-oriented geographic information system. The results of this dissertation are intended to help the research community to enhance the security of the intelligent environments of the future as well as the privacy of the citizens regarding their personal and health data.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
SoK: Cross-border Criminal Investigations and Digital Evidence
Authors:
Fran Casino,
Claudia Pina,
Pablo López-Aguilar,
Edgar Batista,
Agusti Solanas,
Constantinos Patsakis
Abstract:
Digital evidence underpin the majority of crimes as their analysis is an integral part of almost every criminal investigation. Even if we temporarily disregard the numerous challenges in the collection and analysis of digital evidence, the exchange of the evidence among the different stakeholders has many thorny issues. Of specific interest are cross-border criminal investigations as the complexit…
▽ More
Digital evidence underpin the majority of crimes as their analysis is an integral part of almost every criminal investigation. Even if we temporarily disregard the numerous challenges in the collection and analysis of digital evidence, the exchange of the evidence among the different stakeholders has many thorny issues. Of specific interest are cross-border criminal investigations as the complexity is significantly high due to the heterogeneity of legal frameworks which beyond time bottlenecks can also become prohibiting. The aim of this article is to analyse the current state of practice of cross-border investigations considering the efficacy of current collaboration protocols along with the challenges and drawbacks to be overcome. Further to performing a legally-oriented research treatise, we recall all the challenges raised in the literature and discuss them from a more practical yet global perspective. Thus, this article paves the way to enabling practitioners and stakeholders to leverage horizontal strategies to fill in the identified gaps timely and accurately.
△ Less
Submitted 25 May, 2022;
originally announced May 2022.
-
On the Compression of Neural Networks Using $\ell_0$-Norm Regularization and Weight Pruning
Authors:
Felipe Dennis de Resende Oliveira,
Eduardo Luiz Ortiz Batista,
Rui Seara
Abstract:
Despite the growing availability of high-capacity computational platforms, implementation complexity still has been a great concern for the real-world deployment of neural networks. This concern is not exclusively due to the huge costs of state-of-the-art network architectures, but also due to the recent push towards edge intelligence and the use of neural networks in embedded applications. In thi…
▽ More
Despite the growing availability of high-capacity computational platforms, implementation complexity still has been a great concern for the real-world deployment of neural networks. This concern is not exclusively due to the huge costs of state-of-the-art network architectures, but also due to the recent push towards edge intelligence and the use of neural networks in embedded applications. In this context, network compression techniques have been gaining interest due to their ability for reducing deployment costs while keeping inference accuracy at satisfactory levels. The present paper is dedicated to the development of a novel compression scheme for neural networks. To this end, a new form of $\ell_0$-norm-based regularization is firstly developed, which is capable of inducing strong sparseness in the network during training. Then, targeting the smaller weights of the trained network with pruning techniques, smaller yet highly effective networks can be obtained. The proposed compression scheme also involves the use of $\ell_2$-norm regularization to avoid overfitting as well as fine tuning to improve the performance of the pruned network. Experimental results are presented aiming to show the effectiveness of the proposed scheme as well as to make comparisons with competing approaches.
△ Less
Submitted 18 December, 2023; v1 submitted 10 September, 2021;
originally announced September 2021.
-
An Open-Source Tool for Classification Models in Resource-Constrained Hardware
Authors:
Lucas Tsutsui da Silva,
Vinicius M. A. Souza,
Gustavo E. A. P. A. Batista
Abstract:
Applications that need to sense, measure, and gather real-time information from the environment frequently face three main restrictions: power consumption, cost, and lack of infrastructure. Most of the challenges imposed by these limitations can be better addressed by embedding Machine Learning (ML) classifiers in the hardware that senses the environment, creating smart sensors able to interpret t…
▽ More
Applications that need to sense, measure, and gather real-time information from the environment frequently face three main restrictions: power consumption, cost, and lack of infrastructure. Most of the challenges imposed by these limitations can be better addressed by embedding Machine Learning (ML) classifiers in the hardware that senses the environment, creating smart sensors able to interpret the low-level data stream. However, for this approach to be cost-effective, we need highly efficient classifiers suitable to execute in unresourceful hardware, such as low-power microcontrollers. In this paper, we present an open-source tool named EmbML - Embedded Machine Learning that implements a pipeline to develop classifiers for resource-constrained hardware. We describe its implementation details and provide a comprehensive analysis of its classifiers considering accuracy, classification time, and memory usage. Moreover, we compare the performance of its classifiers with classifiers produced by related tools to demonstrate that our tool provides a diverse set of classification algorithms that are both compact and accurate. Finally, we validate EmbML classifiers in a practical application of a smart sensor and trap for disease vector mosquitoes.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Plotting time: On the usage of CNNs for time series classification
Authors:
Nuno M. Rodrigues,
João E. Batista,
Leonardo Trujillo,
Bernardo Duarte,
Mario Giacobini,
Leonardo Vanneschi,
Sara Silva
Abstract:
We present a novel approach for time series classification where we represent time series data as plot images and feed them to a simple CNN, outperforming several state-of-the-art methods. We propose a simple and highly replicable way of plotting the time series, and feed these images as input to a non-optimized shallow CNN, without any normalization or residual connections. These representations…
▽ More
We present a novel approach for time series classification where we represent time series data as plot images and feed them to a simple CNN, outperforming several state-of-the-art methods. We propose a simple and highly replicable way of plotting the time series, and feed these images as input to a non-optimized shallow CNN, without any normalization or residual connections. These representations are no more than default line plots using the time series data, where the only pre-processing applied is to reduce the number of white pixels in the image. We compare our method with different state-of-the-art methods specialized in time series classification on two real-world non public datasets, as well as 98 datasets of the UCR dataset collection. The results show that our approach is very promising, achieving the best results on both real-world datasets and matching / beating the best state-of-the-art methods in six UCR datasets. We argue that, if a simple naive design like ours can obtain such good results, it is worth further exploring the capabilities of using image representation of time series data, along with more powerful CNNs, for classification and other related tasks.
△ Less
Submitted 8 February, 2021;
originally announced February 2021.
-
Challenges in Benchmarking Stream Learning Algorithms with Real-world Data
Authors:
Vinicius M. A. Souza,
Denis M. dos Reis,
Andre G. Maletzke,
Gustavo E. A. P. A. Batista
Abstract:
Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at high speed and the susceptibility to changes in the data distributions due to the dynamic nature of real environments. The data stream mining community sti…
▽ More
Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at high speed and the susceptibility to changes in the data distributions due to the dynamic nature of real environments. The data stream mining community still faces some primary challenges and difficulties related to the comparison and evaluation of new proposals, mainly due to the lack of publicly available non-stationary real-world datasets. The comparison of stream algorithms proposed in the literature is not an easy task, as authors do not always follow the same recommendations, experimental evaluation procedures, datasets, and assumptions. In this paper, we mitigate problems related to the choice of datasets in the experimental evaluation of stream classifiers and drift detectors. To that end, we propose a new public data repository for benchmarking stream algorithms with real-world data. This repository contains the most popular datasets from literature and new datasets related to a highly relevant public health problem that involves the recognition of disease vector insects using optical sensors. The main advantage of these new datasets is the prior knowledge of their characteristics and patterns of changes to evaluate new adaptive algorithm proposals adequately. We also present an in-depth discussion about the characteristics, reasons, and issues that lead to different types of changes in data distribution, as well as a critical review of common problems concerning the current benchmark datasets available in the literature.
△ Less
Submitted 30 June, 2020; v1 submitted 30 April, 2020;
originally announced May 2020.
-
Improving the Detection of Burnt Areas in Remote Sensing using Hyper-features Evolved by M3GP
Authors:
João E. Batista,
Sara Silva
Abstract:
One problem found when working with satellite images is the radiometric variations across the image and different images. Intending to improve remote sensing models for the classification of burnt areas, we set two objectives. The first is to understand the relationship between feature spaces and the predictive ability of the models, allowing us to explain the differences between learning and gene…
▽ More
One problem found when working with satellite images is the radiometric variations across the image and different images. Intending to improve remote sensing models for the classification of burnt areas, we set two objectives. The first is to understand the relationship between feature spaces and the predictive ability of the models, allowing us to explain the differences between learning and generalization when training and testing in different datasets. We find that training on datasets built from more than one image provides models that generalize better. These results are explained by visualizing the dispersion of values on the feature space. The second objective is to evolve hyper-features that improve the performance of different classifiers on a variety of test sets. We find the hyper-features to be beneficial, and obtain the best models with XGBoost, even if the hyper-features are optimized for a different method.
△ Less
Submitted 31 January, 2020;
originally announced February 2020.
-
Ensemble Genetic Programming
Authors:
Nuno M. Rodrigues,
João E. Batista,
Sara Silva
Abstract:
Ensemble learning is a powerful paradigm that has been usedin the top state-of-the-art machine learning methods like Random Forestsand XGBoost. Inspired by the success of such methods, we have devel-oped a new Genetic Programming method called Ensemble GP. The evo-lutionary cycle of Ensemble GP follows the same steps as other GeneticProgramming systems, but with differences in the population struc…
▽ More
Ensemble learning is a powerful paradigm that has been usedin the top state-of-the-art machine learning methods like Random Forestsand XGBoost. Inspired by the success of such methods, we have devel-oped a new Genetic Programming method called Ensemble GP. The evo-lutionary cycle of Ensemble GP follows the same steps as other GeneticProgramming systems, but with differences in the population structure,fitness evaluation and genetic operators. We have tested this method oneight binary classification problems, achieving results significantly betterthan standard GP, with much smaller models. Although other methodslike M3GP and XGBoost were the best overall, Ensemble GP was able toachieve exceptionally good generalization results on a particularly hardproblem where none of the other methods was able to succeed.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Technical Report: Implementation and Validation of a Smart Health Application
Authors:
Fran Casino,
Constantinos Patsakis,
Antoni Martinez-Balleste,
Frederic Borras,
Edgar Batista
Abstract:
In this article, we explain in detail the internal structures and databases of a smart health application. Moreover, we describe how to generate a statistically sound synthetic dataset using real-world medical data.
In this article, we explain in detail the internal structures and databases of a smart health application. Moreover, we describe how to generate a statistically sound synthetic dataset using real-world medical data.
△ Less
Submitted 13 June, 2017;
originally announced June 2017.