-
BiLCNet : BiLSTM-Conformer Network for Encrypted Traffic Classification with 5G SA Physical Channel Records
Authors:
Ke Ma,
Jialiang Lu,
Philippe Martins
Abstract:
Accurate and efficient traffic classification is vital for wireless network management, especially under encrypted payloads and dynamic application behavior, where traditional methods such as port-based identification and deep packet inspection (DPI) are increasingly inadequate. This work explores the feasibility of using physical channel data collected from the air interface of 5G Standalone (SA)…
▽ More
Accurate and efficient traffic classification is vital for wireless network management, especially under encrypted payloads and dynamic application behavior, where traditional methods such as port-based identification and deep packet inspection (DPI) are increasingly inadequate. This work explores the feasibility of using physical channel data collected from the air interface of 5G Standalone (SA) networks for traffic sensing. We develop a preprocessing pipeline to transform raw channel records into structured representations with customized feature engineering to enhance downstream classification performance. To jointly capture temporal dependencies and both local and global structural patterns inherent in physical channel records, we propose a novel hybrid architecture: BiLSTM-Conformer Network (BiLCNet), which integrates the sequential modeling capability of Bidirectional Long Short-Term Memory networks (BiLSTM) with the spatial feature extraction strength of Conformer blocks. Evaluated on a noise-limited 5G SA dataset, our model achieves a classification accuracy of 93.9%, outperforming a series of conventional machine learning and deep learning algorithms. Furthermore, we demonstrate its generalization ability under zero-shot transfer settings, validating its robustness across traffic categories and varying environmental conditions.
△ Less
Submitted 22 September, 2025;
originally announced September 2025.
-
On the under-reaching phenomenon in message-passing neural PDE solvers: revisiting the CFL condition
Authors:
Lucas Tesan,
Mikel M. Iparraguirre,
David Gonzalez,
Pedro Martins,
Elias Cueto
Abstract:
This paper proposes sharp lower bounds for the number of message passing iterations required in graph neural networks (GNNs) when solving partial differential equations (PDE). This significantly reduces the need for exhaustive hyperparameter tuning. Bounds are derived for the three fundamental classes of PDEs (hyperbolic, parabolic and elliptic) by relating the physical characteristics of the prob…
▽ More
This paper proposes sharp lower bounds for the number of message passing iterations required in graph neural networks (GNNs) when solving partial differential equations (PDE). This significantly reduces the need for exhaustive hyperparameter tuning. Bounds are derived for the three fundamental classes of PDEs (hyperbolic, parabolic and elliptic) by relating the physical characteristics of the problem in question to the message-passing requirement of GNNs. In particular, we investigate the relationship between the physical constants of the equations governing the problem, the spatial and temporal discretisation and the message passing mechanisms in GNNs.
When the number of message passing iterations is below these proposed limits, information does not propagate efficiently through the network, resulting in poor solutions, even for deep GNN architectures. In contrast, when the suggested lower bound is satisfied, the GNN parameterisation allows the model to accurately capture the underlying phenomenology, resulting in solvers of adequate accuracy.
Examples are provided for four different examples of equations that show the sharpness of the proposed lower bounds.
△ Less
Submitted 9 July, 2025;
originally announced July 2025.
-
EuroLLM-9B: Technical Report
Authors:
Pedro Henrique Martins,
João Alves,
Patrick Fernandes,
Nuno M. Guerreiro,
Ricardo Rei,
Amin Farajian,
Mateusz Klimaszewski,
Duarte M. Alves,
José Pombal,
Nicolas Boizard,
Manuel Faysse,
Pierre Colombo,
François Yvon,
Barry Haddow,
José G. C. de Souza,
Alexandra Birch,
André F. T. Martins
Abstract:
This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, inclu…
▽ More
This report presents EuroLLM-9B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-9B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. We describe the pre-training data collection and filtering pipeline, including the creation of EuroFilter, an AI-based multilingual filter, as well as the design of EuroBlocks-Synthetic, a novel synthetic dataset for post-training that enhances language coverage for European languages. Evaluation results demonstrate EuroLLM-9B's competitive performance on multilingual benchmarks and machine translation tasks, establishing it as the leading open European-made LLM of its size. To support open research and adoption, we release all major components of this work, including the base and instruction-tuned models, the EuroFilter classifier, and the synthetic post-training dataset.
△ Less
Submitted 16 June, 2025; v1 submitted 4 June, 2025;
originally announced June 2025.
-
Transformer-based Ranking Approaches for Keyword Queries over Relational Databases
Authors:
Paulo Martins,
Altigran da Silva,
Johny Moreira,
Edleno de Moura
Abstract:
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without requiring schema knowledge or query-language proficiency. Although numerous R-KwS methods have been proposed, most still focus on queries referring only to attribute values or primarily address performance enhancements, providing limited support for queries re…
▽ More
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without requiring schema knowledge or query-language proficiency. Although numerous R-KwS methods have been proposed, most still focus on queries referring only to attribute values or primarily address performance enhancements, providing limited support for queries referencing schema elements. We previously introduced Lathe, a system that accommodates schema-based keyword queries and employs an eager CJN evaluation strategy to filter out spurious Candidate Joining Networks (CJNs). However, Lathe still faces challenges in accurately ranking CJNs when queries are ambiguous. In this work, we propose a new transformer-based ranking approach that provides a more context-aware evaluation of Query Matches (QMs) and CJNs. Our solution introduces a linearization process to convert relational structures into textual sequences suitable for transformer models. It also includes a data augmentation strategy aimed at handling diverse and ambiguous queries more effectively. Experimental results, comparing our transformer-based ranking to Lathe's original Bayesian-based method, show significant improvements in recall and R@k, demonstrating the effectiveness of our neural approach in delivering the most relevant query results.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images
Authors:
Euclid Collaboration,
G. Stevens,
S. Fotopoulou,
M. N. Bremer,
T. Matamoro Zatarain,
K. Jahnke,
B. Margalef-Bentabol,
M. Huertas-Company,
M. J. Smith,
M. Walmsley,
M. Salvato,
M. Mezcua,
A. Paulino-Afonso,
M. Siudek,
M. Talia,
F. Ricci,
W. Roster,
N. Aghanim,
B. Altieri,
S. Andreon,
H. Aussel,
C. Baccigalupi,
M. Baldi,
S. Bardelli,
P. Battaglia
, et al. (249 additional authors not shown)
Abstract:
Light emission from galaxies exhibit diverse brightness profiles, influenced by factors such as galaxy type, structural features and interactions with other galaxies. Elliptical galaxies feature more uniform light distributions, while spiral and irregular galaxies have complex, varied light profiles due to their structural heterogeneity and star-forming activity. In addition, galaxies with an acti…
▽ More
Light emission from galaxies exhibit diverse brightness profiles, influenced by factors such as galaxy type, structural features and interactions with other galaxies. Elliptical galaxies feature more uniform light distributions, while spiral and irregular galaxies have complex, varied light profiles due to their structural heterogeneity and star-forming activity. In addition, galaxies with an active galactic nucleus (AGN) feature intense, concentrated emission from gas accretion around supermassive black holes, superimposed on regular galactic light, while quasi-stellar objects (QSO) are the extreme case of the AGN emission dominating the galaxy. The challenge of identifying AGN and QSO has been discussed many times in the literature, often requiring multi-wavelength observations. This paper introduces a novel approach to identify AGN and QSO from a single image. Diffusion models have been recently developed in the machine-learning literature to generate realistic-looking images of everyday objects. Utilising the spatial resolving power of the Euclid VIS images, we created a diffusion model trained on one million sources, without using any source pre-selection or labels. The model learns to reconstruct light distributions of normal galaxies, since the population is dominated by them. We condition the prediction of the central light distribution by masking the central few pixels of each source and reconstruct the light according to the diffusion model. We further use this prediction to identify sources that deviate from this profile by examining the reconstruction error of the few central pixels regenerated in each source's core. Our approach, solely using VIS imaging, features high completeness compared to traditional methods of AGN and QSO selection, including optical, near-infrared, mid-infrared, and X-rays.
△ Less
Submitted 12 August, 2025; v1 submitted 19 March, 2025;
originally announced March 2025.
-
Approximate Evaluation Method for the Probability of the Union of Independent Events
Authors:
Edson Luiz Ursini,
Paulo S. Martins
Abstract:
The evaluation of the probability of union of a large number of independent events requires several combinations involving the factorial and the use of high performance computers with several hours of processing. Bounds and simplifications on the probability of the union are useful in the analysis of stochastic problems across various areas including (but not limited to) systems reliability, biolo…
▽ More
The evaluation of the probability of union of a large number of independent events requires several combinations involving the factorial and the use of high performance computers with several hours of processing. Bounds and simplifications on the probability of the union are useful in the analysis of stochastic problems across various areas including (but not limited to) systems reliability, biological systems, real-time fault-tolerant systems, probability theory, information theory and communications. We propose an approximation to evaluate the probability of the union of several independent events that uses the arithmetic mean of the probability of all of them. The approximate results are very close to, but larger than the exact values. The method allows a much smaller number of operations with a similar result and more simplicity.
△ Less
Submitted 5 March, 2025;
originally announced March 2025.
-
SigN: SIMBox Activity Detection Through Latency Anomalies at the Cellular Edge
Authors:
Anne Josiane Kouam,
Aline Carneiro Viana,
Philippe Martins,
Cedric Adjih,
Alain Tchana
Abstract:
Despite their widespread adoption, cellular networks face growing vulnerabilities due to their inherent complexity and the integration of advanced technologies. One of the major threats in this landscape is Voice over IP (VoIP) to GSM gateways, known as SIMBox devices. These devices use multiple SIM cards to route VoIP traffic through cellular networks, enabling international bypass fraud with los…
▽ More
Despite their widespread adoption, cellular networks face growing vulnerabilities due to their inherent complexity and the integration of advanced technologies. One of the major threats in this landscape is Voice over IP (VoIP) to GSM gateways, known as SIMBox devices. These devices use multiple SIM cards to route VoIP traffic through cellular networks, enabling international bypass fraud with losses of up to $3.11 billion annually. Beyond financial impact, SIMBox activity degrades network performance, threatens national security, and facilitates eavesdropping on communications. Existing detection methods for SIMBox activity are hindered by evolving fraud techniques and implementation complexities, limiting their practical adoption in operator networks.This paper addresses the limitations of current detection methods by introducing SigN , a novel approach to identifying SIMBox activity at the cellular edge. The proposed method focuses on detecting remote SIM card association, a technique used by SIMBox appliances to mimic human mobility patterns. The method detects latency anomalies between SIMBox and standard devices by analyzing cellular signaling during network attachment. Extensive indoor and outdoor experiments demonstrate that SIMBox devices generate significantly higher attachment latencies, particularly during the authentication phase, where latency is up to 23 times greater than that of standard devices. We attribute part of this overhead to immutable factors such as LTE authentication standards and Internet-based communication protocols. Therefore, our approach offers a robust, scalable, and practical solution to mitigate SIMBox activity risks at the network edge.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
Thermodynamics-informed graph neural networks for real-time simulation of digital human twins
Authors:
Lucas Tesán,
David González,
Pedro Martins,
Elías Cueto
Abstract:
The growing importance of real-time simulation in the medical field has exposed the limitations and bottlenecks inherent in the digital representation of complex biological systems. This paper presents a novel methodology aimed at advancing current lines of research in soft tissue simulation. The proposed approach introduces a hybrid model that integrates the geometric bias of graph neural network…
▽ More
The growing importance of real-time simulation in the medical field has exposed the limitations and bottlenecks inherent in the digital representation of complex biological systems. This paper presents a novel methodology aimed at advancing current lines of research in soft tissue simulation. The proposed approach introduces a hybrid model that integrates the geometric bias of graph neural networks with the physical bias derived from the imposition of a metriplectic structure as soft and hard constrains in the architecture, being able to simulate hepatic tissue with dissipative properties. This approach provides an efficient solution capable of generating predictions at high feedback rate while maintaining a remarkable generalization ability for previously unseen anatomies. This makes these features particularly relevant in the context of precision medicine and haptic rendering.
Based on the adopted methodologies, we propose a model that predicts human liver responses to traction and compression loads in as little as 7.3 milliseconds for optimized configurations and as fast as 1.65 milliseconds in the most efficient cases, all in the forward pass. The model achieves relative position errors below 0.15\%, with stress tensor and velocity estimations maintaining relative errors under 7\%. This demonstrates the robustness of the approach developed, which is capable of handling diverse load states and anatomies effectively. This work highlights the feasibility of integrating real-time simulation with patient-specific geometries through deep learning, paving the way for more robust digital human twins in medical applications.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
3D Modelling to Address Pandemic Challenges: A Project-Based Learning Methodology
Authors:
Tânia Rocha,
Ana Ribeiro,
Joana Oliveira,
Ricardo Nunes,
Diana Carvalho,
Hugo Paredes,
Paulo Martins
Abstract:
The use of 3D modelling in medical education is a revolutionary tool during the learning process. In fact, this type of technology enables a more interactive teaching approach, making information retention more effective and enhancing students' understanding. 3D modelling allows for the creation of precise representations of the human body, as well as interaction with three-dimensional models, giv…
▽ More
The use of 3D modelling in medical education is a revolutionary tool during the learning process. In fact, this type of technology enables a more interactive teaching approach, making information retention more effective and enhancing students' understanding. 3D modelling allows for the creation of precise representations of the human body, as well as interaction with three-dimensional models, giving students a better spatial understanding of the different organs and systems and enabling simulations of surgical and technical procedures. This way, medical education is enriched with a more realistic and safe educational experience. The goal is to understand whether, when students and schools are challenged, they play an important role in addressing health issues in their community. School-led projects are directed towards educational scenarios that emphasize STEM education, tackling relevant public health problems through open-school initiatives. By implementing an educational scenario focused on 3D modelling and leveraging technology, we aim to raise community awareness on public health issues.
△ Less
Submitted 13 November, 2024;
originally announced November 2024.
-
EuroLLM: Multilingual Language Models for Europe
Authors:
Pedro Henrique Martins,
Patrick Fernandes,
João Alves,
Nuno M. Guerreiro,
Ricardo Rei,
Duarte M. Alves,
José Pombal,
Amin Farajian,
Manuel Faysse,
Mateusz Klimaszewski,
Pierre Colombo,
Barry Haddow,
José G. C. de Souza,
Alexandra Birch,
André F. T. Martins
Abstract:
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date,…
▽ More
The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Performance analysis of a RIS-assisted communications
Authors:
Hamza Adrat,
Laurent Decreusefond,
Philippe Martins
Abstract:
Reconfigurable Intelligent Surfaces (RIS) are currently considered for adoption in future 6G stantards. ETSI and 3GPP have started feasibility and performance investigations of such a technology. This work proposes an analytical model to analyze RIS performance. It relies on a simple street model where obstacles and mobile units are all aligned. RIS is positioned onto a building parallel to the ro…
▽ More
Reconfigurable Intelligent Surfaces (RIS) are currently considered for adoption in future 6G stantards. ETSI and 3GPP have started feasibility and performance investigations of such a technology. This work proposes an analytical model to analyze RIS performance. It relies on a simple street model where obstacles and mobile units are all aligned. RIS is positioned onto a building parallel to the road. The coverage probability in presence of obstacles and concurrent communications is then computed as a performance criteria.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Randomized heuristic repair for large-scale multidimensional knapsack problem
Authors:
Jean P. Martins
Abstract:
The multidimensional knapsack problem (MKP) is an NP-hard combinatorial optimization problem whose solution is determining a subset of maximum total profit items that do not violate capacity constraints. Due to its hardness, large-scale MKP instances are usually a target for metaheuristics, a context in which effective feasibility maintenance strategies are crucial. In 1998, Chu and Beasley propos…
▽ More
The multidimensional knapsack problem (MKP) is an NP-hard combinatorial optimization problem whose solution is determining a subset of maximum total profit items that do not violate capacity constraints. Due to its hardness, large-scale MKP instances are usually a target for metaheuristics, a context in which effective feasibility maintenance strategies are crucial. In 1998, Chu and Beasley proposed an effective heuristic repair that is still relevant for recent metaheuristics. However, due to its deterministic nature, the diversity of solutions such heuristic provides is insufficient for long runs. As a result, the search for new solutions ceases after a while. This paper proposes an efficiency-based randomization strategy for the heuristic repair that increases the variability of the repaired solutions without deteriorating quality and improves the overall results.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Closed-form congestion control via deep symbolic regression
Authors:
Jean Martins,
Igor Almeida,
Ricardo Souza,
Silvia Lins
Abstract:
As mobile networks embrace the 5G era, the interest in adopting Reinforcement Learning (RL) algorithms to handle challenges in ultra-low-latency and high throughput scenarios increases. Simultaneously, the advent of packetized fronthaul networks imposes demanding requirements that traditional congestion control mechanisms cannot accomplish, highlighting the potential of RL-based congestion control…
▽ More
As mobile networks embrace the 5G era, the interest in adopting Reinforcement Learning (RL) algorithms to handle challenges in ultra-low-latency and high throughput scenarios increases. Simultaneously, the advent of packetized fronthaul networks imposes demanding requirements that traditional congestion control mechanisms cannot accomplish, highlighting the potential of RL-based congestion control algorithms. Although learning RL policies optimized for satisfying the stringent fronthaul requirements is feasible, the adoption of neural network models in real deployments still poses some challenges regarding real-time inference and interpretability. This paper proposes a methodology to deal with such challenges while maintaining the performance and generalization capabilities provided by a baseline RL policy. The method consists of (1) training a congestion control policy specialized in fronthaul-like networks via reinforcement learning, (2) collecting state-action experiences from the baseline, and (3) performing deep symbolic regression on the collected dataset. The proposed process overcomes the challenges related to inference-time limitations through closed-form expressions that approximate the baseline performance (link utilization, delay, and fairness) and which can be directly implemented in any programming language. Finally, we analyze the inner workings of the closed-form expressions.
△ Less
Submitted 28 March, 2024;
originally announced May 2024.
-
Solving the Multiobjective Quasi-Clique Problem
Authors:
Daniela Scherer dos Santos,
Kathrin Klamroth,
Pedro Martins,
Luís Paquete
Abstract:
Given a simple undirected graph $G$, a quasi-clique is a subgraph of $G$ whose density is at least $γ$ $(0 < γ\leq 1)$. Finding a maximum quasi-clique has been addressed from two different perspectives: $i)$ maximizing vertex cardinality for a given edge density; and $ii)$ maximizing edge density for a given vertex cardinality. However, when no a priori preference information about cardinality and…
▽ More
Given a simple undirected graph $G$, a quasi-clique is a subgraph of $G$ whose density is at least $γ$ $(0 < γ\leq 1)$. Finding a maximum quasi-clique has been addressed from two different perspectives: $i)$ maximizing vertex cardinality for a given edge density; and $ii)$ maximizing edge density for a given vertex cardinality. However, when no a priori preference information about cardinality and density is available, a more natural approach is to consider the problem from a multiobjective perspective. We introduce the Multiobjective Quasi-clique Problem (MOQC), which aims to find a quasi-clique by simultaneously maximizing both vertex cardinality and edge density. To efficiently address this problem, we explore the relationship among MOQC, its single-objective counterpart problems, and a biobjective optimization problem, along with several properties of the MOQC problem and quasi-cliques. We propose a baseline approach using $\varepsilon$-constraint scalarization and introduce a Two-phase strategy, which applies a dichotomic search based on weighted sum scalarization in the first phase and an $\varepsilon$-constraint methodology in the second phase. Additionally, we present a Three-phase strategy that combines the dichotomic search used in Two-phase with a vertex-degree-based local search employing novel sufficient conditions to assess quasi-clique efficiency, followed by an $\varepsilon$-constraint in a final stage. Experimental results on real-world sparse graphs indicate that the integrated use of dichotomic search and local search, together with mechanisms to assess quasi-clique efficiency, makes the Three-phase strategy an effective approach for solving the MOQC problem in terms of running time and ability to produce new efficient quasi-cliques.
△ Less
Submitted 16 March, 2024;
originally announced March 2024.
-
Ensuring connectedness for the Maximum Quasi-clique and Densest $k$-subgraph problems
Authors:
Daniela Scherer dos Santos,
Kathrin Klamroth,
Pedro Martins,
Luís Paquete
Abstract:
Given an undirected graph $G$, a quasi-clique is a subgraph of $G$ whose density is at least $γ$ $(0 < γ\leq 1)$. Two optimization problems can be defined for quasi-cliques: the Maximum Quasi-Clique (MQC) Problem, which finds a quasi-clique with maximum vertex cardinality, and the Densest $k$-Subgraph (DKS) Problem, which finds the densest subgraph given a fixed cardinality constraint. Most existi…
▽ More
Given an undirected graph $G$, a quasi-clique is a subgraph of $G$ whose density is at least $γ$ $(0 < γ\leq 1)$. Two optimization problems can be defined for quasi-cliques: the Maximum Quasi-Clique (MQC) Problem, which finds a quasi-clique with maximum vertex cardinality, and the Densest $k$-Subgraph (DKS) Problem, which finds the densest subgraph given a fixed cardinality constraint. Most existing approaches to solve both problems often disregard the requirement of connectedness, which may lead to solutions containing isolated components that are meaningless for many real-life applications. To address this issue, we propose two flow-based connectedness constraints to be integrated into known Mixed-Integer Linear Programming (MILP) formulations for either MQC or DKS problems. We compare the performance of MILP formulations enhanced with our connectedness constraints in terms of both running time and number of solved instances against existing approaches that ensure quasi-clique connectedness. Experimental results demonstrate that our constraints are quite competitive, making them valuable for practical applications requiring connectedness.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Authors:
Duarte M. Alves,
José Pombal,
Nuno M. Guerreiro,
Pedro H. Martins,
João Alves,
Amin Farajian,
Ben Peters,
Ricardo Rei,
Patrick Fernandes,
Sweta Agrawal,
Pierre Colombo,
José G. C. de Souza,
André F. T. Martins
Abstract:
While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and pa…
▽ More
While general-purpose large language models (LLMs) demonstrate proficiency on multiple tasks within the domain of translation, approaches based on open LLMs are competitive only when specializing on a single task. In this paper, we propose a recipe for tailoring LLMs to multiple tasks present in translation workflows. We perform continued pretraining on a multilingual mixture of monolingual and parallel data, creating TowerBase, followed by finetuning on instructions relevant for translation processes, creating TowerInstruct. Our final model surpasses open alternatives on several tasks relevant to translation workflows and is competitive with general-purpose closed LLMs. To facilitate future research, we release the Tower models, our specialization dataset, an evaluation framework for LLMs focusing on the translation ecosystem, and a collection of model generations, including ours, on our benchmark.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
CroissantLLM: A Truly Bilingual French-English Language Model
Authors:
Manuel Faysse,
Patrick Fernandes,
Nuno M. Guerreiro,
António Loison,
Duarte M. Alves,
Caio Corro,
Nicolas Boizard,
João Alves,
Ricardo Rei,
Pedro H. Martins,
Antoni Bigata Casademunt,
François Yvon,
André F. T. Martins,
Gautier Viaud,
Céline Hudelot,
Pierre Colombo
Abstract:
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a cust…
▽ More
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework, and validate 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away from previous English-centric work in order to strengthen our understanding of multilinguality in language models.
△ Less
Submitted 9 April, 2025; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Boosting Mixed-Initiative Co-Creativity in Game Design: A Tutorial
Authors:
Solange Margarido,
Licínio Roque,
Penousal Machado,
Pedro Martins
Abstract:
In recent years, there has been a growing application of mixed-initiative co-creative approaches in the creation of video games. The rapid advances in the capabilities of artificial intelligence (AI) systems further propel creative collaboration between humans and computational agents. In this tutorial, we present guidelines for researchers and practitioners to develop game design tools with a hig…
▽ More
In recent years, there has been a growing application of mixed-initiative co-creative approaches in the creation of video games. The rapid advances in the capabilities of artificial intelligence (AI) systems further propel creative collaboration between humans and computational agents. In this tutorial, we present guidelines for researchers and practitioners to develop game design tools with a high degree of mixed-initiative co-creativity (MI-CCy). We begin by reviewing a selection of current works that will serve as case studies and categorize them by the type of game content they address. We introduce the MI-CCy Quantifier, a framework that can be used by researchers and developers to assess co-creative tools on their level of MI-CCy through a visual scheme of quantifiable criteria scales. We demonstrate the usage of the MI-CCy Quantifier by applying it to the selected works. This analysis enabled us to discern prevalent patterns within these tools, as well as features that contribute to a higher level of MI-CCy. We highlight current gaps in MI-CCy approaches within game design, which we propose as pivotal aspects to tackle in the development of forthcoming approaches.
△ Less
Submitted 14 August, 2025; v1 submitted 11 January, 2024;
originally announced January 2024.
-
DeepContrast: Deep Tissue Contrast Enhancement using Synthetic Data Degradations and OOD Model Predictions
Authors:
Nuno Pimpão Martins,
Yannis Kalaidzidis,
Marino Zerial,
Florian Jug
Abstract:
Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick sa…
▽ More
Microscopy images are crucial for life science research, allowing detailed inspection and characterization of cellular and tissue-level structures and functions. However, microscopy data are unavoidably affected by image degradations, such as noise, blur, or others. Many such degradations also contribute to a loss of image contrast, which becomes especially pronounced in deeper regions of thick samples. Today, best performing methods to increase the quality of images are based on Deep Learning approaches, which typically require ground truth (GT) data during training. Our inability to counteract blurring and contrast loss when imaging deep into samples prevents the acquisition of such clean GT data. The fact that the forward process of blurring and contrast loss deep into tissue can be modeled, allowed us to propose a new method that can circumvent the problem of unobtainable GT data. To this end, we first synthetically degraded the quality of microscopy images even further by using an approximate forward model for deep tissue image degradations. Then we trained a neural network that learned the inverse of this degradation function from our generated pairs of raw and degraded images. We demonstrated that networks trained in this way can be used out-of-distribution (OOD) to improve the quality of less severely degraded images, e.g. the raw data imaged in a microscope. Since the absolute level of degradation in such microscopy images can be stronger than the additional degradation introduced by our forward model, we also explored the effect of iterative predictions. Here, we observed that in each iteration the measured image contrast kept improving while detailed structures in the images got increasingly removed. Therefore, dependent on the desired downstream analysis, a balance between contrast improvement and retention of image details has to be found.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Real-time Traffic Classification for 5G NSA Encrypted Data Flows With Physical Channel Records
Authors:
Xiao Fei,
Philippe Martins,
Jialiang Lu
Abstract:
The classification of fifth-generation New-Radio (5G-NR) mobile network traffic is an emerging topic in the field of telecommunications. It can be utilized for quality of service (QoS) management and dynamic resource allocation. However, traditional approaches such as Deep Packet Inspection (DPI) can not be directly applied to encrypted data flows. Therefore, new real-time encrypted traffic classi…
▽ More
The classification of fifth-generation New-Radio (5G-NR) mobile network traffic is an emerging topic in the field of telecommunications. It can be utilized for quality of service (QoS) management and dynamic resource allocation. However, traditional approaches such as Deep Packet Inspection (DPI) can not be directly applied to encrypted data flows. Therefore, new real-time encrypted traffic classification algorithms need to be investigated to handle dynamic transmission. In this study, we examine the real-time encrypted 5G Non-Standalone (NSA) application-level traffic classification using physical channel records. Due to the vastness of their features, decision-tree-based gradient boosting algorithms are a viable approach for classification. We generate a noise-limited 5G NSA trace dataset with traffic from multiple applications. We develop a new pipeline to convert sequences of physical channel records into numerical vectors. A set of machine learning models are tested, and we propose our solution based on Light Gradient Boosting Machine (LGBM) due to its advantages in fast parallel training and low computational burden in practical scenarios. Our experiments demonstrate that our algorithm can achieve 95% accuracy on the classification task with a state-of-the-art response time as quick as 10ms.
△ Less
Submitted 15 July, 2023;
originally announced July 2023.
-
Challenges and Trends in User Trust Discourse in AI
Authors:
Sonia Sousa,
Jose Cravino,
Paulo Martins
Abstract:
The Internet revolution in 1990, followed by the data-driven and information revolution, has transformed the world as we know it. Nowadays, what seam to be 10 to 20 years ago, a science fiction idea (i.e., machines dominating the world) is seen as possible. This revolution also brought a need for new regulatory practices where user trust and artificial Intelligence (AI) discourse has a central rol…
▽ More
The Internet revolution in 1990, followed by the data-driven and information revolution, has transformed the world as we know it. Nowadays, what seam to be 10 to 20 years ago, a science fiction idea (i.e., machines dominating the world) is seen as possible. This revolution also brought a need for new regulatory practices where user trust and artificial Intelligence (AI) discourse has a central role. This work aims to clarify some misconceptions about user trust in AI discourse and fight the tendency to design vulnerable interactions that lead to further breaches of trust, both real and perceived. Findings illustrate the lack of clarity in understanding user trust and its effects on computer science, especially in measuring user trust characteristics. It argues for clarifying those notions to avoid possible trust gaps and misinterpretations in AI adoption and appropriation.
△ Less
Submitted 23 May, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Human-centered trust framework: An HCI perspective
Authors:
Sonia Sousa,
Jose Cravino,
Paulo Martins,
David Lamas
Abstract:
The rationale of this work is based on the current user trust discourse of Artificial Intelligence (AI). We aim to produce novel HCI approaches that use trust as a facilitator for the uptake (or appropriation) of current technologies. We propose a framework (HCTFrame) to guide non-experts to unlock the full potential of user trust in AI design. Results derived from a data triangulation of findings…
▽ More
The rationale of this work is based on the current user trust discourse of Artificial Intelligence (AI). We aim to produce novel HCI approaches that use trust as a facilitator for the uptake (or appropriation) of current technologies. We propose a framework (HCTFrame) to guide non-experts to unlock the full potential of user trust in AI design. Results derived from a data triangulation of findings from three literature reviews demystify some misconceptions of user trust in computer science and AI discourse, and three case studies are conducted to assess the effectiveness of a psychometric scale in mapping potential users' trust breakdowns and concerns. This work primarily contributes to the fight against the tendency to design technical-centered vulnerable interactions, which can eventually lead to additional real and perceived breaches of trust. The proposed framework can be used to guide system designers on how to map and define user trust and the socioethical and organisational needs and characteristics of AI system design. It can also guide AI system designers on how to develop a prototype and operationalise a solution that meets user trust requirements. The article ends by providing some user research tools that can be employed to measure users' trust intentions and behaviours towards a proposed solution.
△ Less
Submitted 15 May, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
Authors:
Patrick Fernandes,
Aman Madaan,
Emmy Liu,
António Farinhas,
Pedro Henrique Martins,
Amanda Bertsch,
José G. C. de Souza,
Shuyan Zhou,
Tongshuang Wu,
Graham Neubig,
André F. T. Martins
Abstract:
Many recent advances in natural language generation have been fueled by training large language models on internet-scale data. However, this paradigm can lead to models that generate toxic, inaccurate, and unhelpful content, and automatic evaluation metrics often fail to identify these behaviors. As models become more capable, human feedback is an invaluable signal for evaluating and improving mod…
▽ More
Many recent advances in natural language generation have been fueled by training large language models on internet-scale data. However, this paradigm can lead to models that generate toxic, inaccurate, and unhelpful content, and automatic evaluation metrics often fail to identify these behaviors. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation. First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for human intervention.
△ Less
Submitted 31 May, 2023; v1 submitted 1 May, 2023;
originally announced May 2023.
-
Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms
Authors:
Samitha Somathilaka,
Daniel P. Martins,
Xu Li,
Yusong Li,
Sasitharan Balasubramaniam
Abstract:
Bacterial cells are sensitive to a range of external signals used to learn the environment. These incoming external signals are then processed using a Gene Regulatory Network (GRN), exhibiting similarities to modern computing algorithms. An in-depth analysis of gene expression dynamics suggests an inherited Gene Regulatory Neural Network (GRNN) behavior within the GRN that enables the cellular dec…
▽ More
Bacterial cells are sensitive to a range of external signals used to learn the environment. These incoming external signals are then processed using a Gene Regulatory Network (GRN), exhibiting similarities to modern computing algorithms. An in-depth analysis of gene expression dynamics suggests an inherited Gene Regulatory Neural Network (GRNN) behavior within the GRN that enables the cellular decision-making based on received signals from the environment and neighbor cells. In this study, we extract a sub-network of \textit{Pseudomonas aeruginosa} GRN that is associated with one virulence factor: pyocyanin production as a use case to investigate the GRNN behaviors. Further, using Graph Neural Network (GNN) architecture, we model a single species biofilm to reveal the role of GRNN dynamics on ecosystem-wide decision-making. Varying environmental conditions, we prove that the extracted GRNN computes input signals similar to natural decision-making process of the cell. Identifying of neural network behaviors in GRNs may lead to more accurate bacterial cell activity predictive models for many applications, including human health-related problems and agricultural applications. Further, this model can produce data on causal relationships throughout the network, enabling the possibility of designing tailor-made infection-controlling mechanisms. More interestingly, these GRNNs can perform computational tasks for bio-hybrid computing systems.
△ Less
Submitted 10 January, 2023;
originally announced January 2023.
-
ESSYS* Sharing #UC: An Emotion-driven Audiovisual Installation
Authors:
Sérgio M. Rebelo,
Mariana Seiça,
Pedro Martins,
João Bicker,
Penousal Machado
Abstract:
We present ESSYS* Sharing #UC, an audiovisual installation artwork that reflects upon the emotional context related to the university and the city of Coimbra, based on the data shared about them on Twitter. The installation was presented in an urban art gallery of Círculo de Artes Plásticas de Coimbra during the summer and autumn of 2021. In the installation space, one may see a collection of typo…
▽ More
We present ESSYS* Sharing #UC, an audiovisual installation artwork that reflects upon the emotional context related to the university and the city of Coimbra, based on the data shared about them on Twitter. The installation was presented in an urban art gallery of Círculo de Artes Plásticas de Coimbra during the summer and autumn of 2021. In the installation space, one may see a collection of typographic posters displaying the tweets and listening to an ever-changing ambient sound. The present audiovisuals are created by an autonomous computational creative approach, which employs a neural classifier to recognize the emotional context of a tweet and uses this resulting data as feedstock for the audiovisual generation. The installation's space is designed to promote an approach and blend between the online and physical perceptions of the same location. We applied multiple experiments with the proposed approach to evaluate the capability and performance. Also, we conduct interview-based evaluation sessions to understand how the installation elements, especially poster designs, are experienced by people regarding diversity, expressiveness and possible employment in other commercial and social scenarios.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
Efficient Methods for Natural Language Processing: A Survey
Authors:
Marcos Treviso,
Ji-Ung Lee,
Tianchu Ji,
Betty van Aken,
Qingqing Cao,
Manuel R. Ciosici,
Michael Hassid,
Kenneth Heafield,
Sara Hooker,
Colin Raffel,
Pedro H. Martins,
André F. T. Martins,
Jessica Zosa Forde,
Peter Milder,
Edwin Simpson,
Noam Slonim,
Jesse Dodge,
Emma Strubell,
Niranjan Balasubramanian,
Leon Derczynski,
Iryna Gurevych,
Roy Schwartz
Abstract:
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require few…
▽ More
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.
△ Less
Submitted 24 March, 2023; v1 submitted 31 August, 2022;
originally announced September 2022.
-
Multi-agent reinforcement learning for intent-based service assurance in cellular networks
Authors:
Satheesh K. Perepu,
Jean P. Martins,
Ricardo Souza S,
Kaushik Dey
Abstract:
Recently, intent-based management has received good attention in telecom networks owing to stringent performance requirements for many of the use cases. Several approaches in the literature employ traditional closed-loop driven methods to fulfill the intents on the KPIs. However, these methods consider every closed-loop independent of each other which degrades the combined performance. Also, such…
▽ More
Recently, intent-based management has received good attention in telecom networks owing to stringent performance requirements for many of the use cases. Several approaches in the literature employ traditional closed-loop driven methods to fulfill the intents on the KPIs. However, these methods consider every closed-loop independent of each other which degrades the combined performance. Also, such existing methods are not easily scalable. Multi-agent reinforcement learning (MARL) techniques have shown significant promise in many areas in which traditional closed-loop control falls short, typically for complex coordination and conflict management among loops. In this work, we propose a method based on MARL to achieve intent-based management without the need for knowing a model of the underlying system. Moreover, when there are conflicting intents, the MARL agents can implicitly incentivize the loops to cooperate and promote trade-offs, without human interaction, by prioritizing the important KPIs. Experiments have been performed on a network emulator for optimizing KPIs of three services. Results obtained demonstrate that the proposed system performs quite well and is able to fulfill all existing intents when there are enough resources or prioritize the KPIs when resources are scarce.
△ Less
Submitted 26 August, 2022; v1 submitted 7 August, 2022;
originally announced August 2022.
-
Optimal Decision Diagrams for Classification
Authors:
Alexandre M. Florio,
Pedro Martins,
Maximilian Schiffer,
Thiago Serra,
Thibaut Vidal
Abstract:
Decision diagrams for classification have some notable advantages over decision trees, as their internal connections can be determined at training time and their width is not bound to grow exponentially with their depth. Accordingly, decision diagrams are usually less prone to data fragmentation in internal nodes. However, the inherent complexity of training these classifiers acted as a long-stand…
▽ More
Decision diagrams for classification have some notable advantages over decision trees, as their internal connections can be determined at training time and their width is not bound to grow exponentially with their depth. Accordingly, decision diagrams are usually less prone to data fragmentation in internal nodes. However, the inherent complexity of training these classifiers acted as a long-standing barrier to their widespread adoption. In this context, we study the training of optimal decision diagrams (ODDs) from a mathematical programming perspective. We introduce a novel mixed-integer linear programming model for training and demonstrate its applicability for many datasets of practical importance. Further, we show how this model can be easily extended for fairness, parsimony, and stability notions. We present numerical analyses showing that our model allows training ODDs in short computational times, and that ODDs achieve better accuracy than optimal decision trees, while allowing for improved stability without significant accuracy losses.
△ Less
Submitted 28 May, 2022;
originally announced May 2022.
-
Chunk-based Nearest Neighbor Machine Translation
Authors:
Pedro Henrique Martins,
Zita Marinho,
André F. T. Martins
Abstract:
Semi-parametric models, which augment generation with retrieval, have led to impressive results in language modeling and machine translation, due to their ability to retrieve fine-grained information from a datastore of examples. One of the most prominent approaches, $k$NN-MT, exhibits strong domain adaptation capabilities by retrieving tokens from domain-specific datastores \citep{khandelwal2020n…
▽ More
Semi-parametric models, which augment generation with retrieval, have led to impressive results in language modeling and machine translation, due to their ability to retrieve fine-grained information from a datastore of examples. One of the most prominent approaches, $k$NN-MT, exhibits strong domain adaptation capabilities by retrieving tokens from domain-specific datastores \citep{khandelwal2020nearest}. However, $k$NN-MT requires an expensive retrieval operation for every single generated token, leading to a very low decoding speed (around 8 times slower than a parametric model). In this paper, we introduce a \textit{chunk-based} $k$NN-MT model which retrieves chunks of tokens from the datastore, instead of a single token. We propose several strategies for incorporating the retrieved chunks into the generation process, and for selecting the steps at which the model needs to search for neighbors in the datastore. Experiments on machine translation in two settings, static and ``on-the-fly'' domain adaptation, show that the chunk-based $k$NN-MT model leads to significant speed-ups (up to 4 times) with only a small drop in translation quality.
△ Less
Submitted 7 November, 2022; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Efficient Machine Translation Domain Adaptation
Authors:
Pedro Henrique Martins,
Zita Marinho,
André F. T. Martins
Abstract:
Machine translation models struggle when translating out-of-domain text, which makes domain adaptation a topic of critical importance. However, most domain adaptation methods focus on fine-tuning or training the entire or part of the model on every new domain, which can be costly. On the other hand, semi-parametric models have been shown to successfully perform domain adaptation by retrieving exam…
▽ More
Machine translation models struggle when translating out-of-domain text, which makes domain adaptation a topic of critical importance. However, most domain adaptation methods focus on fine-tuning or training the entire or part of the model on every new domain, which can be costly. On the other hand, semi-parametric models have been shown to successfully perform domain adaptation by retrieving examples from an in-domain datastore (Khandelwal et al., 2021). A drawback of these retrieval-augmented models, however, is that they tend to be substantially slower. In this paper, we explore several approaches to speed up nearest neighbor machine translation. We adapt the methods recently proposed by He et al. (2021) for language modeling, and introduce a simple but effective caching strategy that avoids performing retrieval when similar contexts have been seen before. Translation quality and runtimes for several domains show the effectiveness of the proposed solutions.
△ Less
Submitted 26 April, 2022;
originally announced April 2022.
-
Supporting Schema References in Keyword Queries over Relational Databases
Authors:
Paulo Martins,
Altigran da Silva,
João Cavalcanti,
Edleno de Moura
Abstract:
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. These systems take the keywords from the input query, locate the elements of the target database that correspond to these keywords, and look for ways to "connect" these elements using information on referential integr…
▽ More
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without knowing schema details or query languages. These systems take the keywords from the input query, locate the elements of the target database that correspond to these keywords, and look for ways to "connect" these elements using information on referential integrity constraints, i.e., key/foreign key pairs. Although several such systems have been proposed in the literature, most of them only support queries whose keywords refer to the contents of the target database and just very few support queries in which keywords refer to elements of the database schema. This paper proposes LATHE, a novel R-KwS designed to support such queries. To this end, in our work, we first generalize the well-known concepts of Query Matches (QMs) and Candidate Joining Networks (CJNs) to handle keywords referring to schema elements and propose new algorithms to generate them. Then, we introduce an approach to automatically select the CJNs that are more likely to represent the user intent when issuing a keyword query. This approach includes two major innovations: a ranking algorithm for selecting better QMs, yielding the generation of fewer but better CJNs, and an eager evaluation strategy for pruning void useless CJNs. We present a comprehensive set of experiments performed with query sets and datasets previously used in experiments with state-of-the-art R-KwS systems and methods. Our results indicate that LATHE can handle a wider variety of keyword queries while remaining highly effective, even for large databases with intricate schemas.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
$\infty$-former: Infinite Memory Transformer
Authors:
Pedro Henrique Martins,
Zita Marinho,
André F. T. Martins
Abstract:
Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-t…
▽ More
Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the $\infty$-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the $\infty$-former's attention complexity becomes independent of the context length, trading off memory length with precision. In order to control where precision is more important, $\infty$-former maintains "sticky memories" being able to model arbitrarily long contexts while keeping the computation budget fixed. Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the $\infty$-former's ability to retain information from long sequences.
△ Less
Submitted 25 March, 2022; v1 submitted 1 September, 2021;
originally announced September 2021.
-
Applying Intelligent Reflector Surfaces for Detecting Violent Expiratory Aerosol Cloud using Terahertz Signals
Authors:
Harun Šiljak,
Michael Taynnan Barros,
Nathan D'Arcy,
Daniel Perez Martins,
Nicola Marchetti,
Sasitharan Balasubramaniam
Abstract:
The recent COVID-19 pandemic has driven researchers from different spectrum to develop novel solutions that can improve detection and understanding of SARS-CoV-2 virus. In this article we propose the use of Intelligent Reflector Surface (IRS) emitting terahertz signals to detect airborne respiratory aerosol cloud that are secreted from people. Our proposed approach makes use of future IRS infrastr…
▽ More
The recent COVID-19 pandemic has driven researchers from different spectrum to develop novel solutions that can improve detection and understanding of SARS-CoV-2 virus. In this article we propose the use of Intelligent Reflector Surface (IRS) emitting terahertz signals to detect airborne respiratory aerosol cloud that are secreted from people. Our proposed approach makes use of future IRS infrastructure to extend beyond communication functionality by adding environmental scanning for aerosol clouds. Simulations have also been conducted to analyze the accuracy of aerosol cloud detection based on a signal scanning and path optimization algorithm. Utilizing IRS for detecting respiratory aerosol cloud can lead to new added value of telecommunication infrastructures for sensor monitoring data that can be used for public health.
△ Less
Submitted 29 July, 2022; v1 submitted 17 August, 2021;
originally announced August 2021.
-
A Graph-based Molecular Communications Model Analysis of the Human Gut Bacteriome
Authors:
Samitha Somathilaka,
Daniel P. Martins,
Wiley Barton,
Orla O'Sullivan,
Paul D. Cotter,
Sasitharan Balasubramaniam
Abstract:
Alterations in the human gut bacteriome can be associated with human health issues, such as type-2 diabetes and cardiovascular disease. Both external and internal factors can drive changes in the composition and in the interactions of the human gut bacteriome, impacting negatively on the host cells. In this paper, we focus on the human gut bacteriome metabolism and we propose a two-layer network s…
▽ More
Alterations in the human gut bacteriome can be associated with human health issues, such as type-2 diabetes and cardiovascular disease. Both external and internal factors can drive changes in the composition and in the interactions of the human gut bacteriome, impacting negatively on the host cells. In this paper, we focus on the human gut bacteriome metabolism and we propose a two-layer network system to investigate its dynamics. Furthermore, we develop an in-silico simulation model (virtual GB), allowing us to study the impact of the metabolite exchange through molecular communications in the human gut bacteriome network system. Our results show that the regulation of molecular inputs can strongly affect bacterial population growth and create an unbalanced network, as shown by the shift in the node weights based on the molecular signals that are produced. Additionally, we show that the metabolite molecular communication production is greatly affected when directly manipulating the composition of the human gut bacteriome network in the virtual GB. These results indicate that our human GB interaction model can help to identify hidden behaviors of the human gut bacteriome depending on the molecular signal interactions. Moreover, the virtual GB can support the research and development of novel medical treatments based on the accurate control of bacterial growth and exchange of metabolites.
△ Less
Submitted 16 July, 2021;
originally announced July 2021.
-
A Review on Bio-Cyber Interfaces for Intrabody Molecular Communications Systems
Authors:
Yevgeni Koucheryavy,
Anastasia Yastrebova,
Daniel P. Martins,
Sasitharan Balasubramaniam
Abstract:
The recent advancements in bio-engineering and wireless communications systems have motivated researchers to propose novel applications for telemedicine, therapeutics and human health monitoring. For instance, through wireless medical telemetry a healthcare worker can remotely measure biological signals and control certain processes in the organism required for the maintenance of the patient's hea…
▽ More
The recent advancements in bio-engineering and wireless communications systems have motivated researchers to propose novel applications for telemedicine, therapeutics and human health monitoring. For instance, through wireless medical telemetry a healthcare worker can remotely measure biological signals and control certain processes in the organism required for the maintenance of the patient's health state. This technology can be further extended to use Bio-Nano devices to promote a real-time monitoring of the human health and storage of the gathered data in the cloud. This brings new challenges and opportunities for the development of biosensing network, which will depend on the extension of the current intrabody devices functionalities. In this paper we will cover the recent progress made on implantable micro-scale devices and introduce the perspective of improve them to foster the development of new theranostics based on data collected at the nanoscale level.
△ Less
Submitted 30 April, 2021;
originally announced April 2021.
-
Microfluidic-based Bacterial Molecular Computing on a Chip
Authors:
Daniel P. Martins,
Michael Taynnan Barros,
Benjamin O'Sullivan,
Ian Seymour,
Alan O'Riordan,
Lee Coffey,
Joseph Sweeney,
Sasitharan Balasubramaniam
Abstract:
Biocomputing systems based on engineered bacteria can lead to novel tools for environmental monitoring and detection of metabolic diseases. In this paper, we propose a Bacterial Molecular Computing on a Chip (BMCoC) using microfluidic and electrochemical sensing technologies. The computing can be flexibly integrated into the chip, but we focus on engineered bacterial AND Boolean logic gate and ON-…
▽ More
Biocomputing systems based on engineered bacteria can lead to novel tools for environmental monitoring and detection of metabolic diseases. In this paper, we propose a Bacterial Molecular Computing on a Chip (BMCoC) using microfluidic and electrochemical sensing technologies. The computing can be flexibly integrated into the chip, but we focus on engineered bacterial AND Boolean logic gate and ON-OFF switch sensors that produces secondary signals to change the pH and dissolved oxygen concentrations. We present a prototype with experimental results that shows the electrochemical sensors can detect small pH and dissolved oxygen concentration changes created by the engineered bacterial populations' molecular signals. Additionally, we present a theoretical model analysis of the BMCoC computation reliability when subjected to unwanted effects, i.e., molecular signal delays and noise, and electrochemical sensors threshold settings that are based on either standard or blind detectors. Our numerical analysis found that the variations in the production delay and the molecular output signal concentration can impact on the computation reliability for the AND logic gate and ON-OFF switch. The molecular communications of synthetic engineered cells for logic gates integrated with sensing systems can lead to a new breed of biochips that can be used for numerous diagnostic applications.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Authors:
Sebastian Gehrmann,
Tosin Adewumi,
Karmanya Aggarwal,
Pawan Sasanka Ammanamanchi,
Aremu Anuoluwapo,
Antoine Bosselut,
Khyathi Raghavi Chandu,
Miruna Clinciu,
Dipanjan Das,
Kaustubh D. Dhole,
Wanyu Du,
Esin Durmus,
Ondřej Dušek,
Chris Emezue,
Varun Gangal,
Cristina Garbacea,
Tatsunori Hashimoto,
Yufang Hou,
Yacine Jernite,
Harsh Jhamtani,
Yangfeng Ji,
Shailza Jolly,
Mihir Kale,
Dhruv Kumar,
Faisal Ladhak
, et al. (31 additional authors not shown)
Abstract:
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it…
▽ More
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.
△ Less
Submitted 1 April, 2021; v1 submitted 2 February, 2021;
originally announced February 2021.
-
Launchers and Targets in Social Networks
Authors:
Pedro Martins,
Filipa Alarcão Martins
Abstract:
Influence propagation in social networks is a subject of growing interest. A relevant issue in those networks involves the identification of key influencers. These players have an important role on viral marketing strategies and message propagation, including political propaganda and fake news. In effect, an important way to fight malicious usage on social networks is to understand their propertie…
▽ More
Influence propagation in social networks is a subject of growing interest. A relevant issue in those networks involves the identification of key influencers. These players have an important role on viral marketing strategies and message propagation, including political propaganda and fake news. In effect, an important way to fight malicious usage on social networks is to understand their properties, their structure and the way messages propagate.
This paper proposes two new indices for analysing message propagation in social networks, based on the network topological nature and the power of the message. The first index involves the strength of each node as a launcher of the message, dividing the nodes into launchers and non-launchers. The second index addresses the potential of each member as a receiver (target) of the message, dividing the nodes into targets and non-targets. Launcher individuals should indicate strong influencers and target individuals should identify the best target consumers. These indices can assist other known metrics when used to select efficient influencers in a social network. For instance, instead of choosing a strong and probably expensive member according to its degree in the network (number of followers), we may previously select those belonging to the launchers group and look for the lowest degree members, which are probably cheaper but still guarantying almost the same influence effectiveness as the largest degree members.
On a different direction, using the second index, the strong target members should characterize relevant consumers of information in the network, which may include fake news' regular collectors.
We discuss these indices using small-world randomly generated graphs and a number of real-world social networks available in known datasets repositories.
△ Less
Submitted 4 February, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
Evolving Intelligent Reflector Surface towards 6G for Public Health: Application in Airborne Virus Detection
Authors:
Harun Šiljak,
Nouman Ashraf,
Michael Taynnan Barros,
Daniel Perez Martins,
Bernard Butler,
Arman Farhang,
Nicola Marchetti,
Sasitharan Balasubramaniam
Abstract:
While metasurface based intelligent reflecting surfaces (IRS) are an important emerging technology for future generations of wireless connectivity in its own right, the plans for the mass deployment of these surfaces motivate the question of their integration with other new and emerging technologies that would require mass proliferation. This question of integration and the vision of future commun…
▽ More
While metasurface based intelligent reflecting surfaces (IRS) are an important emerging technology for future generations of wireless connectivity in its own right, the plans for the mass deployment of these surfaces motivate the question of their integration with other new and emerging technologies that would require mass proliferation. This question of integration and the vision of future communication systems as an invaluable component for public health motivated our new concept of Intelligent Reflector-Viral Detectors (IR-VD). In this novel scheme, we propose deployment of intelligent reflectors with strips of receptor-based viral detectors placed between the reflective surface tiles. Our proposed approach encodes information of the virus by flicking the angle of the reflected beams, using time variations between the beam deviations to represent the messages. This information includes the presence of the virus, its location and load size. The paper presents simulation to demonstrate the encoding process based on varying quantity of virus that have bound onto the IR-VD.
△ Less
Submitted 4 September, 2020;
originally announced September 2020.
-
Sparse Text Generation
Authors:
Pedro Henrique Martins,
Zita Marinho,
André F. T. Martins
Abstract:
Current state-of-the-art text generators build on powerful language models such as GPT-2, achieving impressive performance. However, to avoid degenerate text, they require sampling from a modified softmax, via temperature parameters or ad-hoc truncation techniques, as in top-$k$ or nucleus sampling. This creates a mismatch between training and testing conditions. In this paper, we use the recently…
▽ More
Current state-of-the-art text generators build on powerful language models such as GPT-2, achieving impressive performance. However, to avoid degenerate text, they require sampling from a modified softmax, via temperature parameters or ad-hoc truncation techniques, as in top-$k$ or nucleus sampling. This creates a mismatch between training and testing conditions. In this paper, we use the recently introduced entmax transformation to train and sample from a natively sparse language model, avoiding this mismatch. The result is a text generator with favorable performance in terms of fluency and consistency, fewer repetitions, and n-gram diversity closer to human text. In order to evaluate our model, we propose three new metrics for comparing sparse or truncated distributions: $ε$-perplexity, sparsemax score, and Jensen-Shannon divergence. Human-evaluated experiments in story completion and dialogue generation show that entmax sampling leads to more engaging and coherent stories and conversations.
△ Less
Submitted 5 October, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Sparse and Structured Visual Attention
Authors:
Pedro Henrique Martins,
Vlad Niculae,
Zita Marinho,
André Martins
Abstract:
Visual attention mechanisms are widely used in multimodal tasks, as visual question answering (VQA). One drawback of softmax-based attention mechanisms is that they assign some probability mass to all image regions, regardless of their adjacency structure and of their relevance to the text. In this paper, to better link the image structure with the text, we replace the traditional softmax attentio…
▽ More
Visual attention mechanisms are widely used in multimodal tasks, as visual question answering (VQA). One drawback of softmax-based attention mechanisms is that they assign some probability mass to all image regions, regardless of their adjacency structure and of their relevance to the text. In this paper, to better link the image structure with the text, we replace the traditional softmax attention mechanism with two alternative sparsity-promoting transformations: sparsemax, which is able to select only the relevant regions (assigning zero weight to the rest), and a newly proposed Total-Variation Sparse Attention (TVmax), which further encourages the joint selection of adjacent spatial locations. Experiments in VQA show gains in accuracy as well as higher similarity to human attention, which suggests better interpretability.
△ Less
Submitted 8 July, 2021; v1 submitted 13 February, 2020;
originally announced February 2020.
-
A Combined Stochastic and Physical Framework for Modeling Indoor 5G Millimeter Wave Propagation
Authors:
Georges Nassif,
Catherine Gloaguen,
Philippe Martins
Abstract:
Indoor coverage is a major challenge for 5G millimeter waves (mmWaves). In this paper, we address this problem through a novel theoretical framework that combines stochastic indoor environment modeling with advanced physical propagation simulation. This approach is particularly adapted to investigate indoor-to-indoor 5G mmWave propagation. Its system implementation, so-called iGeoStat, generates p…
▽ More
Indoor coverage is a major challenge for 5G millimeter waves (mmWaves). In this paper, we address this problem through a novel theoretical framework that combines stochastic indoor environment modeling with advanced physical propagation simulation. This approach is particularly adapted to investigate indoor-to-indoor 5G mmWave propagation. Its system implementation, so-called iGeoStat, generates parameterized typical environments that account for the indoor spatial variations, then simulates radio propagation based on the physical interaction between electromagnetic waves and material properties. This framework is not dedicated to a particular environment, material, frequency or use case and aims to statistically understand the influence of indoor environment parameters on mmWave propagation properties, especially coverage and path loss. Its implementation raises numerous computational challenges that we solve by formulating an adapted link budget and designing new memory optimization algorithms. The first simulation results for two major 5G applications are validated with measurement data and show the efficiency of iGeoStat to simulate multiple diffusion in realistic environments, within a reasonable amount of time and memory resources. Generated output maps confirm that diffusion has a critical impact on indoor mmWave propagation and that proper physical modeling is of the utmost importance to generate relevant propagation models.
△ Less
Submitted 17 February, 2020; v1 submitted 12 February, 2020;
originally announced February 2020.
-
Automatic detection of estuarine dolphin whistles in spectrogram images
Authors:
O. M. Serra,
F. P. R. Martins,
L. R. Padovese
Abstract:
An algorithm for detecting tonal vocalizations from estuarine dolphin (Sotalia guianensis) specimens without interference of a human operator is developed. The raw audio data collected from a passive monitoring sensor in the Cananéia underwater soundscape is converted to spectrogram images, containing the desired acoustic event (whistle) as a linear pattern in the images. Detection is a four-step…
▽ More
An algorithm for detecting tonal vocalizations from estuarine dolphin (Sotalia guianensis) specimens without interference of a human operator is developed. The raw audio data collected from a passive monitoring sensor in the Cananéia underwater soundscape is converted to spectrogram images, containing the desired acoustic event (whistle) as a linear pattern in the images. Detection is a four-step method: first, ridge maps are obtained from the spectrogram images; second, a probabilistic Hough transform algorithm is applied to detect roughly linear ridges, which are adjusted to the true corresponding shape of the whistles via an active contour algorithm; third, feature vectors are built from the geometry of each detected curve; and fourth, the detections are fed to a random forest classifier to parse out false positives. We develop a system capable of reliably classifying roughly 97% of the characteristic patterns detected as Sotalia guianensis whistles or random empty detections.
△ Less
Submitted 10 September, 2019;
originally announced September 2019.
-
Joint Learning of Named Entity Recognition and Entity Linking
Authors:
Pedro Henrique Martins,
Zita Marinho,
André F. T. Martins
Abstract:
Named entity recognition (NER) and entity linking (EL) are two fundamentally related tasks, since in order to perform EL, first the mentions to entities have to be detected. However, most entity linking approaches disregard the mention detection part, assuming that the correct mentions have been previously detected. In this paper, we perform joint learning of NER and EL to leverage their relatedne…
▽ More
Named entity recognition (NER) and entity linking (EL) are two fundamentally related tasks, since in order to perform EL, first the mentions to entities have to be detected. However, most entity linking approaches disregard the mention detection part, assuming that the correct mentions have been previously detected. In this paper, we perform joint learning of NER and EL to leverage their relatedness and obtain a more robust and generalisable system. For that, we introduce a model inspired by the Stack-LSTM approach (Dyer et al., 2015). We observe that, in fact, doing multi-task learning of NER and EL improves the performance in both tasks when comparing with models trained with individual objectives. Furthermore, we achieve results competitive with the state-of-the-art in both NER and EL.
△ Less
Submitted 18 July, 2019;
originally announced July 2019.
-
Computing the $k$-coverage of a wireless network
Authors:
Anaïs Vergne,
Laurent Decreusefond,
Philippe Martins
Abstract:
Coverage is one of the main quality of service of a wirelessnetwork. $k$-coverage, that is to be covered simultaneously by $k$network nodes, is synonym of reliability and numerous applicationssuch as multiple site MIMO features, or handovers. We introduce here anew algorithm for computing the $k$-coverage of a wirelessnetwork. Our method is based on the observation that $k$-coverage canbe interpr…
▽ More
Coverage is one of the main quality of service of a wirelessnetwork. $k$-coverage, that is to be covered simultaneously by $k$network nodes, is synonym of reliability and numerous applicationssuch as multiple site MIMO features, or handovers. We introduce here anew algorithm for computing the $k$-coverage of a wirelessnetwork. Our method is based on the observation that $k$-coverage canbe interpreted as $k$ layers of $1$-coverage, or simply coverage. Weuse simplicial homology to compute the network's topology and areduction algorithm to indentify the layers of $1$-coverage. Weprovide figures and simulation results to illustrate our algorithm.
△ Less
Submitted 29 December, 2018;
originally announced January 2019.
-
Towards Automating Precision Studies of Clone Detectors
Authors:
Vaibhav Saini,
Farima Farmahinifarahani,
Yadong Lu,
Di Yang,
Pedro Martins,
Hitesh Sajnani,
Pierre Baldi,
Cristina Lopes
Abstract:
Current research in clone detection suffers from poor ecosystems for evaluating precision of clone detection tools. Corpora of labeled clones are scarce and incomplete, making evaluation labor intensive and idiosyncratic, and limiting inter tool comparison. Precision-assessment tools are simply lacking. We present a semi-automated approach to facilitate precision studies of clone detection tools.…
▽ More
Current research in clone detection suffers from poor ecosystems for evaluating precision of clone detection tools. Corpora of labeled clones are scarce and incomplete, making evaluation labor intensive and idiosyncratic, and limiting inter tool comparison. Precision-assessment tools are simply lacking. We present a semi-automated approach to facilitate precision studies of clone detection tools. The approach merges automatic mechanisms of clone classification with manual validation of clone pairs. We demonstrate that the proposed automatic approach has a very high precision and it significantly reduces the number of clone pairs that need human validation during precision experiments. Moreover, we aggregate the individual effort of multiple teams into a single evolving dataset of labeled clone pairs, creating an important asset for software clone research.
△ Less
Submitted 13 December, 2018; v1 submitted 12 December, 2018;
originally announced December 2018.
-
Using Computer Vision Techniques for Moving Poster Design
Authors:
Sérgio Rebelo,
Pedro Martins,
João Bicker,
Penousal Machado
Abstract:
Graphic Design encompasses a wide range of activities from the design of traditional print media (e.g., books and posters) to site-specific (e.g., signage systems) and electronic media (e.g., interfaces). Its practice always explores the new possibilities of information and communication technologies. Therefore, interactivity and participation have become key features in the design process. Even i…
▽ More
Graphic Design encompasses a wide range of activities from the design of traditional print media (e.g., books and posters) to site-specific (e.g., signage systems) and electronic media (e.g., interfaces). Its practice always explores the new possibilities of information and communication technologies. Therefore, interactivity and participation have become key features in the design process. Even in traditional print media, graphic designers are trying to enhance user experience and exploring new interaction models. Moving posters are an example of this. This type of posters combine the specific features of motion and print worlds in order to produce attractive forms of communication that explore and exploit the potential of digital screens. In our opinion, the next step towards the integration of moving posters with the surroundings, where they operate, is incorporating data from the environment, which also enables the seamless participation of the audience. As such, the adoption of computer vision techniques for moving poster design becomes a natural approach. Following this line of thought, we present a system wherein computer vision techniques are used to shape a moving poster. Although it is still a work in progress, the system is already able to sense the surrounding physical environment and translate the collected data into graphical information. The data is gathered from the environment in two ways: (1) directly using motion tracking; and (2) indirectly via contextual ambient data. In this sense, each user interaction with the system results in a different experience and in a unique poster design.
△ Less
Submitted 27 November, 2018;
originally announced November 2018.
-
A deep learning approach for understanding natural language commands for mobile service robots
Authors:
Pedro Henrique Martins,
Luís Custódio,
Rodrigo Ventura
Abstract:
Using natural language to give instructions to robots is challenging, since natural language understanding is still largely an open problem. In this paper we address this problem by restricting our attention to commands modeled as one action, plus arguments (also known as slots). For action detection (also called intent detection) and slot filling various architectures of Recurrent Neural Networks…
▽ More
Using natural language to give instructions to robots is challenging, since natural language understanding is still largely an open problem. In this paper we address this problem by restricting our attention to commands modeled as one action, plus arguments (also known as slots). For action detection (also called intent detection) and slot filling various architectures of Recurrent Neural Networks and Long Short Term Memory (LSTM) networks were evaluated, having LSTMs achieved a superior accuracy. As the action requested may not fall within the robots capabilities, a Support Vector Machine(SVM) is used to determine whether it is or not. For the input of the neural networks, several word embedding algorithms were compared. Finally, to implement the system in a robot, a ROS package is created using a SMACH state machine. The proposed system is then evaluated both using well-known datasets and benchmarks in the context of domestic service robots.
△ Less
Submitted 9 July, 2018;
originally announced July 2018.
-
Integrating Proactive Mode Changes in Mixed Criticality Systems
Authors:
Flavio R Massaro Jr.,
Paulo S. Martins,
Edson L. Ursini
Abstract:
In this work, we propose to integrate prediction algorithms to the scheduling of mode changes under the Earliest-Deadline-First and Fixed-priority scheduling in mixed-criticality real-time systems. The method proactively schedules a mode change in the system based on state variables such as laxity, to the percentage difference in the temporal distance between the completion time of the instance of…
▽ More
In this work, we propose to integrate prediction algorithms to the scheduling of mode changes under the Earliest-Deadline-First and Fixed-priority scheduling in mixed-criticality real-time systems. The method proactively schedules a mode change in the system based on state variables such as laxity, to the percentage difference in the temporal distance between the completion time of the instance of a task and its respective deadline, by the deadline (D) stipulated for the task, in order to minimize deadline misses. The simulation model was validated against an analytical model prior to the logical integration of the Kalman-based prediction algorithm. Two study cases were presented, one covering earliest-deadline first and the other the fixed-priority scheduling approach. The results showed the gains in the adoption of the prediction approach for both scheduling paradigms by presenting a significant reduction of the number of missed deadlines for low-criticality tasks.
△ Less
Submitted 29 June, 2018;
originally announced June 2018.
-
The Java Build Framework: Large Scale Compilation
Authors:
Pedro Martins,
Rohan Achar,
Cristina V. Lopes
Abstract:
Large repositories of source code for research tend to limit their utility to static analysis of the code, as they give no guarantees on whether the projects are compilable, much less runnable in any way. The immediate consequence of the lack of large compilable and runnable datasets is that research that requires such properties does not generalize beyond small benchmarks. We present the Java Bui…
▽ More
Large repositories of source code for research tend to limit their utility to static analysis of the code, as they give no guarantees on whether the projects are compilable, much less runnable in any way. The immediate consequence of the lack of large compilable and runnable datasets is that research that requires such properties does not generalize beyond small benchmarks. We present the Java Build Framework, a method and tool capable of automatically compiling a large percentage of Java projects available in open source repositories like GitHub. Two elements are at the core: a very large repository of JAR files, and techniques of resolution of compilation faults and dependencies.
△ Less
Submitted 12 April, 2018;
originally announced April 2018.