-
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
Authors:
Zhilin Wang,
Jiaqi Zeng,
Olivier Delalleau,
Hoo-Chang Shin,
Felipe Soares,
Alexander Bukharin,
Ellie Evans,
Yi Dong,
Oleksii Kuchaiev
Abstract:
Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a…
▽ More
Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference
△ Less
Submitted 16 May, 2025;
originally announced May 2025.
-
Exploring a Large Language Model for Transforming Taxonomic Data into OWL: Lessons Learned and Implications for Ontology Development
Authors:
Filipi Miranda Soares,
Antonio Mauro Saraiva,
Luís Ferreira Pires,
Luiz Olavo Bonino da Silva Santos,
Dilvan de Abreu Moreira,
Fernando Elias Corrêa,
Kelly Rosa Braghetto,
Debora Pignatari Drucker,
Alexandre Cláudio Botazzo Delbem
Abstract:
Managing scientific names in ontologies that represent species taxonomies is challenging due to the ever-evolving nature of these taxonomies. Manually maintaining these names becomes increasingly difficult when dealing with thousands of scientific names. To address this issue, this paper investigates the use of ChatGPT-4 to automate the development of the :Organism module in the Agricultural Produ…
▽ More
Managing scientific names in ontologies that represent species taxonomies is challenging due to the ever-evolving nature of these taxonomies. Manually maintaining these names becomes increasingly difficult when dealing with thousands of scientific names. To address this issue, this paper investigates the use of ChatGPT-4 to automate the development of the :Organism module in the Agricultural Product Types Ontology (APTO) for species classification. Our methodology involved leveraging ChatGPT-4 to extract data from the GBIF Backbone API and generate OWL files for further integration in APTO. Two alternative approaches were explored: (1) issuing a series of prompts for ChatGPT-4 to execute tasks via the BrowserOP plugin and (2) directing ChatGPT-4 to design a Python algorithm to perform analogous tasks. Both approaches rely on a prompting method where we provide instructions, context, input data, and an output indicator. The first approach showed scalability limitations, while the second approach used the Python algorithm to overcome these challenges, but it struggled with typographical errors in data handling. This study highlights the potential of Large language models like ChatGPT-4 to streamline the management of species names in ontologies. Despite certain limitations, these tools offer promising advancements in automating taxonomy-related tasks and improving the efficiency of ontology development.
△ Less
Submitted 25 April, 2025;
originally announced April 2025.
-
HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks
Authors:
Zhilin Wang,
Jiaqi Zeng,
Olivier Delalleau,
Daniel Egert,
Ellie Evans,
Hoo-Chang Shin,
Felipe Soares,
Yi Dong,
Oleksii Kuchaiev
Abstract:
Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback fr…
▽ More
Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect HelpSteer3 data to train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.
△ Less
Submitted 30 May, 2025; v1 submitted 6 March, 2025;
originally announced March 2025.
-
The nonlinear dynamics of a cantilever beam subject to axial flow in a tapered passage
Authors:
Filipe Soares,
José Antunes,
Christophe Vergez,
Vincent Debut,
Bruno Cochelin,
Fabrice Silva
Abstract:
A cantilever beam under axial flow, confined or not, is known to develop self-sustained oscillations at sufficiently large flow velocities. In recent decades, the analysis of this archetypal system has been mostly pursued under linearized conditions, to calculate the critical boundaries separating stable from unstable behavior. However, nonlinear analysis of the self-sustained oscillations ensuing…
▽ More
A cantilever beam under axial flow, confined or not, is known to develop self-sustained oscillations at sufficiently large flow velocities. In recent decades, the analysis of this archetypal system has been mostly pursued under linearized conditions, to calculate the critical boundaries separating stable from unstable behavior. However, nonlinear analysis of the self-sustained oscillations ensuing flutter instabilities are considerably rarer. Here we present a simplified one-dimensional nonlinear model describing a cantilever beam subjected to confined axial flow, for generic axial profiles of the fluid channels. In particular, we explore how the shape of the confinement walls affects the dynamics of the system. To simplify the problem, we consider symmetric channels with plane walls in either converging or diverging configurations. The beam is modeled in a modal framework, while bulk-flow equations, including singular head-loss terms, are used to model the flow-structure coupling forces. The dynamics of the system are first analyzed through linear stability analysis to assess the stabilizing/destabilizing effects of the channel walls configuration. Subsequently, we develop a systematic nonlinear analysis based on the continuation of periodic solutions. The harmonic balance method is used in conjunction with the asymptotic numerical method to calculate branches of periodic solutions. The continuation-based methods are used to investigate bifurcations with respect to both the reduced flow velocity and the channel slope parameter. From the results presented, we illustrate how continuationbased approaches and bifurcation analysis provide an efficient tool to analyze the nonlinear behavior of flow-induced vibration problems, particularly when reduced/simplified models are available.
△ Less
Submitted 25 September, 2024;
originally announced October 2024.
-
Nemotron-4 340B Technical Report
Authors:
Nvidia,
:,
Bo Adler,
Niket Agarwal,
Ashwath Aithal,
Dong H. Anh,
Pallab Bhattacharya,
Annika Brundyn,
Jared Casper,
Bryan Catanzaro,
Sharon Clay,
Jonathan Cohen,
Sirshak Das,
Ayush Dattagupta,
Olivier Delalleau,
Leon Derczynski,
Yi Dong,
Daniel Egert,
Ellie Evans,
Aleksander Ficek,
Denys Fridman,
Shaona Ghosh,
Boris Ginsburg,
Igor Gitman,
Tomasz Grzegorzek
, et al. (58 additional authors not shown)
Abstract:
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation be…
▽ More
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
△ Less
Submitted 6 August, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
On the radiation from unbaffled pistons and their dipole equivalent
Authors:
Filipe Soares,
Vincent Debut
Abstract:
The radiation efficiency from simple vibrating planar surfaces is often used as a basis to describe the sound radiation from more complex structures, having important applications in various fields of acoustics. The low-frequency radiation efficiency of a baffled piston can easily be represented by a simple monopole source. Notably, the equivalent source strength is dependent on the piston surface…
▽ More
The radiation efficiency from simple vibrating planar surfaces is often used as a basis to describe the sound radiation from more complex structures, having important applications in various fields of acoustics. The low-frequency radiation efficiency of a baffled piston can easily be represented by a simple monopole source. Notably, the equivalent source strength is dependent on the piston surface area. However, the unbaffled case presents additional difficulties as the so-called ``edge effects'' significantly alter the piston radiation impedance. Consequently, a low-frequency equivalence between dipoles and an unbaffled pistons is not as straight forward, since not only the piston area but also its shape will have an effect on the radiated sound. In this work, the search for a simple and generic, equivalence between dipoles and unbaffled pistons is pursued. A finite element model is used to calculate the radiation efficiency from unbaffled pistons with the same surface area but different shapes. A broad set of results indicate that the ``edge effects'' can be accurately represented by a simple term dependent on the piston compactness (ratio of area to perimeter). Effectively, pistons with smaller area to perimeter ratio will be less efficient radiators. Such term allows the definition of an equivalent dipole source strength that approximates the low-frequency behavior of an unbaffled piston of arbitrary shape.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Perceived Vulnerability to Disease Scale: Factorial structure, reliability, and validity in times of Portugal's COVID-19 pandemic lockdown
Authors:
Ana Paula Martins,
María C. Vega-Hernández,
Francisca Ribeiro Soares,
Rosa Marina Afonso
Abstract:
The present study examines the factor structure of a Portuguese version of the Perceived Vulnerability to Disease Scale (PVD), designed to assess individual differences in chronic concerns about transmission of infectious diseases. Method: Data from a Portuguese convenience sample (n=1203), collected during the first Covid-19 pandemic lockdown. Results: the scale revealed, through an exploratory f…
▽ More
The present study examines the factor structure of a Portuguese version of the Perceived Vulnerability to Disease Scale (PVD), designed to assess individual differences in chronic concerns about transmission of infectious diseases. Method: Data from a Portuguese convenience sample (n=1203), collected during the first Covid-19 pandemic lockdown. Results: the scale revealed, through an exploratory factor analysis (EFA) and a confirmatory factor analysis (CFA), a slight superiority of a three-factor model over the existing two-factor models of the 15-item original PVD and of the 10-item PVD established with another Portuguese sample (Ferreira et al., 2022). Conclusions: This higher level of differentiation in terms of a perceived resistance to infectious diseases could be explained by the pandemic context which may have differentiated the responses regarding the perception of Resistance. On the other hand, this new factor increases the comprehensive and evaluative dimension and implications of the construct assessed by PVD.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Apodized Slanted Grating Couplers for LiDAR Applications
Authors:
Vahram Voskerchyan,
Francis,
Tian,
Francisco M. Soares,
David Alvarez Outerelo,
Francisco J. Diaz-Otero
Abstract:
Solid state LiDAR systems traditionally rely on costly active components for efficient beam scanning. In this study, we propose a cost-effective, purely passive steering approach using apodized slanted grating couplers. Through apodization, we achieve a uniform upward emission profile and enhanced upward transmission. Theoretical calculations indicate successful steering of 91.5$^\circ$x42.8…
▽ More
Solid state LiDAR systems traditionally rely on costly active components for efficient beam scanning. In this study, we propose a cost-effective, purely passive steering approach using apodized slanted grating couplers. Through apodization, we achieve a uniform upward emission profile and enhanced upward transmission. Theoretical calculations indicate successful steering of 91.5$^\circ$x42.8$^\circ$. Experimental results closely match theoretical predictions, validating the capabilities of our passive steering concept. Additionally, the grating couplers, with a length of 3mm, enable a farfield FWHM of 0.026$^\circ$, further enhancing sensing resolution.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Federated Self-Supervised Learning of Monocular Depth Estimators for Autonomous Vehicles
Authors:
Elton F. de S. Soares,
Carlos Alberto V. Campos
Abstract:
Image-based depth estimation has gained significant attention in recent research on computer vision for autonomous vehicles in intelligent transportation systems. This focus stems from its cost-effectiveness and wide range of potential applications. Unlike binocular depth estimation methods that require two fixed cameras, monocular depth estimation methods only rely on a single camera, making them…
▽ More
Image-based depth estimation has gained significant attention in recent research on computer vision for autonomous vehicles in intelligent transportation systems. This focus stems from its cost-effectiveness and wide range of potential applications. Unlike binocular depth estimation methods that require two fixed cameras, monocular depth estimation methods only rely on a single camera, making them highly versatile. While state-of-the-art approaches for this task leverage self-supervised learning of deep neural networks in conjunction with tasks like pose estimation and semantic segmentation, none of them have explored the combination of federated learning and self-supervision to train models using unlabeled and private data captured by autonomous vehicles. The utilization of federated learning offers notable benefits, including enhanced privacy protection, reduced network consumption, and improved resilience to connectivity issues. To address this gap, we propose FedSCDepth, a novel method that combines federated learning and deep self-supervision to enable the learning of monocular depth estimators with comparable effectiveness and superior efficiency compared to the current state-of-the-art methods. Our evaluation experiments conducted on Eigen's Split of the KITTI dataset demonstrate that our proposed method achieves near state-of-the-art performance, with a test loss below 0.13 and requiring, on average, only 1.5k training steps and up to 0.415 GB of weight data transfer per autonomous vehicle on each round.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
A Polystore Architecture Using Knowledge Graphs to Support Queries on Heterogeneous Data Stores
Authors:
Leonardo Guerreiro Azevedo,
Renan Francisco Santos Souza,
Elton F. de S. Soares,
Raphael M. Thiago,
Julio Cesar Cardoso Tesolin,
Ann C. Oliveira,
Marcio Ferreira Moreno
Abstract:
Modern applications commonly need to manage dataset types composed of heterogeneous data and schemas, making it difficult to access them in an integrated way. A single data store to manage heterogeneous data using a common data model is not effective in such a scenario, which results in the domain data being fragmented in the data stores that best fit their storage and access requirements (e.g., N…
▽ More
Modern applications commonly need to manage dataset types composed of heterogeneous data and schemas, making it difficult to access them in an integrated way. A single data store to manage heterogeneous data using a common data model is not effective in such a scenario, which results in the domain data being fragmented in the data stores that best fit their storage and access requirements (e.g., NoSQL, relational DBMS, or HDFS). Besides, organization workflows independently consume these fragments, and usually, there is no explicit link among the fragments that would be useful to support an integrated view. The research challenge tackled by this work is to provide the means to query heterogeneous data residing on distinct data repositories that are not explicitly connected. We propose a federated database architecture by providing a single abstract global conceptual schema to users, allowing them to write their queries, encapsulating data heterogeneity, location, and linkage by employing: (i) meta-models to represent the global conceptual schema, the remote data local conceptual schemas, and mappings among them; (ii) provenance to create explicit links among the consumed and generated data residing in separate datasets. We evaluated the architecture through its implementation as a polystore service, following a microservice architecture approach, in a scenario that simulates a real case in Oil \& Gas industry. Also, we compared the proposed architecture to a relational multidatabase system based on foreign data wrappers, measuring the user's cognitive load to write a query (or query complexity) and the query processing time. The results demonstrated that the proposed architecture allows query writing two times less complex than the one written for the relational multidatabase system, adding an excess of no more than 30% in query processing time.
△ Less
Submitted 15 March, 2024; v1 submitted 7 August, 2023;
originally announced August 2023.
-
Boundary conditions in hydrodynamic simulations of isolated galaxies and their impact on the gas-loss processes
Authors:
Anderson Caproni,
Gustavo A. Lanfranchi,
Amâncio C. S. Friaça,
Jennifer F. Soares
Abstract:
Three-dimensional hydrodynamic simulations are commonly used to study the evolution of the gaseous content in isolated galaxies, besides its connection with galactic star formation histories. Stellar winds, supernova blasts, and black hole feedback are mechanisms usually invoked to drive galactic outflows and decrease the initial galactic gas reservoir. However, any simulation imposes the need of…
▽ More
Three-dimensional hydrodynamic simulations are commonly used to study the evolution of the gaseous content in isolated galaxies, besides its connection with galactic star formation histories. Stellar winds, supernova blasts, and black hole feedback are mechanisms usually invoked to drive galactic outflows and decrease the initial galactic gas reservoir. However, any simulation imposes the need of choosing the limits of the simulated volume, which depends, for instance, on the size of the galaxy and the required numerical resolution, besides the available computational capability to perform it. In this work, we discuss the effects of boundary conditions on the evolution of the gas fraction in a small-sized galaxy (tidal radius of about 1 kpc), like classical spheroidal galaxies in the Local Group. We found that open boundaries with sizes smaller than approximately 10 times the characteristic radius of the galactic dark-matter halo become unappropriated for this kind of simulation after about 0.6 Gyr of evolution, since they act as an infinity reservoir of gas due to dark-matter gravity. We also tested two different boundary conditions that avoid gas accretion from numerical frontiers: closed and selective boundary conditions. Our results indicate that the later condition (that uses a velocity threshold criterion to open or close frontiers) is preferable since minimizes the number of reversed shocks due to closed boundaries. Although the strategy of putting computational frontiers as far as possible from the galaxy itself is always desirable, simulations with selective boundary condition can lead to similar results at lower computational costs.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Monolithically Integrated Wavelength-meter in InP with measurement bandwidth of 100nm centered on the C band
Authors:
Andrea Volpini,
Damiano Massella,
David Alvarez-Outerelo,
Francisco Soares,
Francisco J. Diaz-Otero
Abstract:
In this paper we will explore the creation of a monolithically integrated wavelength meter in InP. This type of devices are a key requirement for many applications and it is especially important to have them integrated with active components like lasers and gain sections. We present a wavelength meter based on multiple ring resonators that has been realized in a commercial MPW run and tested using…
▽ More
In this paper we will explore the creation of a monolithically integrated wavelength meter in InP. This type of devices are a key requirement for many applications and it is especially important to have them integrated with active components like lasers and gain sections. We present a wavelength meter based on multiple ring resonators that has been realized in a commercial MPW run and tested using a tunable laser. The designed circuit is theoretically capable of resolution down to 1.6pm and a measurement speed down to 500ps within a wavelength range of 100nm.
△ Less
Submitted 9 January, 2023;
originally announced January 2023.
-
Dynamics of antiproton plasma in a time-dependent harmonic trap
Authors:
Luiz Gustavo F. Soares,
Fernando Haas
Abstract:
An antiproton plasma confined in a quasi-1D device is described in terms of a self-consistent fluid formulation using a variational approach. Unlike previous treatments, the use of the time-dependent variational method allows to retain the thermal and Coulomb effects. A certain Ansatz is proposed for the number density and fluid velocity fields, which reduces the problem essentially to ordinary no…
▽ More
An antiproton plasma confined in a quasi-1D device is described in terms of a self-consistent fluid formulation using a variational approach. Unlike previous treatments, the use of the time-dependent variational method allows to retain the thermal and Coulomb effects. A certain Ansatz is proposed for the number density and fluid velocity fields, which reduces the problem essentially to ordinary nonlinear differential equations. In adiabatic cooling, the frequency of the trap potential is slowly decreased. An adiabatic equation of state is assumed for closure. The numerical simulation of the nonlinear dynamics is performed, for realistic parameters.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.
-
A Research Agenda on Pediatric Chest X-Ray: Is Deep Learning Still in Childhood?
Authors:
Afonso U. Fonseca,
Gabriel S. Vieira,
Fabrízzio A. A. M. N. Soares,
Renato F. Bulcão-Neto
Abstract:
Several reasons explain the significant role that chest X-rays play on supporting clinical analysis and early disease detection in pediatric patients, such as low cost, high resolution, low radiation levels, and high availability. In the last decade, Deep Learning (DL) has been given special attention from the computer-aided diagnosis research community, outperforming the state of the art of many…
▽ More
Several reasons explain the significant role that chest X-rays play on supporting clinical analysis and early disease detection in pediatric patients, such as low cost, high resolution, low radiation levels, and high availability. In the last decade, Deep Learning (DL) has been given special attention from the computer-aided diagnosis research community, outperforming the state of the art of many techniques, including those applied to pediatric chest X-rays (PCXR). Due to this increasing interest, much high-quality secondary research has also arisen, overviewing machine learning and DL algorithms on medical imaging and PCXR, in particular. However, these secondary studies follow different guidelines, hampering their reproduction or improvement by third-parties regarding the identified trends and gaps. This paper proposes a "deep radiography" of primary research on DL techniques applied in PCXR images. We elaborated on a Systematic Literature Mapping (SLM) protocol, including automatic search on six sources for studies published from January 1, 2010, to May 20, 2020, and selection criteria utilized on a hundred research papers. As a result, this paper categorizes twenty-six relevant studies and provides a research agenda highlighting limitations, gaps, and trends for further investigations on DL usage in PCXR images. Besides the fact that there is no systematic mapping study on this research topic, to the best of authors' knowledge, this work organizes the process of finding and selecting relevant studies and data gathering and synthesis in a reproducible way.
△ Less
Submitted 7 October, 2020; v1 submitted 20 July, 2020;
originally announced July 2020.
-
BULNER: BUg Localization with word embeddings and NEtwork Regularization
Authors:
Jacson Rodrigues Barbosa,
Ricardo Marcondes Marcacini,
Ricardo Britto,
Frederico Soares,
Solange Rezende,
Auri M. R. Vincenzi,
Marcio E. Delamaro
Abstract:
Bug localization (BL) from the bug report is the strategic activity of the software maintaining process. Because BL is a costly and tedious activity, BL techniques information retrieval-based and machine learning-based could aid software engineers. We propose a method for BUg Localization with word embeddings and Network Regularization (BULNER). The preliminary results suggest that BULNER has bett…
▽ More
Bug localization (BL) from the bug report is the strategic activity of the software maintaining process. Because BL is a costly and tedious activity, BL techniques information retrieval-based and machine learning-based could aid software engineers. We propose a method for BUg Localization with word embeddings and Network Regularization (BULNER). The preliminary results suggest that BULNER has better performance than two state-of-the-art methods.
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
UFRGS Participation on the WMT Biomedical Translation Shared Task
Authors:
Felipe Soares,
Karin Becker
Abstract:
This paper describes the machine translation systems developed by the Universidade Federal do Rio Grande do Sul (UFRGS) team for the biomedical translation shared task. Our systems are based on statistical machine translation and neural machine translation, using the Moses and OpenNMT toolkits, respectively. We participated in four translation directions for the English/Spanish and English/Portugu…
▽ More
This paper describes the machine translation systems developed by the Universidade Federal do Rio Grande do Sul (UFRGS) team for the biomedical translation shared task. Our systems are based on statistical machine translation and neural machine translation, using the Moses and OpenNMT toolkits, respectively. We participated in four translation directions for the English/Spanish and English/Portuguese language pairs. To create our training data, we concatenated several parallel corpora, both from in-domain and out-of-domain sources, as well as terminological resources from UMLS. Our systems achieved the best BLEU scores according to the official shared task evaluation.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
A Large Parallel Corpus of Full-Text Scientific Articles
Authors:
Felipe Soares,
Viviane Pereira Moreira,
Karin Becker
Abstract:
The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages…
▽ More
The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
A Parallel Corpus of Theses and Dissertations Abstracts
Authors:
Felipe Soares,
Gabrielli Harumi Yamashita,
Michel Jose Anzanello
Abstract:
In Brazil, the governmental body responsible for overseeing and coordinating post-graduate programs, CAPES, keeps records of all theses and dissertations presented in the country. Information regarding such documents can be accessed online in the Theses and Dissertations Catalog (TDC), which contains abstracts in Portuguese and English, and additional metadata. Thus, this database can be a potenti…
▽ More
In Brazil, the governmental body responsible for overseeing and coordinating post-graduate programs, CAPES, keeps records of all theses and dissertations presented in the country. Information regarding such documents can be accessed online in the Theses and Dissertations Catalog (TDC), which contains abstracts in Portuguese and English, and additional metadata. Thus, this database can be a potential source of parallel corpora for the Portuguese and English languages. In this article, we present the development of a parallel corpus from TDC, which is made available by CAPES under the open data initiative. Approximately 240,000 documents were collected and aligned using the Hunalign tool. We demonstrate the capability of our developed corpus by training Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models for both language directions, followed by a comparison with Google Translate (GT). Both translation models presented better BLEU scores than GT, with NMT system being the most accurate one. Sentence alignment was also manually evaluated, presenting an average of 82.30% correctly aligned sentences. Our parallel corpus is freely available in TMX format, with complementary information regarding document metadata
△ Less
Submitted 5 May, 2019;
originally announced May 2019.
-
BVS Corpus: A Multilingual Parallel Corpus of Biomedical Scientific Texts
Authors:
Felipe Soares,
Martin Krallinger
Abstract:
The BVS database (Health Virtual Library) is a centralized source of biomedical information for Latin America and Carib, created in 1998 and coordinated by BIREME (Biblioteca Regional de Medicina) in agreement with the Pan American Health Organization (OPAS). Abstracts are available in English, Spanish, and Portuguese, with a subset in more than one language, thus being a possible source of parall…
▽ More
The BVS database (Health Virtual Library) is a centralized source of biomedical information for Latin America and Carib, created in 1998 and coordinated by BIREME (Biblioteca Regional de Medicina) in agreement with the Pan American Health Organization (OPAS). Abstracts are available in English, Spanish, and Portuguese, with a subset in more than one language, thus being a possible source of parallel corpora. In this article, we present the development of parallel corpora from BVS in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for EN/ES and EN/PT language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Neural Machine Translation (OpenNMT) system for each language pair, which outperformed related works on scientific biomedical articles. Sentence alignment was also manually evaluated, presenting an average 96% of correctly aligned sentences across all languages. Our parallel corpus is freely available, with complementary information regarding article metadata.
△ Less
Submitted 5 May, 2019;
originally announced May 2019.
-
Heart rate variability monitoring identifies asymptomatic toddlers exposed to Zika virus during pregnancy
Authors:
Christophe L. Herry,
Helena M. F. Soares,
Lavinia Schuler-Faccini,
Martin G. Frasch
Abstract:
Although Zika virus (ZIKV) seems to be prominently neurotropic, there are some reports of involvement of other organs, particularly the heart. Of special concern are those children exposed prenatally to ZIKV and born with no microcephaly or other congenital anomaly. Electrocardiogram (ECG) - derived heart rate variability (HRV) metrics represent an attractive, low cost, widely deployable tool for…
▽ More
Although Zika virus (ZIKV) seems to be prominently neurotropic, there are some reports of involvement of other organs, particularly the heart. Of special concern are those children exposed prenatally to ZIKV and born with no microcephaly or other congenital anomaly. Electrocardiogram (ECG) - derived heart rate variability (HRV) metrics represent an attractive, low cost, widely deployable tool for early identification of such children. We hypothesized that HRV in such children would yield a biomarker of fetal ZIKV exposure. We investigated the HRV properties of 21 infants aged 4 to 25 months from Brazil. The infants were divided in two groups, the ZIKV-exposed (n=13) and controls (n=8). Single channel ECG was recorded in each child at ~15 months of age and HRV was analyzed in 5 min segments to provide a comprehensive characterization of the degree of variability and complexity of the heart rate. Using a cubic Support Vector Machine (SVM) classifier we identified babies as Zika cases or controls with negative predictive value of 92% and positive predictive value of 86%. Our results show that HRV metrics can help differentiate between ZIKV-affected, yet asymptomatic, and non-ZIKV exposed babies. We identified the Grid Count as the best HRV measure in this study allowing such differentiation, regardless the presence of microcephaly. We show that it is feasible to measure HRV in infants and toddlers using a small non-invasive portable ECG device and that such approach may uncover memory of in utero exposure to ZIKV. This approach may be useful for future studies and low-cost screening tools involving this challenging to examine population.
△ Less
Submitted 12 December, 2018;
originally announced December 2018.
-
Measure of gap and inequalities in basic education students proficiencies
Authors:
José Francisco Soares,
Erica Castilho Rodrigues,
Victor Maia Senna Delgado
Abstract:
This study uses students performance on standardized tests as evidence of the quality of education and introduces a methodology based on the comparison of performance distributions to produce indicators for both the level achieved by the students and the learning gap between social groups, two inseparable dimensions of quality of education. In the first case, the study compares the distribution of…
▽ More
This study uses students performance on standardized tests as evidence of the quality of education and introduces a methodology based on the comparison of performance distributions to produce indicators for both the level achieved by the students and the learning gap between social groups, two inseparable dimensions of quality of education. In the first case, the study compares the distribution of the group observed with a reference distribution, which represents an ideal situation of where students should be. In the second, it compares the performance distribution of students belonging to social groups defined by socioeconomic characteristics. This article uses the Kullback-Leibler divergence to characterize the differences between the distributions. This measure takes into account types of diferences not considered by other measures and have solid conceptual justifications. The proposed methodology is used to describe the quality of Brazilian basic education using the test results applied biannually to all Brazilian students of basic education.
△ Less
Submitted 31 May, 2018; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Local finiteness for Green's relations in semigroup varieties
Authors:
Mikhail V. Volkov,
Pedro V. Silva,
Filipa Soares
Abstract:
A semigroup variety V is said to be locally K-finite, where K stands for any of Green's relations H, R, L, D, or J, if every finitely generated semigroup from V has only finitely many K-classes. We characterize locally K-finite varieties of finite axiomatic rank in the language of "forbidden objects".
A semigroup variety V is said to be locally K-finite, where K stands for any of Green's relations H, R, L, D, or J, if every finitely generated semigroup from V has only finitely many K-classes. We characterize locally K-finite varieties of finite axiomatic rank in the language of "forbidden objects".
△ Less
Submitted 21 November, 2017;
originally announced November 2017.
-
Local finiteness for Green relations in (I-)semigroup varieties
Authors:
Pedro V. Silva,
Filipa Soares
Abstract:
In this work, the lattice of varieties of semigroups and the lattice of varieties of I-semigroups (a common setting for both the variety of completely regular semigroups and the variety of inverse semigroups) are studied with respect to the following concepts: a variety V of (I-)semigroups is said to be locally K-finite, where K stands for any of the five Green's relations, if every finitely gener…
▽ More
In this work, the lattice of varieties of semigroups and the lattice of varieties of I-semigroups (a common setting for both the variety of completely regular semigroups and the variety of inverse semigroups) are studied with respect to the following concepts: a variety V of (I-)semigroups is said to be locally K-finite, where K stands for any of the five Green's relations, if every finitely generated semigroup from V has only finitely many (distinct) K-classes.
△ Less
Submitted 13 June, 2016;
originally announced June 2016.
-
Howson's property for semidirect products of semilattices by groups
Authors:
Pedro V. Silva,
Filipa Soares
Abstract:
An inverse semigroup $S$ is a Howson inverse semigroup if the intersection of finitely generated inverse subsemigroups of $S$ is finitely generated. Given a locally finite action $θ$ of a group $G$ on a semilattice $E$, it is proved that $E \ast_θ G$ is a Howson inverse semigroup if and only if $G$ is a Howson group. It is also shown that this equivalence fails for arbitrary actions.
An inverse semigroup $S$ is a Howson inverse semigroup if the intersection of finitely generated inverse subsemigroups of $S$ is finitely generated. Given a locally finite action $θ$ of a group $G$ on a semilattice $E$, it is proved that $E \ast_θ G$ is a Howson inverse semigroup if and only if $G$ is a Howson group. It is also shown that this equivalence fails for arbitrary actions.
△ Less
Submitted 9 December, 2014;
originally announced December 2014.
-
The infinite partition of a line segment and multifractal objects
Authors:
A. I. L. de Araújo,
R. F. Soares,
J. P. de Oliveira,
G. Corso
Abstract:
We report an algorithm for the partition of a line segment according to a given ratio $ν$. At each step the length distribution among sets of the partition follows a binomial distribution. We call $k$-set to the set of elements with the same length at the step $n$. The total number of elements is $2^n$ and the number of elements in a same $k$-set is $C_n^k$. In the limit of an infinite partion t…
▽ More
We report an algorithm for the partition of a line segment according to a given ratio $ν$. At each step the length distribution among sets of the partition follows a binomial distribution. We call $k$-set to the set of elements with the same length at the step $n$. The total number of elements is $2^n$ and the number of elements in a same $k$-set is $C_n^k$. In the limit of an infinite partion this object become a multifractal where each $k$-set originate a fractal. We find the fractal spectrum $D_k$ and calculate where is its maximum. Finally we find the values of $D_k$ for the limits $k/n \to 0$ and 1.
△ Less
Submitted 7 November, 2008;
originally announced November 2008.
-
Light-induced structural transformations in a single gallium nanoparticulate
Authors:
B. F. Soares,
K. F. MacDonald,
V. A. Fedotov,
N. I. Zheludev
Abstract:
In a single gallium nanoparticulate, self-assembled (from an atomic beam) in a nano-aperture at the tip of a tapered optical fiber, we have observed evidence for a sequence of reversible light-induced transformations between five different structural phases (gamma - epsilon - delta - beta - liquid), stimulated by optical excitation at nanowatt power levels.
In a single gallium nanoparticulate, self-assembled (from an atomic beam) in a nano-aperture at the tip of a tapered optical fiber, we have observed evidence for a sequence of reversible light-induced transformations between five different structural phases (gamma - epsilon - delta - beta - liquid), stimulated by optical excitation at nanowatt power levels.
△ Less
Submitted 9 March, 2005;
originally announced March 2005.
-
Anisotropy and percolation threshold in a multifractal support
Authors:
L. S. Lucena,
J. E. Freitas,
G. Corso,
R. F. Soares
Abstract:
Recently a multifractal object, $Q_{mf}$, was proposed to study percolation properties in a multifractal support. The area and the number of neighbors of the blocks of $Q_{mf}$ show a non-trivial behavior. The value of the probability of occupation at the percolation threshold, $p_{c}$, is a function of $ρ$, a parameter of $Q_{mf}$ which is related to its anisotropy. We investigate the relation…
▽ More
Recently a multifractal object, $Q_{mf}$, was proposed to study percolation properties in a multifractal support. The area and the number of neighbors of the blocks of $Q_{mf}$ show a non-trivial behavior. The value of the probability of occupation at the percolation threshold, $p_{c}$, is a function of $ρ$, a parameter of $Q_{mf}$ which is related to its anisotropy. We investigate the relation between $p_{c}$ and the average number of neighbors of the blocks as well as the anisotropy of $Q_{mf}$.
△ Less
Submitted 14 August, 2003;
originally announced August 2003.
-
Percolation in a Multifractal
Authors:
G. Corso,
J. E. Freitas,
L. S. Lucena,
R. F. Soares
Abstract:
We build a multifractal object and use it as a support to study percolation.
We identify some differences between percolation in a multifractal and in a regular lattice. We use many samples of finite size lattices and draw the histogram of percolating lattices against site occupation probability. Depending on a parameter characterizing the multifractal and the lattice size, the histogram can ha…
▽ More
We build a multifractal object and use it as a support to study percolation.
We identify some differences between percolation in a multifractal and in a regular lattice. We use many samples of finite size lattices and draw the histogram of percolating lattices against site occupation probability. Depending on a parameter characterizing the multifractal and the lattice size, the histogram can have two peaks. The percolation threshold for the multifractal is lower than for the square lattice.
The percolation in the multifractal differs from the percolation in the regular lattice in two points. The first is related with the coordination number that changes along the multifractal. The second comes from the way the weight of each cell in the multifractal affects the percolation cluster. We compute the fractal dimension of the percolating cluster. Despite the differences, the percolation in a multifractal support is in the universality class of standard percolation.
△ Less
Submitted 11 August, 2003; v1 submitted 20 December, 2002;
originally announced December 2002.