-
TiEBe: Tracking Language Model Recall of Notable Worldwide Events Through Time
Authors:
Thales Sales Almeida,
Giovana Kerche Bonás,
João Guilherme Alves Santos,
Hugo Abonizio,
Rodrigo Nogueira
Abstract:
As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of…
▽ More
As the knowledge landscape evolves and large language models (LLMs) become increasingly widespread, there is a growing need to keep these models updated with current events. While existing benchmarks assess general factual recall, few studies explore how LLMs retain knowledge over time or across different regions. To address these gaps, we present the Timely Events Benchmark (TiEBe), a dataset of over 23,000 question-answer pairs centered on notable global and regional events, spanning more than 10 years of events, 23 regions, and 13 languages. TiEBe leverages structured retrospective data from Wikipedia to identify notable events through time. These events are then used to construct a benchmark to evaluate LLMs' understanding of global and regional developments, grounded in factual evidence beyond Wikipedia itself. Our results reveal significant geographic disparities in factual recall, emphasizing the need for more balanced global representation in LLM training. We also observe a Pearson correlation of more than 0.7 between models' performance in TiEBe and various countries' socioeconomic indicators, such as HDI. In addition, we examine the impact of language on factual recall by posing questions in the native language of the region where each event occurred, uncovering substantial performance gaps for low-resource languages.
△ Less
Submitted 20 May, 2025; v1 submitted 13 January, 2025;
originally announced January 2025.
-
The interplay between domain specialization and model size
Authors:
Roseval Malaquias Junior,
Ramon Pires,
Thales Sales Almeida,
Kenzo Sakiyama,
Roseli A. F. Romero,
Rodrigo Nogueira
Abstract:
Scaling laws for language models have often focused on finding the optimal model size and token count for training from scratch. However, achieving this optimal balance requires significant compute resources due to the extensive data demands when training models from randomly-initialized weights. Continued pretraining offers a cost-effective alternative, leveraging the compute investment from pret…
▽ More
Scaling laws for language models have often focused on finding the optimal model size and token count for training from scratch. However, achieving this optimal balance requires significant compute resources due to the extensive data demands when training models from randomly-initialized weights. Continued pretraining offers a cost-effective alternative, leveraging the compute investment from pretrained models to incorporate new knowledge without requiring extensive new data. Recent findings suggest that data quality influences constants in scaling laws, thereby altering the optimal parameter-token allocation ratio. Building on this insight, we investigate the interplay between domain specialization and model size during continued pretraining under compute-constrained scenarios. Our goal is to identify an optimal training regime for this scenario and detect patterns in this interplay that can be generalized across different model sizes and domains. To compare general and specialized training, we filtered a web-based dataset to extract data from three domains: legal, medical, and accounting. We pretrained models with 1.5B, 3B, 7B, and 14B parameters on both the unfiltered and filtered datasets, then evaluated their performance on domain-specific exams. Results show that as model size increases, specialized models outperform general models while requiring less training compute. Additionally, their growing compute efficiency leads to reduced forgetting of previously learned knowledge.
△ Less
Submitted 29 March, 2025; v1 submitted 3 January, 2025;
originally announced January 2025.
-
Sabiá-3 Technical Report
Authors:
Hugo Abonizio,
Thales Sales Almeida,
Thiago Laitz,
Roseval Malaquias Junior,
Giovana Kerche Bonás,
Rodrigo Nogueira,
Ramon Pires
Abstract:
This report presents Sabiá-3, our new flagship language model, and Sabiazinho-3, a more cost-effective sibling. The models were trained on a large brazilian-centric corpus. Evaluations across diverse professional and academic benchmarks show a strong performance on Portuguese and Brazil-related tasks. Sabiá-3 shows large improvements in comparison to our previous best of model, Sabia-2 Medium, esp…
▽ More
This report presents Sabiá-3, our new flagship language model, and Sabiazinho-3, a more cost-effective sibling. The models were trained on a large brazilian-centric corpus. Evaluations across diverse professional and academic benchmarks show a strong performance on Portuguese and Brazil-related tasks. Sabiá-3 shows large improvements in comparison to our previous best of model, Sabia-2 Medium, especially in reasoning-intensive tasks. Notably, Sabiá-3's average performance matches frontier LLMs, while it is offered at a three to four times lower cost per token, reinforcing the benefits of domain specialization.
△ Less
Submitted 1 April, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
SurveySum: A Dataset for Summarizing Multiple Scientific Articles into a Survey Section
Authors:
Leandro Carísio Fernandes,
Gustavo Bartz Guedes,
Thiago Soares Laitz,
Thales Sales Almeida,
Rodrigo Nogueira,
Roberto Lotufo,
Jayr Pereira
Abstract:
Document summarization is a task to shorten texts into concise and informative summaries. This paper introduces a novel dataset designed for summarizing multiple scientific articles into a section of a survey. Our contributions are: (1) SurveySum, a new dataset addressing the gap in domain-specific summarization tools; (2) two specific pipelines to summarize scientific articles into a section of a…
▽ More
Document summarization is a task to shorten texts into concise and informative summaries. This paper introduces a novel dataset designed for summarizing multiple scientific articles into a section of a survey. Our contributions are: (1) SurveySum, a new dataset addressing the gap in domain-specific summarization tools; (2) two specific pipelines to summarize scientific articles into a section of a survey; and (3) the evaluation of these pipelines using multiple metrics to compare their performance. Our results highlight the importance of high-quality retrieval stages and the impact of different configurations on the quality of generated summaries.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
Measuring Cross-lingual Transfer in Bytes
Authors:
Leandro Rodrigues de Souza,
Thales Sales Almeida,
Roberto Lotufo,
Rodrigo Nogueira
Abstract:
Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamina…
▽ More
Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamination and syntactic similarity. An emerging line of research suggests that the representations learned by language models contain two components: a language-specific and a language-agnostic component. The latter is responsible for transferring a more universal knowledge. However, there is a lack of comprehensive exploration of these properties across diverse target languages. To investigate this hypothesis, we conducted an experiment inspired by the work on the Scaling Laws for Transfer. We measured the amount of data transferred from a source language to a target language and found that models initialized from diverse languages perform similarly to a target language in a cross-lingual setting. This was surprising because the amount of data transferred to 10 diverse target languages, such as Spanish, Korean, and Finnish, was quite similar. We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge. Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Sabiá-2: A New Generation of Portuguese Large Language Models
Authors:
Thales Sales Almeida,
Hugo Abonizio,
Rodrigo Nogueira,
Ramon Pires
Abstract:
We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Mediu…
▽ More
We introduce Sabiá-2, a family of large language models trained on Portuguese texts. The models are evaluated on a diverse range of exams, including entry-level tests for Brazilian universities, professional certification exams, and graduate-level exams for various disciplines such as accounting, economics, engineering, law and medicine. Our results reveal that our best model so far, Sabiá-2 Medium, matches or surpasses GPT-4's performance in 23 out of 64 exams and outperforms GPT-3.5 in 58 out of 64 exams. Notably, specialization has a significant impact on a model's performance without the need to increase its size, allowing us to offer Sabiá-2 Medium at a price per token that is 10 times cheaper than GPT-4. Finally, we identified that math and coding are key abilities that need improvement.
△ Less
Submitted 26 March, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams
Authors:
Ramon Pires,
Thales Sales Almeida,
Hugo Abonizio,
Rodrigo Nogueira
Abstract:
Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance e…
▽ More
Recent advancements in language models have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate language models on entrance exams, which incorporates both textual and visual elements. We evaluate the two most recent editions of Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities. Our study not only reaffirms the capabilities of GPT-4 as the state of the art for handling complex multidisciplinary questions, but also pioneers in offering a realistic assessment of multimodal language models on Portuguese examinations. One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement. Yet, despite improvements afforded by images or captions, mathematical questions remain a challenge for these state-of-the-art models. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams
Authors:
Thales Sales Almeida,
Thiago Laitz,
Giovana K. Bonás,
Rodrigo Nogueira
Abstract:
One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazi…
▽ More
One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Sabiá: Portuguese Large Language Models
Authors:
Ramon Pires,
Hugo Abonizio,
Thales Sales Almeida,
Rodrigo Nogueira
Abstract:
As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demo…
▽ More
As the capabilities of language models continue to advance, it is conceivable that "one-size-fits-all" model will remain as the main paradigm. For instance, given the vast number of languages worldwide, many of which are low-resource, the prevalent practice is to pretrain a single model on multiple languages. In this paper, we add to the growing body of evidence that challenges this practice, demonstrating that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora. More specifically, we further pretrain GPT-J and LLaMA models on Portuguese texts using 3% or less of their original pretraining budget. Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin. Our best model, Sabiá-65B, performs on par with GPT-3.5-turbo. By evaluating on datasets originally conceived in the target language as well as translated ones, we study the contributions of language-specific pretraining in terms of 1) capturing linguistic nuances and structures inherent to the target language, and 2) enriching the model's knowledge about a domain or culture. Our results indicate that the majority of the benefits stem from the domain-specific knowledge acquired through monolingual pretraining.
△ Less
Submitted 9 November, 2023; v1 submitted 16 April, 2023;
originally announced April 2023.
-
NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost
Authors:
Thales Sales Almeida,
Thiago Laitz,
João Seródio,
Luiz Henrique Bonifacio,
Roberto Lotufo,
Rodrigo Nogueira
Abstract:
The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this w…
▽ More
The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this work, we describe NeuralSearchX, a metasearch engine based on a multi-purpose large reranking model to merge results and highlight sentences. Due to the homogeneity of our architecture, we could focus our optimization efforts on a single component. We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks. Human evaluation on two domain-specific tasks shows that our retrieval system outperformed Google API by a large margin in terms of nDCG@10 scores. By describing our architecture and implementation in detail, we hope that the community will build on our design choices. The system is available at https://neuralsearchx.nsx.ai.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Mining Artifacts in Mycelium SEM Micrographs
Authors:
Thaicia Stona de Almeida
Abstract:
Mycelium is a promising biomaterial based on fungal mycelium, a highly porous, nanofibrous structure. Scanning electron micrographs are used to characterize its network, but the currently available tools for nanofibrous microstructures do not contemplate the particularities of biomaterials. The adoption of a software for artificial nanofibrous in mycelium characterization adds the uncertainty of i…
▽ More
Mycelium is a promising biomaterial based on fungal mycelium, a highly porous, nanofibrous structure. Scanning electron micrographs are used to characterize its network, but the currently available tools for nanofibrous microstructures do not contemplate the particularities of biomaterials. The adoption of a software for artificial nanofibrous in mycelium characterization adds the uncertainty of imaging artifact formation to the analysis. The reported work combines supervised and unsupervised machine learning methods to automate the identification of artifacts in the mapped pores of mycelium microstructure.
Keywords: Machine learning; unsupervised learning; image processing; mycelium; microstructure informatics
△ Less
Submitted 12 March, 2021;
originally announced March 2021.
-
Convex geometric reasoning for crystalline energies
Authors:
Thaicia Stona de Almeida
Abstract:
The present work revisits the classical Wulff problem restricted to crystalline integrands, a class of surface energies that gives rise to finitely faceted crystals. The general proof of the Wulff theorem was given by J.E. Taylor (1978) by methods of Geometric Measure Theory. This work follows a simpler and direct way through Minkowski Theory by taking advantage of the convex properties of the con…
▽ More
The present work revisits the classical Wulff problem restricted to crystalline integrands, a class of surface energies that gives rise to finitely faceted crystals. The general proof of the Wulff theorem was given by J.E. Taylor (1978) by methods of Geometric Measure Theory. This work follows a simpler and direct way through Minkowski Theory by taking advantage of the convex properties of the considered Wulff shapes.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
Extending the ADM formalism to Weyl geometry
Authors:
A. B. Barreto,
T. S. Almeida,
C. Romero
Abstract:
In order to treat quantum cosmology in the framework of Weyl spacetimes we take the first step of extending the Arnowitt-Deser-Misner formalism to Weyl geometry. We then obtain an expression of the curvature tensor in terms of spatial quantities by splitting spacetime in (3+1)-dimensional form. We next write the Lagrangian of the gravitation field based in Weyl-type gravity theory. We extend the g…
▽ More
In order to treat quantum cosmology in the framework of Weyl spacetimes we take the first step of extending the Arnowitt-Deser-Misner formalism to Weyl geometry. We then obtain an expression of the curvature tensor in terms of spatial quantities by splitting spacetime in (3+1)-dimensional form. We next write the Lagrangian of the gravitation field based in Weyl-type gravity theory. We extend the general relativistic formalism in such a way that it can be applied to investigate the quantum cosmology of models whose spacetimes are endowed with a Weyl geometrical structure.
△ Less
Submitted 29 March, 2015;
originally announced March 2015.
-
(2+1)-Dimensional Gravity in Weyl Integrable Spacetime
Authors:
J. E. Madriz Aguilar,
C. Romero,
J. B. Fonseca-Neto,
T. S. Almeida,
J. B. Formiga
Abstract:
We investigate (2+1)-dimensional gravity in a Weyl integrable spacetime (WIST). We show that, unlike general relativity, this scalar-tensor theory has a Newtonian limit for any dimension and that in three dimensions the congruence of world lines of particles of a pressureless fluid has a non-vanishing geodesic deviation. We present and discuss a class of static vacuum solutions generated by a circ…
▽ More
We investigate (2+1)-dimensional gravity in a Weyl integrable spacetime (WIST). We show that, unlike general relativity, this scalar-tensor theory has a Newtonian limit for any dimension and that in three dimensions the congruence of world lines of particles of a pressureless fluid has a non-vanishing geodesic deviation. We present and discuss a class of static vacuum solutions generated by a circularly symmetric matter distribution that for certain values of the parameter w corresponds to a space-time with a naked singularity at the center of the matter distribution. We interpret all these results as being a direct consequence of the space-time geometry.
△ Less
Submitted 13 March, 2015;
originally announced March 2015.
-
Wormholes in Wyman's solution
Authors:
J. B. Formiga,
T. S. Almeida
Abstract:
The most general solution of the Einstein field equations coupled with a massless scalar field is known as Wyman's solution. This solution is also present in the Brans-Dicke theory and, due to its importance, it has been studied in detail by many authors. However, this solutions has not been studied from the perspective of a possible wormhole. In this paper, we perform a detailed analysis of this…
▽ More
The most general solution of the Einstein field equations coupled with a massless scalar field is known as Wyman's solution. This solution is also present in the Brans-Dicke theory and, due to its importance, it has been studied in detail by many authors. However, this solutions has not been studied from the perspective of a possible wormhole. In this paper, we perform a detailed analysis of this issue. It turns out that there is a wormhole. Although we prove that the so-called throat cannot be traversed by human beings, it can be traversed by particles and bodies that can last long enough.
△ Less
Submitted 10 September, 2014; v1 submitted 1 April, 2014;
originally announced April 2014.
-
From Brans-Dicke gravity to a geometrical scalar-tensor theory
Authors:
T. S. Almeida,
M. L. Pucheu,
C. Romero,
J. B. Formiga
Abstract:
We consider an approach to Brans-Dicke theory of gravity in which the scalar field has a geometrical nature. By postulating the Palatini variation, we find out that the role played by the scalar field consists in turning the space-time geometry into a Weyl integrable manifold. This procedure leads to a scalar-tensor theory that differs from the original Brans-Dicke theory in many aspects and prese…
▽ More
We consider an approach to Brans-Dicke theory of gravity in which the scalar field has a geometrical nature. By postulating the Palatini variation, we find out that the role played by the scalar field consists in turning the space-time geometry into a Weyl integrable manifold. This procedure leads to a scalar-tensor theory that differs from the original Brans-Dicke theory in many aspects and presents some new features.
△ Less
Submitted 21 November, 2013;
originally announced November 2013.