Search | arXiv e-print repository

Grokking Explained: A Statistical Phenomenon

Authors: Breno W. Carvalho, Artur S. d'Avila Garcez, Luís C. Lamb, Emílio Vital Brazil

Abstract: Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between train… ▽ More Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data. We introduce two synthetic datasets specifically designed to analyze grokking. One dataset examines the impact of limited sampling, and the other investigates transfer learning's role in grokking. By inducing distribution shifts through controlled imbalanced sampling of sub-categories, we systematically reproduce the phenomenon, demonstrating that while small-sampling is strongly associated with grokking, it is not its cause. Instead, small-sampling serves as a convenient mechanism for achieving the necessary distribution shift. We also show that when classes form an equivariant map, grokking can be explained by the model's ability to learn from similar classes or sub-categories. Unlike earlier work suggesting that grokking primarily arises from high regularization and sparse data, we demonstrate that it can also occur with dense data and minimal hyper-parameter tuning. Our findings deepen the understanding of grokking and pave the way for developing better stopping criteria in future training processes. △ Less

Submitted 3 February, 2025; originally announced February 2025.

arXiv:2410.12348 [pdf, other]

SELF-BART : A Transformer-based Molecular Representation Model using SELFIES

Authors: Indra Priyadarsini, Seiji Takeda, Lisa Hamada, Emilio Vital Brazil, Eduardo Soares, Hajime Shinohara

Abstract: Large-scale molecular representation methods have revolutionized applications in material science, such as drug discovery, chemical modeling, and material design. With the rise of transformers, models now learn representations directly from molecular structures. In this study, we develop an encoder-decoder model based on BART that is capable of leaning molecular representations and generate new mo… ▽ More Large-scale molecular representation methods have revolutionized applications in material science, such as drug discovery, chemical modeling, and material design. With the rise of transformers, models now learn representations directly from molecular structures. In this study, we develop an encoder-decoder model based on BART that is capable of leaning molecular representations and generate new molecules. Trained on SELFIES, a robust molecular string representation, our model outperforms existing baselines in downstream tasks, demonstrating its potential in efficient and effective molecular data analysis and manipulation. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Comments: NeurIPS AI4Mat 2024

arXiv:2407.20267 [pdf, other]

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

Authors: Eduardo Soares, Victor Shirasuna, Emilio Vital Brazil, Renato Cerqueira, Dmitry Zubarev, Kristin Schmidt

Abstract: Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on spe… ▽ More Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and $8\times289M$). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities. △ Less

Submitted 24 July, 2024; originally announced July 2024.

Comments: 14 pages, 3 figures, 14 tables

arXiv:2310.13802 [pdf, other]

Improving Molecular Properties Prediction Through Latent Space Fusion

Authors: Eduardo Soares, Akihiro Kishimoto, Emilio Vital Brazil, Seiji Takeda, Hiroshi Kajino, Renato Cerqueira

Abstract: Pre-trained Language Models have emerged as promising tools for predicting molecular properties, yet their development is in its early stages, necessitating further research to enhance their efficacy and address challenges such as generalization and sample efficiency. In this paper, we present a multi-view approach that combines latent spaces derived from state-of-the-art chemical models. Our appr… ▽ More Pre-trained Language Models have emerged as promising tools for predicting molecular properties, yet their development is in its early stages, necessitating further research to enhance their efficacy and address challenges such as generalization and sample efficiency. In this paper, we present a multi-view approach that combines latent spaces derived from state-of-the-art chemical models. Our approach relies on two pivotal elements: the embeddings derived from MHG-GNN, which represent molecular structures as graphs, and MoLFormer embeddings rooted in chemical language. The attention mechanism of MoLFormer is able to identify relations between two atoms even when their distance is far apart, while the GNN of MHG-GNN can more precisely capture relations among multiple atoms closely located. In this work, we demonstrate the superior performance of our proposed multi-view approach compared to existing state-of-the-art methods, including MoLFormer-XL, which was trained on 1.1 billion molecules, particularly in intricate tasks such as predicting clinical trial drug toxicity and inhibiting HIV replication. We assessed our approach using six benchmark datasets from MoleculeNet, where it outperformed competitors in five of them. Our study highlights the potential of latent space fusion and feature integration for advancing molecular property prediction. In this work, we use small versions of MHG-GNN and MoLFormer, which opens up an opportunity for further improvement when our approach uses a larger-scale dataset. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 8 Pages, 4 Figures - Submited to the AI4Science Workshop - Neurips 2023

arXiv:2308.12152 [pdf, other]

Geo-Sketcher: Rapid 3D Geological Modeling using Geological and Topographic Map Sketches

Authors: Ronan Amorim, Emilio Vital Brazil, Faramarz Samavati, Mario Costa Sousa

Abstract: The construction of 3D geological models is an essential task in oil/gas exploration, development and production. However, it is a cumbersome, time-consuming and error-prone task mainly because of the model's geometric and topological complexity. The models construction is usually separated into interpretation and 3D modeling, performed by different highly specialized individuals, which leads to i… ▽ More The construction of 3D geological models is an essential task in oil/gas exploration, development and production. However, it is a cumbersome, time-consuming and error-prone task mainly because of the model's geometric and topological complexity. The models construction is usually separated into interpretation and 3D modeling, performed by different highly specialized individuals, which leads to inconsistencies and intensifies the challenges. In addition, the creation of models following geological rules is paramount for properly depicting static and dynamic properties of oil/gas reservoirs. In this work, we propose a sketch-based approach to expedite the creation of valid 3D geological models by mimicking how domain experts interpret geological structures, allowing creating models directly from interpretation sketches. Our sketch-based modeler (Geo-Sketcher) is based on sketches of standard 2D topographic and geological maps, comprised of lines, symbols and annotations. We developed a graph-based representation to enable (1) the automatic computation of the relative ages of rock series and layers, and (2) the embedding of specific geological rules directly in the sketching. We introduce the use of Hermite-Birkhoff Radial Basis Functions to interpolate the geological map constraints, and demonstrate the capabilities of our approach with a variety of results with different levels of complexity. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: 21 pages, 30 Figures

arXiv:2306.14919 [pdf, other]

Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction

Authors: Eduardo Soares, Emilio Vital Brazil, Karen Fiorela Aquino Gutierrez, Renato Cerqueira, Dan Sanders, Kristin Schmidt, Dmitry Zubarev

Abstract: We present a novel multimodal language model approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MULTIMODAL-MOLFORMER, utilizes a causal multistage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then inte… ▽ More We present a novel multimodal language model approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MULTIMODAL-MOLFORMER, utilizes a causal multistage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then integrated with the vector space generated by molecular embeddings from MOLFORMER. In particular, we employ Mordred descriptors as physicochemical features and identify the Markov blanket of the target property, which theoretically contains the most relevant features for accurate prediction. Our results demonstrate a superior performance of our proposed approach compared to existing state-of-the-art algorithms, including the chemical language-based MOLFORMER and graph neural networks, in predicting complex tasks such as biodegradability and PFAS toxicity estimation. Moreover, we demonstrate the effectiveness of our feature selection method in reducing the dimensionality of the Mordred feature space while maintaining or improving the model's performance. Our approach opens up promising avenues for future research in molecular property prediction by harnessing the synergistic potential of both chemical language and physicochemical features, leading to enhanced performance and advancements in the field. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: 14 pages, 6 Figures, 5 tables. Submited to NEURIPS 2023, Under review

ACM Class: J.2; I.2.1

arXiv:2305.07530 [pdf, other]

Retrospective End-User Walkthrough: A Method for Assessing How People Combine Multiple AI Models in Decision-Making Systems

Authors: Vagner Figueredo de Santana, Larissa Monteiro Da Fonseca Galeno, Emilio Vital Brazil, Aliza Heching, Renato Cerqueira

Abstract: Evaluating human-AI decision-making systems is an emerging challenge as new ways of combining multiple AI models towards a specific goal are proposed every day. As humans interact with AI in decision-making systems, multiple factors may be present in a task including trust, interpretability, and explainability, amongst others. In this context, this work proposes a retrospective method to support a… ▽ More Evaluating human-AI decision-making systems is an emerging challenge as new ways of combining multiple AI models towards a specific goal are proposed every day. As humans interact with AI in decision-making systems, multiple factors may be present in a task including trust, interpretability, and explainability, amongst others. In this context, this work proposes a retrospective method to support a more holistic understanding of how people interact with and connect multiple AI models and combine multiple outputs in human-AI decision-making systems. The method consists of employing a retrospective end-user walkthrough with the objective of providing support to HCI practitioners so that they may gain an understanding of the higher order cognitive processes in place and the role that AI model outputs play in human-AI decision-making. The method was qualitatively assessed with 29 participants (four participants in a pilot phase; 25 participants in the main user study) interacting with a human-AI decision-making system in the context of financial decision-making. The system combines visual analytics, three AI models for revenue prediction, AI-supported analogues analysis, and hypothesis testing using external news and natural language processing to provide multiple means for comparing companies. Beyond results on tasks and usability problems, outcomes presented suggest that the method is promising in highlighting why AI models are ignored, used, or trusted, and how future interactions are planned. We suggest that HCI practitioners researching human-AI interaction can benefit by adding this step to user studies in a debriefing stage as a retrospective Thinking-Aloud protocol would be applied, but with emphasis on revisiting tasks and understanding why participants ignored or connected predictions while performing a task. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2303.05545 [pdf, other]

Position Paper on Dataset Engineering to Accelerate Science

Authors: Emilio Vital Brazil, Eduardo Soares, Lucas Villa Real, Leonardo Azevedo, Vinicius Segura, Luiz Zerkowski, Renato Cerqueira

Abstract: Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a well-defined task. For instance, we need a corpus of text broken into sentences to train a natural language machine-learning model. In this work, we will use the… ▽ More Data is a critical element in any discovery process. In the last decades, we observed exponential growth in the volume of available data and the technology to manipulate it. However, data is only practical when one can structure it for a well-defined task. For instance, we need a corpus of text broken into sentences to train a natural language machine-learning model. In this work, we will use the token \textit{dataset} to designate a structured set of data built to perform a well-defined task. Moreover, the dataset will be used in most cases as a blueprint of an entity that at any moment can be stored as a table. Specifically, in science, each area has unique forms to organize, gather and handle its datasets. We believe that datasets must be a first-class entity in any knowledge-intensive process, and all workflows should have exceptional attention to datasets' lifecycle, from their gathering to uses and evolution. We advocate that science and engineering discovery processes are extreme instances of the need for such organization on datasets, claiming for new approaches and tooling. Furthermore, these requirements are more evident when the discovery workflow uses artificial intelligence methods to empower the subject-matter expert. In this work, we discuss an approach to bringing datasets as a critical entity in the discovery process in science. We illustrate some concepts using material discovery as a use case. We chose this domain because it leverages many significant problems that can be generalized to other science fields. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: Published at 2nd Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE) https://ai-2-ase.github.io/papers/16%5cSubmission%5cAAAI_Dataset_Engineering-8.pdf

arXiv:2303.05288 [pdf, other]

Knowledge-augmented Risk Assessment (KaRA): a hybrid-intelligence framework for supporting knowledge-intensive risk assessment of prospect candidates

Authors: Carlos Raoni Mendes, Emilio Vital Brazil, Vinicius Segura, Renato Cerqueira

Abstract: Evaluating the potential of a prospective candidate is a common task in multiple decision-making processes in different industries. We refer to a prospect as something or someone that could potentially produce positive results in a given context, e.g., an area where an oil company could find oil, a compound that, when synthesized, results in a material with required properties, and so on. In many… ▽ More Evaluating the potential of a prospective candidate is a common task in multiple decision-making processes in different industries. We refer to a prospect as something or someone that could potentially produce positive results in a given context, e.g., an area where an oil company could find oil, a compound that, when synthesized, results in a material with required properties, and so on. In many contexts, assessing the Probability of Success (PoS) of prospects heavily depends on experts' knowledge, often leading to biased and inconsistent assessments. We have developed the framework named KARA (Knowledge-augmented Risk Assessment) to address these issues. It combines multiple AI techniques that consider SMEs (Subject Matter Experts) feedback on top of a structured domain knowledge-base to support risk assessment processes of prospect candidates in knowledge-intensive contexts. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: Published at 2nd Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE) https://ai-2-ase.github.io/papers/17%5cSubmission%5cKaRA_AI2ASE_Workshop-3.pdf. arXiv admin note: text overlap with arXiv:2211.04257

arXiv:2211.04257 [pdf, other]

Toward Human-AI Co-creation to Accelerate Material Discovery

Authors: Dmitry Zubarev, Carlos Raoni Mendes, Emilio Vital Brazil, Renato Cerqueira, Kristin Schmidt, Vinicius Segura, Juliana Jansen Ferreira, Dan Sanders

Abstract: There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several re… ▽ More There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several recent advances in Machine Learning and AI to address some of these challenges, there is still a gap in technologies to support end-to-end discovery applications, integrating the myriad of available technologies into a coherent, orchestrated, yet flexible discovery process. Such applications need to handle complex knowledge management at scale, enabling knowledge consumption and production in a timely and efficient way for subject matter experts (SMEs). Furthermore, the discovery of novel functional materials strongly relies on the development of exploration strategies in the chemical space. For instance, generative models have gained attention within the scientific community due to their ability to generate enormous volumes of novel molecules across material domains. These models exhibit extreme creativity that often translates in low viability of the generated candidates. In this work, we propose a workbench framework that aims at enabling the human-AI co-creation to reduce the time until the first discovery and the opportunity costs involved. This framework relies on a knowledge base with domain and process knowledge, and user-interaction components to acquire knowledge and advise the SMEs. Currently,the framework supports four main activities: generative modeling, dataset triage, molecule adjudication, and risk assessment. △ Less

Submitted 5 November, 2022; originally announced November 2022.

Comments: 9 pages, 5 figures, NeurIPS 2022 WS: AI4Science

arXiv:2010.00330 [pdf, other]

doi 10.1002/cpe.6544

Workflow Provenance in the Lifecycle of Scientific Machine Learning

Authors: Renan Souza, Leonardo G. Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A. S. Netto

Abstract: Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibil… ▽ More Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design principles to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments show that the principles enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability, and an order of magnitude of query acceleration under certain workloads against without our representation. △ Less

Submitted 25 August, 2021; v1 submitted 30 September, 2020; originally announced October 2020.

Comments: 21 pages, 10 figures, text overlap with arXiv:1910.04223, a workshop paper being extended in this journal paper

MSC Class: 65Y05; 68P15 ACM Class: I.2; H.2; C.4; J.2

Journal ref: Concurrency Computation Practice Experience. 2021;e6544

arXiv:1910.04223 [pdf, other]

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Authors: Renan Souza, Leonardo Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A. S. Netto

Abstract: Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The m… ▽ More Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stakeholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the Oil and Gas industry, along with its evaluation using 48 GPUs in parallel. △ Less

Submitted 21 October, 2019; v1 submitted 9 October, 2019; originally announced October 2019.

Comments: 10 pages, 7 figures, Accepted at Workflows in Support of Large-scale Science (WORKS) co-located with the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2019, Denver, Colorado

MSC Class: 65Y05; 68P15 ACM Class: I.2; H.2; C.4; J.2

arXiv:1905.04307 [pdf, other]

Semantic Segmentation of Seismic Images

Authors: Daniel Civitarese, Daniela Szwarcman, Emilio Vital Brazil, Bianca Zadrozny

Abstract: Almost all work to understand Earth's subsurface on a large scale relies on the interpretation of seismic surveys by experts who segment the survey (usually a cube) into layers; a process that is very time demanding. In this paper, we present a new deep neural network architecture specially designed to semantically segment seismic images with a minimal amount of training data. To achieve this, we… ▽ More Almost all work to understand Earth's subsurface on a large scale relies on the interpretation of seismic surveys by experts who segment the survey (usually a cube) into layers; a process that is very time demanding. In this paper, we present a new deep neural network architecture specially designed to semantically segment seismic images with a minimal amount of training data. To achieve this, we make use of a transposed residual unit that replaces the traditional dilated convolution for the decode block. Also, instead of using a predefined shape for up-scaling, our network learns all the steps to upscale the features from the encoder. We train our neural network using the Penobscot 3D dataset; a real seismic dataset acquired offshore Nova Scotia, Canada. We compare our approach with two well-known deep neural network topologies: Fully Convolutional Network and U-Net. In our experiments, we show that our approach can achieve more than 99 percent of the mean intersection over union (mIOU) metric, outperforming the existing topologies. Moreover, our qualitative results show that the obtained model can produce masks very close to human interpretation with very little discontinuity. △ Less

Submitted 10 May, 2019; originally announced May 2019.

Comments: 7 pages, 5 figures

arXiv:1904.00770 [pdf, other]

Netherlands Dataset: A New Public Dataset for Machine Learning in Seismic Interpretation

Authors: Reinaldo Mozart Silva, Lais Baroni, Rodrigo S. Ferreira, Daniel Civitarese, Daniela Szwarcman, Emilio Vital Brazil

Abstract: Machine learning and, more specifically, deep learning algorithms have seen remarkable growth in their popularity and usefulness in the last years. This is arguably due to three main factors: powerful computers, new techniques to train deeper networks and larger datasets. Although the first two are readily available in modern computers and ML libraries, the last one remains a challenge for many do… ▽ More Machine learning and, more specifically, deep learning algorithms have seen remarkable growth in their popularity and usefulness in the last years. This is arguably due to three main factors: powerful computers, new techniques to train deeper networks and larger datasets. Although the first two are readily available in modern computers and ML libraries, the last one remains a challenge for many domains. It is a fact that big data is a reality in almost all fields nowadays, and geosciences are not an exception. However, to achieve the success of general-purpose applications such as ImageNet - for which there are +14 million labeled images for 1000 target classes - we not only need more data, we need more high-quality labeled data. When it comes to the Oil&Gas industry, confidentiality issues hamper even more the sharing of datasets. In this work, we present the Netherlands interpretation dataset, a contribution to the development of machine learning in seismic interpretation. The Netherlands F3 dataset acquisition was carried out in the North Sea, Netherlands offshore. The data is publicly available and contains pos-stack data, 8 horizons and well logs of 4 wells. For the purposes of our machine learning tasks, the original dataset was reinterpreted, generating 9 horizons separating different seismic facies intervals. The interpreted horizons were used to generate approximatelly 190,000 labeled images for inlines and crosslines. Finally, we present two deep learning applications in which the proposed dataset was employed and produced compelling results. △ Less

Submitted 26 March, 2019; originally announced April 2019.

Comments: 5 pages, 5 figures

arXiv:1903.12060 [pdf, other]

Penobscot Dataset: Fostering Machine Learning Development for Seismic Interpretation

Authors: Lais Baroni, Reinaldo Mozart Silva, Rodrigo S. Ferreira, Daniel Civitarese, Daniela Szwarcman, Emilio Vital Brazil

Abstract: We have seen in the past years the flourishing of machine and deep learning algorithms in several applications such as image classification and segmentation, object detection and recognition, among many others. This was only possible, in part, because datasets like ImageNet -- with +14 million labeled images -- were created and made publicly available, providing researches with a common ground to… ▽ More We have seen in the past years the flourishing of machine and deep learning algorithms in several applications such as image classification and segmentation, object detection and recognition, among many others. This was only possible, in part, because datasets like ImageNet -- with +14 million labeled images -- were created and made publicly available, providing researches with a common ground to compare their advances and extend the state-of-the-art. Although we have seen an increasing interest in machine learning in geosciences as well, we will only be able to achieve a significant impact in our community if we collaborate to build such a common basis. This is even more difficult when it comes to the Oil&Gas industry, in which confidentiality and commercial interests often hinder the sharing of datasets with others. In this letter, we present the Penobscot interpretation dataset, our contribution to the development of machine learning in geosciences, more specifically in seismic interpretation. The Penobscot 3D seismic dataset was acquired in the Scotian shelf, offshore Nova Scotia, Canada. The data is publicly available and comprises pre- and pos-stack data, 5 horizons and well logs of 2 wells. However, for the dataset to be of practical use for our tasks, we had to reinterpret the seismic, generating 7 horizons separating different seismic facies intervals. The interpreted horizons were used to generated +100,000 labeled images for inlines and crosslines. To demonstrate the utility of our dataset, results of two experiments are presented. △ Less

Submitted 21 March, 2019; originally announced March 2019.

Comments: 5 pages, 5 figures

arXiv:1406.7025 [pdf, other]

DASS: Detail Aware Sketch-Based Surface Modeling

Authors: Emilio Vital Brazil

Abstract: We present a sketch-based modeling system suitable for detail editing, based on a multilevel representation for surfaces. The main advantage of this representation allowing for the control of local (details) and global changes of the model. We used an adaptive mesh (4-8 mesh) and developed a label theory to construct a manifold structure, which is responsible for controlling local editing of the m… ▽ More We present a sketch-based modeling system suitable for detail editing, based on a multilevel representation for surfaces. The main advantage of this representation allowing for the control of local (details) and global changes of the model. We used an adaptive mesh (4-8 mesh) and developed a label theory to construct a manifold structure, which is responsible for controlling local editing of the model. The overall shape and global modifications are defined by a variational implicit surface (Hermite RBF). Our system assembles the manifold structures to allow the user to add details without changing the overall shape, as well as edit the overall shape while repositioning details coherently. △ Less

Submitted 26 June, 2014; originally announced June 2014.

ACM Class: I.3.5

arXiv:1204.2727 [pdf, ps, other]

doi 10.1016/j.dam.2014.12.006

The Cost of Perfection for Matchings in Graphs

Authors: Emilio Vital Brazil, Guilherme D. da Fonseca, Celina de Figueiredo, Diana Sasaki

Abstract: Perfect matchings and maximum weight matchings are two fundamental combinatorial structures. We consider the ratio between the maximum weight of a perfect matching and the maximum weight of a general matching. Motivated by the computer graphics application in triangle meshes, where we seek to convert a triangulation into a quadrangulation by merging pairs of adjacent triangles, we focus mainly on… ▽ More Perfect matchings and maximum weight matchings are two fundamental combinatorial structures. We consider the ratio between the maximum weight of a perfect matching and the maximum weight of a general matching. Motivated by the computer graphics application in triangle meshes, where we seek to convert a triangulation into a quadrangulation by merging pairs of adjacent triangles, we focus mainly on bridgeless cubic graphs. First, we characterize graphs that attain the extreme ratios. Second, we present a lower bound for all bridgeless cubic graphs. Third, we present upper bounds for subclasses of bridgeless cubic graphs, most of which are shown to be tight. Additionally, we present tight bounds for the class of regular bipartite graphs. △ Less

Submitted 4 December, 2014; v1 submitted 11 April, 2012; originally announced April 2012.

Journal ref: Discrete Applied Mathematics, 210:112-122, 2016

Showing 1–17 of 17 results for author: Brazil, E V