-
A Zero-Inflated Spatio-Temporal Model for Integrating Fishery-Dependent and Independent Data under Preferential Sampling
Authors:
Daniela Silva,
Raquel Menezes,
Gonçalo Araújo,
Ana Machado,
Renato Rosa,
Ana Moreno,
Alexandra Silva,
Susana Garrido
Abstract:
Sustainable management of marine ecosystems is vital for maintaining healthy fishery resources, and benefits from advanced scientific tools to accurately assess species distribution patterns. In fisheries science, two primary data sources are used: fishery-independent data (FID), collected through systematic surveys, and fishery-dependent data (FDD), obtained from commercial fishing activities. Wh…
▽ More
Sustainable management of marine ecosystems is vital for maintaining healthy fishery resources, and benefits from advanced scientific tools to accurately assess species distribution patterns. In fisheries science, two primary data sources are used: fishery-independent data (FID), collected through systematic surveys, and fishery-dependent data (FDD), obtained from commercial fishing activities. While these sources provide complementary information, their distinct sampling schemes - systematic for FID and preferential for FDD - pose significant integration challenges. This study introduces a novel spatio-temporal model that integrates FID and FDD, addressing challenges associated with zero-inflation and preferential sampling (PS) common in ecological data. The model employs a six-layer structure to differentiate between presence-absence and biomass observations, offering a robust framework for ecological studies affected by PS biases. Simulation results demonstrate the model's accuracy in parameter estimation across diverse PS scenarios and its ability to detect preferential signals. Application to the study of the distribution patterns of the European sardine populations along the southern Portuguese continental shelf illustrates the model's effectiveness in integrating diverse data sources and incorporating environmental and vessel-specific covariates. The model reveals spatio-temporal variability in sardine presence and biomass, providing actionable insights for fisheries management. Beyond ecology, this framework offers broad applicability to data integration challenges in other disciplines.
△ Less
Submitted 11 September, 2025;
originally announced September 2025.
-
Joint model for zero-inflated data combining fishery-dependent and fishery-independent sources
Authors:
Daniela Silva,
Raquel Menezes,
Gonçalo Araújo,
Renato Rosa,
Ana Moreno,
Alexandra Silva,
Susana Garrido
Abstract:
Accurately identifying spatial patterns of species distribution is crucial for scientific insight and societal benefit, aiding our understanding of species fluctuations. The increasing quantity and quality of ecological datasets present heightened statistical challenges, complicating spatial species dynamics comprehension. Addressing the complex task of integrating multiple data sources to enhance…
▽ More
Accurately identifying spatial patterns of species distribution is crucial for scientific insight and societal benefit, aiding our understanding of species fluctuations. The increasing quantity and quality of ecological datasets present heightened statistical challenges, complicating spatial species dynamics comprehension. Addressing the complex task of integrating multiple data sources to enhance spatial fish distribution understanding in marine ecology, this study introduces a pioneering five-layer Joint model. The model adeptly integrates fishery-independent and fishery-dependent data, accommodating zero-inflated data and distinct sampling processes. A comprehensive simulation study evaluates the model performance across various preferential sampling scenarios and sample sizes, elucidating its advantages and challenges. Our findings highlight the model's robustness in estimating preferential parameters, emphasizing differentiation between presence-absence and biomass observations. Evaluation of estimation of spatial covariance and prediction performance underscores the model's reliability. Augmenting sample sizes reduces parameter estimation variability, aligning with the principle that increased information enhances certainty. Assessing the contribution of each data source reveals successful integration, providing a comprehensive representation of biomass patterns. Empirical validation within a real-world context further solidifies the model's efficacy in capturing species' spatial distribution. This research advances methodologies for integrating diverse datasets with different sampling natures further contributing to a more informed understanding of spatial dynamics of marine species.
△ Less
Submitted 31 March, 2025;
originally announced March 2025.
-
On Divergence Measures for Training GFlowNets
Authors:
Tiago da Silva,
Eliezer de Souza da Silva,
Diego Mesquita
Abstract:
Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects, with applications in generative modeling for tasks in fields such as causal discovery, NLP, and drug discovery. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) an…
▽ More
Generative Flow Networks (GFlowNets) are amortized inference models designed to sample from unnormalized distributions over composable objects, with applications in generative modeling for tasks in fields such as causal discovery, NLP, and drug discovery. Traditionally, the training procedure for GFlowNets seeks to minimize the expected log-squared difference between a proposal (forward policy) and a target (backward policy) distribution, which enforces certain flow-matching conditions. While this training procedure is closely related to variational inference (VI), directly attempting standard Kullback-Leibler (KL) divergence minimization can lead to proven biased and potentially high-variance estimators. Therefore, we first review four divergence measures, namely, Renyi-$α$'s, Tsallis-$α$'s, reverse and forward KL's, and design statistically efficient estimators for their stochastic gradients in the context of training GFlowNets. Then, we verify that properly minimizing these divergences yields a provably correct and empirically effective training scheme, often leading to significantly faster convergence than previously proposed optimization. To achieve this, we design control variates based on the REINFORCE leave-one-out and score-matching estimators to reduce the variance of the learning objectives' gradients. Our work contributes by narrowing the gap between GFlowNets training and generalized variational approximations, paving the way for algorithmic ideas informed by the divergence minimization viewpoint.
△ Less
Submitted 21 October, 2024; v1 submitted 11 October, 2024;
originally announced October 2024.
-
Clustering Survival Data using a Mixture of Non-parametric Experts
Authors:
Gabriel Buginga,
Edmundo de Souza e Silva
Abstract:
Survival analysis aims to predict the timing of future events across various fields, from medical outcomes to customer churn. However, the integration of clustering into survival analysis, particularly for precision medicine, remains underexplored. This study introduces SurvMixClust, a novel algorithm for survival analysis that integrates clustering with survival function prediction within a unifi…
▽ More
Survival analysis aims to predict the timing of future events across various fields, from medical outcomes to customer churn. However, the integration of clustering into survival analysis, particularly for precision medicine, remains underexplored. This study introduces SurvMixClust, a novel algorithm for survival analysis that integrates clustering with survival function prediction within a unified framework. SurvMixClust learns latent representations for clustering while also predicting individual survival functions using a mixture of non-parametric experts. Our evaluations on five public datasets show that SurvMixClust creates balanced clusters with distinct survival curves, outperforms clustering baselines, and competes with non-clustering survival models in predictive accuracy, as measured by the time-dependent c-index and log-rank metrics.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
From Two-Dimensional to Three-Dimensional Environment with Q-Learning: Modeling Autonomous Navigation with Reinforcement Learning and no Libraries
Authors:
Ergon Cugler de Moraes Silva
Abstract:
Reinforcement learning (RL) algorithms have become indispensable tools in artificial intelligence, empowering agents to acquire optimal decision-making policies through interactions with their environment and feedback mechanisms. This study explores the performance of RL agents in both two-dimensional (2D) and three-dimensional (3D) environments, aiming to research the dynamics of learning across…
▽ More
Reinforcement learning (RL) algorithms have become indispensable tools in artificial intelligence, empowering agents to acquire optimal decision-making policies through interactions with their environment and feedback mechanisms. This study explores the performance of RL agents in both two-dimensional (2D) and three-dimensional (3D) environments, aiming to research the dynamics of learning across different spatial dimensions. A key aspect of this investigation is the absence of pre-made libraries for learning, with the algorithm developed exclusively through computational mathematics. The methodological framework centers on RL principles, employing a Q-learning agent class and distinct environment classes tailored to each spatial dimension. The research aims to address the question: How do reinforcement learning agents adapt and perform in environments of varying spatial dimensions, particularly in 2D and 3D settings? Through empirical analysis, the study evaluates agents' learning trajectories and adaptation processes, revealing insights into the efficacy of RL algorithms in navigating complex, multi-dimensional spaces. Reflections on the findings prompt considerations for future research, particularly in understanding the dynamics of learning in higher-dimensional environments.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
To be, or not to be, that is the Question: Exploring the pseudorandom generation of texts to write Hamlet from the perspective of the Infinite Monkey Theorem
Authors:
Ergon Cugler de Moraes Silva
Abstract:
This article explores the theoretical and computational aspects of the Infinite Monkey Theorem, investigating the number of attempts and the time required for a set of pseudorandom characters to assemble and recite Hamlets iconic phrase, To be, or not to be, that is the Question. Drawing inspiration from Emile Borel's original concept (1913), the study delves into the practical implications of pse…
▽ More
This article explores the theoretical and computational aspects of the Infinite Monkey Theorem, investigating the number of attempts and the time required for a set of pseudorandom characters to assemble and recite Hamlets iconic phrase, To be, or not to be, that is the Question. Drawing inspiration from Emile Borel's original concept (1913), the study delves into the practical implications of pseudorandomness using Python. Employing Python simulations to generate excerpts from Hamlet, the research navigates historical perspectives and bridges early theoretical foundations with contemporary computational approaches. A set of tests reveals the attempts and time required to generate incremental parts of the target phrase. Utilizing these results, growth factors are calculated, projecting estimated attempts and time for each text part. The findings indicate an astronomical challenge to generate the entire phrase, requiring approximately $2.68\times 10^{69}$ attempts and $2.95\times 10^{66}$ seconds - equivalent to $8.18\times 10^{62}$ hours or $9.32\times 10^{55}$ years. This temporal scale, exceeding the age of the universe by $6.75\times 10e^{45}$ times, underscores the immense complexity and improbability of random literary creation. The article concludes with reflections on the mathematical intricacies and statistical feasibility within the context of the Infinite Monkey Theorem, emphasizing the theoretical musings surrounding infinite time and the profound limitations inherent in such endeavors. And that only infinity could write Hamlet randomly.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Exploring pseudorandom value addition operations in datasets: A layered approach to escape from normal-Gaussian patterns
Authors:
Ergon Cugler de Moraes Silva
Abstract:
In the realm of statistical exploration, the manipulation of pseudo-random values to discern their impact on data distribution presents a compelling avenue of inquiry. This article investigates the question: Is it possible to add pseudo-random values without compelling a shift towards a normal distribution?. Employing Python techniques, the study explores the nuances of pseudo-random value additio…
▽ More
In the realm of statistical exploration, the manipulation of pseudo-random values to discern their impact on data distribution presents a compelling avenue of inquiry. This article investigates the question: Is it possible to add pseudo-random values without compelling a shift towards a normal distribution?. Employing Python techniques, the study explores the nuances of pseudo-random value addition within the context of additions, aiming to unravel the interplay between randomness and resulting statistical characteristics. The Materials and Methods chapter details the construction of datasets comprising up to 300 billion pseudo-random values, employing three distinct layers of manipulation. The Results chapter visually and quantitatively explores the generated datasets, emphasizing distribution and standard deviation metrics. The study concludes with reflections on the implications of pseudo-random value manipulation and suggests avenues for future research. In the layered exploration, the first layer introduces subtle normalization with increasing summations, while the second layer enhances normality. The third layer disrupts typical distribution patterns, leaning towards randomness despite pseudo-random value summation. Standard deviation patterns across layers further illuminate the dynamic interplay of pseudo-random operations on statistical characteristics. While not aiming to disrupt academic norms, this work modestly contributes insights into data distribution complexities. Future studies are encouraged to delve deeper into the implications of data manipulation on statistical outcomes, extending the understanding of pseudo-random operations in diverse contexts.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Unsupervised Feature Based Algorithms for Time Series Extrinsic Regression
Authors:
David Guijo-Rubio,
Matthew Middlehurst,
Guilherme Arcencio,
Diego Furtado Silva,
Anthony Bagnall
Abstract:
Time Series Extrinsic Regression (TSER) involves using a set of training time series to form a predictive model of a continuous response variable that is not directly related to the regressor series. The TSER archive for comparing algorithms was released in 2022 with 19 problems. We increase the size of this archive to 63 problems and reproduce the previous comparison of baseline algorithms. We th…
▽ More
Time Series Extrinsic Regression (TSER) involves using a set of training time series to form a predictive model of a continuous response variable that is not directly related to the regressor series. The TSER archive for comparing algorithms was released in 2022 with 19 problems. We increase the size of this archive to 63 problems and reproduce the previous comparison of baseline algorithms. We then extend the comparison to include a wider range of standard regressors and the latest versions of TSER models used in the previous study. We show that none of the previously evaluated regressors can outperform a regression adaptation of a standard classifier, rotation forest. We introduce two new TSER algorithms developed from related work in time series classification. FreshPRINCE is a pipeline estimator consisting of a transform into a wide range of summary features followed by a rotation forest regressor. DrCIF is a tree ensemble that creates features from summary statistics over random intervals. Our study demonstrates that both algorithms, along with InceptionTime, exhibit significantly better performance compared to the other 18 regressors tested. More importantly, these two proposals (DrCIF and FreshPRINCE) models are the only ones that significantly outperform the standard rotation forest regressor.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Exact Bayesian Inference for Geostatistical Models under Preferential Sampling
Authors:
Douglas Mateus da Silva,
Dani Gamerman
Abstract:
Preferential sampling is a common feature in geostatistics and occurs when the locations to be sampled are chosen based on information about the phenomena under study. In this case, point pattern models are commonly used as the probability law for the distribution of the locations. However, analytic intractability of the point process likelihood prevents its direct calculation. Many Bayesian (and…
▽ More
Preferential sampling is a common feature in geostatistics and occurs when the locations to be sampled are chosen based on information about the phenomena under study. In this case, point pattern models are commonly used as the probability law for the distribution of the locations. However, analytic intractability of the point process likelihood prevents its direct calculation. Many Bayesian (and non-Bayesian) approaches in non-parametric model specifications handle this difficulty with approximation-based methods. These approximations involve errors that are difficult to quantify and can lead to biased inference. This paper presents an approach for performing exact Bayesian inference for this setting without the need for model approximation. A qualitatively minor change on the traditional model is proposed to circumvent the likelihood intractability. This change enables the use of an augmented model strategy. Recent work on Bayesian inference for point pattern models can be adapted to the geostatistics setting and renders computational tractability for exact inference for the proposed methodology. Estimation of model parameters and prediction of the response at unsampled locations can then be obtained from the joint posterior distribution of the augmented model. Simulated studies showed good quality of the proposed model for estimation and prediction in a variety of preferentiality scenarios. The performance of our approach is illustrated in the analysis of real datasets and compares favourably against approximation-based approaches. The paper is concluded with comments regarding extensions of and improvements to the proposed methodology.
△ Less
Submitted 25 October, 2022;
originally announced October 2022.
-
Model-robust Bayesian design through Generalised Additive Models for monitoring submerged shoals
Authors:
Dilishiya De Silva,
Rebecca Fisher,
Ben Radford,
Helen Thompson,
James McGree
Abstract:
Optimal sampling strategies are critical for surveys of deeper coral reef and shoal systems, due to the significant cost of accessing and field sampling these remote and poorly understood ecosystems. Additionally, well-established standard diver-based sampling techniques used in shallow reef systems cannot be deployed because of water depth. Here we develop a Bayesian design strategy to optimise s…
▽ More
Optimal sampling strategies are critical for surveys of deeper coral reef and shoal systems, due to the significant cost of accessing and field sampling these remote and poorly understood ecosystems. Additionally, well-established standard diver-based sampling techniques used in shallow reef systems cannot be deployed because of water depth. Here we develop a Bayesian design strategy to optimise sampling for a shoal deep reef system using three years of pilot data. Bayesian designs are generally found by maximising the expectation of a utility function with respect to the joint distribution of the parameters and the response conditional on an assumed statistical model. Unfortunately, specifying such a model a priori is difficult as knowledge of the data generating process is typically incomplete. To address this, we present an approach to find Bayesian designs that are robust to unknown model uncertainty. This is achieved through couching the specified model within a Generalised Additive Modelling framework and formulating prior information that allows the additive component to capture discrepancies between what is assumed and the underlying data generating process. The motivation for this is to enable Bayesian designs to be found under epistemic model uncertainty; a highly desirable property of Bayesian designs. Our approach is demonstrated initially on an exemplar design problem where a theoretic result is derived and used to explore the properties of optimal designs. We then apply our approach to design future monitoring of sub-merged shoals off the north-west coast of Australia with the aim of significantly improving on current monitoring designs.
△ Less
Submitted 30 August, 2022;
originally announced August 2022.
-
MCEM and SAEM Algorithms for Geostatistical Models under Preferential Sampling
Authors:
Douglas Mateus da Silva,
Lourdes C. Contreras Montenegro
Abstract:
The problem of preferential sampling in geostatistics arises when the choise of location to be sampled is made with information about the phenomena in the study. The geostatistical model under preferential sampling deals with this problem, but parameter estimation is challenging because the likelihood function has no closed form. We developed an MCEM and an SAEM algorithm for finding the maximum l…
▽ More
The problem of preferential sampling in geostatistics arises when the choise of location to be sampled is made with information about the phenomena in the study. The geostatistical model under preferential sampling deals with this problem, but parameter estimation is challenging because the likelihood function has no closed form. We developed an MCEM and an SAEM algorithm for finding the maximum likelihood estimators of parameters of the model and compared our methodology with the existing ones: Monte Carlo likelihood approximation and Laplace approximation. Simulated studies were realized to assess the quality of the proposed methods and showed good parameter estimation and prediction in preferential sampling. Finally, we illustrate our findings on the well known moss data from Galicia.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
The potential stickiness of pandemic-induced behavior changes in the United States
Authors:
Deborah Salon,
Matthew Wigginton Conway,
Denise Capasso da Silva,
Rishabh Singh Chauhan,
Sybil Derrible,
Kouros Mohammadian,
Sara Khoeini,
Nathan Parker,
Laura Mirtich,
Ali Shamshiripour,
Ehsan Rahimi,
Ram Pendyala
Abstract:
Human behavior is notoriously difficult to change, but a disruption of the magnitude of the COVID-19 pandemic has the potential to bring about long-term behavioral changes. During the pandemic, people have been forced to experience new ways of interacting, working, learning, shopping, traveling, and eating meals. A critical question going forward is how these experiences have actually changed pref…
▽ More
Human behavior is notoriously difficult to change, but a disruption of the magnitude of the COVID-19 pandemic has the potential to bring about long-term behavioral changes. During the pandemic, people have been forced to experience new ways of interacting, working, learning, shopping, traveling, and eating meals. A critical question going forward is how these experiences have actually changed preferences and habits in ways that might persist after the pandemic ends. Many observers have suggested theories about what the future will bring, but concrete evidence has been lacking. We present evidence on how much U.S. adults expect their own post-pandemic choices to differ from their pre-pandemic lifestyles in the areas of telecommuting, restaurant patronage, air travel, online shopping, transit use, car commuting, uptake of walking and biking, and home location. The analysis is based on a nationally-representative survey dataset collected between July and October 2020. Key findings include that the new normal will feature a doubling of telecommuting, reduced air travel, and improved quality of life for some.
△ Less
Submitted 25 May, 2021; v1 submitted 29 April, 2021;
originally announced April 2021.
-
A database of travel-related behaviors and attitudes before, during, and after COVID-19 in the United States
Authors:
Rishabh Singh Chauhan,
Matthew Wigginton Conway,
Denise Capasso da Silva,
Deborah Salon,
Ali Shamshiripour,
Ehsan Rahimi,
Sara Khoeini,
Abolfazl Mohammadian,
Sybil Derrible,
Ram Pendyala
Abstract:
The COVID-19 pandemic has impacted billions of people around the world. To capture some of these impacts in the United States, we are conducting a nationwide longitudinal survey collecting information about activity and travel-related behaviors and attitudes before, during, and after the COVID-19 pandemic. The survey questions cover a wide range of topics including commuting, daily travel, air tra…
▽ More
The COVID-19 pandemic has impacted billions of people around the world. To capture some of these impacts in the United States, we are conducting a nationwide longitudinal survey collecting information about activity and travel-related behaviors and attitudes before, during, and after the COVID-19 pandemic. The survey questions cover a wide range of topics including commuting, daily travel, air travel, working from home, online learning, shopping, and risk perception, along with attitudinal, socioeconomic, and demographic information. The survey is deployed over multiple waves to the same respondents to monitor how behaviors and attitudes evolve over time. Version 1.0 of the survey contains 8,723 Wave 1 responses that are publicly available. This article details the methodology adopted for the collection, cleaning, and processing of the data. In addition, the data are weighted to be representative of national and regional demographics. This survey dataset can aid researchers, policymakers, businesses, and government agencies in understanding both the extent of behavioral shifts and the likelihood that changes in behaviors will persist after COVID-19.
△ Less
Submitted 9 October, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
Microphone Array Based Surveillance Audio Classification
Authors:
Dimitri Leandro de Oliveira Silva,
Tito Spadini,
Ricardo Suyama
Abstract:
The work assessed seven classical classifiers and two beamforming algorithms for detecting surveillance sound events. The tests included the use of AWGN with -10 dB to 30 dB SNR. Data Augmentation was also employed to improve algorithms' performance. The results showed that the combination of SVM and Delay-and-Sum (DaS) scored the best accuracy (up to 86.0\%), but had high computational cost (…
▽ More
The work assessed seven classical classifiers and two beamforming algorithms for detecting surveillance sound events. The tests included the use of AWGN with -10 dB to 30 dB SNR. Data Augmentation was also employed to improve algorithms' performance. The results showed that the combination of SVM and Delay-and-Sum (DaS) scored the best accuracy (up to 86.0\%), but had high computational cost ($\approx $ 402 ms), mainly due to DaS. The use of SGD also seems to be a good alternative since it has achieved good accuracy either (up to 85.3\%), but with quicker processing time ($\approx$ 165 ms).
△ Less
Submitted 22 May, 2020;
originally announced May 2020.
-
Forecasting in Non-stationary Environments with Fuzzy Time Series
Authors:
Petrônio Cândido de Lima e Silva,
Carlos Alberto Severiano Junior,
Marcos Antonio Alves,
Rodrigo Silva,
Miri Weiss Cohen,
Frederico Gadelha Guimarães
Abstract:
In this paper we introduce a Non-Stationary Fuzzy Time Series (NSFTS) method with time varying parameters adapted from the distribution of the data. In this approach, we employ Non-Stationary Fuzzy Sets, in which perturbation functions are used to adapt the membership function parameters in the knowledge base in response to statistical changes in the time series. The proposed method is capable of…
▽ More
In this paper we introduce a Non-Stationary Fuzzy Time Series (NSFTS) method with time varying parameters adapted from the distribution of the data. In this approach, we employ Non-Stationary Fuzzy Sets, in which perturbation functions are used to adapt the membership function parameters in the knowledge base in response to statistical changes in the time series. The proposed method is capable of dynamically adapting its fuzzy sets to reflect the changes in the stochastic process based on the residual errors, without the need to retraining the model. This method can handle non-stationary and heteroskedastic data as well as scenarios with concept-drift. The proposed approach allows the model to be trained only once and remain useful long after while keeping reasonable accuracy. The flexibility of the method by means of computational experiments was tested with eight synthetic non-stationary time series data with several kinds of concept drifts, four real market indices (Dow Jones, NASDAQ, SP500 and TAIEX), three real FOREX pairs (EUR-USD, EUR-GBP, GBP-USD), and two real cryptocoins exchange rates (Bitcoin-USD and Ethereum-USD). As competitor models the Time Variant fuzzy time series and the Incremental Ensemble were used, these are two of the major approaches for handling non-stationary data sets. Non-parametric tests are employed to check the significance of the results. The proposed method shows resilience to concept drift, by adapting parameters of the model, while preserving the symbolic structure of the knowledge base.
△ Less
Submitted 26 April, 2020;
originally announced April 2020.
-
Sound Event Recognition in a Smart City Surveillance Context
Authors:
Tito Spadini,
Dimitri Leandro de Oliveira Silva,
Ricardo Suyama
Abstract:
Due to the growing demand for improving surveillance capabilities in smart cities, systems need to be developed to provide better monitoring capabilities to competent authorities, agencies responsible for strategic resource management, and emergency call centers. This work assumes that, as a complementary monitoring solution, the use of a system capable of detecting the occurrence of sound events,…
▽ More
Due to the growing demand for improving surveillance capabilities in smart cities, systems need to be developed to provide better monitoring capabilities to competent authorities, agencies responsible for strategic resource management, and emergency call centers. This work assumes that, as a complementary monitoring solution, the use of a system capable of detecting the occurrence of sound events, performing the Sound Events Recognition (SER) task, is highly convenient. In order to contribute to the classification of such events, this paper explored several classifiers over the SESA dataset, composed of audios of three hazard classes (gunshots, explosions, and sirens) and a class of casual sounds that could be misinterpreted as some of the other sounds. The best result was obtained by SGD, with an accuracy of 72.13% with 6.81 ms classification time, reinforcing the viability of such an approach.
△ Less
Submitted 1 February, 2020; v1 submitted 27 October, 2019;
originally announced October 2019.
-
Prior Specification for Bayesian Matrix Factorization via Prior Predictive Matching
Authors:
Eliezer de Souza da Silva,
Tomasz Kuśmierczyk,
Marcelo Hartmann,
Arto Klami
Abstract:
The behavior of many Bayesian models used in machine learning critically depends on the choice of prior distributions, controlled by some hyperparameters that are typically selected by Bayesian optimization or cross-validation. This requires repeated, costly, posterior inference. We provide an alternative for selecting good priors without carrying out posterior inference, building on the prior pre…
▽ More
The behavior of many Bayesian models used in machine learning critically depends on the choice of prior distributions, controlled by some hyperparameters that are typically selected by Bayesian optimization or cross-validation. This requires repeated, costly, posterior inference. We provide an alternative for selecting good priors without carrying out posterior inference, building on the prior predictive distribution that marginalizes out the model parameters. We estimate virtual statistics for data generated by the prior predictive distribution and then optimize over the hyperparameters to learn ones for which these virtual statistics match target values provided by the user or estimated from (subset of) the observed data. We apply the principle for probabilistic matrix factorization, for which good solutions for prior selection have been missing. We show that for Poisson factorization models we can analytically determine the hyperparameters, including the number of factors, that best replicate the target statistics, and we study empirically the sensitivity of the approach for model mismatch. We also present a model-independent procedure that determines the hyperparameters for general models by stochastic optimization, and demonstrate this extension in context of hierarchical matrix factorization models.
△ Less
Submitted 30 September, 2022; v1 submitted 27 October, 2019;
originally announced October 2019.
-
Augmented Memory Networks for Streaming-Based Active One-Shot Learning
Authors:
Andreas Kvistad,
Massimiliano Ruocco,
Eliezer de Souza da Silva,
Erlend Aune
Abstract:
One of the major challenges in training deep architectures for predictive tasks is the scarcity and cost of labeled training data. Active Learning (AL) is one way of addressing this challenge. In stream-based AL, observations are continuously made available to the learner that have to decide whether to request a label or to make a prediction. The goal is to reduce the request rate while at the sam…
▽ More
One of the major challenges in training deep architectures for predictive tasks is the scarcity and cost of labeled training data. Active Learning (AL) is one way of addressing this challenge. In stream-based AL, observations are continuously made available to the learner that have to decide whether to request a label or to make a prediction. The goal is to reduce the request rate while at the same time maximize prediction performance. In previous research, reinforcement learning has been used for learning the AL request/prediction strategy. In our work, we propose to equip a reinforcement learning process with memory augmented neural networks, to enhance the one-shot capabilities. Moreover, we introduce Class Margin Sampling (CMS) as an extension of the standard margin sampling to the reinforcement learning setting. This strategy aims to reduce training time and improve sample efficiency in the training process. We evaluate the proposed method on a classification task using empirical accuracy of label predictions and percentage of label requests. The results indicates that the proposed method, by making use of the memory augmented networks and CMS in the training process, outperforms existing baselines.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Ensemble learning with Conformal Predictors: Targeting credible predictions of conversion from Mild Cognitive Impairment to Alzheimer's Disease
Authors:
Telma Pereira,
Sandra Cardoso,
Dina Silva,
Manuela Guerreiro,
Alexandre de Mendonça,
Sara C. Madeira
Abstract:
Most machine learning classifiers give predictions for new examples accurately, yet without indicating how trustworthy predictions are. In the medical domain, this hampers their integration in decision support systems, which could be useful in the clinical practice. We use a supervised learning approach that combines Ensemble learning with Conformal Predictors to predict conversion from Mild Cogni…
▽ More
Most machine learning classifiers give predictions for new examples accurately, yet without indicating how trustworthy predictions are. In the medical domain, this hampers their integration in decision support systems, which could be useful in the clinical practice. We use a supervised learning approach that combines Ensemble learning with Conformal Predictors to predict conversion from Mild Cognitive Impairment to Alzheimer's Disease. Our goal is to enhance the classification performance (Ensemble learning) and complement each prediction with a measure of credibility (Conformal Predictors). Our results showed the superiority of the proposed approach over a similar ensemble framework with standard classifiers.
△ Less
Submitted 5 July, 2018; v1 submitted 4 July, 2018;
originally announced July 2018.
-
Factor Analysis of Interval Data
Authors:
Paula Cheira,
Paula Brito,
A. Pedro Duarte Silva
Abstract:
This paper presents a factor analysis model for symbolic data, focusing on the particular case of interval-valued variables. The proposed method describes the correlation structure among the measured interval-valued variables in terms of a few underlying, but unobservable, uncorrelated interval-valued variables, called \textit{common factors}. Uniform and Triangular distributions are considered wi…
▽ More
This paper presents a factor analysis model for symbolic data, focusing on the particular case of interval-valued variables. The proposed method describes the correlation structure among the measured interval-valued variables in terms of a few underlying, but unobservable, uncorrelated interval-valued variables, called \textit{common factors}. Uniform and Triangular distributions are considered within each observed interval. We obtain the corresponding sample mean, variance and covariance assuming a general Triangular distribution.
In our proposal, factors are extracted either by Principal Component or by Principal Axis Factoring, performed on the interval-valued variables correlation matrix. To estimate the values of the common factors, usually called \textit{factor scores}, two approaches are considered, which are inspired in methods for real-valued data: the Bartlett and the Anderson-Rubin methods. In both cases, the estimated values are obtained solving an optimization problem that minimizes a function of the weighted squared Mallows distance between quantile functions. Explicit expressions for the quantile function and the squared Mallows distance are derived assuming a general Triangular distribution.
The applicability of the method is illustrated using two sets of data: temperature and precipitation in cities of the United States of America between the years 1971 and 2000 and measures of car characteristics of different makes and models. Moreover, the method is evaluated on synthetic data with predefined correlation structures.
△ Less
Submitted 14 September, 2017;
originally announced September 2017.
-
Prediction Measures in Nonlinear Beta Regression Models
Authors:
Patrícia Leone Espinheira,
Luana C. Meireles da Silva,
Alisson de Oliveira Silva,
Raydonal Ospina
Abstract:
Nonlinear models are frequently applied to determine the optimal supply natural gas to a given residential unit based on economical and technical factors, or used to fit biochemical and pharmaceutical assay nonlinear data. In this article we propose PRESS statistics and prediction coefficients for a class of nonlinear beta regression models, namely $P^2$ statistics. We aim at using both prediction…
▽ More
Nonlinear models are frequently applied to determine the optimal supply natural gas to a given residential unit based on economical and technical factors, or used to fit biochemical and pharmaceutical assay nonlinear data. In this article we propose PRESS statistics and prediction coefficients for a class of nonlinear beta regression models, namely $P^2$ statistics. We aim at using both prediction coefficients and goodness-of-fit measures as a scheme of model select criteria. In this sense, we introduce for beta regression models under nonlinearity the use of the model selection criteria based on robust pseudo-$R^2$ statistics. Monte Carlo simulation results on the finite sample behavior of both prediction-based model selection criteria $P^2$ and the pseudo-$R^2$ statistics are provided. Three applications for real data are presented. The linear application relates to the distribution of natural gas for home usage in São Paulo, Brazil. Faced with the economic risk of too overestimate or to underestimate the distribution of gas has been necessary to construct prediction limits and to select the best predicted and fitted model to construct best prediction limits it is the aim of the first application. Additionally, the two nonlinear applications presented also highlight the importance of considering both goodness-of-predictive and goodness-of-fit of the competitive models.
△ Less
Submitted 22 May, 2017;
originally announced May 2017.
-
New Algorithms for Computing a Single Component of the Discrete Fourier Transform
Authors:
G. Jerônimo da Silva Jr.,
R. M. Campello de Souza,
H. M. de Oliveira
Abstract:
This paper introduces the theory and hardware implementation of two new algorithms for computing a single component of the discrete Fourier transform. In terms of multiplicative complexity, both algorithms are more efficient, in general, than the well known Goertzel Algorithm.
This paper introduces the theory and hardware implementation of two new algorithms for computing a single component of the discrete Fourier transform. In terms of multiplicative complexity, both algorithms are more efficient, in general, than the well known Goertzel Algorithm.
△ Less
Submitted 9 March, 2015;
originally announced March 2015.
-
Prediction Measures in Beta Regression Models
Authors:
Patrícia L. Espinheira,
Luana Cecília Meireles da Silva,
Alisson de Oliveira Silva
Abstract:
We consider the issue of constructing PRESS statistics and coefficients of prediction for a class of beta regression models. We aim at displaying measures of predictive power of the model regardless goodness-of-fit. Monte Carlo simulation results on the finite sample behavior of such measures are provided.We also present an application that relates to the distribution of natural gas for home usage…
▽ More
We consider the issue of constructing PRESS statistics and coefficients of prediction for a class of beta regression models. We aim at displaying measures of predictive power of the model regardless goodness-of-fit. Monte Carlo simulation results on the finite sample behavior of such measures are provided.We also present an application that relates to the distribution of natural gas for home usage in São Paulo, Brazil. Faced with the economic risk of to overestimate or to underestimate the distribution of gas was necessary to construct prediction limits using beta regression models (Espinheira et al., 2014). Thus, it arises the aim of this work, the selection of best predictive model to construct best prediction limits.
△ Less
Submitted 20 January, 2015;
originally announced January 2015.
-
An Integer Programming Formulation Applied to Optimum Allocation in Multivariate Stratified Sampling
Authors:
Jose Andre de Moura Brito,
Gustavo Silva Semaan,
Pedro Luis do Nascimento Silva,
Nelson Maculan
Abstract:
The problem of optimal allocation of samples in surveys using a stratified sampling plan was first discussed by Neyman in 1934. Since then, many researchers have studied the problem of the sample allocation in multivariate surveys and several methods have been proposed. Basically, these methods are divided into two class: The first involves forming a weighted average of the stratum variances and f…
▽ More
The problem of optimal allocation of samples in surveys using a stratified sampling plan was first discussed by Neyman in 1934. Since then, many researchers have studied the problem of the sample allocation in multivariate surveys and several methods have been proposed. Basically, these methods are divided into two class: The first involves forming a weighted average of the stratum variances and finding the optimal allocation for the average variance. The second class is associated with methods that require that an acceptable coefficient of variation for each of the variables on which the allocation is to be done. Particularly, this paper proposes a new optimization approach to the second problem. This approach is based on an integer programming formulation. Several experiments showed that the proposed approach is efficient way to solve this problem, considering a comparison of this approach with the other approach from the literature.
△ Less
Submitted 24 September, 2013;
originally announced September 2013.
-
Hierarchical Nystrom Methods for Constructing Markov State Models for Conformational Dynamics
Authors:
Yuan Yao,
Raymond Z. Cui,
Gregory R. Bowman,
Daniel Silva,
Jian Sun,
Xuhui Huang
Abstract:
Markov state models (MSMs) have become a popular approach for investigating the conformational dynamics of proteins and other biomolecules. MSMs are typically built from numerous molecular dynamics simulations by dividing the sampled configurations into a large number of microstates based on geometric criteria. The resulting microstate model can then be coarse-grained into a more understandable ma…
▽ More
Markov state models (MSMs) have become a popular approach for investigating the conformational dynamics of proteins and other biomolecules. MSMs are typically built from numerous molecular dynamics simulations by dividing the sampled configurations into a large number of microstates based on geometric criteria. The resulting microstate model can then be coarse-grained into a more understandable macro state model by lumping together rapidly mixing microstates into larger, metastable aggregates. However, finite sampling often results in the creation of many poorly sampled microstates. During coarse-graining, these states are mistakenly identified as being kinetically important because transitions to/from them appear to be slow. In this paper we propose a formalism based on an algebraic principle for matrix approximation, i.e. the Nystrom method, to deal with such poorly sampled microstates. Our scheme builds a hierarchy of microstates from high to low populations and progressively applies spectral clustering on sets of microstates within each level of the hierarchy. It helps spectral clustering identify metastable aggregates with highly populated microstates rather than being distracted by lowly populated states. We demonstrate the ability of this algorithm to discover the major metastable states on two model systems, the alanine dipeptide and TrpZip2.
△ Less
Submitted 5 January, 2013;
originally announced January 2013.
-
Dynamics of Snoring Sounds and Its Connection with Obstructive Sleep Apnea
Authors:
Adriano M. Alencar,
Diego Greatti Vaz da Silva,
Carolina Beatriz Oliveira,
Andre P. Vieira,
Henrique T. Moriya,
Geraldo Lorenzi-Filho
Abstract:
Snoring is extremely common in the general population and when irregular may indicate the presence of obstructive sleep apnea. We analyze the overnight sequence of wave packets --- the snore sound --- recorded during full polysomnography in patients referred to the sleep laboratory due to suspected obstructive sleep apnea. We hypothesize that irregular snore, with duration in the range between 10…
▽ More
Snoring is extremely common in the general population and when irregular may indicate the presence of obstructive sleep apnea. We analyze the overnight sequence of wave packets --- the snore sound --- recorded during full polysomnography in patients referred to the sleep laboratory due to suspected obstructive sleep apnea. We hypothesize that irregular snore, with duration in the range between 10 and 100 seconds, correlates with respiratory obstructive events. We find that the number of irregular snores --- easily accessible, and quantified by what we call the snore time interval index (STII) --- is in good agreement with the well-known apnea-hypopnea index, which expresses the severity of obstructive sleep apnea and is extracted only from polysomnography. In addition, the Hurst analysis of the snore sound itself, which calculates the fluctuations in the signal as a function of time interval, is used to build a classifier that is able to distinguish between patients with no or mild apnea and patients with moderate or severe apnea.
△ Less
Submitted 10 August, 2012;
originally announced August 2012.
-
A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studies
Authors:
Verena Zuber,
A. Pedro Duarte Silva,
Korbinian Strimmer
Abstract:
Background: Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that need to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate…
▽ More
Background: Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that need to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs.
Results: We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs.
Conclusions: Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/ .
△ Less
Submitted 25 October, 2012; v1 submitted 14 March, 2012;
originally announced March 2012.