-
Integrating Statistical Significance and Discriminative Power in Pattern Discovery
Authors:
Leonardo Alexandre,
Rafael S. Costa,
Rui Henriques
Abstract:
Pattern discovery plays a central role in both descriptive and predictive tasks across multiple domains. Actionable patterns must meet rigorous statistical significance criteria and, in the presence of target variables, further uphold discriminative power. Our work addresses the underexplored area of guiding pattern discovery by integrating statistical significance and discriminative power criteri…
▽ More
Pattern discovery plays a central role in both descriptive and predictive tasks across multiple domains. Actionable patterns must meet rigorous statistical significance criteria and, in the presence of target variables, further uphold discriminative power. Our work addresses the underexplored area of guiding pattern discovery by integrating statistical significance and discriminative power criteria into state-of-the-art algorithms while preserving pattern quality. We also address how pattern quality thresholds, imposed by some algorithms, can be rectified to accommodate these additional criteria. To test the proposed methodology, we select the triclustering task as the guiding pattern discovery case and extend well-known greedy and multi-objective optimization triclustering algorithms, $δ$-Trimax and TriGen, that use various pattern quality criteria, such as Mean Squared Residual (MSR), Least Squared Lines (LSL), and Multi Slope Measure (MSL). Results from three case studies show the role of the proposed methodology in discovering patterns with pronounced improvements of discriminative power and statistical significance without quality deterioration, highlighting its importance in supervisedly guiding the search. Although the proposed methodology is motivated over multivariate time series data, it can be straightforwardly extended to pattern discovery tasks involving multivariate, N-way (N>3), transactional, and sequential data structures.
Availability: The code is freely available at https://github.com/JupitersMight/MOF_Triclustering under the MIT license.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
TriSig: Assessing the statistical significance of triclusters
Authors:
Leonardo Alexandre,
Rafael S. Costa,
Rui Henriques
Abstract:
Tensor data analysis allows researchers to uncover novel patterns and relationships that cannot be obtained from matrix data alone. The information inferred from the patterns provides valuable insights into disease progression, bioproduction processes, weather fluctuations, and group dynamics. However, spurious and redundant patterns hamper this process. This work aims at proposing a statistical f…
▽ More
Tensor data analysis allows researchers to uncover novel patterns and relationships that cannot be obtained from matrix data alone. The information inferred from the patterns provides valuable insights into disease progression, bioproduction processes, weather fluctuations, and group dynamics. However, spurious and redundant patterns hamper this process. This work aims at proposing a statistical frame to assess the probability of patterns in tensor data to deviate from null expectations, extending well-established principles for assessing the statistical significance of patterns in matrix data. A comprehensive discussion on binomial testing for false positive discoveries is entailed at the light of: variable dependencies, temporal dependencies and misalignments, and \textit{p}-value corrections under the Benjamini-Hochberg procedure. Results gathered from the application of state-of-the-art triclustering algorithms over distinct real-world case studies in biochemical and biotechnological domains confer validity to the proposed statistical frame while revealing vulnerabilities of some triclustering searches. The proposed assessment can be incorporated into existing triclustering algorithms to mitigate false positive/spurious discoveries and further prune the search space, reducing their computational complexity.
Availability: The code is freely available at https://github.com/JupitersMight/TriSig under the MIT license.
△ Less
Submitted 12 June, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
Scaling pattern mining through non-overlapping variable partitioning
Authors:
Leonardo Alexandre,
Rafael S. Costa,
Rui Henriques
Abstract:
Biclustering algorithms play a central role in the biotechnological and biomedical domains. The knowledge extracted supports the extraction of putative regulatory modules, essential to understanding diseases, aiding therapy research, and advancing biological knowledge. However, given the NP-hard nature of the biclustering task, algorithms with optimality guarantees tend to scale poorly in the pres…
▽ More
Biclustering algorithms play a central role in the biotechnological and biomedical domains. The knowledge extracted supports the extraction of putative regulatory modules, essential to understanding diseases, aiding therapy research, and advancing biological knowledge. However, given the NP-hard nature of the biclustering task, algorithms with optimality guarantees tend to scale poorly in the presence of high-dimensionality data. To this end, we propose a pipeline for clustering-based vertical partitioning that takes into consideration both parallelization and cross-partition pattern merging needs. Given a specific type of pattern coherence, these clusters are built based on the likelihood that variables form those patterns. Subsequently, the extracted patterns per cluster are then merged together into a final set of closed patterns. This approach is evaluated using five published datasets. Results show that in some of the tested data, execution times yield statistically significant improvements when variables are clustered together based on the likelihood to form specific types of patterns, as opposed to partitions based on dissimilarity or randomness. This work offers a departuring step on the efficiency impact of vertical partitioning criteria along the different stages of pattern mining and biclustering algorithms.
Availability: All the code is freely available at https://github.com/JupitersMight/pattern_merge under the MIT license.
△ Less
Submitted 10 December, 2022;
originally announced December 2022.
-
User-Specific Bicluster-based Collaborative Filtering: Handling Preference Locality, Sparsity and Subjectivity
Authors:
Miguel G. Silva,
Rui Henriques,
Sara C. Madeira
Abstract:
Collaborative Filtering (CF), the most common approach to build Recommender Systems, became pervasive in our daily lives as consumers of products and services. However, challenges limit the effectiveness of Collaborative Filtering approaches when dealing with recommendation data, mainly due to the diversity and locality of user preferences, structural sparsity of user-item ratings, subjectivity of…
▽ More
Collaborative Filtering (CF), the most common approach to build Recommender Systems, became pervasive in our daily lives as consumers of products and services. However, challenges limit the effectiveness of Collaborative Filtering approaches when dealing with recommendation data, mainly due to the diversity and locality of user preferences, structural sparsity of user-item ratings, subjectivity of rating scales, and increasingly high item dimensionality and user bases. To answer some of these challenges, some authors proposed successful approaches combining CF with Biclustering techniques.
This work assesses the effectiveness of Biclustering approaches for CF, comparing the impact of algorithmic choices, and identifies principles for superior Biclustering-based CF. As a result, we propose USBFC, a Biclustering-based CF approach that creates user-specific models from strongly coherent and statistically significant rating patterns, corresponding to subspaces of shared preferences across users. Evaluation on real-world data reveals that USBCF achieves competitive predictive accuracy against state-of-the-art CF methods. Moreover, USBFC successfully suppresses the main shortcomings of the previously proposed state-of-the-art biclustering-based CF by increasing coverage, and coclustering-based CF by strengthening subspace homogeneity.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
EEG to fMRI Synthesis Benefits from Attentional Graphs of Electrode Relationships
Authors:
David Calhas,
Rui Henriques
Abstract:
Topographical structures represent connections between entities and provide a comprehensive design of complex systems. Currently these structures are used to discover correlates of neuronal and haemodynamical activity. In this work, we incorporate them with neural processing techniques to perform regression, using electrophysiological activity to retrieve haemodynamics. To this end, we use Fourier…
▽ More
Topographical structures represent connections between entities and provide a comprehensive design of complex systems. Currently these structures are used to discover correlates of neuronal and haemodynamical activity. In this work, we incorporate them with neural processing techniques to perform regression, using electrophysiological activity to retrieve haemodynamics. To this end, we use Fourier features, attention mechanisms, shared space between modalities and incorporation of style in the latent representation. By combining these techniques, we propose several models that significantly outperform current state-of-the-art of this task in resting state and task-based recording settings. We report which EEG electrodes are the most relevant for the regression task and which relations impacted it the most. In addition, we observe that haemodynamic activity at the scalp, in contrast with sub-cortical regions, is relevant to the learned shared space. Overall, these results suggest that EEG electrode relationships are pivotal to retain information necessary for haemodynamical activity retrieval.
△ Less
Submitted 7 March, 2022;
originally announced March 2022.
-
On the Role of Multi-Objective Optimization to the Transit Network Design Problem
Authors:
Vasco D. Silva,
Anna Finamore,
Rui Henriques
Abstract:
Ongoing traffic changes, including those triggered by the COVID-19 pandemic, reveal the necessity to adapt our public transport systems to the ever-changing users' needs. This work shows that single and multi objective stances can be synergistically combined to better answer the transit network design problem (TNDP). Single objective formulations are dynamically inferred from the rating of network…
▽ More
Ongoing traffic changes, including those triggered by the COVID-19 pandemic, reveal the necessity to adapt our public transport systems to the ever-changing users' needs. This work shows that single and multi objective stances can be synergistically combined to better answer the transit network design problem (TNDP). Single objective formulations are dynamically inferred from the rating of networks in the approximated (multi-objective) Pareto Front, where a regression approach is used to infer the optimal weights of transfer needs, times, distances, coverage, and costs. As a guiding case study, the solution is applied to the multimodal public transport network in the city of Lisbon, Portugal. The system takes individual trip data given by smartcard validations at CARRIS buses and METRO subway stations and uses them to estimate the origin-destination demand in the city. Then, Genetic Algorithms are used, considering both single and multi objective approaches, to redesign the bus network that better fits the observed traffic demand. The proposed TNDP optimization proved to improve results, with reductions in objective functions of up to 28.3%. The system managed to extensively reduce the number of routes, and all passenger related objectives, including travel time and transfers per trip, significantly improve. Grounded on automated fare collection data, the system can incrementally redesign the bus network to dynamically handle ongoing changes to the city traffic.
△ Less
Submitted 27 January, 2022;
originally announced January 2022.
-
Context-aware demand prediction in bike sharing systems: incorporating spatial, meteorological and calendrical context
Authors:
Cláudio Sardinha,
Anna C. Finamore,
Rui Henriques
Abstract:
Bike sharing demand is increasing in large cities worldwide. The proper functioning of bike-sharing systems is, nevertheless, dependent on a balanced geographical distribution of bicycles throughout a day. In this context, understanding the spatiotemporal distribution of check-ins and check-outs is key for station balancing and bike relocation initiatives. Still, recent contributions from deep lea…
▽ More
Bike sharing demand is increasing in large cities worldwide. The proper functioning of bike-sharing systems is, nevertheless, dependent on a balanced geographical distribution of bicycles throughout a day. In this context, understanding the spatiotemporal distribution of check-ins and check-outs is key for station balancing and bike relocation initiatives. Still, recent contributions from deep learning and distance-based predictors show limited success on forecasting bike sharing demand. This consistent observation is hypothesized to be driven by: i) the strong dependence between demand and the meteorological and situational context of stations; and ii) the absence of spatial awareness as most predictors are unable to model the effects of high-low station load on nearby stations.
This work proposes a comprehensive set of new principles to incorporate both historical and prospective sources of spatial, meteorological, situational and calendrical context in predictive models of station demand. To this end, a new recurrent neural network layering composed by serial long-short term memory (LSTM) components is proposed with two major contributions: i) the feeding of multivariate time series masks produced from historical context data at the input layer, and ii) the time-dependent regularization of the forecasted time series using prospective context data. This work further assesses the impact of incorporating different sources of context, showing the relevance of the proposed principles for the community even though not all improvements from the context-aware predictors yield statistical significance.
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
DI2: prior-free and multi-item discretization ofbiomedical data and its applications
Authors:
Leonardo Alexandre,
Rafael S. Costa,
Rui Henriques
Abstract:
Motivation: A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molec…
▽ More
Motivation: A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations.
Results: In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. DI2 provides robust guarantees of generalization by placing data corrections using the Kolmogorov-Smirnov test before statistically fitting distribution candidates. DI2 further supports multi-item assignments. Results gathered from biomedical data show its relevance to improve classic discretization choices.
Software: available at https://github.com/JupitersMight/DI2
△ Less
Submitted 7 March, 2021;
originally announced March 2021.
-
EEG to fMRI Synthesis: Is Deep Learning a candidate?
Authors:
David Calhas,
Rui Henriques
Abstract:
Advances on signal, image and video generation underly major breakthroughs on generative medical imaging tasks, including Brain Image Synthesis. Still, the extent to which functional Magnetic Ressonance Imaging (fMRI) can be mapped from the brain electrophysiology remains largely unexplored. This work provides the first comprehensive view on how to use state-of-the-art principles from Neural Proce…
▽ More
Advances on signal, image and video generation underly major breakthroughs on generative medical imaging tasks, including Brain Image Synthesis. Still, the extent to which functional Magnetic Ressonance Imaging (fMRI) can be mapped from the brain electrophysiology remains largely unexplored. This work provides the first comprehensive view on how to use state-of-the-art principles from Neural Processing to synthesize fMRI data from electroencephalographic (EEG) data. Given the distinct spatiotemporal nature of haemodynamic and electrophysiological signals, this problem is formulated as the task of learning a mapping function between multivariate time series with highly dissimilar structures. A comparison of state-of-the-art synthesis approaches, including Autoencoders, Generative Adversarial Networks and Pairwise Learning, is undertaken. Results highlight the feasibility of EEG to fMRI brain image mappings, pinpointing the role of current advances in Machine Learning and showing the relevance of upcoming contributions to further improve performance. EEG to fMRI synthesis offers a way to enhance and augment brain image data, and guarantee access to more affordable, portable and long-lasting protocols of brain activity monitoring. The code used in this manuscript is available in Github and the datasets are open source.
△ Less
Submitted 29 September, 2020;
originally announced September 2020.
-
fMRI Multiple Missing Values Imputation Regularized by a Recurrent Denoiser
Authors:
David Calhas,
Rui Henriques
Abstract:
Functional Magnetic Resonance Imaging (fMRI) is a neuroimaging technique with pivotal importance due to its scientific and clinical applications. As with any widely used imaging modality, there is a need to ensure the quality of the same, with missing values being highly frequent due to the presence of artifacts or sub-optimal imaging resolutions. Our work focus on missing values imputation on mul…
▽ More
Functional Magnetic Resonance Imaging (fMRI) is a neuroimaging technique with pivotal importance due to its scientific and clinical applications. As with any widely used imaging modality, there is a need to ensure the quality of the same, with missing values being highly frequent due to the presence of artifacts or sub-optimal imaging resolutions. Our work focus on missing values imputation on multivariate signal data. To do so, a new imputation method is proposed consisting on two major steps: spatial-dependent signal imputation and time-dependent regularization of the imputed signal. A novel layer, to be used in deep learning architectures, is proposed in this work, bringing back the concept of chained equations for multiple imputation. Finally, a recurrent layer is applied to tune the signal, such that it captures its true patterns. Both operations yield an improved robustness against state-of-the-art alternatives.
△ Less
Submitted 26 September, 2020;
originally announced September 2020.
-
TripMD: Driving patterns investigation via Motif Analysis
Authors:
Maria Inês Silva,
Roberto Henriques
Abstract:
Processing driving data and investigating driving behavior has been receiving an increasing interest in the last decades, with applications ranging from car insurance pricing to policy making. A common strategy to analyze driving behavior is to study the maneuvers being performance by the driver. In this paper, we propose TripMD, a system that extracts the most relevant driving patterns from senso…
▽ More
Processing driving data and investigating driving behavior has been receiving an increasing interest in the last decades, with applications ranging from car insurance pricing to policy making. A common strategy to analyze driving behavior is to study the maneuvers being performance by the driver. In this paper, we propose TripMD, a system that extracts the most relevant driving patterns from sensor recordings (such as acceleration) and provides a visualization that allows for an easy investigation. Additionally, we test our system using the UAH-DriveSet dataset, a publicly available naturalistic driving dataset. We show that (1) our system can extract a rich number of driving patterns from a single driver that are meaningful to understand driving behaviors and (2) our system can be used to identify the driving behavior of an unknown driver from a set of drivers whose behavior we know.
△ Less
Submitted 5 July, 2021; v1 submitted 7 July, 2020;
originally announced July 2020.
-
Exploring time-series motifs through DTW-SOM
Authors:
Maria Inês Silva,
Roberto Henriques
Abstract:
Motif discovery is a fundamental step in data mining tasks for time-series data such as clustering, classification and anomaly detection. Even though many papers have addressed the problem of how to find motifs in time-series by proposing new motif discovery algorithms, not much work has been done on the exploration of the motifs extracted by these algorithms. In this paper, we argue that visually…
▽ More
Motif discovery is a fundamental step in data mining tasks for time-series data such as clustering, classification and anomaly detection. Even though many papers have addressed the problem of how to find motifs in time-series by proposing new motif discovery algorithms, not much work has been done on the exploration of the motifs extracted by these algorithms. In this paper, we argue that visually exploring time-series motifs computed by motif discovery algorithms can be useful to understand and debug results. To explore the output of motif discovery algorithms, we propose the use of an adapted Self-Organizing Map, the DTW-SOM, on the list of motif's centers. In short, DTW-SOM is a vanilla Self-Organizing Map with three main differences, namely (1) the use the Dynamic Time Warping distance instead of the Euclidean distance, (2) the adoption of two new network initialization routines (a random sample initialization and an anchor initialization) and (3) the adjustment of the Adaptation phase of the training to work with variable-length time-series sequences. We test DTW-SOM in a synthetic motif dataset and two real time-series datasets from the UCR Time Series Classification Archive. After an exploration of results, we conclude that DTW-SOM is capable of extracting relevant information from a set of motifs and display it in a visualization that is space-efficient.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.
-
Finding manoeuvre motifs in vehicle telematics
Authors:
Maria Inês Silva,
Roberto Henriques
Abstract:
Driving behaviour has a great impact on road safety. A popular way of analysing driving behaviour is to move the focus to the manoeuvres as they give useful information about the driver who is performing them. In this paper, we investigate a new way of identifying manoeuvres from vehicle telematics data, through motif detection in time-series. We implement a modified version of the Extended Motif…
▽ More
Driving behaviour has a great impact on road safety. A popular way of analysing driving behaviour is to move the focus to the manoeuvres as they give useful information about the driver who is performing them. In this paper, we investigate a new way of identifying manoeuvres from vehicle telematics data, through motif detection in time-series. We implement a modified version of the Extended Motif Discovery (EMD) algorithm, a classical variable-length motif detection algorithm for time-series and we applied it to the UAH-DriveSet, a publicly available naturalistic driving dataset. After a systematic exploration of the extracted motifs, we were able to conclude that the EMD algorithm was not only capable of extracting simple manoeuvres such as accelerations, brakes and curves, but also more complex manoeuvres, such as lane changes and overtaking manoeuvres, which validates motif discovery as a worthwhile line for future research.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
Fitting IVIM with Variable Projection and Simplicial Optimization
Authors:
Shreyas Fadnavis,
Hamza Farooq,
Maryam Afzali,
Christoph Lenglet,
Tryphon Georgiou,
Hu Cheng,
Sharlene Newman,
Shahnawaz Ahmed,
Rafael Neto Henriques,
Eric Peterson,
Serge Koudoro,
Ariel Rokem,
Eleftherios Garyfallidis
Abstract:
Fitting multi-exponential models to Diffusion MRI (dMRI) data has always been challenging due to various underlying complexities. In this work, we introduce a novel and robust fitting framework for the standard two-compartment IVIM microstructural model. This framework provides a significant improvement over the existing methods and helps estimate the associated diffusion and perfusion parameters…
▽ More
Fitting multi-exponential models to Diffusion MRI (dMRI) data has always been challenging due to various underlying complexities. In this work, we introduce a novel and robust fitting framework for the standard two-compartment IVIM microstructural model. This framework provides a significant improvement over the existing methods and helps estimate the associated diffusion and perfusion parameters of IVIM in an automatic manner. As a part of this work we provide capabilities to switch between more advanced global optimization methods such as simplicial homology (SH) and differential evolution (DE). Our experiments show that the results obtained from this simultaneous fitting procedure disentangle the model parameters in a reduced subspace. The proposed framework extends the seminal work originated in the MIX framework, with improved procedures for multi-stage fitting. This framework has been made available as an open-source Python implementation and disseminated to the community through the DIPY project.
△ Less
Submitted 15 February, 2020; v1 submitted 27 September, 2019;
originally announced October 2019.
-
On the use of Pairwise Distance Learning for Brain Signal Classification with Limited Observations
Authors:
David Calhas,
Enrique Romero,
Rui Henriques
Abstract:
The increasing access to brain signal data using electroencephalography creates new opportunities to study electrophysiological brain activity and perform ambulatory diagnoses of neuronal diseases. This work proposes a pairwise distance learning approach for Schizophrenia classification relying on the spectral properties of the signal. Given the limited number of observations (i.e. the case and/or…
▽ More
The increasing access to brain signal data using electroencephalography creates new opportunities to study electrophysiological brain activity and perform ambulatory diagnoses of neuronal diseases. This work proposes a pairwise distance learning approach for Schizophrenia classification relying on the spectral properties of the signal. Given the limited number of observations (i.e. the case and/or control individuals) in clinical trials, we propose a Siamese neural network architecture to learn a discriminative feature space from pairwise combinations of observations per channel. In this way, the multivariate order of the signal is used as a form of data augmentation, further supporting the network generalization ability. Convolutional layers with parameters learned under a cosine contrastive loss are proposed to adequately explore spectral images derived from the brain signal. Results on a case-control population show that the features extracted using the proposed neural network lead to an improved Schizophrenia diagnosis (+10pp in accuracy and sensitivity) against spectral features, thus suggesting the existence of non-trivial, discriminative electrophysiological brain patterns.
△ Less
Submitted 6 September, 2019; v1 submitted 5 June, 2019;
originally announced June 2019.
-
Order-Preserving Pattern Matching Indeterminate Strings
Authors:
Diogo Costa,
Luís M. S. Russo,
Rui Henriques,
Hideo Bannai,
Alexandre P. Francisco
Abstract:
Given an indeterminate string pattern $p$ and an indeterminate string text $t$, the problem of order-preserving pattern matching with character uncertainties ($μ$OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM)…
▽ More
Given an indeterminate string pattern $p$ and an indeterminate string text $t$, the problem of order-preserving pattern matching with character uncertainties ($μ$OPPM) is to find all substrings of $t$ that satisfy one of the possible orderings defined by $p$. When the text and pattern are determinate strings, we are in the presence of the well-studied exact order-preserving pattern matching (OPPM) problem with diverse applications on time series analysis. Despite its relevance, the exact OPPM problem suffers from two major drawbacks: 1) the inability to deal with indetermination in the text, thus preventing the analysis of noisy time series; and 2) the inability to deal with indetermination in the pattern, thus imposing the strict satisfaction of the orders among all pattern positions. This paper provides the first polynomial algorithm to answer the $μ$OPPM problem when indetermination is observed on the pattern or text. Given two strings with length $m$ and $O(r)$ uncertain characters per string position, we show that the $μ$OPPM problem can be solved in $O(mr\lg r)$ time when one string is indeterminate and $r\in\mathbb{N}^+$. Mappings into satisfiability problems are provided when indetermination is observed on both the pattern and the text, and results concerning the general problem complexity are presented as well, with $μ$OPPM problem proved to be NP-hard in general.
△ Less
Submitted 7 May, 2019;
originally announced May 2019.
-
Sustainable computational science: the ReScience initiative
Authors:
Nicolas P. Rougier,
Konrad Hinsen,
Frédéric Alexandre,
Thomas Arildsen,
Lorena Barba,
Fabien C. Y. Benureau,
C. Titus Brown,
Pierre de Buyl,
Ozan Caglayan,
Andrew P. Davison,
Marc André Delsuc,
Georgios Detorakis,
Alexandra K. Diem,
Damien Drix,
Pierre Enel,
Benoît Girard,
Olivia Guest,
Matt G. Hall,
Rafael Neto Henriques,
Xavier Hinaut,
Kamil S Jaron,
Mehdi Khamassi,
Almar Klein,
Tiina Manninen,
Pietro Marchesi
, et al. (20 additional authors not shown)
Abstract:
Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results, however computational science lags behind. In the best case, authors may provide their source code as a compressed archive and they may feel confident their research is reproducible. But this is not exactly true. James Buckheit and David Donoho proposed more than tw…
▽ More
Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results, however computational science lags behind. In the best case, authors may provide their source code as a compressed archive and they may feel confident their research is reproducible. But this is not exactly true. James Buckheit and David Donoho proposed more than two decades ago that an article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code, and data that produced the result. This implies new workflows, in particular in peer-reviews. Existing journals have been slow to adapt: source codes are rarely requested, hardly ever actually executed to check that they produce the results advertised in the article. ReScience is a peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research can be replicated from its description. To achieve this goal, the whole publishing chain is radically different from other traditional scientific journals. ReScience resides on GitHub where each new implementation of a computational study is made available together with comments, explanations, and software tests.
△ Less
Submitted 11 November, 2017; v1 submitted 14 July, 2017;
originally announced July 2017.
-
Learning the structure of Bayesian Networks: A quantitative assessment of the effect of different algorithmic schemes
Authors:
Stefano Beretta,
Mauro Castelli,
Ivo Goncalves,
Roberto Henriques,
Daniele Ramazzotti
Abstract:
One of the most challenging tasks when adopting Bayesian Networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions, and by the fact that the problem is NP-hard. Hence, full enumeration of all the possible solutions is not always feasible and approximations are often required. However, to the best of our knowledge, a qua…
▽ More
One of the most challenging tasks when adopting Bayesian Networks (BNs) is the one of learning their structure from data. This task is complicated by the huge search space of possible solutions, and by the fact that the problem is NP-hard. Hence, full enumeration of all the possible solutions is not always feasible and approximations are often required. However, to the best of our knowledge, a quantitative analysis of the performance and characteristics of the different heuristics to solve this problem has never been done before.
For this reason, in this work, we provide a detailed comparison of many different state-of-the-arts methods for structural learning on simulated data considering both BNs with discrete and continuous variables, and with different rates of noise in the data. In particular, we investigate the performance of different widespread scores and algorithmic approaches proposed for the inference and the statistical pitfalls within them.
△ Less
Submitted 3 August, 2018; v1 submitted 27 April, 2017;
originally announced April 2017.
-
On When and How to use SAT to Mine Frequent Itemsets
Authors:
Rui Henriques,
Inês Lynce,
Vasco Manquinho
Abstract:
A new stream of research was born in the last decade with the goal of mining itemsets of interest using Constraint Programming (CP). This has promoted a natural way to combine complex constraints in a highly flexible manner. Although CP state-of-the-art solutions formulate the task using Boolean variables, the few attempts to adopt propositional Satisfiability (SAT) provided an unsatisfactory perf…
▽ More
A new stream of research was born in the last decade with the goal of mining itemsets of interest using Constraint Programming (CP). This has promoted a natural way to combine complex constraints in a highly flexible manner. Although CP state-of-the-art solutions formulate the task using Boolean variables, the few attempts to adopt propositional Satisfiability (SAT) provided an unsatisfactory performance. This work deepens the study on when and how to use SAT for the frequent itemset mining (FIM) problem by defining different encodings with multiple task-driven enumeration options and search strategies. Although for the majority of the scenarios SAT-based solutions appear to be non-competitive with CP peers, results show a variety of interesting cases where SAT encodings are the best option.
△ Less
Submitted 26 July, 2012;
originally announced July 2012.
-
Automatic Test Generation for Space
Authors:
Ulisses Araujo Costa,
Daniela da Cruz,
Pedro Rangel Henriques
Abstract:
The European Space Agency (ESA) uses an engine to perform tests in the Ground Segment infrastructure, specially the Operational Simulator. This engine uses many different tools to ensure the development of regression testing infrastructure and these tests perform black-box testing to the C++ simulator implementation. VST (VisionSpace Technologies) is one of the companies that provides these servic…
▽ More
The European Space Agency (ESA) uses an engine to perform tests in the Ground Segment infrastructure, specially the Operational Simulator. This engine uses many different tools to ensure the development of regression testing infrastructure and these tests perform black-box testing to the C++ simulator implementation. VST (VisionSpace Technologies) is one of the companies that provides these services to ESA and they need a tool to infer automatically tests from the existing C++ code, instead of writing manually scripts to perform tests. With this motivation in mind, this paper explores automatic testing approaches and tools in order to propose a system that satisfies VST needs.
△ Less
Submitted 22 June, 2012;
originally announced June 2012.
-
Modeling Languages: metrics and assessing tools
Authors:
Daniela Fonte,
Ismael Vilas Boas,
José Azevedo,
José João Peixoto,
Pedro Faria,
Pedro Silva,
Tiago Sá,
Ulisses Costa,
Daniela da Cruz,
Pedro Rangel Henriques
Abstract:
Any traditional engineering field has metrics to rigorously assess the quality of their products. Engineers know that the output must satisfy the requirements, must comply with the production and market rules, and must be competitive.
Professionals in the new field of software engineering started a few years ago to define metrics to appraise their product: individual programs and software system…
▽ More
Any traditional engineering field has metrics to rigorously assess the quality of their products. Engineers know that the output must satisfy the requirements, must comply with the production and market rules, and must be competitive.
Professionals in the new field of software engineering started a few years ago to define metrics to appraise their product: individual programs and software systems. This concern motivates the need to assess not only the outcome but also the process and tools employed in its development. In this context, assessing the quality of programming languages is a legitimate objective; in a similar way, it makes sense to be concerned with models and modeling approaches, as more and more people start the software development process by a modeling phase.
In this paper we introduce and motivate the assessment of models quality in the Software Development cycle. After the general discussion of this topic, we focus the attention on the most popular modeling language -- the UML -- presenting metrics. Through a Case-Study, we present and explore two tools. To conclude we identify what is still lacking in the tools side.
△ Less
Submitted 20 June, 2012;
originally announced June 2012.