-
A data-science pipeline to enable the Interpretability of Many-Objective Feature Selection
Authors:
Uchechukwu F. Njoku,
Alberto Abelló,
Besim Bilalli,
Gianluca Bontempi
Abstract:
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task. As a consequence, MOFS typically returns a large set of non-dominated solutions, which have to be assessed by the data scientist in order to proceed with the final choice. Given the multi-variate nature of the assessment, which may include…
▽ More
Many-Objective Feature Selection (MOFS) approaches use four or more objectives to determine the relevance of a subset of features in a supervised learning task. As a consequence, MOFS typically returns a large set of non-dominated solutions, which have to be assessed by the data scientist in order to proceed with the final choice. Given the multi-variate nature of the assessment, which may include criteria (e.g. fairness) not related to predictive accuracy, this step is often not straightforward and suffers from the lack of existing tools. For instance, it is common to make use of a tabular presentation of the solutions, which provide little information about the trade-offs and the relations between criteria over the set of solutions.
This paper proposes an original methodology to support data scientists in the interpretation and comparison of the MOFS outcome by combining post-processing and visualisation of the set of solutions. The methodology supports the data scientist in the selection of an optimal feature subset by providing her with high-level information at three different levels: objectives, solutions, and individual features.
The methodology is experimentally assessed on two feature selection tasks adopting a GA-based MOFS with six objectives (number of selected features, balanced accuracy, F1-Score, variance inflation factor, statistical parity, and equalised odds). The results show the added value of the methodology in the selection of the final subset of features.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Federated Learning Enables Big Data for Rare Cancer Boundary Detection
Authors:
Sarthak Pati,
Ujjwal Baid,
Brandon Edwards,
Micah Sheller,
Shih-Han Wang,
G Anthony Reina,
Patrick Foley,
Alexey Gruzdev,
Deepthi Karkada,
Christos Davatzikos,
Chiharu Sako,
Satyam Ghodasara,
Michel Bilello,
Suyash Mohan,
Philipp Vollmuth,
Gianluca Brugnara,
Chandrakanth J Preetha,
Felix Sahm,
Klaus Maier-Hein,
Maximilian Zenk,
Martin Bendszus,
Wolfgang Wick,
Evan Calabrese,
Jeffrey Rudie,
Javier Villanueva-Meyer
, et al. (254 additional authors not shown)
Abstract:
Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train acc…
▽ More
Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing.
△ Less
Submitted 25 April, 2022; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Eris: Measuring discord among multidimensional data sources
Authors:
Alberto Abello,
James Cheney
Abstract:
Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data gathering processes in the different sources are imperfect and cannot provide an accurate merging of values. Thus, in the absence of ways to determine ground tru…
▽ More
Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data gathering processes in the different sources are imperfect and cannot provide an accurate merging of values. Thus, in the absence of ways to determine ground truth, it is important to at least quantify how far from being internally consistent a dataset is. Hence, we propose definitions of concordant data and define a discordance metric as a way of measuring disagreement to improve decision making based on trustworthiness.
We define the discord measurement problem of numerical attributes in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of different conceptualizations of the same reality (e.g., granularities or units), we wish to assess whether the different sources are concordant, or if not, use the discordance metric to quantify how discordant they are. We also define a set of algebraic operators to describe the alignments of different data sources with correctness guarantees, together with two alternative relational database implementations that reduce the problem to linear or quadratic programming. These are evaluated against both COVID-19 and synthetic data, and our experimental results show that discordance measurement can be performed efficiently in realistic situations.
△ Less
Submitted 17 August, 2023; v1 submitted 31 January, 2022;
originally announced January 2022.
-
Improving Web API Usage Logging
Authors:
Rediana Koçi,
Xavier Franch,
Petar Jovanovic,
Alberto Abelló
Abstract:
A Web API (WAPI) is a type of API whose interaction with its consumers is done through the Internet. While being accessed through the Internet can be challenging, mostly when WAPIs evolve, it gives providers the possibility to monitor their usage, and understand and analyze consumers' behavior. Currently, WAPI usage is mostly logged for traffic monitoring and troubleshooting. Even though they cont…
▽ More
A Web API (WAPI) is a type of API whose interaction with its consumers is done through the Internet. While being accessed through the Internet can be challenging, mostly when WAPIs evolve, it gives providers the possibility to monitor their usage, and understand and analyze consumers' behavior. Currently, WAPI usage is mostly logged for traffic monitoring and troubleshooting. Even though they contain invaluable information regarding consumers' behavior} they are not sufficiently used by providers. In this paper, we first consider two phases of the application development lifecycle, and based on them we distinguish two different types of usage logs, namely development logs and production logs. For each of them we show the potential analyses (e.g., WAPI usability evaluation, consumers' needs identification) that can be performed, as well as the main impediments, that may be caused by the unsuitable log format. We then conduct a case study using logs of the same WAPI from different deployments and different formats, to demonstrate the occurrence of these impediments and at the same time the importance of a proper log format. Next, based on the case study results, we present the main quality issues of WAPI log data and explain their impact on data analyses. For each of them, we give some practical suggestions on how to deal with them, as well as mitigating their root cause.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
MEDAL: An AI-driven Data Fabric Concept for Elastic Cloud-to-Edge Intelligence
Authors:
Vasileios Theodorou,
Ilias Gerostathopoulos,
Iyad Alshabani,
Alberto Abello,
David Breitgand
Abstract:
Current Cloud solutions for Edge Computing are inefficient for data-centric applications, as they focus on the IaaS/PaaS level and they miss the data modeling and operations perspective. Consequently, Edge Computing opportunities are lost due to cumbersome and data assets-agnostic processes for end-to-end deployment over the Cloud-to-Edge continuum. In this paper, we introduce MEDAL, an intelligen…
▽ More
Current Cloud solutions for Edge Computing are inefficient for data-centric applications, as they focus on the IaaS/PaaS level and they miss the data modeling and operations perspective. Consequently, Edge Computing opportunities are lost due to cumbersome and data assets-agnostic processes for end-to-end deployment over the Cloud-to-Edge continuum. In this paper, we introduce MEDAL, an intelligent Cloud-to-Edge Data Fabric to support Data Operations (DataOps)across the continuum and to automate management and orchestration operations over a combined view of the data and the resource layer. MEDAL facilitates building and managing data workflows on top of existing flexible and composable data services, seamlessly exploiting and federating IaaS/PaaS/SaaS resources across different Cloud and Edge environments. We describe the MEDAL Platform as a usable tool for Data Scientists and Engineers, encompassing our concept and we illustrate its application though a connected cars use case.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
A Cost-based Storage Format Selector for Materialization in Big Data Frameworks
Authors:
Rana Faisal Munir,
Alberto Abelló,
Oscar Romero,
Maik Thiele,
Wolfgang Lehner
Abstract:
Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously. Typically, users deploy Data-Intensive Workflows (DIWs) for their analytical tasks. These DIWs of different users share many common parts (i.e, 50-80%), which can be materialized to reuse them in future executions. The materialization improves the overall processing time of DIWs an…
▽ More
Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously. Typically, users deploy Data-Intensive Workflows (DIWs) for their analytical tasks. These DIWs of different users share many common parts (i.e, 50-80%), which can be materialized to reuse them in future executions. The materialization improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems (DFS) by using a fixed data format. However, a fixed choice might not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (i.e., horizontal, vertical or hybrid) behave better or worse according to the access patterns of the subsequent operations.
In this paper, we present a cost-based approach which helps deciding the most appropriate storage format in every situation. A generic cost-based storage format selector framework considering the three fragmentation strategies is presented. Then, we use our framework to instantiate cost models for specific Hadoop data formats (namely SequenceFile, Avro and Parquet), and test it with realistic use cases. Our solution gives on average 33% speedup over SequenceFile, 11% speedup over Avro, 32% speedup over Parquet, and overall, it provides upto 25% performance gain.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Day-ahead Trading of Aggregated Energy Flexibility - Full Version
Authors:
Emmanouil Valsomatzis,
Torben Bach Pedersen,
Alberto Abello
Abstract:
Flexibility of small loads, in particular from Electric Vehicles (EVs), has recently attracted a lot of interest due to their possibility of participating in the energy market and the new commercial potentials. Different from existing work, the aggregation techniques proposed in this paper produce flexible aggregated loads from EVs taking into account technical market requirements. They can be fur…
▽ More
Flexibility of small loads, in particular from Electric Vehicles (EVs), has recently attracted a lot of interest due to their possibility of participating in the energy market and the new commercial potentials. Different from existing work, the aggregation techniques proposed in this paper produce flexible aggregated loads from EVs taking into account technical market requirements. They can be further transformed into the so-called flexible orders and be traded in the day-ahead market by a Balance Responsible Party (BRP). As a result, the BRP can achieve at least 20% cost reduction on average in energy purchase compared to traditional charging based on 2017 real electricity prices from the Danish electricity market.
△ Less
Submitted 24 May, 2018; v1 submitted 6 May, 2018;
originally announced May 2018.
-
PRESISTANT: Learning based assistant for data pre-processing
Authors:
Besim Bilalli,
Alberto Abelló,
Tomàs Aluja-Banet,
Robert Wrembel
Abstract:
Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they ar…
▽ More
Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks.
△ Less
Submitted 2 March, 2018;
originally announced March 2018.
-
An Integration-Oriented Ontology to Govern Evolution in Big Data Ecosystems
Authors:
Sergi Nadal,
Oscar Romero,
Alberto Abelló,
Panos Vassiliadis,
Stijn Vansummeren
Abstract:
Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in their original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving. Thus data analysts need to adapt their analytical processes after each API release. This gets more challenging when performing an integrated or historical analysis. To cope wit…
▽ More
Big Data architectures allow to flexibly store and process heterogeneous data, from multiple sources, in their original format. The structure of those data, commonly supplied by means of REST APIs, is continuously evolving. Thus data analysts need to adapt their analytical processes after each API release. This gets more challenging when performing an integrated or historical analysis. To cope with such complexity, in this paper, we present the Big Data Integration ontology, the core construct to govern the data integration process under schema evolution by systematically annotating it with information regarding the schema of the sources. We present a query rewriting algorithm that, using the annotated ontology, converts queries posed over the ontology to queries over the sources. To cope with syntactic evolution in the sources, we present an algorithm that semi-automatically adapts the ontology upon new releases. This guarantees ontology-mediated queries to correctly retrieve data from the most recent schema version as well as correctness in historical queries. A functional and performance evaluation on real-world APIs is performed to validate our approach.
△ Less
Submitted 16 January, 2018;
originally announced January 2018.