-
Hip Fracture Patient Pathways and Agent-based Modelling
Authors:
Alison N. O'Connor,
Stephen E. Ryan,
Gauri Vaidya,
Paul Harford,
Meghana Kshirsagar
Abstract:
Increased healthcare demand is significantly straining European services. Digital solutions including advanced modelling techniques offer a promising solution to optimising patient flow without impacting day-to-day healthcare provision. In this work we outline an ongoing project that aims to optimise healthcare resources using agent-based simulations.
Increased healthcare demand is significantly straining European services. Digital solutions including advanced modelling techniques offer a promising solution to optimising patient flow without impacting day-to-day healthcare provision. In this work we outline an ongoing project that aims to optimise healthcare resources using agent-based simulations.
△ Less
Submitted 18 October, 2024; v1 submitted 30 September, 2024;
originally announced October 2024.
-
A Blueprint for Auditing Generative AI
Authors:
Jakob Mokander,
Justin Curl,
Mihir Kshirsagar
Abstract:
The widespread use of generative AI systems is coupled with significant ethical and social challenges. As a result, policymakers, academic researchers, and social advocacy groups have all called for such systems to be audited. However, existing auditing procedures fail to address the governance challenges posed by generative AI systems, which display emergent capabilities and are adaptable to a wi…
▽ More
The widespread use of generative AI systems is coupled with significant ethical and social challenges. As a result, policymakers, academic researchers, and social advocacy groups have all called for such systems to be audited. However, existing auditing procedures fail to address the governance challenges posed by generative AI systems, which display emergent capabilities and are adaptable to a wide range of downstream tasks. In this chapter, we address that gap by outlining a novel blueprint for how to audit such systems. Specifically, we propose a three-layered approach, whereby governance audits (of technology providers that design and disseminate generative AI systems), model audits (of generative AI systems after pre-training but prior to their release), and application audits (of applications based on top of generative AI systems) complement and inform each other. We show how audits on these three levels, when conducted in a structured and coordinated manner, can be a feasible and effective mechanism for identifying and managing some of the ethical and social risks posed by generative AI systems. That said, it is important to remain realistic about what auditing can reasonably be expected to achieve. For this reason, the chapter also discusses the limitations not only of our three-layered approach but also of the prospect of auditing generative AI systems at all. Ultimately, this chapter seeks to expand the methodological toolkit available to technology providers and policymakers who wish to analyse and evaluate generative AI systems from technical, ethical, and legal perspectives.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
A Novel ML-driven Test Case Selection Approach for Enhancing the Performance of Grammatical Evolution
Authors:
Krishn Kumar Gupt,
Meghana Kshirsagar,
Douglas Mota Dias,
Joseph P. Sullivan,
Conor Ryan
Abstract:
Computational cost in metaheuristics such as Evolutionary Algorithms (EAs) is often a major concern, particularly with their ability to scale. In data-based training, traditional EAs typically use a significant portion, if not all, of the dataset for model training and fitness evaluation in each generation. This makes EAs suffer from high computational costs incurred during the fitness evaluation…
▽ More
Computational cost in metaheuristics such as Evolutionary Algorithms (EAs) is often a major concern, particularly with their ability to scale. In data-based training, traditional EAs typically use a significant portion, if not all, of the dataset for model training and fitness evaluation in each generation. This makes EAs suffer from high computational costs incurred during the fitness evaluation of the population, particularly when working with large datasets. To mitigate this issue, we propose a Machine Learning (ML)-driven Distance-based Selection (DBS) algorithm that reduces the fitness evaluation time by optimizing test cases. We test our algorithm by applying it to 24 benchmark problems from Symbolic Regression (SR) and digital circuit domains and then using Grammatical Evolution (GE) to train models using the reduced dataset. We use GE to test DBS on SR and produce a system flexible enough to test it on digital circuit problems further. The quality of the solutions is tested and compared against the conventional training method to measure the coverage of training data selected using DBS, i.e., how well the subset matches the statistical properties of the entire dataset. Moreover, the effect of optimized training data on run time and the effective size of the evolved solutions is analyzed. Experimental and statistical evaluations of the results show our method empowered GE to yield superior or comparable solutions to the baseline (using the full datasets) with smaller sizes and demonstrates computational efficiency in terms of speed.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data
Authors:
Mayana Pereira,
Meghana Kshirsagar,
Sumit Mukherjee,
Rahul Dodhia,
Juan Lavista Ferres,
Rafael de Sousa
Abstract:
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent…
▽ More
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic data set generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generator MWEM PGM can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.
△ Less
Submitted 29 October, 2023;
originally announced October 2023.
-
How Algorithms Shape the Distribution of Political Advertising: Case Studies of Facebook, Google, and TikTok
Authors:
Orestis Papakyriakopoulos,
Christelle Tessono,
Arvind Narayanan,
Mihir Kshirsagar
Abstract:
Online platforms play an increasingly important role in shaping democracy by influencing the distribution of political information to the electorate. In recent years, political campaigns have spent heavily on the platforms' algorithmic tools to target voters with online advertising. While the public interest in understanding how platforms perform the task of shaping the political discourse has nev…
▽ More
Online platforms play an increasingly important role in shaping democracy by influencing the distribution of political information to the electorate. In recent years, political campaigns have spent heavily on the platforms' algorithmic tools to target voters with online advertising. While the public interest in understanding how platforms perform the task of shaping the political discourse has never been higher, the efforts of the major platforms to make the necessary disclosures to understand their practices falls woefully short. In this study, we collect and analyze a dataset containing over 800,000 ads and 2.5 million videos about the 2020 U.S. presidential election from Facebook, Google, and TikTok. We conduct the first large scale data analysis of public data to critically evaluate how these platforms amplified or moderated the distribution of political advertisements. We conclude with recommendations for how to improve the disclosures so that the public can hold the platforms and political advertisers accountable.
△ Less
Submitted 13 July, 2022; v1 submitted 9 June, 2022;
originally announced June 2022.
-
An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected Surprises
Authors:
Mayana Pereira,
Meghana Kshirsagar,
Sumit Mukherjee,
Rahul Dodhia,
Juan Lavista Ferres
Abstract:
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze d…
▽ More
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze disparities in model utility and bias caused by the synthetic dataset, measured through algorithmic fairness metrics. Our first set of results show that although there seems to be a clear negative correlation between privacy and utility (the more private, the less accurate) across all data synthesizers we evaluated, more privacy does not necessarily imply more bias. Additionally, we assess the effects of utilizing synthetic datasets for model training and model evaluation. We show that results obtained on synthetic data can misestimate the actual model performance when it is deployed on real data. We hence advocate on the need for defining proper testing protocols in scenarios where differentially private synthetic datasets are utilized for model training and evaluation.
△ Less
Submitted 15 June, 2021;
originally announced June 2021.
-
Becoming Good at AI for Good
Authors:
Meghana Kshirsagar,
Caleb Robinson,
Siyu Yang,
Shahrzad Gholami,
Ivan Klyuzhin,
Sumit Mukherjee,
Md Nasir,
Anthony Ortiz,
Felipe Oviedo,
Darren Tanner,
Anusua Trivedi,
Yixi Xu,
Ming Zhong,
Bistra Dilkina,
Rahul Dodhia,
Juan M. Lavista Ferres
Abstract:
AI for good (AI4G) projects involve developing and applying artificial intelligence (AI) based solutions to further goals in areas such as sustainability, health, humanitarian aid, and social justice. Developing and deploying such solutions must be done in collaboration with partners who are experts in the domain in question and who already have experience in making progress towards such goals. Ba…
▽ More
AI for good (AI4G) projects involve developing and applying artificial intelligence (AI) based solutions to further goals in areas such as sustainability, health, humanitarian aid, and social justice. Developing and deploying such solutions must be done in collaboration with partners who are experts in the domain in question and who already have experience in making progress towards such goals. Based on our experiences, we detail the different aspects of this type of collaboration broken down into four high-level categories: communication, data, modeling, and impact, and distill eleven takeaways to guide such projects in the future. We briefly describe two case studies to illustrate how some of these takeaways were applied in practice during our past collaborations.
△ Less
Submitted 3 May, 2021; v1 submitted 23 April, 2021;
originally announced April 2021.
-
What Makes a Dark Pattern... Dark? Design Attributes, Normative Considerations, and Measurement Methods
Authors:
Arunesh Mathur,
Jonathan Mayer,
Mihir Kshirsagar
Abstract:
There is a rapidly growing literature on dark patterns, user interface designs -- typically related to shopping or privacy -- that researchers deem problematic. Recent work has been predominantly descriptive, documenting and categorizing objectionable user interfaces. These contributions have been invaluable in highlighting specific designs for researchers and policymakers. But the current literat…
▽ More
There is a rapidly growing literature on dark patterns, user interface designs -- typically related to shopping or privacy -- that researchers deem problematic. Recent work has been predominantly descriptive, documenting and categorizing objectionable user interfaces. These contributions have been invaluable in highlighting specific designs for researchers and policymakers. But the current literature lacks a conceptual foundation: What makes a user interface a dark pattern? Why are certain designs problematic for users or society?
We review recent work on dark patterns and demonstrate that the literature does not reflect a singular concern or consistent definition, but rather, a set of thematically related considerations. Drawing from scholarship in psychology, economics, ethics, philosophy, and law, we articulate a set of normative perspectives for analyzing dark patterns and their effects on individuals and society. We then show how future research on dark patterns can go beyond subjective criticism of user interface designs and apply empirical methods grounded in normative perspectives.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
Virtual Classrooms and Real Harms: Remote Learning at U.S. Universities
Authors:
Shaanan Cohney,
Ross Teixeira,
Anne Kohlbrenner,
Arvind Narayanan,
Mihir Kshirsagar,
Yan Shvartzshnaider,
Madelyn Sanfilippo
Abstract:
Universities have been forced to rely on remote educational technology to facilitate the rapid shift to online learning. In doing so, they acquire new risks of security vulnerabilities and privacy violations. To help universities navigate this landscape, we develop a model that describes the actors, incentives, and risks, informed by surveying 49 educators and 14 administrators at U.S. universitie…
▽ More
Universities have been forced to rely on remote educational technology to facilitate the rapid shift to online learning. In doing so, they acquire new risks of security vulnerabilities and privacy violations. To help universities navigate this landscape, we develop a model that describes the actors, incentives, and risks, informed by surveying 49 educators and 14 administrators at U.S. universities. Next, we develop a methodology for administrators to assess security and privacy risks of these products. We then conduct a privacy and security analysis of 23 popular platforms using a combination of sociological analyses of privacy policies and 129 state laws, alongside a technical assessment of platform software. Based on our findings, we develop recommendations for universities to mitigate the risks to their stakeholders.
△ Less
Submitted 15 June, 2021; v1 submitted 10 December, 2020;
originally announced December 2020.
-
Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset
Authors:
Ryan Amos,
Gunes Acar,
Eli Lucherini,
Mihir Kshirsagar,
Arvind Narayanan,
Jonathan Mayer
Abstract:
Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. Prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To…
▽ More
Automated analysis of privacy policies has proved a fruitful research direction, with developments such as automated policy summarization, question answering systems, and compliance detection. Prior research has been limited to analysis of privacy policies from a single point in time or from short spans of time, as researchers did not have access to a large-scale, longitudinal, curated dataset. To address this gap, we developed a crawler that discovers, downloads, and extracts archived privacy policies from the Internet Archive's Wayback Machine. Using the crawler and following a series of validation and quality control steps, we curated a dataset of 1,071,488 English language privacy policies, spanning over two decades and over 130,000 distinct websites.
Our analyses of the data paint a troubling picture of the transparency and accessibility of privacy policies. By comparing the occurrence of tracking-related terminology in our dataset to prior web privacy measurements, we find that privacy policies have consistently failed to disclose the presence of common tracking technologies and third parties. We also find that over the last twenty years privacy policies have become even more difficult to read, doubling in length and increasing a full grade in the median reading level. Our data indicate that self-regulation for first-party websites has stagnated, while self-regulation for third parties has increased but is dominated by online advertising trade associations. Finally, we contribute to the literature on privacy regulation by demonstrating the historic impact of the GDPR on privacy policies.
△ Less
Submitted 20 July, 2021; v1 submitted 20 August, 2020;
originally announced August 2020.
-
Interpretable Network Propagation with Application to Expanding the Repertoire of Human Proteins that Interact with SARS-CoV-2
Authors:
Jeffrey N. Law,
Kyle Akers,
Nure Tasnina,
Catherine M. Della Santina,
Shay Deutsch,
Meghana Kshirsagar,
Judith Klein-Seetharaman,
Mark Crovella,
Padmavathy Rajagopalan,
Simon Kasif,
T. M. Murali
Abstract:
Background: Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction.
Results: We design a network propagation…
▽ More
Background: Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction.
Results: We design a network propagation framework with two novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents.
Conclusions: We examine how our provenance tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly-emerging viruses.
△ Less
Submitted 19 November, 2021; v1 submitted 2 June, 2020;
originally announced June 2020.
-
Learning task structure via sparsity grouped multitask learning
Authors:
Meghana Kshirsagar,
Eunho Yang,
Aurélie C. Lozano
Abstract:
Sparse mapping has been a key methodology in many high-dimensional scientific problems. When multiple tasks share the set of relevant features, learning them jointly in a group drastically improves the quality of relevant feature selection. However, in practice this technique is used limitedly since such grouping information is usually hidden. In this paper, our goal is to recover the group struct…
▽ More
Sparse mapping has been a key methodology in many high-dimensional scientific problems. When multiple tasks share the set of relevant features, learning them jointly in a group drastically improves the quality of relevant feature selection. However, in practice this technique is used limitedly since such grouping information is usually hidden. In this paper, our goal is to recover the group structure on the sparsity patterns and leverage that information in the sparse learning. Toward this, we formulate a joint optimization problem in the task parameter and the group membership, by constructing an appropriate regularizer to encourage sparse learning as well as correct recovery of task groups. We further demonstrate that our proposed method recovers groups and the sparsity patterns in the task parameters accurately by extensive experiments.
△ Less
Submitted 14 September, 2017; v1 submitted 13 May, 2017;
originally announced May 2017.
-
Survey on Modelling Methods Applicable to Gene Regulatory Network
Authors:
Chanda Panse,
Dr. Manali Kshirsagar
Abstract:
Gene Regulatory Network (GRN) plays an important role in knowing insight of cellular life cycle. It gives information about at which different environmental conditions genes of particular interest get over expressed or under expressed. Modelling of GRN is nothing but finding interactive relationships between genes. Interaction can be positive or negative. For inference of GRN, time series data pro…
▽ More
Gene Regulatory Network (GRN) plays an important role in knowing insight of cellular life cycle. It gives information about at which different environmental conditions genes of particular interest get over expressed or under expressed. Modelling of GRN is nothing but finding interactive relationships between genes. Interaction can be positive or negative. For inference of GRN, time series data provided by Microarray technology is used. Key factors to be considered while constructing GRN are scalability, robustness, reliability and maximum detection of true positive interactions between genes. This paper gives detailed technical review of existing methods applied for building of GRN along with scope for future work.
△ Less
Submitted 9 October, 2013;
originally announced October 2013.