-
Fair In-Context Learning via Latent Concept Variables
Authors:
Karuna Bhaila,
Minh-Hao Van,
Kennedy Edemacu,
Chen Zhao,
Feng Chen,
Xintao Wu
Abstract:
The emerging in-context learning (ICL) ability of large language models (LLMs) has prompted their use for predictive tasks in various domains with different types of data facilitated by serialization methods. However, with increasing applications in high-stakes domains, it has been shown that LLMs can inherit social bias and discrimination from their pre-training data. In this work, we investigate…
▽ More
The emerging in-context learning (ICL) ability of large language models (LLMs) has prompted their use for predictive tasks in various domains with different types of data facilitated by serialization methods. However, with increasing applications in high-stakes domains, it has been shown that LLMs can inherit social bias and discrimination from their pre-training data. In this work, we investigate this inherent bias in LLMs during in-context learning with tabular data. We focus on an optimal demonstration selection approach that utilizes latent concept variables for resource-efficient task adaptation. We design data augmentation strategies that reduce correlation between predictive outcomes and sensitive variables helping to promote fairness during latent concept learning. We utilize the learned concept and select demonstrations from a training dataset to obtain fair predictions during inference while maintaining model utility. The latent concept variable is learned using a smaller internal LLM and the selected demonstrations can be used for inference with larger external LLMs. We empirically verify that the fair latent variable approach improves fairness results on tabular datasets compared to multiple heuristic demonstration selection methods.
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Privacy Preserving Prompt Engineering: A Survey
Authors:
Kennedy Edemacu,
Xintao Wu
Abstract:
Pre-trained language models (PLMs) have demonstrated significant proficiency in solving a wide range of general natural language processing (NLP) tasks. Researchers have observed a direct correlation between the performance of these models and their sizes. As a result, the sizes of these models have notably expanded in recent years, persuading researchers to adopt the term large language models (L…
▽ More
Pre-trained language models (PLMs) have demonstrated significant proficiency in solving a wide range of general natural language processing (NLP) tasks. Researchers have observed a direct correlation between the performance of these models and their sizes. As a result, the sizes of these models have notably expanded in recent years, persuading researchers to adopt the term large language models (LLMs) to characterize the larger-sized PLMs. The size expansion comes with a distinct capability called in-context learning (ICL), which represents a special form of prompting and allows the models to be utilized through the presentation of demonstration examples without modifications to the model parameters. Although interesting, privacy concerns have become a major obstacle in its widespread usage. Multiple studies have examined the privacy risks linked to ICL and prompting in general, and have devised techniques to alleviate these risks. Thus, there is a necessity to organize these mitigation techniques for the benefit of the community. This survey provides a systematic overview of the privacy protection methods employed during ICL and prompting in general. We review, analyze, and compare different methods under this paradigm. Furthermore, we provide a summary of the resources accessible for the development of these frameworks. Finally, we discuss the limitations of these frameworks and offer a detailed examination of the promising areas that necessitate further exploration.
△ Less
Submitted 10 April, 2024; v1 submitted 9 April, 2024;
originally announced April 2024.
-
DP-TabICL: In-Context Learning with Differentially Private Tabular Data
Authors:
Alycia N. Carey,
Karuna Bhaila,
Kennedy Edemacu,
Xintao Wu
Abstract:
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. How…
▽ More
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks by conditioning on demonstrations of question-answer pairs and it has been shown to have comparable performance to costly model retraining and fine-tuning. Recently, ICL has been extended to allow tabular data to be used as demonstration examples by serializing individual records into natural language formats. However, it has been shown that LLMs can leak information contained in prompts, and since tabular data often contain sensitive information, understanding how to protect the underlying tabular data used in ICL is a critical area of research. This work serves as an initial investigation into how to use differential privacy (DP) -- the long-established gold standard for data privacy and anonymization -- to protect tabular data used in ICL. Specifically, we investigate the application of DP mechanisms for private tabular ICL via data privatization prior to serialization and prompting. We formulate two private ICL frameworks with provable privacy guarantees in both the local (LDP-TabICL) and global (GDP-TabICL) DP scenarios via injecting noise into individual records or group statistics, respectively. We evaluate our DP-based frameworks on eight real-world tabular datasets and across multiple ICL and DP settings. Our evaluations show that DP-based ICL can protect the privacy of the underlying tabular data while achieving comparable performance to non-LLM baselines, especially under high privacy regimes.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Scalable Multi-Party Privacy-Preserving Gradient Tree Boosting over Vertically Partitioned Dataset with Outsourced Computations
Authors:
Kennedy Edemacu,
Beakcheol Jang,
Jong Wook Kim
Abstract:
Due to privacy concerns, multi-party gradient tree boosting algorithms have become widely popular amongst machine learning researchers and practitioners. However, limited existing works have focused on vertically partitioned datasets, and the few existing works are either not scalable or tend to leak information. Thus, in this work, we propose SSXGB which is a scalable and secure multi-party gradi…
▽ More
Due to privacy concerns, multi-party gradient tree boosting algorithms have become widely popular amongst machine learning researchers and practitioners. However, limited existing works have focused on vertically partitioned datasets, and the few existing works are either not scalable or tend to leak information. Thus, in this work, we propose SSXGB which is a scalable and secure multi-party gradient tree boosting framework for vertically partitioned datasets with partially outsourced computations. Specifically, we employ an additive homomorphic encryption (HE) scheme for security. We design two sub-protocols based on the HE scheme to perform non-linear operations associated with gradient tree boosting algorithms. Next, we propose a secure training and a secure prediction algorithms under the SSXGB framework. Then we provide theoretical security and communication analysis for the proposed framework. Finally, we evaluate the performance of the framework with experiments using two real-world datasets.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
Reliability Check via Weight Similarity in Privacy-Preserving Multi-Party Machine Learning
Authors:
Kennedy Edemacu,
Beakcheol Jang,
Jong Wook Kim
Abstract:
Multi-party machine learning is a paradigm in which multiple participants collaboratively train a machine learning model to achieve a common learning objective without sharing their privately owned data. The paradigm has recently received a lot of attention from the research community aimed at addressing its associated privacy concerns. In this work, we focus on addressing the concerns of data pri…
▽ More
Multi-party machine learning is a paradigm in which multiple participants collaboratively train a machine learning model to achieve a common learning objective without sharing their privately owned data. The paradigm has recently received a lot of attention from the research community aimed at addressing its associated privacy concerns. In this work, we focus on addressing the concerns of data privacy, model privacy, and data quality associated with privacy-preserving multi-party machine learning, i.e., we present a scheme for privacy-preserving collaborative learning that checks the participants' data quality while guaranteeing data and model privacy. In particular, we propose a novel metric called weight similarity that is securely computed and used to check whether a participant can be categorized as a reliable participant (holds good quality data) or not. The problems of model and data privacy are tackled by integrating homomorphic encryption in our scheme and uploading encrypted weights, which prevent leakages to the server and malicious participants, respectively. The analytical and experimental evaluations of our scheme demonstrate that it is accurate and ensures data and model privacy.
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
Packet Drop Attack Detection Techniques in Wireless Ad hoc Networks: A Review
Authors:
Kennedy Edemacu,
Martin Euku,
Richard Ssekibuule
Abstract:
Wireless ad hoc networks have gained lots of attention due to their ease and low cost of deployment. This has made ad hoc networks of great importance in numerous military and civilian applications. But, the lack of centralized management of these networks makes them vulnerable to a number of security attacks. One of the attacks is packet drop attack, where a compromised node drops packets malicio…
▽ More
Wireless ad hoc networks have gained lots of attention due to their ease and low cost of deployment. This has made ad hoc networks of great importance in numerous military and civilian applications. But, the lack of centralized management of these networks makes them vulnerable to a number of security attacks. One of the attacks is packet drop attack, where a compromised node drops packets maliciously. Several techniques have been proposed to detect the packet drop attack in wireless ad hoc networks. Therefore, in this paper we review some of the packet drop attack detection techniques and comparatively analyze them basing on; their ability to detect the attack under different attack strategies (partial and or cooperate attacks), environments and the computational and communication overheads caused in the process of detection.
△ Less
Submitted 8 October, 2014;
originally announced October 2014.