-
On Subset Retrieval and Group Testing Problems with Differential Privacy Constraints
Authors:
Mira Gonen,
Michael Langberg,
Alex Sprintson
Abstract:
This paper focuses on the design and analysis of privacy-preserving techniques for group testing and infection status retrieval. Our work is motivated by the need to provide accurate information on the status of disease spread among a group of individuals while protecting the privacy of the infection status of any single individual involved. The paper is motivated by practical scenarios, such as c…
▽ More
This paper focuses on the design and analysis of privacy-preserving techniques for group testing and infection status retrieval. Our work is motivated by the need to provide accurate information on the status of disease spread among a group of individuals while protecting the privacy of the infection status of any single individual involved. The paper is motivated by practical scenarios, such as controlling the spread of infectious diseases, where individuals might be reluctant to participate in testing if their outcomes are not kept confidential.
The paper makes the following contributions. First, we present a differential privacy framework for the subset retrieval problem, which focuses on sharing the infection status of individuals with administrators and decision-makers. We characterize the trade-off between the accuracy of subset retrieval and the degree of privacy guaranteed to the individuals. In particular, we establish tight lower and upper bounds on the achievable level of accuracy subject to the differential privacy constraints. We then formulate the differential privacy framework for the noisy group testing problem in which noise is added either before or after the pooling process. We establish a reduction between the private subset retrieval and noisy group testing problems and show that the converse and achievability schemes for subset retrieval carry over to differentially private group testing.
△ Less
Submitted 22 January, 2025;
originally announced January 2025.
-
Group Testing on General Set-Systems
Authors:
Mira Gonen,
Michael Langberg,
Alex Sprintson
Abstract:
Group testing is one of the fundamental problems in coding theory and combinatorics in which one is to identify a subset of contaminated items from a given ground set. There has been renewed interest in group testing recently due to its applications in diagnostic virology, including pool testing for the novel coronavirus. The majority of existing works on group testing focus on the \emph{uniform}…
▽ More
Group testing is one of the fundamental problems in coding theory and combinatorics in which one is to identify a subset of contaminated items from a given ground set. There has been renewed interest in group testing recently due to its applications in diagnostic virology, including pool testing for the novel coronavirus. The majority of existing works on group testing focus on the \emph{uniform} setting in which any subset of size $d$ from a ground set $V$ of size $n$ is potentially contaminated. In this work, we consider a {\em generalized} version of group testing with an arbitrary set-system of potentially contaminated sets. The generalized problem is characterized by a hypergraph $H=(V,E)$, where $V$ represents the ground set and edges $e\in E$ represent potentially contaminated sets. The problem of generalized group testing is motivated by practical settings in which not all subsets of a given size $d$ may be potentially contaminated, rather, due to social dynamics, geographical limitations, or other considerations, there exist subsets that can be readily ruled out. For example, in the context of pool testing, the edge set $E$ may consist of families, work teams, or students in a classroom, i.e., subsets likely to be mutually contaminated. The goal in studying the generalized setting is to leverage the additional knowledge characterized by $H=(V,E)$ to significantly reduce the number of required tests. The paper considers both adaptive and non-adaptive group testing and makes the following contributions. First, for the non-adaptive setting, we show that finding an optimal solution for the generalized version of group testing is NP-hard. For this setting, we present a solution that requires $O(d\log{|E|})$ tests, where $d$ is the maximum size of a set $e \in E$. Our solutions generalize those given for the traditional setting and are shown to be of order-optimal size $O(\log{|E|})$ for hypergraphs with edges that have ``large'' symmetric differences. For the adaptive setting, when edges in $E$ are of size exactly $d$, we present a solution of size $O(\log{|E|}+d\log^2{d})$ that comes close to the lower bound of $Ω(\log{|E|} + d)$.
△ Less
Submitted 10 February, 2022;
originally announced February 2022.
-
Minimizing the alphabet size in codes with restricted error sets
Authors:
Mira Gonen,
Michael Langberg,
Alex Sprintson
Abstract:
This paper focuses on error-correcting codes that can handle a predefined set of specific error patterns. The need for such codes arises in many settings of practical interest, including wireless communication and flash memory systems. In many such settings, a smaller field size is achievable than that offered by MDS and other standard codes. We establish a connection between the minimum alphabet…
▽ More
This paper focuses on error-correcting codes that can handle a predefined set of specific error patterns. The need for such codes arises in many settings of practical interest, including wireless communication and flash memory systems. In many such settings, a smaller field size is achievable than that offered by MDS and other standard codes. We establish a connection between the minimum alphabet size for this generalized setting and the combinatorial properties of a hypergraph that represents the prespecified collection of error patterns. We also show a connection between error and erasure correcting codes in this specialized setting. This allows us to establish bounds on the minimum alphabet size and show an advantage of non-linear codes over linear codes in a generalized setting. We also consider a variation of the problem which allows a small probability of decoding error and relate it to an approximate version of hypergraph coloring.
△ Less
Submitted 4 February, 2021;
originally announced February 2021.
-
Minimizing the alphabet size of erasure codes with restricted decoding sets
Authors:
Mira Gonen,
Ishay Haviv,
Michael Langberg,
Alex Sprintson
Abstract:
A Maximum Distance Separable code over an alphabet $F$ is defined via an encoding function $C:F^k \rightarrow F^n$ that allows to retrieve a message $m \in F^k$ from the codeword $C(m)$ even after erasing any $n-k$ of its symbols. The minimum possible alphabet size of general (non-linear) MDS codes for given parameters $n$ and $k$ is unknown and forms one of the central open problems in coding the…
▽ More
A Maximum Distance Separable code over an alphabet $F$ is defined via an encoding function $C:F^k \rightarrow F^n$ that allows to retrieve a message $m \in F^k$ from the codeword $C(m)$ even after erasing any $n-k$ of its symbols. The minimum possible alphabet size of general (non-linear) MDS codes for given parameters $n$ and $k$ is unknown and forms one of the central open problems in coding theory. The paper initiates the study of the alphabet size of codes in a generalized setting where the coding scheme is required to handle a pre-specified subset of all possible erasure patterns, naturally represented by an $n$-vertex $k$-uniform hypergraph. We relate the minimum possible alphabet size of such codes to the strong chromatic number of the hypergraph and analyze the tightness of the obtained bounds for both the linear and non-linear settings. We further consider variations of the problem which allow a small probability of decoding error.
△ Less
Submitted 14 May, 2020;
originally announced May 2020.
-
BaTFLED: Bayesian Tensor Factorization Linked to External Data
Authors:
Nathan H Lazar,
Mehmet Gönen,
Kemal Sönmez
Abstract:
The vast majority of current machine learning algorithms are designed to predict single responses or a vector of responses, yet many types of response are more naturally organized as matrices or higher-order tensor objects where characteristics are shared across modes. We present a new machine learning algorithm BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values i…
▽ More
The vast majority of current machine learning algorithms are designed to predict single responses or a vector of responses, yet many types of response are more naturally organized as matrices or higher-order tensor objects where characteristics are shared across modes. We present a new machine learning algorithm BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values in a three-dimensional response tensor using input features for each of the dimensions. BaTFLED uses a probabilistic Bayesian framework to learn projection matrices mapping input features for each mode into latent representations that multiply to form the response tensor. By utilizing a Tucker decomposition, the model can capture weights for interactions between latent factors for each mode in a small core tensor. Priors that encourage sparsity in the projection matrices and core tensor allow for feature selection and model regularization. This method is shown to far outperform elastic net and neural net models on 'cold start' tasks from data simulated in a three-mode structure. Additionally, we apply the model to predict dose-response curves in a panel of breast cancer cell lines treated with drug compounds that was used as a Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge.
△ Less
Submitted 22 December, 2016; v1 submitted 9 December, 2016;
originally announced December 2016.
-
Approximation and Heuristic Algorithms for Probabilistic Physical Search on General Graphs
Authors:
Noam Hazon,
Mira Gonen,
Max Kleb
Abstract:
We consider an agent seeking to obtain an item, potentially available at different locations in a physical environment. The traveling costs between locations are known in advance, but there is only probabilistic knowledge regarding the possible prices of the item at any given location. Given such a setting, the problem is to find a plan that maximizes the probability of acquiring the good while mi…
▽ More
We consider an agent seeking to obtain an item, potentially available at different locations in a physical environment. The traveling costs between locations are known in advance, but there is only probabilistic knowledge regarding the possible prices of the item at any given location. Given such a setting, the problem is to find a plan that maximizes the probability of acquiring the good while minimizing both travel and purchase costs. Sample applications include agents in search-and-rescue or exploration missions, e.g., a rover on Mars seeking to mine a specific mineral. These probabilistic physical search problems have been previously studied, but we present the first approximation and heuristic algorithms for solving such problems on general graphs. We establish an interesting connection between these problems and classical graph-search problems, which led us to provide the approximation algorithms and hardness of approximation results for our settings. We further suggest several heuristics for practical use, and demonstrate their effectiveness with simulation on real graph structure and synthetic graphs.
△ Less
Submitted 27 September, 2015;
originally announced September 2015.
-
Kernelized Bayesian Matrix Factorization
Authors:
Mehmet Gönen,
Suleiman A. Khan,
Samuel Kaski
Abstract:
We extend kernelized matrix factorization with a fully Bayesian treatment and with an ability to work with multiple side information sources expressed as different kernels. Kernel functions have been introduced to matrix factorization to integrate side information about the rows and columns (e.g., objects and users in recommender systems), which is necessary for making out-of-matrix (i.e., cold st…
▽ More
We extend kernelized matrix factorization with a fully Bayesian treatment and with an ability to work with multiple side information sources expressed as different kernels. Kernel functions have been introduced to matrix factorization to integrate side information about the rows and columns (e.g., objects and users in recommender systems), which is necessary for making out-of-matrix (i.e., cold start) predictions. We discuss specifically bipartite graph inference, where the output matrix is binary, but extensions to more general matrices are straightforward. We extend the state of the art in two key aspects: (i) A fully conjugate probabilistic formulation of the kernelized matrix factorization problem enables an efficient variational approximation, whereas fully Bayesian treatments are not computationally feasible in the earlier approaches. (ii) Multiple side information sources are included, treated as different kernels in multiple kernel learning that additionally reveals which side information sources are informative. Our method outperforms alternatives in predicting drug-protein interactions on two data sets. We then show that our framework can also be used for solving multilabel learning problems by considering samples and labels as the two domains where matrix factorization operates on. Our algorithm obtains the lowest Hamming loss values on 10 out of 14 multilabel classification data sets compared to five state-of-the-art multilabel learning algorithms.
△ Less
Submitted 8 May, 2013; v1 submitted 6 November, 2012;
originally announced November 2012.
-
Bayesian Efficient Multiple Kernel Learning
Authors:
Mehmet Gonen
Abstract:
Multiple kernel learning algorithms are proposed to combine kernels in order to obtain a better similarity measure or to integrate feature representations coming from different data sources. Most of the previous research on such methods is focused on the computational efficiency issue. However, it is still not feasible to combine many kernels using existing Bayesian approaches due to their high ti…
▽ More
Multiple kernel learning algorithms are proposed to combine kernels in order to obtain a better similarity measure or to integrate feature representations coming from different data sources. Most of the previous research on such methods is focused on the computational efficiency issue. However, it is still not feasible to combine many kernels using existing Bayesian approaches due to their high time complexity. We propose a fully conjugate Bayesian formulation and derive a deterministic variational approximation, which allows us to combine hundreds or thousands of kernels very efficiently. We briefly explain how the proposed method can be extended for multiclass learning and semi-supervised learning. Experiments with large numbers of kernels on benchmark data sets show that our inference method is quite fast, requiring less than a minute. On one bioinformatics and three image recognition data sets, our method outperforms previously reported results with better generalization performance.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Coded Cooperative Data Exchange Problem for General Topologies
Authors:
Mira Gonen,
Michael Langberg
Abstract:
We consider the "coded cooperative data exchange problem" for general graphs. In this problem, given a graph G=(V,E) representing clients in a broadcast network, each of which initially hold a (not necessarily disjoint) set of information packets; one wishes to design a communication scheme in which eventually all clients will hold all the packets of the network. Communication is performed in roun…
▽ More
We consider the "coded cooperative data exchange problem" for general graphs. In this problem, given a graph G=(V,E) representing clients in a broadcast network, each of which initially hold a (not necessarily disjoint) set of information packets; one wishes to design a communication scheme in which eventually all clients will hold all the packets of the network. Communication is performed in rounds, where in each round a single client broadcasts a single (possibly encoded) information packet to its neighbors in G. The objective is to design a broadcast scheme that satisfies all clients with the minimum number of broadcast rounds.
The coded cooperative data exchange problem has seen significant research over the last few years; mostly when the graph G is the complete broadcast graph in which each client is adjacent to all other clients in the network, but also on general topologies, both in the fractional and integral setting. In this work we focus on the integral setting in general undirected topologies G. We tie the data exchange problem on G to certain well studied combinatorial properties of G and in such show that solving the problem exactly or even approximately within a multiplicative factor of \log{|V|} is intractable (i.e., NP-Hard). We then turn to study efficient data exchange schemes yielding a number of communication rounds comparable to our intractability result. Our communication schemes do not involve encoding, and in such yield bounds on the "coding advantage" in the setting at hand.
△ Less
Submitted 9 February, 2012;
originally announced February 2012.
-
An $O(\log n)$-approximation for the Set Cover Problem with Set Ownership
Authors:
Mira Gonen,
Yuval Shavitt
Abstract:
In highly distributed Internet measurement systems distributed agents periodically measure the Internet using a tool called {\tt traceroute}, which discovers a path in the network graph. Each agent performs many traceroute measurement to a set of destinations in the network, and thus reveals a portion of the Internet graph as it is seen from the agent locations. In every period we need to check…
▽ More
In highly distributed Internet measurement systems distributed agents periodically measure the Internet using a tool called {\tt traceroute}, which discovers a path in the network graph. Each agent performs many traceroute measurement to a set of destinations in the network, and thus reveals a portion of the Internet graph as it is seen from the agent locations. In every period we need to check whether previously discovered edges still exist in this period, a process termed {\em validation}. For this end we maintain a database of all the different measurements performed by each agent. Our aim is to be able to {\em validate} the existence of all previously discovered edges in the minimum possible time. In this work we formulate the validation problem as a generalization of the well know set cover problem. We reduce the set cover problem to the validation problem, thus proving that the validation problem is ${\cal NP}$-hard. We present a $O(\log n)$-approximation algorithm to the validation problem, where $n$ in the number of edges that need to be validated. We also show that unless ${\cal P = NP}$ the approximation ratio of the validation problem is $Ω(\log n)$.
△ Less
Submitted 21 July, 2008;
originally announced July 2008.
-
Bounding the Bias of Tree-Like Sampling in IP Topologies
Authors:
Reuven Cohen,
Mira Gonen,
Avishai Wool
Abstract:
It is widely believed that the Internet's AS-graph degree distribution obeys a power-law form. Most of the evidence showing the power-law distribution is based on BGP data. However, it was recently argued that since BGP collects data in a tree-like fashion, it only produces a sample of the degree distribution, and this sample may be biased. This argument was backed by simulation data and mathema…
▽ More
It is widely believed that the Internet's AS-graph degree distribution obeys a power-law form. Most of the evidence showing the power-law distribution is based on BGP data. However, it was recently argued that since BGP collects data in a tree-like fashion, it only produces a sample of the degree distribution, and this sample may be biased. This argument was backed by simulation data and mathematical analysis, which demonstrated that under certain conditions a tree sampling procedure can produce an artificail power-law in the degree distribution. Thus, although the observed degree distribution of the AS-graph follows a power-law, this phenomenon may be an artifact of the sampling process. In this work we provide some evidence to the contrary. We show, by analysis and simulation, that when the underlying graph degree distribution obeys a power-law with an exponent larger than 2, a tree-like sampling process produces a negligible bias in the sampled degree distribution. Furthermore, recent data collected from the DIMES project, which is not based on BGP sampling, indicates that the underlying AS-graph indeed obeys a power-law degree distribution with an exponent larger than 2. By combining this empirical data with our analysis, we conclude that the bias in the degree distribution calculated from BGP data is negligible.
△ Less
Submitted 30 November, 2006;
originally announced November 2006.
-
A Geographic Directed Preferential Internet Topology Model
Authors:
Sagy Bar,
Mira Gonen,
Avishai Wool
Abstract:
The goal of this work is to model the peering arrangements between Autonomous Systems (ASes). Most existing models of the AS-graph assume an undirected graph. However, peering arrangements are mostly asymmetric Customer-Provider arrangements, which are better modeled as directed edges. Furthermore, it is well known that the AS-graph, and in particular its clustering structure, is influenced by g…
▽ More
The goal of this work is to model the peering arrangements between Autonomous Systems (ASes). Most existing models of the AS-graph assume an undirected graph. However, peering arrangements are mostly asymmetric Customer-Provider arrangements, which are better modeled as directed edges. Furthermore, it is well known that the AS-graph, and in particular its clustering structure, is influenced by geography.
We introduce a new model that describes the AS-graph as a directed graph, with an edge going from the customer to the provider, but also models symmetric peer-to-peer arrangements, and takes geography into account. We are able to mathematically analyze its power-law exponent and number of leaves. Beyond the analysis we have implemented our model as a synthetic network generator we call GdTang. Experimentation with GdTang shows that the networks it produces are more realistic than those generated by other network generators, in terms of its power-law exponent, fractions of customer-provider and symmetric peering arrangements, and the size of its dense core. We believe that our model is the first to manifest realistic regional dense cores that have a clear geographic flavor. Our synthetic networks also exhibit path inflation effects that are similar to those observed in the real AS graph.
△ Less
Submitted 14 February, 2005;
originally announced February 2005.