-
More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Authors:
Shahar Levy,
Nir Mazor,
Lihi Shalmon,
Michael Hassid,
Gabriel Stanovsky
Abstract:
Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and positio…
▽ More
Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Mildly Accurate Computationally Differentially Private Inner Product Protocols Imply Oblivious Transfer
Authors:
Iftach Haitner,
Noam Mazor,
Jad Silbak,
Eliad Tsfadia,
Chao Yan
Abstract:
In distributed differential privacy, multiple parties collaboratively analyze their combined data while protecting the privacy of each party's data from the eyes of the others. Interestingly, for certain fundamental two-party functions like inner product and Hamming distance, the accuracy of distributed solutions significantly lags behind what can be achieved in the centralized model. However, und…
▽ More
In distributed differential privacy, multiple parties collaboratively analyze their combined data while protecting the privacy of each party's data from the eyes of the others. Interestingly, for certain fundamental two-party functions like inner product and Hamming distance, the accuracy of distributed solutions significantly lags behind what can be achieved in the centralized model. However, under computational differential privacy, these limitations can be circumvented using oblivious transfer via secure multi-party computation. Yet, no results show that oblivious transfer is indeed necessary for accurately estimating a non-Boolean functionality. In particular, for the inner-product functionality, it was previously unknown whether oblivious transfer is necessary even for the best possible constant additive error.
In this work, we prove that any computationally differentially private protocol that estimates the inner product over $\{-1,1\}^n \times \{-1,1\}^n$ up to an additive error of $O(n^{1/6})$, can be used to construct oblivious transfer. In particular, our result implies that protocols with sub-polynomial accuracy are equivalent to oblivious transfer. In this accuracy regime, our result improves upon Haitner, Mazor, Silbak, and Tsfadia [STOC '22] who showed that a key-agreement protocol is necessary.
△ Less
Submitted 21 February, 2025;
originally announced February 2025.
-
On the Complexity of Two-Party Differential Privacy
Authors:
Iftach Haitner,
Noam Mazor,
Jad Silbak,
Eliad Tsfadia
Abstract:
In distributed differential privacy, the parties perform analysis over their joint data while preserving the privacy for both datasets. Interestingly, for a few fundamental two-party functions such as inner product and Hamming distance, the accuracy of the distributed solution lags way behind what is achievable in the client-server setting. McGregor, Mironov, Pitassi, Reingold, Talwar, and Vadhan…
▽ More
In distributed differential privacy, the parties perform analysis over their joint data while preserving the privacy for both datasets. Interestingly, for a few fundamental two-party functions such as inner product and Hamming distance, the accuracy of the distributed solution lags way behind what is achievable in the client-server setting. McGregor, Mironov, Pitassi, Reingold, Talwar, and Vadhan [FOCS '10] proved that this gap is inherent, showing upper bounds on the accuracy of (any) distributed solution for these functions. These limitations can be bypassed when settling for computational differential privacy, where the data is differentially private only in the eyes of a computationally bounded observer, using public-key cryptography primitives.
We prove that the use of public-key cryptography is necessary for bypassing the limitation of McGregor et al., showing that a non-trivial solution for the inner-product, or the Hamming distance, implies the existence of a key-agreement protocol. Our bound implies a combinatorial proof for the fact that non-Boolean inner product of independent (strong) Santha-Vazirani sources is a good condenser. We obtain our main result by showing that the inner-product of a (single, strong) SV source with a uniformly random seed is a good condenser, even when the seed and source are dependent.
△ Less
Submitted 17 June, 2022; v1 submitted 17 August, 2021;
originally announced August 2021.
-
On the Communication Complexity of Key-Agreement Protocols
Authors:
Iftach Haitner,
Noam Mazor,
Rotem Oshman,
Omer Reingold,
Amir Yehudayoff
Abstract:
Key-agreement protocols whose security is proven in the random oracle model are an important alternative to protocols based on public-key cryptography. In the random oracle model, the parties and the eavesdropper have access to a shared random function (an "oracle"), but the parties are limited in the number of queries they can make to the oracle. The random oracle serves as an abstraction for bla…
▽ More
Key-agreement protocols whose security is proven in the random oracle model are an important alternative to protocols based on public-key cryptography. In the random oracle model, the parties and the eavesdropper have access to a shared random function (an "oracle"), but the parties are limited in the number of queries they can make to the oracle. The random oracle serves as an abstraction for black-box access to a symmetric cryptographic primitive, such as a collision resistant hash. Unfortunately, as shown by Impagliazzo and Rudich [STOC '89] and Barak and Mahmoody [Crypto '09], such protocols can only guarantee limited secrecy: the key of any $\ell$-query protocol can be revealed by an $O(\ell^2)$-query adversary. This quadratic gap between the query complexity of the honest parties and the eavesdropper matches the gap obtained by the Merkle's Puzzles protocol of Merkle [CACM '78].
In this work we tackle a new aspect of key-agreement protocols in the random oracle model: their communication complexity. In Merkle's Puzzles, to obtain secrecy against an eavesdropper that makes roughly $\ell^2$ queries, the honest parties need to exchange $Ω(\ell)$ bits. We show that for protocols with certain natural properties, ones that Merkle's Puzzle has, such high communication is unavoidable. Specifically, this is the case if the honest parties' queries are uniformly random, or alternatively if the protocol uses non-adaptive queries and has only two rounds. Our proof for the first setting uses a novel reduction from the set-disjointness problem in two-party communication complexity. For the second setting we prove the lower bound directly, using information-theoretic arguments.
△ Less
Submitted 6 May, 2021; v1 submitted 5 May, 2021;
originally announced May 2021.
-
Channels of Small Log-Ratio Leakage and Characterization of Two-Party Differentially Private Computation
Authors:
Iftach Haitner,
Noam Mazor,
Ronen Shaltiel,
Jad Silbak
Abstract:
Consider a PPT two-party protocol $π=(A,B)$ in which the parties get no private inputs and obtain outputs $O^A,O^B\in \{0,1\}$, and let $V^A$ and $V^B$ denote the parties' individual views. Protocol $π$ has $α$-agreement if $Pr[O^A=O^B]=1/2+α$. The leakage of $π$ is the amount of information a party obtains about the event $\{O^A=O^B\}$; that is, the leakage $ε$ is the maximum, over $P\in\{A,B\}$,…
▽ More
Consider a PPT two-party protocol $π=(A,B)$ in which the parties get no private inputs and obtain outputs $O^A,O^B\in \{0,1\}$, and let $V^A$ and $V^B$ denote the parties' individual views. Protocol $π$ has $α$-agreement if $Pr[O^A=O^B]=1/2+α$. The leakage of $π$ is the amount of information a party obtains about the event $\{O^A=O^B\}$; that is, the leakage $ε$ is the maximum, over $P\in\{A,B\}$, of the distance between $V^P|OA=OB$ and $V^P|OA\neq OB$. Typically, this distance is measured in statistical distance, or, in the computational setting, in computational indistinguishability. For this choice, Wullschleger [TCC 09] showed that if $α>>ε$ then the protocol can be transformed into an OT protocol.
We consider measuring the protocol leakage by the log-ratio distance (which was popularized by its use in the differential privacy framework). The log-ratio distance between X,Y over domain Ωis the minimal $ε>0$ for which, for every $v\inΩ$, $log(Pr[X=v]/Pr[Y=v])\in [-ε,ε]$. In the computational setting, we use computational indistinguishability from having log-ratio distance $ε$. We show that a protocol with (noticeable) accuracy $α\inΩ(ε^2)$ can be transformed into an OT protocol (note that this allows $ε>>α$). We complete the picture, in this respect, showing that a protocol with $α\in o(ε^2)$ does not necessarily imply OT. Our results hold for both the information theoretic and the computational settings, and can be viewed as a "fine grained" approach to "weak OT amplification".
We then use the above result to fully characterize the complexity of differentially private two-party computation for the XOR function, answering the open question put by Goyal, Khurana, Mironov, Pandey, and Sahai [ICALP 16] and Haitner, Nissim, Omri, Shaltiel, and Silbak [FOCS 18].
△ Less
Submitted 9 May, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Lower Bounds on the Time/Memory Tradeoff of Function Inversion
Authors:
Dror Chawin,
Iftach Haitner,
Noam Mazor
Abstract:
We study time/memory tradeoffs of function inversion: an algorithm, i.e., an inverter, equipped with an s-bit advice on a randomly chosen function $f : [n] -> [n]$ and using $q$ oracle queries to $f$, tries to invert a randomly chosen output $y$ of $f$, i.e., to find $x\in f^{-1}(y)$. Much progress was done regarding adaptive function inversion - the inverter is allowed to make adaptive oracle que…
▽ More
We study time/memory tradeoffs of function inversion: an algorithm, i.e., an inverter, equipped with an s-bit advice on a randomly chosen function $f : [n] -> [n]$ and using $q$ oracle queries to $f$, tries to invert a randomly chosen output $y$ of $f$, i.e., to find $x\in f^{-1}(y)$. Much progress was done regarding adaptive function inversion - the inverter is allowed to make adaptive oracle queries. Hellman [IEEE transactions on Information Theory 80] presented an adaptive inverter that inverts with high probability a random $f$. Fiat and Naor [SICOMP 00] proved that for any $s$, $q$ with $s^3q = n$ (ignoring low-order terms), an $s$-advice, $q$-query variant of Hellmans algorithm inverts a constant fraction of the image points of any function. Yao [STOC 90] proved a lower bound of $sq \geq n$ for this problem. Closing the gap between the above lower and upper bounds is a long-standing open question. Very little is known for the non-adaptive variant of the question. The only known upper bounds, i.e., inverters, are the trivial ones (with $s+q = n$), and the only lower bound is the above bound of Yao. In a recent work, Corrigan-Gibbs and Kogan [TCC 19] partially justified the difficulty of finding lower bounds on non-adaptive inverters, showing that a lower bound on the time/memory tradeoff of non-adaptive inverters implies a lower bound on low-depth Boolean circuits. Bounds that, for a strong enough choice of parameters, are notoriously hard to prove. We make progress on the above intriguing question, both for the adaptive and the non-adaptive case, proving the following lower bounds on restricted families of inverters.
△ Less
Submitted 9 May, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.