-
Optimal mass estimation in the conditional sampling model
Authors:
Tomer Adar,
Eldar Fischer,
Amit Levi
Abstract:
The conditional sampling model, introduced by Cannone, Ron and Servedio (SODA 2014, SIAM J. Comput. 2015) and independently by Chakraborty, Fischer, Goldhirsh and Matsliah (ITCS 2013, SIAM J. Comput. 2016), is a common framework for a number of studies concerning strengthened models of distribution testing. A core task in these investigations is that of estimating the mass of individual elements.…
▽ More
The conditional sampling model, introduced by Cannone, Ron and Servedio (SODA 2014, SIAM J. Comput. 2015) and independently by Chakraborty, Fischer, Goldhirsh and Matsliah (ITCS 2013, SIAM J. Comput. 2016), is a common framework for a number of studies concerning strengthened models of distribution testing. A core task in these investigations is that of estimating the mass of individual elements. The above mentioned works, and the improvement of Kumar, Meel and Pote (AISTATS 2025), provided polylogarithmic algorithms for this task.
In this work we shatter the polylogarithmic barrier, and provide an estimator for the mass of individual elements that uses only $O(\log \log N) + O(\mathrm{poly}(1/\varepsilon))$ conditional samples. We complement this result with an $Ω(\log\log N)$ lower bound.
We then show that our mass estimator provides an improvement (and in some cases a unifying framework) for a number of related tasks, such as testing by learning of any label-invariant property, and distance estimation between two (unknown) distribution. By considering some known lower bounds, this also shows that the full power of the conditional model is indeed required for the doubly-logarithmic upper bound.
Finally, we exponentially improve the previous lower bound on testing by learning of label-invariant properties from double-logarithmic to $Ω(\log N)$ conditional samples, whereas our testing by learning algorithm provides an upper bound of $O(\mathrm{poly}(1/\varepsilon)\cdot\log N \log \log N)$.
△ Less
Submitted 29 March, 2025; v1 submitted 16 March, 2025;
originally announced March 2025.
-
Jailbreak Attack Initializations as Extractors of Compliance Directions
Authors:
Amit Levi,
Rom Himelstein,
Yaniv Nemcovsky,
Avi Mendelson,
Chaim Baskin
Abstract:
Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initial…
▽ More
Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.
△ Less
Submitted 5 June, 2025; v1 submitted 13 February, 2025;
originally announced February 2025.
-
Testing vs Estimation for Index-Invariant Properties in the Huge Object Model
Authors:
Sourav Chakraborty,
Eldar Fischer,
Arijit Ghosh,
Amit Levi,
Gopinath Mishra,
Sayantan Sen
Abstract:
The Huge Object model of property testing [Goldreich and Ron, TheoretiCS 23] concerns properties of distributions supported on $\{0,1\}^n$, where $n$ is so large that even reading a single sampled string is unrealistic. Instead, query access is provided to the samples, and the efficiency of the algorithm is measured by the total number of queries that were made to them.
Index-invariant propertie…
▽ More
The Huge Object model of property testing [Goldreich and Ron, TheoretiCS 23] concerns properties of distributions supported on $\{0,1\}^n$, where $n$ is so large that even reading a single sampled string is unrealistic. Instead, query access is provided to the samples, and the efficiency of the algorithm is measured by the total number of queries that were made to them.
Index-invariant properties under this model were defined in [Chakraborty et al., COLT 23], as a compromise between enduring the full intricacies of string testing when considering unconstrained properties, and giving up completely on the string structure when considering label-invariant properties. Index-invariant properties are those that are invariant through a consistent reordering of the bits of the involved strings.
Here we provide an adaptation of Szemerédi's regularity method for this setting, and in particular show that if an index-invariant property admits an $ε$-test with a number of queries depending only on the proximity parameter $ε$, then it also admits a distance estimation algorithm whose number of queries depends only on the approximation parameter.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer
Authors:
Gesa Mittmann,
Sara Laiouar-Pedari,
Hendrik A. Mehrtens,
Sarah Haggenmüller,
Tabea-Clara Bucher,
Tirtha Chanda,
Nadine T. Gaisa,
Mathias Wagner,
Gilbert Georg Klamminger,
Tilman T. Rau,
Christina Neppl,
Eva Maria Compérat,
Andreas Gocht,
Monika Hämmerle,
Niels J. Rupp,
Jula Westhoff,
Irene Krücken,
Maximillian Seidl,
Christian M. Schürch,
Marcus Bauer,
Wiebke Solass,
Yu Chun Tam,
Florian Weber,
Rainer Grobholz,
Jaroslaw Augustyniak
, et al. (41 additional authors not shown)
Abstract:
The aggressiveness of prostate cancer, the most common cancer in men worldwide, is primarily assessed based on histopathological data using the Gleason scoring system. While artificial intelligence (AI) has shown promise in accurately predicting Gleason scores, these predictions often lack inherent explainability, potentially leading to distrust in human-machine interactions. To address this issue…
▽ More
The aggressiveness of prostate cancer, the most common cancer in men worldwide, is primarily assessed based on histopathological data using the Gleason scoring system. While artificial intelligence (AI) has shown promise in accurately predicting Gleason scores, these predictions often lack inherent explainability, potentially leading to distrust in human-machine interactions. To address this issue, we introduce a novel dataset of 1,015 tissue microarray core images, annotated by an international group of 54 pathologists. The annotations provide detailed localized pattern descriptions for Gleason grading in line with international guidelines. Utilizing this dataset, we develop an inherently explainable AI system based on a U-Net architecture that provides predictions leveraging pathologists' terminology. This approach circumvents post-hoc explainability methods while maintaining or exceeding the performance of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 $\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason patterns). By employing soft labels during training, we capture the intrinsic uncertainty in the data, yielding strong results in Gleason pattern segmentation even in the context of high interobserver variability. With the release of this dataset, we aim to encourage further research into segmentation in medical tasks with high levels of subjectivity and to advance the understanding of pathologists' reasoning processes.
△ Less
Submitted 19 October, 2024;
originally announced October 2024.
-
Improved Bounds for High-Dimensional Equivalence and Product Testing using Subcube Queries
Authors:
Tomer Adar,
Eldar Fischer,
Amit Levi
Abstract:
We study property testing in the subcube conditional model introduced by Bhattacharyya and Chakraborty (2017). We obtain the first equivalence test for $n$-dimensional distributions that is quasi-linear in $n$, improving the previously known $\tilde{O}(n^2/\varepsilon^2)$ query complexity bound to $\tilde{O}(n/\varepsilon^2)$. We extend this result to general finite alphabets with logarithmic cost…
▽ More
We study property testing in the subcube conditional model introduced by Bhattacharyya and Chakraborty (2017). We obtain the first equivalence test for $n$-dimensional distributions that is quasi-linear in $n$, improving the previously known $\tilde{O}(n^2/\varepsilon^2)$ query complexity bound to $\tilde{O}(n/\varepsilon^2)$. We extend this result to general finite alphabets with logarithmic cost in the alphabet size.
By exploiting the specific structure of the queries that we use (which are more restrictive than general subcube queries), we obtain a cubic improvement over the best known test for distributions over $\{1,\ldots,N\}$ under the interval querying model of Canonne, Ron and Servedio (2015), attaining a query complexity of $\tilde{O}((\log N)/\varepsilon^2)$, which for fixed $\varepsilon$ almost matches the known lower bound of $Ω((\log N)/\log\log N)$. We also derive a product test for $n$-dimensional distributions with $\tilde{O}(n / \varepsilon^2)$ queries, and provide an $Ω(\sqrt{n} / \varepsilon^2)$ lower bound for this property.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Support Testing in the Huge Object Model
Authors:
Tomer Adar,
Eldar Fischer,
Amit Levi
Abstract:
The Huge Object model is a distribution testing model in which we are given access to independent samples from an unknown distribution over the set of strings $\{0,1\}^n$, but are only allowed to query a few bits from the samples. We investigate the problem of testing whether a distribution is supported on $m$ elements in this model. It turns out that the behavior of this property is surprisingly…
▽ More
The Huge Object model is a distribution testing model in which we are given access to independent samples from an unknown distribution over the set of strings $\{0,1\}^n$, but are only allowed to query a few bits from the samples. We investigate the problem of testing whether a distribution is supported on $m$ elements in this model. It turns out that the behavior of this property is surprisingly intricate, especially when also considering the question of adaptivity.
We prove lower and upper bounds for both adaptive and non-adaptive algorithms in the one-sided and two-sided error regime. Our bounds are tight when $m$ is fixed to a constant (and the distance parameter $\varepsilon$ is the only variable). For the general case, our bounds are at most $O(\log m)$ apart. In particular, our results show a surprising $O(\log \varepsilon^{-1})$ gap between the number of queries required for non-adaptive testing as compared to adaptive testing. For one sided error testing, we also show that a $O(\log m)$ gap between the number of samples and the number of queries is necessary. Our results utilize a wide variety of combinatorial and probabilistic methods.
△ Less
Submitted 17 September, 2024; v1 submitted 30 August, 2023;
originally announced August 2023.
-
Accelerated deep self-supervised ptycho-laminography for three-dimensional nanoscale imaging of integrated circuits
Authors:
Iksung Kang,
Yi Jiang,
Mirko Holler,
Manuel Guizar-Sicairos,
A. F. J. Levi,
Jeffrey Klug,
Stefan Vogt,
George Barbastathis
Abstract:
Three-dimensional inspection of nanostructures such as integrated circuits is important for security and reliability assurance. Two scanning operations are required: ptychographic to recover the complex transmissivity of the specimen; and rotation of the specimen to acquire multiple projections covering the 3D spatial frequency domain. Two types of rotational scanning are possible: tomographic and…
▽ More
Three-dimensional inspection of nanostructures such as integrated circuits is important for security and reliability assurance. Two scanning operations are required: ptychographic to recover the complex transmissivity of the specimen; and rotation of the specimen to acquire multiple projections covering the 3D spatial frequency domain. Two types of rotational scanning are possible: tomographic and laminographic. For flat, extended samples, for which the full 180 degree coverage is not possible, the latter is preferable because it provides better coverage of the 3D spatial frequency domain compared to limited-angle tomography. It is also because the amount of attenuation through the sample is approximately the same for all projections. However, both techniques are time consuming because of extensive acquisition and computation time. Here, we demonstrate the acceleration of ptycho-laminographic reconstruction of integrated circuits with 16-times fewer angular samples and 4.67-times faster computation by using a physics-regularized deep self-supervised learning architecture. We check the fidelity of our reconstruction against a densely sampled reconstruction that uses full scanning and no learning. As already reported elsewhere [Zhou and Horstmeyer, Opt. Express, 28(9), pp. 12872-12896], we observe improvement of reconstruction quality even over the densely sampled reconstruction, due to the ability of the self-supervised learning kernel to fill the missing cone.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
Streaming Euclidean MST to a Constant Factor
Authors:
Vincent Cohen-Addad,
Xi Chen,
Rajesh Jayaram,
Amit Levi,
Erik Waingarten
Abstract:
We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \mathbb{R}^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+ε)$ approximations are possible in sublinear space [Fra…
▽ More
We study streaming algorithms for the fundamental geometric problem of computing the cost of the Euclidean Minimum Spanning Tree (MST) on an $n$-point set $X \subset \mathbb{R}^d$. In the streaming model, the points in $X$ can be added and removed arbitrarily, and the goal is to maintain an approximation in small space. In low dimensions, $(1+ε)$ approximations are possible in sublinear space [Frahling, Indyk, Sohler, SoCG '05]. However, for high dimensional spaces the best known approximation for this problem was $\tilde{O}(\log n)$, due to [Chen, Jayaram, Levi, Waingarten, STOC '22], improving on the prior $O(\log^2 n)$ bound due to [Indyk, STOC '04] and [Andoni, Indyk, Krauthgamer, SODA '08]. In this paper, we break the logarithmic barrier, and give the first constant factor sublinear space approximation to Euclidean MST. For any $ε\geq 1$, our algorithm achieves an $\tilde{O}(ε^{-2})$ approximation in $n^{O(ε)}$ space.
We complement this by proving that any single pass algorithm which obtains a better than $1.10$-approximation must use $Ω(\sqrt{n})$ space, demonstrating that $(1+ε)$ approximations are not possible in high-dimensions, and that our algorithm is tight up to a constant. Nevertheless, we demonstrate that $(1+ε)$ approximations are possible in sublinear space with $O(1/ε)$ passes over the stream. More generally, for any $α\geq 2$, we give a $α$-pass streaming algorithm which achieves a $(1+O(\frac{\log α+ 1}{ αε}))$ approximation in $n^{O(ε)} d^{O(1)}$ space. Our streaming algorithms are linear sketches, and therefore extend to the massively-parallel computation model (MPC). Thus, our results imply the first $(1+ε)$-approximation to Euclidean MST in a constant number of rounds in the MPC model.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Learnable Graph Convolutional Attention Networks
Authors:
Adrián Javaloy,
Pablo Sanchez-Martin,
Amit Levi,
Isabel Valera
Abstract:
Existing Graph Neural Networks (GNNs) compute the message exchange between nodes by either aggregating uniformly (convolving) the features of all the neighboring nodes, or by applying a non-uniform score (attending) to the features. Recent works have shown the strengths and weaknesses of the resulting GNN architectures, respectively, GCNs and GATs. In this work, we aim at exploiting the strengths…
▽ More
Existing Graph Neural Networks (GNNs) compute the message exchange between nodes by either aggregating uniformly (convolving) the features of all the neighboring nodes, or by applying a non-uniform score (attending) to the features. Recent works have shown the strengths and weaknesses of the resulting GNN architectures, respectively, GCNs and GATs. In this work, we aim at exploiting the strengths of both approaches to their full extent. To this end, we first introduce the graph convolutional attention layer (CAT), which relies on convolutions to compute the attention scores. Unfortunately, as in the case of GCNs and GATs, we show that there exists no clear winner between the three (neither theoretically nor in practice) as their performance directly depends on the nature of the data (i.e., of the graph and features). This result brings us to the main contribution of our work, the learnable graph convolutional attention network (L-CAT): a GNN architecture that automatically interpolates between GCN, GAT and CAT in each layer, by adding only two scalar parameters. Our results demonstrate that L-CAT is able to efficiently combine different GNN layers along the network, outperforming competing methods in a wide range of datasets, and resulting in a more robust model that reduces the need of cross-validating.
△ Less
Submitted 28 February, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Vildehaye: A Family of Versatile, Widely-Applicable, and Field-Proven Lightweight Wildlife Tracking and Sensing Tags
Authors:
Sivan Toledo,
Shai Mendel,
Anat Levi,
Yoni Vortman,
Wiebke Ullmann,
Lena-Rosa Scherer,
Jan Pufelski,
Frank van Maarseveen,
Bas Denissen,
Allert Bijleveld,
Yotam Orchan,
Yoav Bartan,
Sivan Margalit,
Idan Talmon,
Ran Nathan
Abstract:
We describe the design and implementation of Vildehaye, a family of versatile, widely-applicable, and field-proven tags for wildlife sensing and radio tracking. The family includes 6 distinct hardware designs for tags, 3 add-on boards, a programming adapter, and base stations; modular firmware for tags and base stations (both standalone low-power embedded base stations and base stations tethered t…
▽ More
We describe the design and implementation of Vildehaye, a family of versatile, widely-applicable, and field-proven tags for wildlife sensing and radio tracking. The family includes 6 distinct hardware designs for tags, 3 add-on boards, a programming adapter, and base stations; modular firmware for tags and base stations (both standalone low-power embedded base stations and base stations tethered to a computer running Linux or Windows); and desktop software for programming and configuring tags, monitoring tags, and downloading and processing sensor data. The tags are versatile: they support multiple packet formats, data rates, and frequency bands; they can be configured for minimum mass (down to less than 1g), making them applicable to a wide range of flying and terrestrial animals, or for inclusion of important sensors and large memories; they can transmit packets compatible with time-of-arrival transmitter-localization systems, tag identification and state packets, and they can reliably upload sensor data through their radio link. The system has been designed, upgraded, and maintained as an academic research project, but it has been extensively used by 5 different groups of ecologists in 4 countries over a period of 5 years. More than 7100 tags have been produced and most of these have been deployed. Production used 41 manufacturing runs. The tags have been used in studies that so far resulted in 9 scientific publications in ecology (including in Science). The paper describes innovative design aspects of Vildehaye, field-use experiences, and lessons from the design, implementation, and maintenance of the system. Both the hardware and software of the system are open.
△ Less
Submitted 4 May, 2022;
originally announced June 2022.
-
Graph Attention Retrospective
Authors:
Kimon Fountoulakis,
Amit Levi,
Shenghao Yang,
Aseem Baranwal,
Aukosh Jagannath
Abstract:
Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neig…
▽ More
Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we theoretically study the behaviour of graph attention networks. We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here, the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an "easy" regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges. Thus it maintains the weights of important edges and significantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classification. In the "hard" regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. In addition, we show that graph attention convolution cannot (almost) perfectly classify the nodes even if intra-class edges could be separated from inter-class edges. Beyond perfect node classification, we provide a positive result on graph attention's robustness against structural noise in the graph. In particular, our robustness result implies that graph attention can be strictly better than both the simple graph convolution and the best linear classifier of node features. We evaluate our theoretical results on synthetic and real-world data.
△ Less
Submitted 21 May, 2023; v1 submitted 25 February, 2022;
originally announced February 2022.
-
No Community Can Do Everything: Why People Participate in Similar Online Communities
Authors:
Nathan TeBlunthuis,
Charles Kiene,
Isabella Brown,
Laura Alia Levi,
Nicole McGinnis,
Benjamin Mako Hill
Abstract:
Large-scale quantitative analyses have shown that individuals frequently talk to each other about similar things in different online spaces. Why do these overlapping communities exist? We provide an answer grounded in the analysis of 20 interviews with active participants in clusters of highly related subreddits. Within a broad topical area, there are a diversity of benefits an online community ca…
▽ More
Large-scale quantitative analyses have shown that individuals frequently talk to each other about similar things in different online spaces. Why do these overlapping communities exist? We provide an answer grounded in the analysis of 20 interviews with active participants in clusters of highly related subreddits. Within a broad topical area, there are a diversity of benefits an online community can confer. These include (a) specific information and discussion, (b) socialization with similar others, and (c) attention from the largest possible audience. A single community cannot meet all three needs. Our findings suggest that topical areas within an online community platform tend to become populated by groups of specialized communities with diverse sizes, topical boundaries, and rules. Compared with any single community, such systems of overlapping communities are able to provide a greater range of benefits.
△ Less
Submitted 10 February, 2022; v1 submitted 11 January, 2022;
originally announced January 2022.
-
New Streaming Algorithms for High Dimensional EMD and MST
Authors:
Xi Chen,
Rajesh Jayaram,
Amit Levi,
Erik Waingarten
Abstract:
We study streaming algorithms for two fundamental geometric problems: computing the cost of a Minimum Spanning Tree (MST) of an $n$-point set $X \subset \{1,2,\dots,Δ\}^d$, and computing the Earth Mover Distance (EMD) between two multi-sets $A,B \subset \{1,2,\dots,Δ\}^d$ of size $n$. We consider the turnstile model, where points can be added and removed. We give a one-pass streaming algorithm for…
▽ More
We study streaming algorithms for two fundamental geometric problems: computing the cost of a Minimum Spanning Tree (MST) of an $n$-point set $X \subset \{1,2,\dots,Δ\}^d$, and computing the Earth Mover Distance (EMD) between two multi-sets $A,B \subset \{1,2,\dots,Δ\}^d$ of size $n$. We consider the turnstile model, where points can be added and removed. We give a one-pass streaming algorithm for MST and a two-pass streaming algorithm for EMD, both achieving an approximation factor of $\tilde{O}(\log n)$ and using polylog$(n,d,Δ)$-space only. Furthermore, our algorithm for EMD can be compressed to a single pass with a small additive error. Previously, the best known sublinear-space streaming algorithms for either problem achieved an approximation of $O(\min\{ \log n , \log (Δd)\} \log n)$ [Andoni-Indyk-Krauthgamer '08, Backurs-Dong-Indyk-Razenshteyn-Wagner '20]. For MST, we also prove that any constant space streaming algorithm can only achieve an approximation of $Ω(\log n)$, analogous to the $Ω(\log n)$ lower bound for EMD of [Andoni-Indyk-Krauthgamer '08].
Our algorithms are based on an improved analysis of a recursive space partitioning method known generically as the Quadtree. Specifically, we show that the Quadtree achieves an $\tilde{O}(\log n)$ approximation for both EMD and MST, improving on the $O(\min\{ \log n , \log (Δd)\} \log n)$ approximation of [Andoni-Indyk-Krauthgamer '08, Backurs-Dong-Indyk-Razenshteyn-Wagner '20].
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
A Survey on Ransomware: Evolution, Taxonomy, and Defense Solutions
Authors:
Harun Oz,
Ahmet Aris,
Albert Levi,
A. Selcuk Uluagac
Abstract:
In recent years, ransomware has been one of the most notorious malware targeting end users, governments, and business organizations. It has become a very profitable business for cybercriminals with revenues of millions of dollars, and a very serious threat to organizations with financial loss of billions of dollars. Numerous studies were proposed to address the ransomware threat, including surveys…
▽ More
In recent years, ransomware has been one of the most notorious malware targeting end users, governments, and business organizations. It has become a very profitable business for cybercriminals with revenues of millions of dollars, and a very serious threat to organizations with financial loss of billions of dollars. Numerous studies were proposed to address the ransomware threat, including surveys that cover certain aspects of ransomware research. However, no study exists in the literature that gives the complete picture on ransomware and ransomware defense research with respect to the diversity of targeted platforms. Since ransomware is already prevalent in PCs/workstations/desktops/laptops, is becoming more prevalent in mobile devices, and has already hit IoT/CPS recently, and will likely grow further in the IoT/CPS domain very soon, understanding ransomware and analyzing defense mechanisms with respect to target platforms is becoming more imperative. In order to fill this gap and motivate further research, in this paper, we present a comprehensive survey on ransomware and ransomware defense research with respect to PCs/workstations, mobile devices, and IoT/CPS platforms. Specifically, covering 137 studies over the period of 1990-2020, we give a detailed overview of ransomware evolution, comprehensively analyze the key building blocks of ransomware, present a taxonomy of notable ransomware families, and provide an extensive overview of ransomware defense research (i.e., analysis, detection, and recovery) with respect to platforms of PCs/workstations, mobile devices, and IoT/CPS. Moreover, we derive an extensive list of open issues for future ransomware research. We believe this survey will motivate further research by giving a complete picture on state-of-the-art ransomware research.
△ Less
Submitted 24 February, 2022; v1 submitted 11 February, 2021;
originally announced February 2021.
-
Erasure-Resilient Sublinear-Time Graph Algorithms
Authors:
Amit Levi,
Ramesh Krishnan S. Pallavoor,
Sofya Raskhodnikova,
Nithin Varma
Abstract:
We investigate sublinear-time algorithms that take partially erased graphs represented by adjacency lists as input. Our algorithms make degree and neighbor queries to the input graph and work with a specified fraction of adversarial erasures in adjacency entries. We focus on two computational tasks: testing if a graph is connected or $\varepsilon$-far from connected and estimating the average degr…
▽ More
We investigate sublinear-time algorithms that take partially erased graphs represented by adjacency lists as input. Our algorithms make degree and neighbor queries to the input graph and work with a specified fraction of adversarial erasures in adjacency entries. We focus on two computational tasks: testing if a graph is connected or $\varepsilon$-far from connected and estimating the average degree. For testing connectedness, we discover a threshold phenomenon: when the fraction of erasures is less than $\varepsilon$, this property can be tested efficiently (in time independent of the size of the graph); when the fraction of erasures is at least $\varepsilon,$ then a number of queries linear in the size of the graph representation is required. Our erasure-resilient algorithm (for the special case with no erasures) is an improvement over the previously known algorithm for connectedness in the standard property testing model and has optimal dependence on the proximity parameter $\varepsilon$. For estimating the average degree, our results provide an "interpolation" between the query complexity for this computational task in the model with no erasures in two different settings: with only degree queries, investigated by Feige (SIAM J. Comput. `06), and with degree queries and neighbor queries, investigated by Goldreich and Ron (Random Struct. Algorithms `08) and Eden et al. (ICALP `17). We conclude with a discussion of our model and open questions raised by our work.
△ Less
Submitted 29 November, 2020;
originally announced November 2020.
-
Learning and Testing Junta Distributions with Subcube Conditioning
Authors:
Xi Chen,
Rajesh Jayaram,
Amit Levi,
Erik Waingarten
Abstract:
We study the problems of learning and testing junta distributions on $\{-1,1\}^n$ with respect to the uniform distribution, where a distribution $p$ is a $k$-junta if its probability mass function $p(x)$ depends on a subset of at most $k$ variables. The main contribution is an algorithm for finding relevant coordinates in a $k$-junta distribution with subcube conditioning [BC18, CCKLW20]. We give…
▽ More
We study the problems of learning and testing junta distributions on $\{-1,1\}^n$ with respect to the uniform distribution, where a distribution $p$ is a $k$-junta if its probability mass function $p(x)$ depends on a subset of at most $k$ variables. The main contribution is an algorithm for finding relevant coordinates in a $k$-junta distribution with subcube conditioning [BC18, CCKLW20]. We give two applications:
1. An algorithm for learning $k$-junta distributions with $\tilde{O}(k/ε^2) \log n + O(2^k/ε^2)$ subcube conditioning queries, and
2. An algorithm for testing $k$-junta distributions with $\tilde{O}((k + \sqrt{n})/ε^2)$ subcube conditioning queries.
All our algorithms are optimal up to poly-logarithmic factors.
Our results show that subcube conditioning, as a natural model for accessing high-dimensional distributions, enables significant savings in learning and testing junta distributions compared to the standard sampling model. This addresses an open question posed by Aliakbarpour, Blais, and Rubinfeld [ABR17].
△ Less
Submitted 26 April, 2020;
originally announced April 2020.
-
Random Restrictions of High-Dimensional Distributions and Uniformity Testing with Subcube Conditioning
Authors:
Clément L. Canonne,
Xi Chen,
Gautam Kamath,
Amit Levi,
Erik Waingarten
Abstract:
We give a nearly-optimal algorithm for testing uniformity of distributions supported on $\{-1,1\}^n$, which makes $\tilde O (\sqrt{n}/\varepsilon^2)$ queries to a subcube conditional sampling oracle (Bhattacharyya and Chakraborty (2018)). The key technical component is a natural notion of random restriction for distributions on $\{-1,1\}^n$, and a quantitative analysis of how such a restriction af…
▽ More
We give a nearly-optimal algorithm for testing uniformity of distributions supported on $\{-1,1\}^n$, which makes $\tilde O (\sqrt{n}/\varepsilon^2)$ queries to a subcube conditional sampling oracle (Bhattacharyya and Chakraborty (2018)). The key technical component is a natural notion of random restriction for distributions on $\{-1,1\}^n$, and a quantitative analysis of how such a restriction affects the mean vector of the distribution. Along the way, we consider the problem of mean testing with independent samples and provide a nearly-optimal algorithm.
△ Less
Submitted 4 February, 2021; v1 submitted 17 November, 2019;
originally announced November 2019.
-
Hard properties with (very) short PCPPs and their applications
Authors:
Omri Ben-Eliezer,
Eldar Fischer,
Amit Levi,
Ron D. Rothblum
Abstract:
We show that there exist properties that are maximally hard for testing, while still admitting PCPPs with a proof size very close to linear. Specifically, for every fixed $\ell$, we construct a property $\mathcal{P}^{(\ell)}\subseteq\{0,1\}^n$ satisfying the following: Any testing algorithm for $\mathcal{P}^{(\ell)}$ requires $Ω(n)$ many queries, and yet $\mathcal{P}^{(\ell)}$ has a constant query…
▽ More
We show that there exist properties that are maximally hard for testing, while still admitting PCPPs with a proof size very close to linear. Specifically, for every fixed $\ell$, we construct a property $\mathcal{P}^{(\ell)}\subseteq\{0,1\}^n$ satisfying the following: Any testing algorithm for $\mathcal{P}^{(\ell)}$ requires $Ω(n)$ many queries, and yet $\mathcal{P}^{(\ell)}$ has a constant query PCPP whose proof size is $O(n\cdot \log^{(\ell)}n)$, where $\log^{(\ell)}$ denotes the $\ell$ times iterated log function (e.g., $\log^{(2)}n = \log \log n$). The best previously known upper bound on the PCPP proof size for a maximally hard to test property was $O(n \cdot \mathrm{poly}\log{n})$.
As an immediate application, we obtain stronger separations between the standard testing model and both the tolerant testing model and the erasure-resilient testing model: for every fixed $\ell$, we construct a property that has a constant-query tester, but requires $Ω(n/\log^{(\ell)}(n))$ queries for every tolerant or erasure-resilient tester.
△ Less
Submitted 15 November, 2019; v1 submitted 7 September, 2019;
originally announced September 2019.
-
Nearly optimal edge estimation with independent set queries
Authors:
Xi Chen,
Amit Levi,
Erik Waingarten
Abstract:
We study the problem of estimating the number of edges of an unknown, undirected graph $G=([n],E)$ with access to an independent set oracle. When queried about a subset $S\subseteq [n]$ of vertices the independent set oracle answers whether $S$ is an independent set in $G$ or not. Our first main result is an algorithm that computes a $(1+ε)$-approximation of the number of edges $m$ of the graph us…
▽ More
We study the problem of estimating the number of edges of an unknown, undirected graph $G=([n],E)$ with access to an independent set oracle. When queried about a subset $S\subseteq [n]$ of vertices the independent set oracle answers whether $S$ is an independent set in $G$ or not. Our first main result is an algorithm that computes a $(1+ε)$-approximation of the number of edges $m$ of the graph using $\min(\sqrt{m},n / \sqrt{m})\cdot\textrm{poly}(\log n,1/ε)$ independent set queries. This improves the upper bound of $\min(\sqrt{m},n^2/m)\cdot\textrm{poly}(\log n,1/ε)$ by Beame et al. \cite{BHRRS18}. Our second main result shows that ${\min(\sqrt{m},n/\sqrt{m}))/\textrm{polylog}(n)}$ independent set queries are necessary, thus establishing that our algorithm is optimal up to a factor of $\textrm{poly}(\log n, 1/ε)$.
△ Less
Submitted 9 July, 2019;
originally announced July 2019.
-
Ordered Graph Limits and Their Applications
Authors:
Omri Ben-Eliezer,
Eldar Fischer,
Amit Levi,
Yuichi Yoshida
Abstract:
The emerging theory of graph limits exhibits an analytic perspective on graphs, showing that many important concepts and tools in graph theory and its applications can be described more naturally (and sometimes proved more easily) in analytic language. We extend the theory of graph limits to the ordered setting, presenting a limit object for dense vertex-ordered graphs, which we call an orderon. A…
▽ More
The emerging theory of graph limits exhibits an analytic perspective on graphs, showing that many important concepts and tools in graph theory and its applications can be described more naturally (and sometimes proved more easily) in analytic language. We extend the theory of graph limits to the ordered setting, presenting a limit object for dense vertex-ordered graphs, which we call an orderon. As a special case, this yields limit objects for matrices whose rows and columns are ordered, and for dynamic graphs that expand (via vertex insertions) over time. Along the way, we devise an ordered locality-preserving variant of the cut distance between ordered graphs, showing that two graphs are close with respect to this distance if and only if they are similar in terms of their ordered subgraph frequencies. We show that the space of orderons is compact with respect to this distance notion, which is key to a successful analysis of combinatorial objects through their limits.
We derive several applications of the ordered limit theory in extremal combinatorics, sampling, and property testing in ordered graphs. In particular, we prove a new ordered analogue of the well-known result by Alon and Stav [RS\&A'08] on the furthest graph from a hereditary property; this is the first known result of this type in the ordered setting. Unlike the unordered regime, here the random graph model $G(n, p)$ with an ordering over the vertices is not always asymptotically the furthest from the property for some $p$. However, using our ordered limit theory, we show that random graphs generated by a stochastic block model, where the blocks are consecutive in the vertex ordering, are (approximately) the furthest. Additionally, we describe an alternative analytic proof of the ordered graph removal lemma [Alon et al., FOCS'17].
△ Less
Submitted 31 July, 2023; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Sublinear-Time Quadratic Minimization via Spectral Decomposition of Matrices
Authors:
Amit Levi,
Yuichi Yoshida
Abstract:
We design a sublinear-time approximation algorithm for quadratic function minimization problems with a better error bound than the previous algorithm by Hayashi and Yoshida (NIPS'16). Our approximation algorithm can be modified to handle the case where the minimization is done over a sphere. The analysis of our algorithms is obtained by combining results from graph limit theory, along with a novel…
▽ More
We design a sublinear-time approximation algorithm for quadratic function minimization problems with a better error bound than the previous algorithm by Hayashi and Yoshida (NIPS'16). Our approximation algorithm can be modified to handle the case where the minimization is done over a sphere. The analysis of our algorithms is obtained by combining results from graph limit theory, along with a novel spectral decomposition of matrices. Specifically, we prove that a matrix $A$ can be decomposed into a structured part and a pseudorandom part, where the structured part is a block matrix with a polylogarithmic number of blocks, such that in each block all the entries are the same, and the pseudorandom part has a small spectral norm, achieving better error bound than the existing decomposition theorem of Frieze and Kannan (FOCS'96). As an additional application of the decomposition theorem, we give a sublinear-time approximation algorithm for computing the top singular values of a matrix.
△ Less
Submitted 27 June, 2018;
originally announced June 2018.
-
Lower Bounds for Tolerant Junta and Unateness Testing via Rejection Sampling of Graphs
Authors:
Amit Levi,
Erik Waingarten
Abstract:
We introduce a new model for testing graph properties which we call the \emph{rejection sampling model}. We show that testing bipartiteness of $n$-nodes graphs using rejection sampling queries requires complexity $\widetildeΩ(n^2)$. Via reductions from the rejection sampling model, we give three new lower bounds for tolerant testing of Boolean functions of the form $f\colon\{0,1\}^n\to \{0,1\}$:…
▽ More
We introduce a new model for testing graph properties which we call the \emph{rejection sampling model}. We show that testing bipartiteness of $n$-nodes graphs using rejection sampling queries requires complexity $\widetildeΩ(n^2)$. Via reductions from the rejection sampling model, we give three new lower bounds for tolerant testing of Boolean functions of the form $f\colon\{0,1\}^n\to \{0,1\}$:
$\bullet$Tolerant $k$-junta testing with \emph{non-adaptive} queries requires $\widetildeΩ(k^2)$ queries.
$\bullet$Tolerant unateness testing requires $\widetildeΩ(n)$ queries.
$\bullet$Tolerant unateness testing with \emph{non-adaptive} queries requires $\widetildeΩ(n^{3/2})$ queries.
Given the $\widetilde{O}(k^{3/2})$-query non-adaptive junta tester of Blais \cite{B08}, we conclude that non-adaptive tolerant junta testing requires more queries than non-tolerant junta testing. In addition, given the $\widetilde{O}(n^{3/4})$-query unateness tester of Chen, Waingarten, and Xie \cite{CWX17b} and the $\widetilde{O}(n)$-query non-adaptive unateness tester of Baleshzar, Chakrabarty, Pallavoor, Raskhodnikova, and Seshadhri \cite{BCPRS17}, we conclude that tolerant unateness testing requires more queries than non-tolerant unateness testing, in both adaptive and non-adaptive settings. These lower bounds provide the first separation between tolerant and non-tolerant testing for a natural property of Boolean functions.
△ Less
Submitted 2 May, 2018;
originally announced May 2018.
-
Tolerant Junta Testing and the Connection to Submodular Optimization and Function Isomorphism
Authors:
Eric Blais,
Clément L. Canonne,
Talya Eden,
Amit Levi,
Dana Ron
Abstract:
A function $f\colon \{-1,1\}^n \to \{-1,1\}$ is a $k$-junta if it depends on at most $k$ of its variables. We consider the problem of tolerant testing of $k$-juntas, where the testing algorithm must accept any function that is $ε$-close to some $k$-junta and reject any function that is $ε'$-far from every $k'$-junta for some $ε'= O(ε)$ and $k' = O(k)$.
Our first result is an algorithm that solve…
▽ More
A function $f\colon \{-1,1\}^n \to \{-1,1\}$ is a $k$-junta if it depends on at most $k$ of its variables. We consider the problem of tolerant testing of $k$-juntas, where the testing algorithm must accept any function that is $ε$-close to some $k$-junta and reject any function that is $ε'$-far from every $k'$-junta for some $ε'= O(ε)$ and $k' = O(k)$.
Our first result is an algorithm that solves this problem with query complexity polynomial in $k$ and $1/ε$. This result is obtained via a new polynomial-time approximation algorithm for submodular function minimization (SFM) under large cardinality constraints, which holds even when only given an approximate oracle access to the function.
Our second result considers the case where $k'=k$. We show how to obtain a smooth tradeoff between the amount of tolerance and the query complexity in this setting. Specifically, we design an algorithm that given $ρ\in(0,1/2)$ accepts any function that is $\frac{ερ}{16}$-close to some $k$-junta and rejects any function that is $ε$-far from every $k$-junta. The query complexity of the algorithm is $O\big( \frac{k\log k}{ερ(1-ρ)^k} \big)$.
Finally, we show how to apply the second result to the problem of tolerant isomorphism testing between two unknown Boolean functions $f$ and $g$. We give an algorithm for this problem whose query complexity only depends on the (unknown) smallest $k$ such that either $f$ or $g$ is close to being a $k$-junta.
△ Less
Submitted 3 November, 2016; v1 submitted 13 July, 2016;
originally announced July 2016.
-
On the Converse of Talagrand's Influence Inequality
Authors:
Saleet Klein,
Amit Levi,
Muli Safra,
Clara Shikhelman,
Yinon Spinka
Abstract:
In 1994, Talagrand showed a generalization of the celebrated KKL theorem. In this work, we prove that the converse of this generalization also holds. Namely, for any sequence of numbers $0<a_1,a_2,\ldots,a_n\le 1$ such that $\sum_{j=1}^n a_j/(1-\log a_j)\ge C$ for some constant $C>0$, it is possible to find a roughly balanced Boolean function $f$ such that $\textrm{Inf}_j[f] < a_j$ for every…
▽ More
In 1994, Talagrand showed a generalization of the celebrated KKL theorem. In this work, we prove that the converse of this generalization also holds. Namely, for any sequence of numbers $0<a_1,a_2,\ldots,a_n\le 1$ such that $\sum_{j=1}^n a_j/(1-\log a_j)\ge C$ for some constant $C>0$, it is possible to find a roughly balanced Boolean function $f$ such that $\textrm{Inf}_j[f] < a_j$ for every $1 \le j \le n$.
△ Less
Submitted 23 June, 2015; v1 submitted 21 June, 2015;
originally announced June 2015.
-
Approximately Counting Triangles in Sublinear Time
Authors:
Talya Eden,
Amit Levi,
Dana Ron,
C. Seshadhri
Abstract:
We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a {\em sublinear-time\/} algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queri…
▽ More
We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a {\em sublinear-time\/} algorithm for approximating the number of triangles in a graph, where the algorithm is given query access to the graph. The allowed queries are degree queries, vertex-pair queries and neighbor queries.
We show that for any given approximation parameter $0<ε<1$, the algorithm provides an estimate $\widehat{t}$ such that with high constant probability, $(1-ε)\cdot t< \widehat{t}<(1+ε)\cdot t$, where $t$ is the number of triangles in the graph $G$. The expected query complexity of the algorithm is $\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)\cdot {\rm poly}(\log n, 1/ε)$, where $n$ is the number of vertices in the graph and $m$ is the number of edges, and the expected running time is $\!\left(\frac{n}{t^{1/3}} + \frac{m^{3/2}}{t}\right)\cdot {\rm poly}(\log n, 1/ε)$. We also prove that $Ω\!\left(\frac{n}{t^{1/3}} + \min\left\{m, \frac{m^{3/2}}{t}\right\}\right)$ queries are necessary, thus establishing that the query complexity of this algorithm is optimal up to polylogarithmic factors in $n$ (and the dependence on $1/ε$).
△ Less
Submitted 22 September, 2015; v1 submitted 3 April, 2015;
originally announced April 2015.