-
Making Robust Generalizers Less Rigid with Loss Concentration
Authors:
Matthew J. Holland,
Toma Hamada
Abstract:
While the traditional formulation of machine learning tasks is in terms of performance on average, in practice we are often interested in how well a trained model performs on rare or difficult data points at test time. To achieve more robust and balanced generalization, methods applying sharpness-aware minimization to a subset of worst-case examples have proven successful for image classification…
▽ More
While the traditional formulation of machine learning tasks is in terms of performance on average, in practice we are often interested in how well a trained model performs on rare or difficult data points at test time. To achieve more robust and balanced generalization, methods applying sharpness-aware minimization to a subset of worst-case examples have proven successful for image classification tasks, but only using overparameterized neural networks under which the relative difference between "easy" and "hard" data points becomes negligible. In this work, we show how such a strategy can dramatically break down under simpler models where the difficulty gap becomes more extreme. As a more flexible alternative, instead of typical sharpness, we propose and evaluate a training criterion which penalizes poor loss concentration, which can be easily combined with loss transformations such exponential tilting, conditional value-at-risk (CVaR), or distributionally robust optimization (DRO) that control tail emphasis.
△ Less
Submitted 20 May, 2025; v1 submitted 7 August, 2024;
originally announced August 2024.
-
Criterion Collapse and Loss Distribution Control
Authors:
Matthew J. Holland
Abstract:
In this work, we consider the notion of "criterion collapse," in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVaR, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literat…
▽ More
In this work, we consider the notion of "criterion collapse," in which optimization of one metric implies optimality in another, with a particular focus on conditions for collapse into error probability minimizers under a wide variety of learning criteria, ranging from DRO and OCE risks (CVaR, tilted ERM) to non-monotonic criteria underlying recent ascent-descent algorithms explored in the literature (Flooding, SoftAD). We show how collapse in the context of losses with a Bernoulli distribution goes far beyond existing results for CVaR and DRO, then expand our scope to include surrogate losses, showing conditions where monotonic criteria such as tilted ERM cannot avoid collapse, whereas non-monotonic alternatives can.
△ Less
Submitted 21 May, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Reductive Quantum Phase Estimation
Authors:
Nicholas J. C. Papadopoulos,
Jarrod T. Reilly,
John Drew Wilson,
Murray J. Holland
Abstract:
Estimating a quantum phase is a necessary task in a wide range of fields of quantum science. To accomplish this task, two well-known methods have been developed in distinct contexts, namely, Ramsey interferometry (RI) in atomic and molecular physics and quantum phase estimation (QPE) in quantum computing. We demonstrate that these canonical examples are instances of a larger class of phase estimat…
▽ More
Estimating a quantum phase is a necessary task in a wide range of fields of quantum science. To accomplish this task, two well-known methods have been developed in distinct contexts, namely, Ramsey interferometry (RI) in atomic and molecular physics and quantum phase estimation (QPE) in quantum computing. We demonstrate that these canonical examples are instances of a larger class of phase estimation protocols, which we call reductive quantum phase estimation (RQPE) circuits. Here we present an explicit algorithm that allows one to create an RQPE circuit. This circuit distinguishes an arbitrary set of phases with a fewer number of qubits and unitary applications, thereby solving a general class of quantum hypothesis testing to which RI and QPE belong. We further demonstrate a trade-off between measurement precision and phase distinguishability, which allows one to tune the circuit to be optimal for a specific application.
△ Less
Submitted 11 July, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Soft ascent-descent as a stable and flexible alternative to flooding
Authors:
Matthew J. Holland,
Kosuke Nakatani
Abstract:
As a heuristic for improving test accuracy in classification, the "flooding" method proposed by Ishida et al. (2020) sets a threshold for the average surrogate loss at training time; above the threshold, gradient descent is run as usual, but below the threshold, a switch to gradient ascent is made. While setting the threshold is non-trivial and is usually done with validation data, this simple tec…
▽ More
As a heuristic for improving test accuracy in classification, the "flooding" method proposed by Ishida et al. (2020) sets a threshold for the average surrogate loss at training time; above the threshold, gradient descent is run as usual, but below the threshold, a switch to gradient ascent is made. While setting the threshold is non-trivial and is usually done with validation data, this simple technique has proved remarkably effective in terms of accuracy. On the other hand, what if we are also interested in other metrics such as model complexity or average surrogate loss at test time? As an attempt to achieve better overall performance with less fine-tuning, we propose a softened, pointwise mechanism called SoftAD (soft ascent-descent) that downweights points on the borderline, limits the effects of outliers, and retains the ascent-descent effect of flooding, with no additional computational overhead. We contrast formal stationarity guarantees with those for flooding, and empirically demonstrate how SoftAD can realize classification accuracy competitive with flooding (and the more expensive alternative SAM) while enjoying a much smaller loss generalization gap and model norm.
△ Less
Submitted 21 October, 2024; v1 submitted 15 October, 2023;
originally announced October 2023.
-
MCQUIC -- A Multicast Extension for QUIC
Authors:
Max Franke,
Jake Holland,
Stefan Schmid
Abstract:
Mass live content, such as world cups, the Superbowl or the Olympics, attract audiences of hundreds of millions of viewers. While such events were predominantly consumed on TV, more and more viewers follow big events on the Internet, which poses a scalability challenge: current unicast delivery over the web comes with large overheads and is inefficient. An attractive alternative are multicast-base…
▽ More
Mass live content, such as world cups, the Superbowl or the Olympics, attract audiences of hundreds of millions of viewers. While such events were predominantly consumed on TV, more and more viewers follow big events on the Internet, which poses a scalability challenge: current unicast delivery over the web comes with large overheads and is inefficient. An attractive alternative are multicast-based transmissions, however, current solutions have several drawbacks, mostly related to security and privacy, which prevent them from being implemented in browsers.
In this paper we introduce a multicast extension to QUIC, a widely popular transport protocol standardized by the IETF, that solves several of these problems. It enables multicast delivery by offering encryption as well as integrity verification of packets distributed over multicast and automatic unicast fallback, which solves one of multicasts major obstacles to large scale deployment. It is transparent to applications and can be easily utilized by simply enabling an option in QUIC. This extension is soley focused on the transport layer and uses already existing multicast mechanisms on the network layer.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
DeTorrent: An Adversarial Padding-only Traffic Analysis Defense
Authors:
James K Holland,
Jason Carpenter,
Se Eun Oh,
Nicholas Hopper
Abstract:
While anonymity networks like Tor aim to protect the privacy of their users, they are vulnerable to traffic analysis attacks such as Website Fingerprinting (WF) and Flow Correlation (FC). Recent implementations of WF and FC attacks, such as Tik-Tok and DeepCoFFEA, have shown that the attacks can be effectively carried out, threatening user privacy. Consequently, there is a need for effective traff…
▽ More
While anonymity networks like Tor aim to protect the privacy of their users, they are vulnerable to traffic analysis attacks such as Website Fingerprinting (WF) and Flow Correlation (FC). Recent implementations of WF and FC attacks, such as Tik-Tok and DeepCoFFEA, have shown that the attacks can be effectively carried out, threatening user privacy. Consequently, there is a need for effective traffic analysis defense.
There are a variety of existing defenses, but most are either ineffective, incur high latency and bandwidth overhead, or require additional infrastructure. As a result, we aim to design a traffic analysis defense that is efficient and highly resistant to both WF and FC attacks. We propose DeTorrent, which uses competing neural networks to generate and evaluate traffic analysis defenses that insert 'dummy' traffic into real traffic flows. DeTorrent operates with moderate overhead and without delaying traffic. In a closed-world WF setting, it reduces an attacker's accuracy by 61.5%, a reduction 10.5% better than the next-best padding-only defense. Against the state-of-the-art FC attacker, DeTorrent reduces the true positive rate for a $10^{-5}$ false positive rate to about .12, which is less than half that of the next-best defense. We also demonstrate DeTorrent's practicality by deploying it alongside the Tor network and find that it maintains its performance when applied to live traffic.
△ Less
Submitted 22 September, 2023; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Robust variance-regularized risk minimization with concomitant scaling
Authors:
Matthew J. Holland
Abstract:
Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in tradit…
▽ More
Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.
△ Less
Submitted 8 February, 2024; v1 submitted 27 January, 2023;
originally announced January 2023.
-
Padding-only defenses add delay in Tor
Authors:
Ethan Witwer,
James Holland,
Nicholas Hopper
Abstract:
Website fingerprinting is an attack that uses size and timing characteristics of encrypted downloads to identify targeted websites. Since this can defeat the privacy goals of anonymity networks such as Tor, many algorithms to defend against this attack in Tor have been proposed in the literature. These algorithms typically consist of some combination of the injection of dummy "padding" packets wit…
▽ More
Website fingerprinting is an attack that uses size and timing characteristics of encrypted downloads to identify targeted websites. Since this can defeat the privacy goals of anonymity networks such as Tor, many algorithms to defend against this attack in Tor have been proposed in the literature. These algorithms typically consist of some combination of the injection of dummy "padding" packets with the delay of actual packets to disrupt timing patterns. For usability reasons, Tor is intended to provide low latency; as such, many authors focus on padding-only defenses in the belief that they are "zero-delay." We demonstrate through Shadow simulations that by increasing queue lengths, padding-only defenses add delay when deployed network-wide, so they should not be considered "zero-delay." We further argue that future defenses should also be evaluated using network-wide deployment simulations
△ Less
Submitted 4 August, 2022;
originally announced August 2022.
-
Flexible risk design using bi-directional dispersion
Authors:
Matthew J. Holland
Abstract:
Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is off…
▽ More
Many novel notions of "risk" (e.g., CVaR, tilted risk, DRO risk) have been proposed and studied, but these risks are all at least as sensitive as the mean to loss tails on the upside, and tend to ignore deviations on the downside. We study a complementary new risk class that penalizes loss deviations in a bi-directional manner, while having more flexibility in terms of tail sensitivity than is offered by mean-variance. This class lets us derive high-probability learning guarantees without explicit gradient clipping, and empirical tests using both simulated and real data illustrate a high degree of control over key properties of the test loss distribution incurred by gradient-based learners.
△ Less
Submitted 16 February, 2023; v1 submitted 27 March, 2022;
originally announced March 2022.
-
Towards Reproducible Network Traffic Analysis
Authors:
Jordan Holland,
Paul Schmitt,
Prateek Mittal,
Nick Feamster
Abstract:
Analysis techniques are critical for gaining insight into network traffic given both the higher proportion of encrypted traffic and increasing data rates. Unfortunately, the domain of network traffic analysis suffers from a lack of standardization, leading to incomparable results and barriers to reproducibility. Unlike other disciplines, no standard dataset format exists, forcing researchers and p…
▽ More
Analysis techniques are critical for gaining insight into network traffic given both the higher proportion of encrypted traffic and increasing data rates. Unfortunately, the domain of network traffic analysis suffers from a lack of standardization, leading to incomparable results and barriers to reproducibility. Unlike other disciplines, no standard dataset format exists, forcing researchers and practitioners to create bespoke analysis pipelines for each individual task. Without standardization researchers cannot compare "apples-to-apples", preventing us from knowing with certainty if a new technique represents a methodological advancement or if it simply benefits from a different interpretation of a given dataset.
In this work, we examine irreproducibility that arises from the lack of standardization in network traffic analysis. First, we study the literature, highlighting evidence of irreproducible research based on different interpretations of popular public datasets. Next, we investigate the underlying issues that have lead to the status quo and prevent reproducible research. Third, we outline the standardization requirements that any solution aiming to fix reproducibility issues must address. We then introduce pcapML, an open source system which increases reproducibility of network traffic analysis research by enabling metadata information to be directly encoded into raw traffic captures in a generic manner. Finally, we use the standardization pcapML provides to create the pcapML benchmarks, an open source leaderboard website and repository built to track the progress of network traffic analysis methods.
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
A Survey of Learning Criteria Going Beyond the Usual Risk
Authors:
Matthew J. Holland,
Kazuki Tanabe
Abstract:
Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, w…
▽ More
Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss.
△ Less
Submitted 29 November, 2023; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Robust learning with anytime-guaranteed feedback
Authors:
Matthew J. Holland
Abstract:
Under data distributions which may be heavy-tailed, many stochastic gradient-based learning algorithms are driven by feedback queried at points with almost no performance guarantees on their own. Here we explore a modified "anytime online-to-batch" mechanism which for smooth objectives admits high-probability error bounds while requiring only lower-order moment bounds on the stochastic gradients.…
▽ More
Under data distributions which may be heavy-tailed, many stochastic gradient-based learning algorithms are driven by feedback queried at points with almost no performance guarantees on their own. Here we explore a modified "anytime online-to-batch" mechanism which for smooth objectives admits high-probability error bounds while requiring only lower-order moment bounds on the stochastic gradients. Using this conversion, we can derive a wide variety of "anytime robust" procedures, for which the task of performance analysis can be effectively reduced to regret control, meaning that existing regret bounds (for the bounded gradient case) can be robustified and leveraged in a straightforward manner. As a direct takeaway, we obtain an easily implemented stochastic gradient-based algorithm for which all queried points formally enjoy sub-Gaussian error bounds, and in practice show noteworthy gains on real-world data applications.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
Spectral risk-based learning using unbounded losses
Authors:
Matthew J. Holland,
El Mehdi Haress
Abstract:
In this work, we consider the setting of learning problems under a wide class of spectral risk (or "L-risk") functions, where a Lipschitz-continuous spectral density is used to flexibly assign weight to extreme loss values. We obtain excess risk guarantees for a derivative-free learning procedure under unbounded heavy-tailed loss distributions, and propose a computationally efficient implementatio…
▽ More
In this work, we consider the setting of learning problems under a wide class of spectral risk (or "L-risk") functions, where a Lipschitz-continuous spectral density is used to flexibly assign weight to extreme loss values. We obtain excess risk guarantees for a derivative-free learning procedure under unbounded heavy-tailed loss distributions, and propose a computationally efficient implementation which empirically outperforms traditional risk minimizers in terms of balancing spectral risk and misclassification error.
△ Less
Submitted 11 May, 2021;
originally announced May 2021.
-
Better scalability under potentially heavy-tailed feedback
Authors:
Matthew J. Holland
Abstract:
We study scalable alternatives to robust gradient descent (RGD) techniques that can be used when the losses and/or gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we instead focus computational effort…
▽ More
We study scalable alternatives to robust gradient descent (RGD) techniques that can be used when the losses and/or gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we instead focus computational effort on robustly choosing (or newly constructing) a strong candidate based on a collection of cheap stochastic sub-processes which can be run in parallel. The exact selection process depends on the convexity of the underlying objective, but in all cases, our selection technique amounts to a robust form of boosting the confidence of weak learners. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data, along with applications to a variety of benchmark datasets. The overall take-away is an extensible procedure that is simple to implement, trivial to parallelize, which keeps the formal merits of RGD methods but scales much better to large learning problems.
△ Less
Submitted 14 December, 2020;
originally announced December 2020.
-
RegulaTor: A Straightforward Website Fingerprinting Defense
Authors:
James K Holland,
Nicholas Hopper
Abstract:
Website Fingerprinting (WF) attacks are used by local passive attackers to determine the destination of encrypted internet traffic by comparing the sequences of packets sent to and received by the user to a previously recorded data set. As a result, WF attacks are of particular concern to privacy-enhancing technologies such as Tor. In response, a variety of WF defenses have been developed, though…
▽ More
Website Fingerprinting (WF) attacks are used by local passive attackers to determine the destination of encrypted internet traffic by comparing the sequences of packets sent to and received by the user to a previously recorded data set. As a result, WF attacks are of particular concern to privacy-enhancing technologies such as Tor. In response, a variety of WF defenses have been developed, though they tend to incur high bandwidth and latency overhead or require additional infrastructure, thus making them difficult to implement in practice. Some lighter-weight defenses have been presented as well; still, they attain only moderate effectiveness against recently published WF attacks. In this paper, we aim to present a realistic and novel defense, RegulaTor, which takes advantage of common patterns in web browsing traffic to reduce both defense overhead and the accuracy of current WF attacks. In the closed-world setting, RegulaTor reduces the accuracy of the state-of-the-art attack, Tik-Tok, against comparable defenses from 66% to 25.4%. To achieve this performance, it requires limited added latency and a bandwidth overhead 39.3% less than the leading moderate-overhead defense. In the open-world setting, RegulaTor limits a precision-tuned Tik-Tok attack to an F-score of .135, compared to .625 for the best comparable defense.
△ Less
Submitted 21 September, 2021; v1 submitted 11 December, 2020;
originally announced December 2020.
-
Learning with risks based on M-location
Authors:
Matthew J. Holland
Abstract:
In this work, we study a new class of risks defined in terms of the location and deviation of the loss distribution, generalizing far beyond classical mean-variance risk functions. The class is easily implemented as a wrapper around any smooth loss, it admits finite-sample stationarity guarantees for stochastic gradient methods, it is straightforward to interpret and adjust, with close links to M-…
▽ More
In this work, we study a new class of risks defined in terms of the location and deviation of the loss distribution, generalizing far beyond classical mean-variance risk functions. The class is easily implemented as a wrapper around any smooth loss, it admits finite-sample stationarity guarantees for stochastic gradient methods, it is straightforward to interpret and adjust, with close links to M-estimators of the loss location, and has a salient effect on the test loss distribution.
△ Less
Submitted 25 April, 2021; v1 submitted 4 December, 2020;
originally announced December 2020.
-
Evaluating Snowflake as an Indistinguishable Censorship Circumvention Tool
Authors:
Kyle MacMillan,
Jordan Holland,
Prateek Mittal
Abstract:
Tor is the most well-known tool for circumventing censorship. Unfortunately, Tor traffic has been shown to be detectable using deep-packet inspection. WebRTC is a popular web frame-work that enables browser-to-browser connections. Snowflake is a novel pluggable transport that leverages WebRTC to connect Tor clients to the Tor network. In theory, Snowflake was created to be indistinguishable from o…
▽ More
Tor is the most well-known tool for circumventing censorship. Unfortunately, Tor traffic has been shown to be detectable using deep-packet inspection. WebRTC is a popular web frame-work that enables browser-to-browser connections. Snowflake is a novel pluggable transport that leverages WebRTC to connect Tor clients to the Tor network. In theory, Snowflake was created to be indistinguishable from other WebRTC services. In this paper, we evaluate the indistinguishability of Snowflake. We collect over 6,500 DTLS handshakes from Snowflake, Facebook Messenger, Google Hangouts, and Discord WebRTC connections and show that Snowflake is identifiable among these applications with 100% accuracy. We show that several features, including the extensions offered and the number of packets in the handshake, distinguish Snowflake among these services. Finally, we suggest recommendations for improving identification resistance in Snowflake. We have made the dataset publicly available.
△ Less
Submitted 14 October, 2020; v1 submitted 23 July, 2020;
originally announced August 2020.
-
New Directions in Automated Traffic Analysis
Authors:
Jordan Holland,
Paul Schmitt,
Nick Feamster,
Prateek Mittal
Abstract:
Despite the use of machine learning for many network traffic analysis tasks in security, from application identification to intrusion detection, the aspects of the machine learning pipeline that ultimately determine the performance of the model -- feature selection and representation, model selection, and parameter tuning -- remain manual and painstaking. This paper presents a method to automate m…
▽ More
Despite the use of machine learning for many network traffic analysis tasks in security, from application identification to intrusion detection, the aspects of the machine learning pipeline that ultimately determine the performance of the model -- feature selection and representation, model selection, and parameter tuning -- remain manual and painstaking. This paper presents a method to automate many aspects of traffic analysis, making it easier to apply machine learning techniques to a wider variety of traffic analysis tasks. We introduce nPrint, a tool that generates a unified packet representation that is amenable for representation learning and model training. We integrate nPrint with automated machine learning (AutoML), resulting in nPrintML, a public system that largely eliminates feature extraction and model tuning for a wide variety of traffic analysis tasks. We have evaluated nPrintML on eight separate traffic analysis tasks and released nPrint and nPrintML to enable future work to extend these methods.
△ Less
Submitted 19 October, 2021; v1 submitted 6 August, 2020;
originally announced August 2020.
-
Making learning more transparent using conformalized performance prediction
Authors:
Matthew J. Holland
Abstract:
In this work, we study some novel applications of conformal inference techniques to the problem of providing machine learning procedures with more transparent, accurate, and practical performance guarantees. We provide a natural extension of the traditional conformal prediction framework, done in such a way that we can make valid and well-calibrated predictive statements about the future performan…
▽ More
In this work, we study some novel applications of conformal inference techniques to the problem of providing machine learning procedures with more transparent, accurate, and practical performance guarantees. We provide a natural extension of the traditional conformal prediction framework, done in such a way that we can make valid and well-calibrated predictive statements about the future performance of arbitrary learning algorithms, when passed an as-yet unseen training set. In addition, we include some nascent empirical examples to illustrate potential applications.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
Classifying Network Vendors at Internet Scale
Authors:
Jordan Holland,
Ross Teixeira,
Paul Schmitt,
Kevin Borgolte,
Jennifer Rexford,
Nick Feamster,
Jonathan Mayer
Abstract:
In this paper, we develop a method to create a large, labeled dataset of visible network device vendors across the Internet by mapping network-visible IP addresses to device vendors. We use Internet-wide scanning, banner grabs of network-visible devices across the IPv4 address space, and clustering techniques to assign labels to more than 160,000 devices. We subsequently probe these devices and us…
▽ More
In this paper, we develop a method to create a large, labeled dataset of visible network device vendors across the Internet by mapping network-visible IP addresses to device vendors. We use Internet-wide scanning, banner grabs of network-visible devices across the IPv4 address space, and clustering techniques to assign labels to more than 160,000 devices. We subsequently probe these devices and use features extracted from the responses to train a classifier that can accurately classify device vendors. Finally, we demonstrate how this method can be used to understand broader trends across the Internet by predicting device vendors in traceroutes from CAIDA's Archipelago measurement system and subsequently examining vendor distributions across these traceroutes.
△ Less
Submitted 24 June, 2020; v1 submitted 23 June, 2020;
originally announced June 2020.
-
Learning with CVaR-based feedback under potentially heavy tails
Authors:
Matthew J. Holland,
El Mehdi Haress
Abstract:
We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not ch…
▽ More
We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not change too fast or slow around just the quantile of interest. With this estimator in hand, we then derive a new learning algorithm which robustly chooses among candidates produced by stochastic gradient-driven sub-processes. For this procedure we provide high-probability excess CVaR bounds, and to complement the theory we conduct empirical tests of the underlying CVaR estimator and the learning algorithm derived from it.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Improved scalability under heavy tails, without strong convexity
Authors:
Matthew J. Holland
Abstract:
Real-world data is laden with outlying values. The challenge for machine learning is that the learner typically has no prior knowledge of whether the feedback it receives (losses, gradients, etc.) will be heavy-tailed or not. In this work, we study a simple algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. The core technique introduces a simple robust v…
▽ More
Real-world data is laden with outlying values. The challenge for machine learning is that the learner typically has no prior knowledge of whether the feedback it receives (losses, gradients, etc.) will be heavy-tailed or not. In this work, we study a simple algorithmic strategy that can be leveraged when both losses and gradients can be heavy-tailed. The core technique introduces a simple robust validation sub-routine, which is used to boost the confidence of inexpensive gradient-based sub-processes. Compared with recent robust gradient descent methods from the literature, dimension dependence (both risk bounds and cost) is substantially improved, without relying upon strong convexity or expensive per-step robustification. Empirically, we also show that under heavy-tailed losses, the proposed procedure cannot simply be replaced with naive cross-validation. Taken together, we have a scalable method with transparent guarantees, which performs well without prior knowledge of how "convenient" the feedback it receives will be.
△ Less
Submitted 14 December, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Better scalability under potentially heavy-tailed gradients
Authors:
Matthew J. Holland
Abstract:
We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too fa…
▽ More
We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run for a single pass over partitioned data. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data. The result is a procedure that is simple to implement, trivial to parallelize, which keeps the formal strength of RGD methods but scales much better to large learning problems.
△ Less
Submitted 14 December, 2020; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Identifying Disinformation Websites Using Infrastructure Features
Authors:
Austin Hounsel,
Jordan Holland,
Ben Kaiser,
Kevin Borgolte,
Nick Feamster,
Jonathan Mayer
Abstract:
Platforms have struggled to keep pace with the spread of disinformation. Current responses like user reports, manual analysis, and third-party fact checking are slow and difficult to scale, and as a result, disinformation can spread unchecked for some time after being created. Automation is essential for enabling platforms to respond rapidly to disinformation. In this work, we explore a new direct…
▽ More
Platforms have struggled to keep pace with the spread of disinformation. Current responses like user reports, manual analysis, and third-party fact checking are slow and difficult to scale, and as a result, disinformation can spread unchecked for some time after being created. Automation is essential for enabling platforms to respond rapidly to disinformation. In this work, we explore a new direction for automated detection of disinformation websites: infrastructure features. Our hypothesis is that while disinformation websites may be perceptually similar to authentic news websites, there may also be significant non-perceptual differences in the domain registrations, TLS/SSL certificates, and web hosting configurations. Infrastructure features are particularly valuable for detecting disinformation websites because they are available before content goes live and reaches readers, enabling early detection. We demonstrate the feasibility of our approach on a large corpus of labeled website snapshots. We also present results from a preliminary real-time deployment, successfully discovering disinformation websites while highlighting unexplored challenges for automated disinformation detection.
△ Less
Submitted 28 September, 2020; v1 submitted 28 February, 2020;
originally announced March 2020.
-
Comparing the Effects of DNS, DoT, and DoH on Web Performance
Authors:
Austin Hounsel,
Kevin Borgolte,
Paul Schmitt,
Jordan Holland,
Nick Feamster
Abstract:
Nearly every service on the Internet relies on the Domain Name System (DNS), which translates a human-readable name to an IP address before two endpoints can communicate. Today, DNS traffic is unencrypted, leaving users vulnerable to eavesdropping and tampering. Past work has demonstrated that DNS queries can reveal a user's browsing history and even what smart devices they are using at home. In r…
▽ More
Nearly every service on the Internet relies on the Domain Name System (DNS), which translates a human-readable name to an IP address before two endpoints can communicate. Today, DNS traffic is unencrypted, leaving users vulnerable to eavesdropping and tampering. Past work has demonstrated that DNS queries can reveal a user's browsing history and even what smart devices they are using at home. In response to these privacy concerns, two new protocols have been proposed: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT). Instead of sending DNS queries and responses in the clear, DoH and DoT establish encrypted connections between users and resolvers. By doing so, these protocols provide privacy and security guarantees that traditional DNS (Do53) lacks.
In this paper, we measure the effect of Do53, DoT, and DoH on query response times and page load times from five global vantage points. We find that although DoH and DoT response times are generally higher than Do53, both protocols can perform better than Do53 in terms of page load times. However, as throughput decreases and substantial packet loss and latency are introduced, web pages load fastest with Do53. Additionally, web pages successfully load more often with Do53 and DoT than DoH. Based on these results, we provide several recommendations to improve DNS performance, such as opportunistic partial responses and wire format caching.
△ Less
Submitted 23 February, 2020; v1 submitted 18 July, 2019;
originally announced July 2019.
-
PAC-Bayes under potentially heavy tails
Authors:
Matthew J. Holland
Abstract:
We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the first three moments of the loss d…
▽ More
We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and obtain a novel optimal Gibbs posterior which enjoys finite-sample excess risk bounds at logarithmic confidence. Our core technique itself makes use of PAC-Bayesian inequalities in order to derive a robust risk estimator, which by design is easy to compute. In particular, only assuming that the first three moments of the loss distribution are bounded, the learning algorithm derived from this estimator achieves nearly sub-Gaussian statistical error, up to the quality of the prior.
△ Less
Submitted 18 December, 2019; v1 submitted 20 May, 2019;
originally announced May 2019.
-
Measuring Irregular Geographic Exposure on the Internet
Authors:
Jordan Holland,
Jared Smith,
Max Schuchard
Abstract:
We examine the extent of needless traffic exposure by the routing infrastructure to nations geographically irrelevant to packet transmission. We quantify what countries are geographically logical to observe on a network path traveling between two nations through the use of convex hulls circumscribing major population centers. We then compare that to the nation states observed in over 2.5 billion m…
▽ More
We examine the extent of needless traffic exposure by the routing infrastructure to nations geographically irrelevant to packet transmission. We quantify what countries are geographically logical to observe on a network path traveling between two nations through the use of convex hulls circumscribing major population centers. We then compare that to the nation states observed in over 2.5 billion measured paths. We examine both the entire geographic topology of the Internet and a subset of the topology that a Tor user would typically interact with. We find that 44% of paths across the entire geographic topology of the Internet and 33% of paths in the user experience subset unnecessarily expose traffic to one or more nations. Finally, we consider the scenario where countries exercise both legal and physical control over autonomous systems, gaining access to traffic outside of their geographic borders, but carried by organizations that fall under the AS's registered country's legal jurisdiction. At least 49% of paths in both measurements expose traffic to a geographically irrelevant country when considering both the physical and legal countries that a path traverses.
△ Less
Submitted 31 May, 2019; v1 submitted 19 April, 2019;
originally announced April 2019.
-
Robust descent using smoothed multiplicative noise
Authors:
Matthew J. Holland
Abstract:
To improve the off-sample generalization of classical procedures minimizing the empirical risk under potentially heavy-tailed data, new robust learning algorithms have been proposed in recent years, with generalized median-of-means strategies being particularly salient. These procedures enjoy performance guarantees in the form of sharp risk bounds under weak moment assumptions on the underlying lo…
▽ More
To improve the off-sample generalization of classical procedures minimizing the empirical risk under potentially heavy-tailed data, new robust learning algorithms have been proposed in recent years, with generalized median-of-means strategies being particularly salient. These procedures enjoy performance guarantees in the form of sharp risk bounds under weak moment assumptions on the underlying loss, but typically suffer from a large computational overhead and substantial bias when the data happens to be sub-Gaussian, limiting their utility. In this work, we propose a novel robust gradient descent procedure which makes use of a smoothed multiplicative noise applied directly to observations before constructing a sum of soft-truncated gradient coordinates. We show that the procedure has competitive theoretical guarantees, with the major advantage of a simple implementation that does not require an iterative sub-routine for robustification. Empirical tests reinforce the theory, showing more efficient generalization over a much wider class of data distributions.
△ Less
Submitted 15 October, 2018;
originally announced October 2018.
-
Classification using margin pursuit
Authors:
Matthew J. Holland
Abstract:
In this work, we study a new approach to optimizing the margin distribution realized by binary classifiers. The classical approach to this problem is simply maximization of the expected margin, while more recent proposals consider simultaneous variance control and proxy objectives based on robust location estimates, in the vein of keeping the margin distribution sharply concentrated in a desirable…
▽ More
In this work, we study a new approach to optimizing the margin distribution realized by binary classifiers. The classical approach to this problem is simply maximization of the expected margin, while more recent proposals consider simultaneous variance control and proxy objectives based on robust location estimates, in the vein of keeping the margin distribution sharply concentrated in a desirable region. While conceptually appealing, these new approaches are often computationally unwieldy, and theoretical guarantees are limited. Given this context, we propose an algorithm which searches the hypothesis space in such a way that a pre-set "margin level" ends up being a distribution-robust estimator of the margin location. This procedure is easily implemented using gradient descent, and admits finite-sample bounds on the excess risk under unbounded inputs. Empirical tests on real-world benchmark data reinforce the basic principles highlighted by the theory, and are suggestive of a promising new technique for classification.
△ Less
Submitted 11 October, 2018;
originally announced October 2018.