-
NetSSM: Multi-Flow and State-Aware Network Trace Generation using State-Space Models
Authors:
Andrew Chu,
Xi Jiang,
Shinan Liu,
Arjun Bhagoji,
Francesco Bronzino,
Paul Schmitt,
Nick Feamster
Abstract:
Access to raw network traffic data is essential for many computer networking tasks, from traffic modeling to performance evaluation. Unfortunately, this data is scarce due to high collection costs and governance rules. Previous efforts explore this challenge by generating synthetic network data, but fail to reliably handle multi-flow sessions, struggle to reason about stateful communication in mod…
▽ More
Access to raw network traffic data is essential for many computer networking tasks, from traffic modeling to performance evaluation. Unfortunately, this data is scarce due to high collection costs and governance rules. Previous efforts explore this challenge by generating synthetic network data, but fail to reliably handle multi-flow sessions, struggle to reason about stateful communication in moderate to long-duration network sessions, and lack robust evaluations tied to real-world utility. We propose a new method based on state-space models called NetSSM that generates raw network traffic at the packet-level granularity. Our approach captures interactions between multiple, interleaved flows -- an objective unexplored in prior work -- and effectively reasons about flow-state in sessions to capture traffic characteristics. NetSSM accomplishes this by learning from and producing traces 8x and 78x longer than existing transformer-based approaches. Evaluation results show that our method generates high-fidelity traces that outperform prior efforts in existing benchmarks. We also find that NetSSM's traces have high semantic similarity to real network data regarding compliance with standard protocol requirements and flow and session-level traffic characteristics.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
Cruise Control: Dynamic Model Selection for ML-Based Network Traffic Analysis
Authors:
Johann Hugon,
Paul Schmitt,
Anthony Busson,
Francesco Bronzino
Abstract:
Modern networks increasingly rely on machine learning models for real-time insights, including traffic classification, application quality of experience inference, and intrusion detection. However, existing approaches prioritize prediction accuracy without considering deployment constraints or the dynamism of network traffic, leading to potentially suboptimal performance. Because of this, deployin…
▽ More
Modern networks increasingly rely on machine learning models for real-time insights, including traffic classification, application quality of experience inference, and intrusion detection. However, existing approaches prioritize prediction accuracy without considering deployment constraints or the dynamism of network traffic, leading to potentially suboptimal performance. Because of this, deploying ML models in real-world networks with tight performance constraints remains an open challenge. In contrast with existing work that aims to select an optimal candidate model for each task based on offline information, we propose an online, system-driven approach to dynamically select the best ML model for network traffic analysis. To this end, we present Cruise Control, a system that pre-trains several models for a given task with different accuracy-cost tradeoffs and selects the most appropriate model based on lightweight signals representing the system's current traffic processing ability. Experimental results using two real-world traffic analysis tasks demonstrate Cruise Control's effectiveness in adapting to changing network conditions. Our evaluation shows that Cruise Control improves median accuracy by 2.78% while reducing packet loss by a factor of four compared to offline-selected models.
△ Less
Submitted 19 December, 2024;
originally announced December 2024.
-
Micro and macro facial expressions by driven animations in realistic Virtual Humans
Authors:
Rubens Halbig Montanha,
Giovana Nascimento Raupp,
Ana Carolina Policarpo Schmitt,
Victor Flávio de Andrade Araujo,
Soraia Raupp Musse
Abstract:
Computer Graphics (CG) advancements have allowed the creation of more realistic Virtual Humans (VH) through modern techniques for animating the VH body and face, thereby affecting perception. From traditional methods, including blend shapes, to driven animations using facial and body tracking, these advancements can potentially enhance the perception of comfort and realism in relation to VHs. Prev…
▽ More
Computer Graphics (CG) advancements have allowed the creation of more realistic Virtual Humans (VH) through modern techniques for animating the VH body and face, thereby affecting perception. From traditional methods, including blend shapes, to driven animations using facial and body tracking, these advancements can potentially enhance the perception of comfort and realism in relation to VHs. Previously, Psychology studied facial movements in humans, with some works separating expressions into macro and micro expressions. Also, some previous CG studies have analyzed how macro and micro expressions are perceived, replicating psychology studies in VHs, encompassing studies with realistic and cartoon VHs, and exploring different VH technologies. However, instead of using facial tracking animation methods, these previous studies animated the VHs using blendshapes interpolation. To understand how the facial tracking technique alters the perception of VHs, this paper extends the study to macro and micro expressions, employing two datasets to transfer real facial expressions to VHs and analyze how their expressions are perceived. Our findings suggest that transferring facial expressions from real actors to VHs significantly diminishes the accuracy of emotion perception compared to VH facial animations created by artists.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Feasibility of State Space Models for Network Traffic Generation
Authors:
Andrew Chu,
Xi Jiang,
Shinan Liu,
Arjun Bhagoji,
Francesco Bronzino,
Paul Schmitt,
Nick Feamster
Abstract:
Many problems in computer networking rely on parsing collections of network traces (e.g., traffic prioritization, intrusion detection). Unfortunately, the availability and utility of these collections is limited due to privacy concerns, data staleness, and low representativeness. While methods for generating data to augment collections exist, they often fall short in replicating the quality of rea…
▽ More
Many problems in computer networking rely on parsing collections of network traces (e.g., traffic prioritization, intrusion detection). Unfortunately, the availability and utility of these collections is limited due to privacy concerns, data staleness, and low representativeness. While methods for generating data to augment collections exist, they often fall short in replicating the quality of real-world traffic In this paper, we i) survey the evolution of traffic simulators/generators and ii) propose the use of state-space models, specifically Mamba, for packet-level, synthetic network trace generation by modeling it as an unsupervised sequence generation problem. Early evaluation shows that state-space models can generate synthetic network traffic with higher statistical similarity to real traffic than the state-of-the-art. Our approach thus has the potential to reliably generate realistic, informative synthetic network traces for downstream tasks.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Beyond Data Points: Regionalizing Crowdsourced Latency Measurements
Authors:
Taveesh Sharma,
Paul Schmitt,
Francesco Bronzino,
Nick Feamster,
Nicole Marwell
Abstract:
Despite significant investments in access network infrastructure, universal access to high-quality Internet connectivity remains a challenge. Policymakers often rely on large-scale, crowdsourced measurement datasets to assess the distribution of access network performance across geographic areas. These decisions typically rest on the assumption that Internet performance is uniformly distributed wi…
▽ More
Despite significant investments in access network infrastructure, universal access to high-quality Internet connectivity remains a challenge. Policymakers often rely on large-scale, crowdsourced measurement datasets to assess the distribution of access network performance across geographic areas. These decisions typically rest on the assumption that Internet performance is uniformly distributed within predefined social boundaries. However, this assumption may not be valid for two reasons: crowdsourced measurements often exhibit non-uniform sampling densities within geographic areas; and predefined social boundaries may not align with the actual boundaries of Internet infrastructure. In this paper, we present a spatial analysis on crowdsourced datasets for constructing stable boundaries for sampling Internet performance. We hypothesize that greater stability in sampling boundaries will reflect the true nature of Internet performance disparities than misleading patterns observed as a result of data sampling variations. We apply and evaluate a series of statistical techniques to: aggregate Internet performance over geographic regions; overlay interpolated maps with various sampling unit choices; and spatially cluster boundary units to identify contiguous areas with similar performance characteristics. We assess the effectiveness of the techniques we apply by comparing the similarity of the resulting boundaries for monthly samples drawn from the dataset. Our evaluation shows that the combination of techniques we apply achieves higher similarity compared to directly calculating central measures of network metrics over census tracts or neighborhood boundaries. These findings underscore the important role of spatial modeling in accurately assessing and optimizing the distribution of Internet performance, to inform policy, network operations, and long-term planning decisions.
△ Less
Submitted 26 October, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
VidPlat: A Tool for Fast Crowdsourcing of Quality-of-Experience Measurements
Authors:
Xu Zhang,
Hanchen Li,
Paul Schmitt,
Marshini Chetty,
Nick Feamster,
Junchen Jiang
Abstract:
For video or web services, it is crucial to measure user-perceived quality of experience (QoE) at scale under various video quality or page loading delays. However, fast QoE measurements remain challenging as they must elicit subjective assessment from human users. Previous work either (1) automates QoE measurements by letting crowdsourcing raters watch and rate QoE test videos or (2) dynamically…
▽ More
For video or web services, it is crucial to measure user-perceived quality of experience (QoE) at scale under various video quality or page loading delays. However, fast QoE measurements remain challenging as they must elicit subjective assessment from human users. Previous work either (1) automates QoE measurements by letting crowdsourcing raters watch and rate QoE test videos or (2) dynamically prunes redundant QoE tests based on previously collected QoE measurements. Unfortunately, it is hard to combine both ideas because traditional crowdsourcing requires QoE test videos to be pre-determined before a crowdsourcing campaign begins. Thus, if researchers want to dynamically prune redundant test videos based on other test videos' QoE, they are forced to launch multiple crowdsourcing campaigns, causing extra overheads to re-calibrate or train raters every time.
This paper presents VidPlat, the first open-source tool for fast and automated QoE measurements, by allowing dynamic pruning of QoE test videos within a single crowdsourcing task. VidPlat creates an indirect shim layer between researchers and the crowdsourcing platforms. It allows researchers to define a logic that dynamically determines which new test videos need more QoE ratings based on the latest QoE measurements, and it then redirects crowdsourcing raters to watch QoE test videos dynamically selected by this logic. Other than having fewer crowdsourcing campaigns, VidPlat also reduces the total number of QoE ratings by dynamically deciding when enough ratings are gathered for each test video. It is an open-source platform that future researchers can reuse and customize. We have used VidPlat in three projects (web loading, on-demand video, and online gaming). We show that VidPlat can reduce crowdsourcing cost by 31.8% - 46.0% and latency by 50.9% - 68.8%.
△ Less
Submitted 11 November, 2023;
originally announced November 2023.
-
NetDiffusion: Network Data Augmentation Through Protocol-Constrained Traffic Generation
Authors:
Xi Jiang,
Shinan Liu,
Aaron Gember-Jacobson,
Arjun Nitin Bhagoji,
Paul Schmitt,
Francesco Bronzino,
Nick Feamster
Abstract:
Datasets of labeled network traces are essential for a multitude of machine learning (ML) tasks in networking, yet their availability is hindered by privacy and maintenance concerns, such as data staleness. To overcome this limitation, synthetic network traces can often augment existing datasets. Unfortunately, current synthetic trace generation methods, which typically produce only aggregated flo…
▽ More
Datasets of labeled network traces are essential for a multitude of machine learning (ML) tasks in networking, yet their availability is hindered by privacy and maintenance concerns, such as data staleness. To overcome this limitation, synthetic network traces can often augment existing datasets. Unfortunately, current synthetic trace generation methods, which typically produce only aggregated flow statistics or a few selected packet attributes, do not always suffice, especially when model training relies on having features that are only available from packet traces. This shortfall manifests in both insufficient statistical resemblance to real traces and suboptimal performance on ML tasks when employed for data augmentation. In this paper, we apply diffusion models to generate high-resolution synthetic network traffic traces. We present NetDiffusion, a tool that uses a finely-tuned, controlled variant of a Stable Diffusion model to generate synthetic network traffic that is high fidelity and conforms to protocol specifications. Our evaluation demonstrates that packet captures generated from NetDiffusion can achieve higher statistical similarity to real data and improved ML model performance than current state-of-the-art approaches (e.g., GAN-based approaches). Furthermore, our synthetic traces are compatible with common network analysis tools and support a myriad of network tasks, suggesting that NetDiffusion can serve a broader spectrum of network analysis and testing tasks, extending beyond ML-centric applications.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
SoK: The Ghost Trilemma
Authors:
Sulagna Mukherjee,
Srivatsan Ravi,
Paul Schmitt,
Barath Raghavan
Abstract:
Trolls, bots, and sybils distort online discourse and compromise the security of networked platforms. User identity is central to the vectors of attack and manipulation employed in these contexts. However it has long seemed that, try as it might, the security community has been unable to stem the rising tide of such problems. We posit the Ghost Trilemma, that there are three key properties of iden…
▽ More
Trolls, bots, and sybils distort online discourse and compromise the security of networked platforms. User identity is central to the vectors of attack and manipulation employed in these contexts. However it has long seemed that, try as it might, the security community has been unable to stem the rising tide of such problems. We posit the Ghost Trilemma, that there are three key properties of identity -- sentience, location, and uniqueness -- that cannot be simultaneously verified in a fully-decentralized setting. Many fully-decentralized systems -- whether for communication or social coordination -- grapple with this trilemma in some way, perhaps unknowingly. In this Systematization of Knowledge (SoK) paper, we examine the design space, use cases, problems with prior approaches, and possible paths forward. We sketch a proof of this trilemma and outline options for practical, incrementally deployable schemes to achieve an acceptable tradeoff of trust in centralized trust anchors, decentralized operation, and an ability to withstand a range of attacks, while protecting user privacy.
△ Less
Submitted 19 January, 2024; v1 submitted 4 August, 2023;
originally announced August 2023.
-
Internet Localization of Multi-Party Relay Users: Inherent Friction Between Internet Services and User Privacy
Authors:
Sean Flynn,
Francesco Bronzino,
Paul Schmitt
Abstract:
Internet privacy is increasingly important on the modern Internet. Users are looking to control the trail of data that they leave behind on the systems that they interact with. Multi-Party Relay (MPR) architectures lower the traditional barriers to adoption of privacy enhancing technologies on the Internet. MPRs are unique from legacy architectures in that they are able to offer privacy guarantees…
▽ More
Internet privacy is increasingly important on the modern Internet. Users are looking to control the trail of data that they leave behind on the systems that they interact with. Multi-Party Relay (MPR) architectures lower the traditional barriers to adoption of privacy enhancing technologies on the Internet. MPRs are unique from legacy architectures in that they are able to offer privacy guarantees without paying significant performance penalties. Apple's iCloud Private Relay is a recently deployed MPR service, creating the potential for widespread consumer adoption of the architecture. However, many current Internet-scale systems are designed based on assumptions that may no longer hold for users of privacy enhancing systems like Private Relay. There are inherent tensions between systems that rely on data about users -- estimated location of a user based on their IP address, for example -- and the trend towards a more private Internet.
This work studies a core function that is widely used to control network and application behavior, IP geolocation, in the context of iCloud Private Relay usage. We study the location accuracy of popular IP geolocation services compared against the published location dataset that Apple publicly releases to explicitly aid in geolocating PR users. We characterize geolocation service performance across a number of dimensions, including different countries, IP version, infrastructure provider, and time. Our findings lead us to conclude that existing approaches to IP geolocation (e.g., frequently updated databases) perform inadequately for users of the MPR architecture. For example, we find median location errors >1,000 miles in some countries for IPv4 addresses using IP2Location. Our findings lead us to conclude that new, privacy-focused, techniques for inferring user location may be required as privacy becomes a default user expectation on the Internet.
△ Less
Submitted 8 July, 2023;
originally announced July 2023.
-
AC-DC: Adaptive Ensemble Classification for Network Traffic Identification
Authors:
Xi Jiang,
Shinan Liu,
Saloua Naama,
Francesco Bronzino,
Paul Schmitt,
Nick Feamster
Abstract:
Accurate and efficient network traffic classification is important for many network management tasks, from traffic prioritization to anomaly detection. Although classifiers using pre-computed flow statistics (e.g., packet sizes, inter-arrival times) can be efficient, they may experience lower accuracy than techniques based on raw traffic, including packet captures. Past work on representation lear…
▽ More
Accurate and efficient network traffic classification is important for many network management tasks, from traffic prioritization to anomaly detection. Although classifiers using pre-computed flow statistics (e.g., packet sizes, inter-arrival times) can be efficient, they may experience lower accuracy than techniques based on raw traffic, including packet captures. Past work on representation learning-based classifiers applied to network traffic captures has shown to be more accurate, but slower and requiring considerable additional memory resources, due to the substantial costs in feature preprocessing. In this paper, we explore this trade-off and develop the Adaptive Constraint-Driven Classification (AC-DC) framework to efficiently curate a pool of classifiers with different target requirements, aiming to provide comparable classification performance to complex packet-capture classifiers while adapting to varying network traffic load.
AC-DC uses an adaptive scheduler that tracks current system memory availability and incoming traffic rates to determine the optimal classifier and batch size to maximize classification performance given memory and processing constraints. Our evaluation shows that AC-DC improves classification performance by more than 100% compared to classifiers that rely on flow statistics alone; compared to the state-of-the-art packet-capture classifiers, AC-DC achieves comparable performance (less than 12.3% lower in F1-Score), but processes traffic over 150x faster.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Enabling Personalized Video Quality Optimization with VidHoc
Authors:
Xu Zhang,
Paul Schmitt,
Marshini Chetty,
Nick Feamster,
Junchen Jiang
Abstract:
The emerging video applications greatly increase the demand in network bandwidth that is not easy to scale. To provide higher quality of experience (QoE) under limited bandwidth, a recent trend is to leverage the heterogeneity of quality preferences across individual users. Although these efforts have suggested the great potential benefits, service providers still have not deployed them to realize…
▽ More
The emerging video applications greatly increase the demand in network bandwidth that is not easy to scale. To provide higher quality of experience (QoE) under limited bandwidth, a recent trend is to leverage the heterogeneity of quality preferences across individual users. Although these efforts have suggested the great potential benefits, service providers still have not deployed them to realize the promised QoE improvement. The missing piece is an automation of online per-user QoE modeling and optimization scheme for new users. Previous efforts either optimize QoE by known per-user QoE models or learn a user's QoE model by offline approaches, such as analysis of video viewing history and in-lab user study. Relying on such offline modeling is problematic, because QoE optimization will start late for collecting enough data to train an unbiased QoE model. In this paper, we propose VidHoc, the first automatic system that jointly personalizes QoE model and optimizes QoE in an online manner for each new user. VidHoc can build per-user QoE models within a small number of video sessions as well as maintain good QoE. We evaluate VidHoc in a pilot deployment to fifteen users for four months with the care of statistical validity. Compared with other baselines, the results show that VidHoc can save 17.3% bandwidth while maintaining the same QoE or improve QoE by 13.9% with the same bandwidth.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Constraint-based Task Specification and Trajectory Optimization for Sequential Manipulation
Authors:
Mun Seng Phoon,
Philipp S. Schmitt,
Georg v. Wichert
Abstract:
To economically deploy robotic manipulators the programming and execution of robot motions must be swift. To this end, we propose a novel, constraint-based method to intuitively specify sequential manipulation tasks and to compute time-optimal robot motions for such a task specification. Our approach follows the ideas of constraint-based task specification by aiming for a minimal and object-centri…
▽ More
To economically deploy robotic manipulators the programming and execution of robot motions must be swift. To this end, we propose a novel, constraint-based method to intuitively specify sequential manipulation tasks and to compute time-optimal robot motions for such a task specification. Our approach follows the ideas of constraint-based task specification by aiming for a minimal and object-centric task description that is largely independent of the underlying robot kinematics. We transform this task description into a non-linear optimization problem. By solving this problem we obtain a (locally) time-optimal robot motion, not just for a single motion, but for an entire manipulation sequence. We demonstrate the capabilities of our approach in a series of experiments involving five distinct robot models, including a highly redundant mobile manipulator.
△ Less
Submitted 19 August, 2022;
originally announced August 2022.
-
Towards Reproducible Network Traffic Analysis
Authors:
Jordan Holland,
Paul Schmitt,
Prateek Mittal,
Nick Feamster
Abstract:
Analysis techniques are critical for gaining insight into network traffic given both the higher proportion of encrypted traffic and increasing data rates. Unfortunately, the domain of network traffic analysis suffers from a lack of standardization, leading to incomparable results and barriers to reproducibility. Unlike other disciplines, no standard dataset format exists, forcing researchers and p…
▽ More
Analysis techniques are critical for gaining insight into network traffic given both the higher proportion of encrypted traffic and increasing data rates. Unfortunately, the domain of network traffic analysis suffers from a lack of standardization, leading to incomparable results and barriers to reproducibility. Unlike other disciplines, no standard dataset format exists, forcing researchers and practitioners to create bespoke analysis pipelines for each individual task. Without standardization researchers cannot compare "apples-to-apples", preventing us from knowing with certainty if a new technique represents a methodological advancement or if it simply benefits from a different interpretation of a given dataset.
In this work, we examine irreproducibility that arises from the lack of standardization in network traffic analysis. First, we study the literature, highlighting evidence of irreproducible research based on different interpretations of popular public datasets. Next, we investigate the underlying issues that have lead to the status quo and prevent reproducible research. Third, we outline the standardization requirements that any solution aiming to fix reproducibility issues must address. We then introduce pcapML, an open source system which increases reproducibility of network traffic analysis research by enabling metadata information to be directly encoded into raw traffic captures in a generic manner. Finally, we use the standardization pcapML provides to create the pcapML benchmarks, an open source leaderboard website and repository built to track the progress of network traffic analysis methods.
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
nuReality: A VR environment for research of pedestrian and autonomous vehicle interactions
Authors:
Paul Schmitt,
Nicholas Britten,
JiHyun Jeong,
Amelia Coffey,
Kevin Clark,
Shweta Sunil Kothawade,
Elena Corina Grigore,
Adam Khaw,
Christopher Konopka,
Linh Pham,
Kim Ryan,
Christopher Schmitt,
Aryaman Pandya,
Emilio Frazzoli
Abstract:
We present nuReality, a virtual reality 'VR' environment designed to test the efficacy of vehicular behaviors to communicate intent during interactions between autonomous vehicles 'AVs' and pedestrians at urban intersections. In this project we focus on expressive behaviors as a means for pedestrians to readily recognize the underlying intent of the AV's movements. VR is an ideal tool to use to te…
▽ More
We present nuReality, a virtual reality 'VR' environment designed to test the efficacy of vehicular behaviors to communicate intent during interactions between autonomous vehicles 'AVs' and pedestrians at urban intersections. In this project we focus on expressive behaviors as a means for pedestrians to readily recognize the underlying intent of the AV's movements. VR is an ideal tool to use to test these situations as it can be immersive and place subjects into these potentially dangerous scenarios without risk. nuReality provides a novel and immersive virtual reality environment that includes numerous visual details (road and building texturing, parked cars, swaying tree limbs) as well as auditory details (birds chirping, cars honking in the distance, people talking). In these files we present the nuReality environment, its 10 unique vehicle behavior scenarios, and the Unreal Engine and Autodesk Maya source files for each scenario. The files are publicly released as open source at www.nuReality.org, to support the academic community studying the critical AV-pedestrian interaction.
△ Less
Submitted 12 January, 2022;
originally announced January 2022.
-
LEAF: Navigating Concept Drift in Cellular Networks
Authors:
Shinan Liu,
Francesco Bronzino,
Paul Schmitt,
Arjun Nitin Bhagoji,
Nick Feamster,
Hector Garcia Crespo,
Timothy Coyle,
Brian Ward
Abstract:
Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target to be predicted changes. Mitigating concept drift is an essential part of operationalizing machine learning models in…
▽ More
Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target to be predicted changes. Mitigating concept drift is an essential part of operationalizing machine learning models in general, but is of particular importance in networking's highly dynamic deployment environments. In this paper, we first characterize concept drift in a large cellular network for a major metropolitan area in the United States. We find that concept drift occurs across many important key performance indicators (KPIs), independently of the model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. We then show that frequent model retraining with newly available data is not sufficient to mitigate concept drift, and can even degrade model accuracy further. Finally, we develop a new methodology for concept drift mitigation, Local Error Approximation of Features (LEAF). LEAF works by detecting drift; explaining the features and time intervals that contribute the most to drift; and mitigates it using forgetting and over-sampling. We evaluate LEAF against industry-standard mitigation approaches (notably, periodic retraining) with more than four years of cellular KPI data. Our initial tests with a major cellular provider in the US show that LEAF consistently outperforms periodic and triggered retraining on complex, real-world data while reducing costly retraining operations.
△ Less
Submitted 2 February, 2023; v1 submitted 7 September, 2021;
originally announced September 2021.
-
A Pluralist Approach to Democratizing Online Discourse
Authors:
Jay Chen,
Barath Raghavan,
Paul Schmitt,
Tai Liu
Abstract:
Online discourse takes place in corporate-controlled spaces thought by users to be public realms. These platforms in name enable free speech but in practice implement varying degrees of censorship either by government edict or by uneven and unseen corporate policy. This kind of censorship has no countervailing accountability mechanism, and as such platform owners, moderators, and algorithms shape…
▽ More
Online discourse takes place in corporate-controlled spaces thought by users to be public realms. These platforms in name enable free speech but in practice implement varying degrees of censorship either by government edict or by uneven and unseen corporate policy. This kind of censorship has no countervailing accountability mechanism, and as such platform owners, moderators, and algorithms shape public discourse without recourse or transparency.
Systems research has explored approaches to decentralizing or democratizing Internet infrastructure for decades. In parallel, the Internet censorship literature is replete with efforts to measure and overcome online censorship. However, in the course of designing specialized open-source platforms and tools, projects generally neglect the needs of supportive but uninvolved `average' users. In this paper, we propose a pluralistic approach to democratizing online discourse that considers both the systems-related and user-facing issues as first-order design goals.
△ Less
Submitted 28 August, 2021;
originally announced August 2021.
-
Characterizing Service Provider Response to the COVID-19 Pandemic in the United States
Authors:
Shinan Liu,
Paul Schmitt,
Francesco Bronzino,
Nick Feamster
Abstract:
The COVID-19 pandemic has resulted in dramatic changes to the daily habits of billions of people. Users increasingly have to rely on home broadband Internet access for work, education, and other activities. These changes have resulted in corresponding changes to Internet traffic patterns. This paper aims to characterize the effects of these changes with respect to Internet service providers in the…
▽ More
The COVID-19 pandemic has resulted in dramatic changes to the daily habits of billions of people. Users increasingly have to rely on home broadband Internet access for work, education, and other activities. These changes have resulted in corresponding changes to Internet traffic patterns. This paper aims to characterize the effects of these changes with respect to Internet service providers in the United States. We study three questions: (1)How did traffic demands change in the United States as a result of the COVID-19 pandemic?; (2)What effects have these changes had on Internet performance?; (3)How did service providers respond to these changes? We study these questions using data from a diverse collection of sources. Our analysis of interconnection data for two large ISPs in the United States shows a 30-60% increase in peak traffic rates in the first quarter of 2020. In particular, we observe traffic downstream peak volumes for a major ISP increase of 13-20% while upstream peaks increased by more than 30%. Further, we observe significant variation in performance across ISPs in conjunction with the traffic volume shifts, with evident latency increases after stay-at-home orders were issued, followed by a stabilization of traffic after April. Finally, we observe that in response to changes in usage, ISPs have aggressively augmented capacity at interconnects, at more than twice the rate of normal capacity augmentation. Similarly, video conferencing applications have increased their network footprint, more than doubling their advertised IP address space.
△ Less
Submitted 1 November, 2020;
originally announced November 2020.
-
Traffic Refinery: Cost-Aware Data Representation for Machine Learning on Network Traffic
Authors:
Francesco Bronzino,
Paul Schmitt,
Sara Ayoubi,
Hyojoon Kim,
Renata Teixeira,
Nick Feamster
Abstract:
Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus…
▽ More
Network management often relies on machine learning to make predictions about performance and security from network traffic. Often, the representation of the traffic is as important as the choice of the model. The features that the model relies on, and the representation of those features, ultimately determine model accuracy, as well as where and whether the model can be deployed in practice. Thus, the design and evaluation of these models ultimately requires understanding not only model accuracy but also the systems costs associated with deploying the model in an operational network. Towards this goal, this paper develops a new framework and system that enables a joint evaluation of both the conventional notions of machine learning performance (e.g., model accuracy) and the systems-level costs of different representations of network traffic. We highlight these two dimensions for two practical network management tasks, video streaming quality inference and malware detection, to demonstrate the importance of exploring different representations to find the appropriate operating point. We demonstrate the benefit of exploring a range of representations of network traffic and present Traffic Refinery, a proof-of-concept implementation that both monitors network traffic at 10 Gbps and transforms traffic in real time to produce a variety of feature representations for machine learning. Traffic Refinery both highlights this design space and makes it possible to explore different representations for learning, balancing systems costs related to feature extraction and model training against model accuracy.
△ Less
Submitted 7 June, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Pretty Good Phone Privacy
Authors:
Paul Schmitt,
Barath Raghavan
Abstract:
To receive service in today's cellular architecture, phones uniquely identify themselves to towers and thus to operators. This is now a cause of major privacy violations, as operators now sell and leak identity and location data of hundreds of millions of mobile users.
In this paper, we take an end-to-end perspective on the cellular architecture and find key points of decoupling that enable us t…
▽ More
To receive service in today's cellular architecture, phones uniquely identify themselves to towers and thus to operators. This is now a cause of major privacy violations, as operators now sell and leak identity and location data of hundreds of millions of mobile users.
In this paper, we take an end-to-end perspective on the cellular architecture and find key points of decoupling that enable us to protect user identity and location privacy with no changes to physical infrastructure, no added latency, and no requirement of direct cooperation from existing operators.
We describe Pretty Good Phone Privacy (PGPP) and demonstrate how our modified backend stack (NGC) works with real phones to provide ordinary yet privacy-preserving connectivity. We explore inherent privacy and efficiency tradeoffs in a simulation of a large metropolitan region. We show how PGPP maintains today's control overheads while significantly improving user identity and location privacy.
△ Less
Submitted 28 December, 2020; v1 submitted 18 September, 2020;
originally announced September 2020.
-
New Directions in Automated Traffic Analysis
Authors:
Jordan Holland,
Paul Schmitt,
Nick Feamster,
Prateek Mittal
Abstract:
Despite the use of machine learning for many network traffic analysis tasks in security, from application identification to intrusion detection, the aspects of the machine learning pipeline that ultimately determine the performance of the model -- feature selection and representation, model selection, and parameter tuning -- remain manual and painstaking. This paper presents a method to automate m…
▽ More
Despite the use of machine learning for many network traffic analysis tasks in security, from application identification to intrusion detection, the aspects of the machine learning pipeline that ultimately determine the performance of the model -- feature selection and representation, model selection, and parameter tuning -- remain manual and painstaking. This paper presents a method to automate many aspects of traffic analysis, making it easier to apply machine learning techniques to a wider variety of traffic analysis tasks. We introduce nPrint, a tool that generates a unified packet representation that is amenable for representation learning and model training. We integrate nPrint with automated machine learning (AutoML), resulting in nPrintML, a public system that largely eliminates feature extraction and model tuning for a wide variety of traffic analysis tasks. We have evaluated nPrintML on eight separate traffic analysis tasks and released nPrint and nPrintML to enable future work to extend these methods.
△ Less
Submitted 19 October, 2021; v1 submitted 6 August, 2020;
originally announced August 2020.
-
Can Encrypted DNS Be Fast?
Authors:
Austin Hounsel,
Paul Schmitt,
Kevin Borgolte,
Nick Feamster
Abstract:
In this paper, we study the performance of encrypted DNS protocols and conventional DNS from thousands of home networks in the United States, over one month in 2020. We perform these measurements from the homes of 2,693 participating panelists in the Federal Communications Commission's (FCC) Measuring Broadband America program. We found that clients do not have to trade DNS performance for privacy…
▽ More
In this paper, we study the performance of encrypted DNS protocols and conventional DNS from thousands of home networks in the United States, over one month in 2020. We perform these measurements from the homes of 2,693 participating panelists in the Federal Communications Commission's (FCC) Measuring Broadband America program. We found that clients do not have to trade DNS performance for privacy. For certain resolvers, DoT was able to perform faster than DNS in median response times, even as latency increased. We also found significant variation in DoH performance across recursive resolvers. Based on these results, we recommend that DNS clients (e.g., web browsers) should periodically conduct simple latency and response time measurements to determine which protocol and resolver a client should use. No single DNS protocol nor resolver performed the best for all clients.
△ Less
Submitted 27 July, 2021; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Classifying Network Vendors at Internet Scale
Authors:
Jordan Holland,
Ross Teixeira,
Paul Schmitt,
Kevin Borgolte,
Jennifer Rexford,
Nick Feamster,
Jonathan Mayer
Abstract:
In this paper, we develop a method to create a large, labeled dataset of visible network device vendors across the Internet by mapping network-visible IP addresses to device vendors. We use Internet-wide scanning, banner grabs of network-visible devices across the IPv4 address space, and clustering techniques to assign labels to more than 160,000 devices. We subsequently probe these devices and us…
▽ More
In this paper, we develop a method to create a large, labeled dataset of visible network device vendors across the Internet by mapping network-visible IP addresses to device vendors. We use Internet-wide scanning, banner grabs of network-visible devices across the IPv4 address space, and clustering techniques to assign labels to more than 160,000 devices. We subsequently probe these devices and use features extracted from the responses to train a classifier that can accurately classify device vendors. Finally, we demonstrate how this method can be used to understand broader trends across the Internet by predicting device vendors in traceroutes from CAIDA's Archipelago measurement system and subsequently examining vendor distributions across these traceroutes.
△ Less
Submitted 24 June, 2020; v1 submitted 23 June, 2020;
originally announced June 2020.
-
Beyond the Trees: Resilient Multipath for Last-mile WISP Networks
Authors:
Bilal Saleem,
Paul Schmitt,
Jay Chen,
Barath Raghavan
Abstract:
Expanding the reach of the Internet is a topic of widespread interest today. Google and Facebook, among others, have begun investing substantial research efforts toward expanding Internet access at the edge. Compared to data center networks, which are relatively over-engineered, last-mile networks are highly constrained and end up being ultimately responsible for the performance issues that impact…
▽ More
Expanding the reach of the Internet is a topic of widespread interest today. Google and Facebook, among others, have begun investing substantial research efforts toward expanding Internet access at the edge. Compared to data center networks, which are relatively over-engineered, last-mile networks are highly constrained and end up being ultimately responsible for the performance issues that impact the user experience.
The most viable and cost-effective approach for providing last-mile connectivity has proved to be Wireless ISPs (WISPs), which rely on point-to-point wireless backhaul infrastructure to provide connectivity using cheap commodity wireless hardware. However, individual WISP network links are known to have poor reliability and the networks as a whole are highly cost constrained.
Motivated by these observations, we propose Wireless ISPs with Redundancy (WISPR), which leverages the cost-performance tradeoff inherent in commodity wireless hardware to move toward a greater number of inexpensive links in WISP networks thereby lowering costs. To take advantage of this new path diversity, we introduce a new, general protocol that provides increased performance, reliability, or a combination of the two.
△ Less
Submitted 27 February, 2020;
originally announced February 2020.
-
Encryption without Centralization: Distributing DNS Queries Across Recursive Resolvers
Authors:
Austin Hounsel,
Paul Schmitt,
Kevin Borgolte,
Nick Feamster
Abstract:
Emerging protocols such as DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) improve the privacy of DNS queries and responses. While this trend towards encryption is positive, deployment of these protocols has in some cases resulted in further centralization of the DNS, which introduces new challenges. In particular, centralization has consequences for performance, privacy, and availability; a potential…
▽ More
Emerging protocols such as DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) improve the privacy of DNS queries and responses. While this trend towards encryption is positive, deployment of these protocols has in some cases resulted in further centralization of the DNS, which introduces new challenges. In particular, centralization has consequences for performance, privacy, and availability; a potentially greater concern is that it has become more difficult to control the choice of DNS recursive resolver, particularly for IoT devices. Ultimately, the best strategy for selecting among one or more recursive resolvers may ultimately depend on circumstance, user, and even device. Accordingly, the DNS architecture must permit flexibility in allowing users, devices, and applications to specify these strategies. Towards this goal of increased de-centralization and improved flexibility, this paper presents the design and implementation of a refactored DNS resolver architecture that allows for de-centralized name resolution, preserving the benefits of encrypted DNS while satisfying other desirable properties, including performance and privacy.
△ Less
Submitted 21 September, 2021; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Comparing the Effects of DNS, DoT, and DoH on Web Performance
Authors:
Austin Hounsel,
Kevin Borgolte,
Paul Schmitt,
Jordan Holland,
Nick Feamster
Abstract:
Nearly every service on the Internet relies on the Domain Name System (DNS), which translates a human-readable name to an IP address before two endpoints can communicate. Today, DNS traffic is unencrypted, leaving users vulnerable to eavesdropping and tampering. Past work has demonstrated that DNS queries can reveal a user's browsing history and even what smart devices they are using at home. In r…
▽ More
Nearly every service on the Internet relies on the Domain Name System (DNS), which translates a human-readable name to an IP address before two endpoints can communicate. Today, DNS traffic is unencrypted, leaving users vulnerable to eavesdropping and tampering. Past work has demonstrated that DNS queries can reveal a user's browsing history and even what smart devices they are using at home. In response to these privacy concerns, two new protocols have been proposed: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT). Instead of sending DNS queries and responses in the clear, DoH and DoT establish encrypted connections between users and resolvers. By doing so, these protocols provide privacy and security guarantees that traditional DNS (Do53) lacks.
In this paper, we measure the effect of Do53, DoT, and DoH on query response times and page load times from five global vantage points. We find that although DoH and DoT response times are generally higher than Do53, both protocols can perform better than Do53 in terms of page load times. However, as throughput decreases and substantial packet loss and latency are introduced, web pages load fastest with Do53. Additionally, web pages successfully load more often with Do53 and DoT than DoH. Based on these results, we provide several recommendations to improve DNS performance, such as opportunistic partial responses and wire format caching.
△ Less
Submitted 23 February, 2020; v1 submitted 18 July, 2019;
originally announced July 2019.
-
Inferring Streaming Video Quality from Encrypted Traffic: Practical Models and Deployment Experience
Authors:
Paul Schmitt,
Francesco Bronzino,
Sara Ayoubi,
Guilherme Martins,
Renata Teixeira,
Nick Feamster
Abstract:
Inferring the quality of streaming video applications is important for Internet service providers, but the fact that most video streams are encrypted makes it difficult to do so. We develop models that infer quality metrics (\ie, startup delay and resolution) for encrypted streaming video services. Our paper builds on previous work, but extends it in several ways. First, the model works in deploym…
▽ More
Inferring the quality of streaming video applications is important for Internet service providers, but the fact that most video streams are encrypted makes it difficult to do so. We develop models that infer quality metrics (\ie, startup delay and resolution) for encrypted streaming video services. Our paper builds on previous work, but extends it in several ways. First, the model works in deployment settings where the video sessions and segments must be identified from a mix of traffic and the time precision of the collected traffic statistics is more coarse (\eg, due to aggregation). Second, we develop a single composite model that works for a range of different services (i.e., Netflix, YouTube, Amazon, and Twitch), as opposed to just a single service. Third, unlike many previous models, the model performs predictions at finer granularity (\eg, the precise startup delay instead of just detecting short versus long delays) allowing to draw better conclusions on the ongoing streaming quality. Fourth, we demonstrate the model is practical through a 16-month deployment in 66 homes and provide new insights about the relationships between Internet "speed" and the quality of the corresponding video streams, for a variety of services; we find that higher speeds provide only minimal improvements to startup delay and resolution.
△ Less
Submitted 14 August, 2019; v1 submitted 17 January, 2019;
originally announced January 2019.
-
Robust, Compliant Assembly via Optimal Belief Space Planning
Authors:
Florian Wirnshofer,
Philipp S. Schmitt,
Wendelin Feiten,
Georg v. Wichert,
Wolfram Burgard
Abstract:
In automated manufacturing, robots must reliably assemble parts of various geometries and low tolerances. Ideally, they plan the required motions autonomously. This poses a substantial challenge due to high-dimensional state spaces and non-linear contact-dynamics. Furthermore, object poses and model parameters, such as friction, are not exactly known and a source of uncertainty. The method propose…
▽ More
In automated manufacturing, robots must reliably assemble parts of various geometries and low tolerances. Ideally, they plan the required motions autonomously. This poses a substantial challenge due to high-dimensional state spaces and non-linear contact-dynamics. Furthermore, object poses and model parameters, such as friction, are not exactly known and a source of uncertainty. The method proposed in this paper models the task of parts assembly as a belief space planning problem over an underlying impedance-controlled, compliant system. To solve this planning problem we introduce an asymptotically optimal belief space planner by extending an optimal, randomized, kinodynamic motion planner to non-deterministic domains. Under an expansiveness assumption we establish probabilistic completeness and asymptotic optimality. We validate our approach in thorough, simulated and real-world experiments of multiple assembly tasks. The experiments demonstrate our planner's ability to reliably assemble objects, solely based on CAD models as input.
△ Less
Submitted 9 November, 2018;
originally announced November 2018.
-
Oblivious DNS: Practical Privacy for DNS Queries
Authors:
Paul Schmitt,
Anne Edmundson,
Nick Feamster
Abstract:
Virtually every Internet communication typically involves a Domain Name System (DNS) lookup for the destination server that the client wants to communicate with. Operators of DNS recursive resolvers---the machines that receive a client's query for a domain name and resolve it to a corresponding IP address---can learn significant information about client activity. Past work, for example, indicates…
▽ More
Virtually every Internet communication typically involves a Domain Name System (DNS) lookup for the destination server that the client wants to communicate with. Operators of DNS recursive resolvers---the machines that receive a client's query for a domain name and resolve it to a corresponding IP address---can learn significant information about client activity. Past work, for example, indicates that DNS queries reveal information ranging from web browsing activity to the types of devices that a user has in their home. Recognizing the privacy vulnerabilities associated with DNS queries, various third parties have created alternate DNS services that obscure a user's DNS queries from his or her Internet service provider. Yet, these systems merely transfer trust to a different third party. We argue that no single party ought to be able to associate DNS queries with a client IP address that issues those queries. To this end, we present Oblivious DNS (ODNS), which introduces an additional layer of obfuscation between clients and their queries. To do so, ODNS uses its own authoritative namespace; the authoritative servers for the ODNS namespace act as recursive resolvers for the DNS queries that they receive, but they never see the IP addresses for the clients that initiated these queries. We present an initial deployment of ODNS; our experiments show that ODNS introduces minimal performance overhead, both for individual queries and for web page loads. We design ODNS to be compatible with existing DNS protocols and infrastructure, and we are actively working on an open standard with the IETF.
△ Less
Submitted 11 December, 2018; v1 submitted 1 June, 2018;
originally announced June 2018.
-
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining
Authors:
Ryan J. Urbanowicz,
Randal S. Olson,
Peter Schmitt,
Melissa Meeker,
Jason H. Moore
Abstract:
Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. `omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally…
▽ More
Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. `omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the `Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF* performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.
△ Less
Submitted 2 April, 2018; v1 submitted 22 November, 2017;
originally announced November 2017.
-
OCDN: Oblivious Content Distribution Networks
Authors:
Anne Edmundson,
Paul Schmitt,
Nick Feamster,
Jennifer Rexford
Abstract:
As publishers increasingly use Content Distribution Networks (CDNs) to distribute content across geographically diverse networks, CDNs themselves are becoming unwitting targets of requests for both access to user data and content takedown. From copyright infringement to moderation of online speech, CDNs have found themselves at the forefront of many recent legal quandaries. At the heart of the ten…
▽ More
As publishers increasingly use Content Distribution Networks (CDNs) to distribute content across geographically diverse networks, CDNs themselves are becoming unwitting targets of requests for both access to user data and content takedown. From copyright infringement to moderation of online speech, CDNs have found themselves at the forefront of many recent legal quandaries. At the heart of the tension, however, is the fact that CDNs have rich information both about the content they are serving and the users who are requesting that content. This paper offers a technical contribution that is relevant to this ongoing tension with the design of an Oblivious CDN (OCDN); the system is both compatible with the existing Web ecosystem of publishers and clients and hides from the CDN both the content it is serving and the users who are requesting that content. OCDN is compatible with the way that publishers currently host content on CDNs. Using OCDN, publishers can use multiple CDNs to publish content; clients retrieve content through a peer-to-peer anonymizing network of proxies. Our prototype implementation and evaluation of OCDN show that the system can obfuscate both content and clients from the CDN operator while still delivering content with good performance.
△ Less
Submitted 4 November, 2017;
originally announced November 2017.
-
Rangzen: Anonymously Getting the Word Out in a Blackout
Authors:
Adam Lerner,
Giulia Fanti,
Yahel Ben-David,
Jesus Garcia,
Paul Schmitt,
Barath Raghavan
Abstract:
In recent years governments have shown themselves willing to impose blackouts to shut off key communication infrastructure during times of civil strife, and to surveil citizen communications whenever possible. However, it is exactly during such strife that citizens need reliable and anonymous communications the most. In this paper, we present Rangzen, a system for anonymous broadcast messaging dur…
▽ More
In recent years governments have shown themselves willing to impose blackouts to shut off key communication infrastructure during times of civil strife, and to surveil citizen communications whenever possible. However, it is exactly during such strife that citizens need reliable and anonymous communications the most. In this paper, we present Rangzen, a system for anonymous broadcast messaging during network blackouts. Rangzen is distinctive in both aim and design. Our aim is to provide an anonymous, one-to-many messaging layer that requires only users' smartphones and can withstand network-level attacks. Our design is a delay-tolerant mesh network which deprioritizes adversarial messages by means of a social graph while preserving user anonymity. We built a complete implementation that runs on Android smartphones, present benchmarks of its performance and battery usage, and present simulation results suggesting Rangzen's efficacy at scale.
△ Less
Submitted 10 December, 2016;
originally announced December 2016.