-
The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems
Authors:
Lei Zhang,
Vaastav Anand,
Zhiqiang Xie,
Ymir Vigfusson,
Jonathan Mace
Abstract:
Today's distributed tracing frameworks are ill-equipped to troubleshoot rare edge-case requests. The crux of the problem is a trade-off between specificity and overhead. On the one hand, frameworks can indiscriminately select requests to trace when they enter the system (head sampling), but this is unlikely to capture a relevant edge-case trace because the framework cannot know which requests will…
▽ More
Today's distributed tracing frameworks are ill-equipped to troubleshoot rare edge-case requests. The crux of the problem is a trade-off between specificity and overhead. On the one hand, frameworks can indiscriminately select requests to trace when they enter the system (head sampling), but this is unlikely to capture a relevant edge-case trace because the framework cannot know which requests will be problematic until after-the-fact. On the other hand, frameworks can trace everything and later keep only the interesting edge-case traces (tail sampling), but this has high overheads on the traced application and enormous data ingestion costs.
In this paper we circumvent this trade-off for any edge-case with symptoms that can be programmatically detected, such as high tail latency, errors, and bottlenecked queues. We propose a lightweight and always-on distributed tracing system, Hindsight, which implements a retroactive sampling abstraction: instead of eagerly ingesting and processing traces, Hindsight lazily retrieves trace data only after symptoms of a problem are detected. Hindsight is analogous to a car dash-cam that, upon detecting a sudden jolt in momentum, persists the last hour of footage. Developers using Hindsight receive the exact edge-case traces they desire without undue overhead or dependence on luck. Our evaluation shows that Hindsight scales to millions of requests per second, adds nanosecond-level overhead to generate trace data, handles GB/s of data per node, transparently integrates with existing distributed tracing systems, and successfully persists full, detailed traces in real-world use cases when edge-case problems are detected.
△ Less
Submitted 26 April, 2022; v1 submitted 11 February, 2022;
originally announced February 2022.
-
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Authors:
Arpan Gujarati,
Reza Karimi,
Safya Alzayat,
Wei Hao,
Antoine Kaufmann,
Ymir Vigfusson,
Jonathan Mace
Abstract:
Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpre…
▽ More
Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable - on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.
△ Less
Submitted 26 October, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations for Cyberbullying Classification
Authors:
Caleb Ziems,
Ymir Vigfusson,
Fred Morstatter
Abstract:
Cyberbullying is a pervasive problem in online communities. To identify cyberbullying cases in large-scale social networks, content moderators depend on machine learning classifiers for automatic cyberbullying detection. However, existing models remain unfit for real-world applications, largely due to a shortage of publicly available training data and a lack of standard criteria for assigning grou…
▽ More
Cyberbullying is a pervasive problem in online communities. To identify cyberbullying cases in large-scale social networks, content moderators depend on machine learning classifiers for automatic cyberbullying detection. However, existing models remain unfit for real-world applications, largely due to a shortage of publicly available training data and a lack of standard criteria for assigning ground truth labels. In this study, we address the need for reliable data using an original annotation framework. Inspired by social sciences research into bullying behavior, we characterize the nuanced problem of cyberbullying using five explicit factors to represent its social and linguistic aspects. We model this behavior using social network and language-based features, which improve classifier performance. These results demonstrate the importance of representing and modeling cyberbullying as a social phenomenon.
△ Less
Submitted 3 April, 2020;
originally announced April 2020.
-
MITHRIL: Mining Sporadic Associations for Cache Prefetching
Authors:
Juncheng Yang,
Reza Karimi,
Trausti Sæmundsson,
Avani Wildani,
Ymir Vigfusson
Abstract:
The growing pressure on cloud application scalability has accentuated storage performance as a critical bottle- neck. Although cache replacement algorithms have been extensively studied, cache prefetching - reducing latency by retrieving items before they are actually requested remains an underexplored area. Existing approaches to history-based prefetching, in particular, provide too few benefits…
▽ More
The growing pressure on cloud application scalability has accentuated storage performance as a critical bottle- neck. Although cache replacement algorithms have been extensively studied, cache prefetching - reducing latency by retrieving items before they are actually requested remains an underexplored area. Existing approaches to history-based prefetching, in particular, provide too few benefits for real systems for the resources they cost. We propose MITHRIL, a prefetching layer that efficiently exploits historical patterns in cache request associations. MITHRIL is inspired by sporadic association rule mining and only relies on the timestamps of requests. Through evaluation of 135 block-storage traces, we show that MITHRIL is effective, giving an average of a 55% hit ratio increase over LRU and PROBABILITY GRAPH, a 36% hit ratio gain over AMP at reasonable cost. We further show that MITHRIL can supplement any cache replacement algorithm and be readily integrated into existing systems. Furthermore, we demonstrate the improvement comes from MITHRIL being able to capture mid-frequency blocks.
△ Less
Submitted 21 May, 2017;
originally announced May 2017.
-
Wireless Scheduling Algorithms in Complex Environments
Authors:
Helga Gudmundsdottir,
Eyjólfur I Ásgeirsson,
Marijke H. L. Bodlaender,
Joseph T. Foley,
Magnús M. Halldórsson,
Ymir Vigfusson
Abstract:
Efficient spectrum use in wireless sensor networks through spatial reuse requires effective models of packet reception at the physical layer in the presence of interference. Despite recent progress in analytic and simulations research into worst-case behavior from interference effects, these efforts generally assume geometric path loss and isotropic transmission, assumptions which have not been bo…
▽ More
Efficient spectrum use in wireless sensor networks through spatial reuse requires effective models of packet reception at the physical layer in the presence of interference. Despite recent progress in analytic and simulations research into worst-case behavior from interference effects, these efforts generally assume geometric path loss and isotropic transmission, assumptions which have not been borne out in experiments.
Our paper aims to provide a methodology for grounding theoretical results into wireless interference in experimental reality. We develop a new framework for wireless algorithms in which distance-based path loss is replaced by an arbitrary gain matrix, typically obtained by measurements of received signal strength (RSS). Gain matrices allow for the modeling of complex environments, e.g., with obstacles and walls. We experimentally evaluate the framework in two indoors testbeds with 20 and 60 motes, and confirm superior predictive performance in packet reception rate for a gain matrix model over a geometric distance-based model.
At the heart of our approach is a new parameter $ζ$ called metricity which indicates how close the gain matrix is to a distance metric, effectively measuring the complexity of the environment. A powerful theoretical feature of this parameter is that all known SINR scheduling algorithms that work in general metric spaces carry over to arbitrary gain matrices and achieve equivalent performance guarantees in terms of $ζ$ as previously obtained in terms of the path loss constant. Our experiments confirm the sensitivity of $ζ$ to the nature of the environment. Finally, we show analytically and empirically how multiple channels can be leveraged to improve metricity and thereby performance. We believe our contributions will facilitate experimental validation for recent advances in algorithms for physical wireless interference models.
△ Less
Submitted 16 May, 2014; v1 submitted 8 January, 2014;
originally announced January 2014.