-
Proof: Accelerating Approximate Aggregation Queries with Expensive Predicates
Authors:
Daniel Kang,
John Guibas,
Peter Bailis,
Tatsunori Hashimoto,
Yi Sun,
Matei Zaharia
Abstract:
Given a dataset $\mathcal{D}$, we are interested in computing the mean of a subset of $\mathcal{D}$ which matches a predicate. ABae leverages stratified sampling and proxy models to efficiently compute this statistic given a sampling budget $N$. In this document, we theoretically analyze ABae and show that the MSE of the estimate decays at rate $O(N_1^{-1} + N_2^{-1} + N_1^{1/2}N_2^{-3/2})$, where…
▽ More
Given a dataset $\mathcal{D}$, we are interested in computing the mean of a subset of $\mathcal{D}$ which matches a predicate. ABae leverages stratified sampling and proxy models to efficiently compute this statistic given a sampling budget $N$. In this document, we theoretically analyze ABae and show that the MSE of the estimate decays at rate $O(N_1^{-1} + N_2^{-1} + N_1^{1/2}N_2^{-3/2})$, where $N=K \cdot N_1+N_2$ for some integer constant $K$ and $K \cdot N_1$ and $N_2$ represent the number of samples used in Stage 1 and Stage 2 of ABae respectively. Hence, if a constant fraction of the total sample budget $N$ is allocated to each stage, we will achieve a mean squared error of $O(N^{-1})$ which matches the rate of mean squared error of the optimal stratified sampling algorithm given a priori knowledge of the predicate positive rate and standard deviation per stratum.
△ Less
Submitted 28 July, 2021; v1 submitted 26 July, 2021;
originally announced July 2021.
-
Allocation of Fungible Resources via a Fast, Scalable Price Discovery Method
Authors:
Akshay Agrawal,
Stephen Boyd,
Deepak Narayanan,
Fiodar Kazhamiaka,
Matei Zaharia
Abstract:
We consider the problem of assigning or allocating resources to a set of jobs. We consider the case when the resources are fungible, that is, the job can be done with any mix of the resources, but with different efficiencies. In our formulation we maximize a total utility subject to a given limit on the resource usage, which is a convex optimization problem and so is tractable. In this paper we de…
▽ More
We consider the problem of assigning or allocating resources to a set of jobs. We consider the case when the resources are fungible, that is, the job can be done with any mix of the resources, but with different efficiencies. In our formulation we maximize a total utility subject to a given limit on the resource usage, which is a convex optimization problem and so is tractable. In this paper we develop a custom, parallelizable algorithm for solving the resource allocation problem that scales to large problems, with millions of jobs. Our algorithm is based on the dual problem, in which the dual variables associated with the resource usage limit can be interpreted as resource prices. Our method updates the resource prices in each iteration, ultimately discovering the optimal resource prices, from which an optimal allocation is obtained. We provide an open-source implementation of our method, which can solve problems with millions of jobs in a few seconds on CPU, and under a second on a GPU; our software can solve smaller problems in milliseconds. On large problems, our implementation is up to three orders of magnitude faster than a commerical solver for convex optimization.
△ Less
Submitted 18 April, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Efficient Online ML API Selection for Multi-Label Classification Tasks
Authors:
Lingjiao Chen,
Matei Zaharia,
James Zou
Abstract:
Multi-label classification tasks such as OCR and multi-object recognition are a major focus of the growing machine learning as a service industry. While many multi-label prediction APIs are available, it is challenging for users to decide which API to use for their own data and budget, due to the heterogeneity in those APIs' price and performance. Recent work shows how to select from single-label…
▽ More
Multi-label classification tasks such as OCR and multi-object recognition are a major focus of the growing machine learning as a service industry. While many multi-label prediction APIs are available, it is challenging for users to decide which API to use for their own data and budget, due to the heterogeneity in those APIs' price and performance. Recent work shows how to select from single-label prediction APIs. However the computation complexity of the previous approach is exponential in the number of labels and hence is not suitable for settings like OCR. In this work, we propose FrugalMCT, a principled framework that adaptively selects the APIs to use for different data in an online fashion while respecting user's budget. The API selection problem is cast as an integer linear program, which we show has a special structure that we leverage to develop an efficient online API selector with strong performance guarantees. We conduct systematic experiments using ML APIs from Google, Microsoft, Amazon, IBM, Tencent and other providers for tasks including multi-label image classification, scene text recognition and named entity recognition. Across diverse tasks, FrugalMCT can achieve over 90% cost reduction while matching the accuracy of the best single API, or up to 8% better accuracy while matching the best API's cost.
△ Less
Submitted 16 July, 2022; v1 submitted 17 February, 2021;
originally announced February 2021.