Skip to main content

Showing 1–15 of 15 results for author: Mussmann, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.06692  [pdf, other

    cs.CL cs.AI cs.LG

    An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models

    Authors: Gantavya Bhatt, Yifang Chen, Arnav M. Das, Jifan Zhang, Sang T. Truong, Stephen Mussmann, Yinglun Zhu, Jeffrey Bilmes, Simon S. Du, Kevin Jamieson, Jordan T. Ash, Robert D. Nowak

    Abstract: Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern large language models (LLMs). However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues t… ▽ More

    Submitted 7 July, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: Accepted to Findings of the Association for Computational Linguistics: ACL 2024

  2. arXiv:2306.09910  [pdf, other

    cs.LG cs.AI cs.CV

    LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning

    Authors: Jifan Zhang, Yifang Chen, Gregory Canal, Stephen Mussmann, Arnav M. Das, Gantavya Bhatt, Yinglun Zhu, Jeffrey Bilmes, Simon Shaolei Du, Kevin Jamieson, Robert D Nowak

    Abstract: Labeled data are critical to modern machine learning applications, but obtaining labels can be expensive. To mitigate this cost, machine learning methods, such as transfer learning, semi-supervised learning and active learning, aim to be label-efficient: achieving high predictive performance from relatively few labeled examples. While obtaining the best label-efficiency in practice often requires… ▽ More

    Submitted 1 March, 2024; v1 submitted 16 June, 2023; originally announced June 2023.

  3. arXiv:2304.14108  [pdf, other

    cs.CV cs.CL cs.LG

    DataComp: In search of the next generation of multimodal datasets

    Authors: Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song , et al. (9 additional authors not shown)

    Abstract: Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Commo… ▽ More

    Submitted 20 October, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023 Datasets and Benchmarks Track

  4. arXiv:2303.04068  [pdf, other

    cs.DB cs.CV cs.SD eess.AS

    VOCALExplore: Pay-as-You-Go Video Data Exploration and Model Building [Technical Report]

    Authors: Maureen Daum, Enhao Zhang, Dong He, Stephen Mussmann, Brandon Haynes, Ranjay Krishna, Magdalena Balazinska

    Abstract: We introduce VOCALExplore, a system designed to support users in building domain-specific models over video datasets. VOCALExplore supports interactive labeling sessions and trains models using user-supplied labels. VOCALExplore maximizes model quality by automatically deciding how to select samples based on observed skew in the collected labels. It also selects the optimal video representations t… ▽ More

    Submitted 29 September, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

  5. arXiv:2211.09283  [pdf, other

    cs.LG

    Active Learning with Expected Error Reduction

    Authors: Stephen Mussmann, Julia Reisler, Daniel Tsai, Ehsan Mousavi, Shayne O'Brien, Moises Goldszmidt

    Abstract: Active learning has been studied extensively as a method for efficient data collection. Among the many approaches in literature, Expected Error Reduction (EER) (Roy and McCallum) has been shown to be an effective method for active learning: select the candidate sample that, in expectation, maximally decreases the error on an unlabeled set. However, EER requires the model to be retrained for every… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

  6. arXiv:2103.02761  [pdf, other

    cs.LG stat.ML

    Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation

    Authors: Mayee F. Chen, Benjamin Cohen-Wang, Stephen Mussmann, Frederic Sala, Christopher Ré

    Abstract: Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this via a framework centered on model misspecification… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

    Comments: To appear in AISTATS 2021

  7. arXiv:2010.05103  [pdf, other

    cs.CL cs.LG

    On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

    Authors: Stephen Mussmann, Robin Jia, Percy Liang

    Abstract: Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., $99.99\%$ of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA… ▽ More

    Submitted 10 October, 2020; originally announced October 2020.

    Comments: In Findings of EMNLP 2020

  8. arXiv:2007.04612  [pdf, other

    cs.LG stat.ML

    Concept Bottleneck Models

    Authors: Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang

    Abstract: We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthri… ▽ More

    Submitted 28 December, 2020; v1 submitted 9 July, 2020; originally announced July 2020.

    Comments: Edited for clarity from the ICML 2020 version

  9. arXiv:1906.11829  [pdf, other

    cs.LG stat.ML

    Selection via Proxy: Efficient Data Selection for Deep Learning

    Authors: Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia

    Abstract: Data selection methods, such as active learning and core-set selection, are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because they depend on feature representations that need to be learned. In this work, we show that we can greatly improve the computational efficiency by using a small proxy model to perform data sele… ▽ More

    Submitted 26 October, 2020; v1 submitted 26 June, 2019; originally announced June 2019.

    Comments: ICLR 2020

  10. arXiv:1906.11385  [pdf, other

    cs.DS cs.AI cs.CC

    A Tight Analysis of Greedy Yields Subexponential Time Approximation for Uniform Decision Tree

    Authors: Ray Li, Percy Liang, Stephen Mussmann

    Abstract: Decision Tree is a classic formulation of active learning: given $n$ hypotheses with nonnegative weights summing to 1 and a set of tests that each partition the hypotheses, output a decision tree using the provided tests that uniquely identifies each hypothesis and has minimum (weighted) average depth. Previous works showed that the greedy algorithm achieves a $O(\log n)$ approximation ratio for t… ▽ More

    Submitted 21 October, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

    Comments: 40 pages, 5 figures

  11. arXiv:1812.01815  [pdf, other

    cs.LG stat.ML

    Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss

    Authors: Stephen Mussmann, Percy Liang

    Abstract: Uncertainty sampling, a popular active learning algorithm, is used to reduce the amount of data required to learn a classifier, but it has been observed in practice to converge to different parameters depending on the initialization and sometimes to even better parameters than standard training on all the data. In this work, we give a theoretical explanation of this phenomenon, showing that uncert… ▽ More

    Submitted 4 December, 2018; originally announced December 2018.

    Comments: NeurIPS 2018

  12. arXiv:1806.06123  [pdf, other

    cs.LG stat.ML

    On the Relationship between Data Efficiency and Error for Uncertainty Sampling

    Authors: Stephen Mussmann, Percy Liang

    Abstract: While active learning offers potential cost savings, the actual data efficiency---the reduction in amount of labeled data needed to obtain the same error rate---observed in practice is mixed. This paper poses a basic question: when is active learning actually helpful? We provide an answer for logistic regression with the popular active learning algorithm, uncertainty sampling. Empirically, on 21 d… ▽ More

    Submitted 15 June, 2018; originally announced June 2018.

  13. arXiv:1802.09751  [pdf, other

    cs.AI cs.DS

    Generalized Binary Search For Split-Neighborly Problems

    Authors: Stephen Mussmann, Percy Liang

    Abstract: In sequential hypothesis testing, Generalized Binary Search (GBS) greedily chooses the test with the highest information gain at each step. It is known that GBS obtains the gold standard query cost of $O(\log n)$ for problems satisfying the $k$-neighborly condition, which requires any two tests to be connected by a sequence of tests where neighboring tests disagree on at most $k$ hypotheses. In th… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: AISTATS 2018

  14. arXiv:1707.03372  [pdf, other

    cs.LG stat.ML

    Fast Amortized Inference and Learning in Log-linear Models with Randomly Perturbed Nearest Neighbor Search

    Authors: Stephen Mussmann, Daniel Levy, Stefano Ermon

    Abstract: Inference in log-linear models scales linearly in the size of output space in the worst-case. This is often a bottleneck in natural language processing and computer vision tasks when the output space is feasibly enumerable but very large. We propose a method to perform inference in log-linear models with sublinear amortized cost. Our idea hinges on using Gumbel random variable perturbations and a… ▽ More

    Submitted 11 July, 2017; originally announced July 2017.

    Comments: In UAI proceedings

  15. arXiv:1501.00614  [pdf, other

    cs.CV

    Understanding Trajectory Behavior: A Motion Pattern Approach

    Authors: Mahdi M. Kalayeh, Stephen Mussmann, Alla Petrakova, Niels da Vitoria Lobo, Mubarak Shah

    Abstract: Mining the underlying patterns in gigantic and complex data is of great importance to data analysts. In this paper, we propose a motion pattern approach to mine frequent behaviors in trajectory data. Motion patterns, defined by a set of highly similar flow vector groups in a spatial locality, have been shown to be very effective in extracting dominant motion behaviors in video sequences. Inspired… ▽ More

    Submitted 3 January, 2015; originally announced January 2015.