Skip to main content

Showing 1–6 of 6 results for author: Merrick, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2508.10975  [pdf, ps, other

    cs.LG cs.CL

    BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

    Authors: DatologyAI, :, Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga , et al. (6 additional authors not shown)

    Abstract: Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we i… ▽ More

    Submitted 19 August, 2025; v1 submitted 14 August, 2025; originally announced August 2025.

    Comments: Blog version can be viewed at: http://blog.datologyai.com/beyondweb

  2. arXiv:2412.04506  [pdf, other

    cs.CL cs.IR cs.LG

    Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

    Authors: Puxuan Yu, Luke Merrick, Gaurav Nuti, Daniel Campos

    Abstract: This paper presents the training methodology of Arctic-Embed 2.0, a set of open-source text embedding models built for accurate and efficient multilingual retrieval. While prior works have suffered from degraded English retrieval quality, Arctic-Embed 2.0 delivers competitive retrieval quality on multilingual and English-only benchmarks, and supports Matryoshka Representation Learning (MRL) for ef… ▽ More

    Submitted 13 December, 2024; v1 submitted 3 December, 2024; originally announced December 2024.

    Comments: 10 pages, 5 figures, 3 tables

  3. arXiv:2407.18887  [pdf, other

    cs.LG cs.CL

    Embedding And Clustering Your Data Can Improve Contrastive Pretraining

    Authors: Luke Merrick

    Abstract: Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: 16 pages, 3 figures, 2 tables

  4. arXiv:2405.05374  [pdf, other

    cs.CL cs.AI cs.IR

    Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models

    Authors: Luke Merrick, Danmei Xu, Gaurav Nuti, Daniel Campos

    Abstract: This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: 17 pages, 11 Figures, 9 tables

  5. arXiv:1910.00174  [pdf, other

    cs.LG stat.ML

    Randomized Ablation Feature Importance

    Authors: Luke Merrick

    Abstract: Given a model $f$ that predicts a target $y$ from a vector of input features $\pmb{x} = x_1, x_2, \ldots, x_M$, we seek to measure the importance of each feature with respect to the model's ability to make a good prediction. To this end, we consider how (on average) some measure of goodness or badness of prediction (which we term "loss" $\ell$), changes when we hide or ablate each feature from the… ▽ More

    Submitted 1 October, 2019; v1 submitted 30 September, 2019; originally announced October 2019.

  6. arXiv:1909.08128  [pdf, other

    cs.LG cs.AI stat.ML

    The Explanation Game: Explaining Machine Learning Models Using Shapley Values

    Authors: Luke Merrick, Ankur Taly

    Abstract: A number of techniques have been proposed to explain a machine learning model's prediction by attributing it to the corresponding input features. Popular among these are techniques that apply the Shapley value method from cooperative game theory. While existing papers focus on the axiomatic motivation of Shapley values, and efficient techniques for computing them, they offer little justification f… ▽ More

    Submitted 25 June, 2020; v1 submitted 17 September, 2019; originally announced September 2019.