Skip to main content

Showing 1–7 of 7 results for author: Nussbaum, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.15511  [pdf, ps, other

    cs.LG

    NOMAD Projection

    Authors: Brandon Duderstadt, Zach Nussbaum, Laurens van der Maaten

    Abstract: The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this p… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

  2. arXiv:2502.07972  [pdf, other

    cs.CL cs.AI cs.IR

    Training Sparse Mixture Of Experts Text Embedding Models

    Authors: Zach Nussbaum, Brandon Duderstadt

    Abstract: Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increas… ▽ More

    Submitted 9 March, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

  3. arXiv:2412.01007  [pdf, other

    cs.CL cs.IR

    CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking

    Authors: Tarun Suresh, Revanth Gangi Reddy, Yifei Xu, Zach Nussbaum, Andriy Mulyar, Brandon Duderstadt, Heng Ji

    Abstract: Effective code retrieval plays a crucial role in advancing code generation, bug fixing, and software maintenance, particularly as software systems increase in complexity. While current code embedding models have demonstrated promise in retrieving code snippets for small-scale, well-defined tasks, they often underperform in more demanding real-world applications such as bug localization within GitH… ▽ More

    Submitted 3 March, 2025; v1 submitted 1 December, 2024; originally announced December 2024.

    Comments: Published as a conference paper at ICLR 2025. First and second author had equal contribution

  4. arXiv:2406.18587  [pdf, other

    cs.CV cs.AI

    Nomic Embed Vision: Expanding the Latent Space

    Authors: Zach Nussbaum, Brandon Duderstadt, Andriy Mulyar

    Abstract: This technical report describes the training of nomic-embed-vision, a highly performant, open-code, open-weights image embedding model that shares the same latent space as nomic-embed-text. Together, nomic-embed-vision and nomic-embed-text form the first unified latent space to achieve high performance across vision, language, and multimodal tasks.

    Submitted 6 June, 2024; originally announced June 2024.

  5. arXiv:2402.01613  [pdf, ps, other

    cs.CL cs.AI

    Nomic Embed: Training a Reproducible Long Context Text Embedder

    Authors: Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar

    Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 lic… ▽ More

    Submitted 3 February, 2025; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted to TMLR https://openreview.net/forum?id=IPmzyQSiQE

  6. arXiv:2311.04931  [pdf, other

    cs.CL cs.AI

    GPT4All: An Ecosystem of Open Source Compressed Language Models

    Authors: Yuvanesh Anand, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Ben Schmidt, GPT4All Community, Brandon Duderstadt, Andriy Mulyar

    Abstract: Large language models (LLMs) have recently achieved human-level performance on a range of professional and academic benchmarks. The accessibility of these models has lagged behind their performance. State-of-the-art LLMs require costly infrastructure; are only accessible via rate-limited, geo-locked, and censored web interfaces; and lack publicly available code and technical reports. In this paper… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: Accepted at NLP-OSS at EMNLP 2023

  7. arXiv:2107.13098  [pdf, other

    cs.CV cs.LG

    A Tale Of Two Long Tails

    Authors: Daniel D'souza, Zach Nussbaum, Chirag Agarwal, Sara Hooker

    Abstract: As machine learning models are increasingly employed to assist human decision-makers, it becomes critical to communicate the uncertainty associated with these model predictions. However, the majority of work on uncertainty has focused on traditional probabilistic or ranking approaches - where the model assigns low probabilities or scores to uncertain examples. While this captures what examples are… ▽ More

    Submitted 27 July, 2021; originally announced July 2021.

    Comments: Preliminary results accepted to Workshop on Uncertainty and Robustness in Deep Learning (UDL), ICML, 2021