Skip to main content

Showing 1–16 of 16 results for author: Shah, D J

.
  1. arXiv:2506.14111  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Essential-Web v1.0: 24T tokens of organized web data

    Authors: Essential AI, :, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah, Ishaan Shah, Karl Stratos, Khoi Nguyen, Kurt Smith, Michael Callahan, Peter Rushton, Philip Monk, Platon Mazarakis, Saad Jamal, Saurabh Srivastava, Somanshu Singla, Ashish Vaswani

    Abstract: Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels ar… ▽ More

    Submitted 19 June, 2025; v1 submitted 16 June, 2025; originally announced June 2025.

    Comments: include MegaMath-Web-Pro

  2. arXiv:2505.02222  [pdf, other

    cs.LG stat.ML

    Practical Efficiency of Muon for Pretraining

    Authors: Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani

    Abstract: We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study th… ▽ More

    Submitted 19 May, 2025; v1 submitted 4 May, 2025; originally announced May 2025.

  3. arXiv:2504.04022  [pdf, other

    cs.CL cs.AI

    Rethinking Reflection in Pre-Training

    Authors: Essential AI, :, Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk , et al. (4 additional authors not shown)

    Abstract: A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model c… ▽ More

    Submitted 4 April, 2025; originally announced April 2025.

  4. On in-silico estimation of left ventricular end-diastolic pressure from cardiac strains

    Authors: Emilio A. Mendiola, Raza Rana Mehdi, Dipan J. Shah, Reza Avazmohammadi

    Abstract: Left ventricular diastolic dysfunction (LVDD) is a group of diseases that adversely affect the passive phase of the cardiac cycle and can lead to heart failure. While left ventricular end-diastolic pressure (LVEDP) is a valuable prognostic measure in LVDD patients, traditional invasive methods of measuring LVEDP present risks and limitations, highlighting the need for alternative approaches. This… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Journal ref: 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1-4

  5. arXiv:2405.18334  [pdf, other

    cs.DB cs.CV cs.LG

    SketchQL Demonstration: Zero-shot Video Moment Querying with Sketches

    Authors: Renzhi Wu, Pramod Chunduri, Dristi J Shah, Ashmitha Julius Aravind, Ali Payani, Xu Chu, Joy Arulraj, Kexin Rong

    Abstract: In this paper, we will present SketchQL, a video database management system (VDBMS) for retrieving video moments with a sketch-based query interface. This novel interface allows users to specify object trajectory events with simple mouse drag-and-drop operations. Users can use trajectories of single objects as building blocks to compose complex events. Using a pre-trained model that encodes trajec… ▽ More

    Submitted 30 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Journal ref: Published on International Conference on Very Large Databases 2024

  6. arXiv:2112.03858  [pdf, other

    cs.CL

    Reducing Target Group Bias in Hate Speech Detectors

    Authors: Darsh J Shah, Sinong Wang, Han Fang, Hao Ma, Luke Zettlemoyer

    Abstract: The ubiquity of offensive and hateful content on online fora necessitates the need for automatic solutions that detect such content competently across target groups. In this paper we show that text classification models trained on large publicly available datasets despite having a high overall performance, may significantly under-perform on several protected groups. On the \citet{vidgen2020learnin… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

  7. arXiv:2104.08668  [pdf, other

    cs.CL

    Generating Related Work

    Authors: Darsh J Shah, Regina Barzilay

    Abstract: Communicating new research ideas involves highlighting similarities and differences with past work. Authors write fluent, often long sections to survey the distinction of a new paper with related work. In this work we model generating related work sections while being cognisant of the motivation behind citing papers. Our content planning model generates a tree of cited papers before a surface real… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

  8. arXiv:2104.03465  [pdf, other

    cs.CL

    Nutribullets Hybrid: Multi-document Health Summarization

    Authors: Darsh J Shah, Lili Yu, Tao Lei, Regina Barzilay

    Abstract: We present a method for generating comparative summaries that highlights similarities and contradictions in input documents. The key challenge in creating such summaries is the lack of large parallel training data required for training typical summarization systems. To this end, we introduce a hybrid generation approach inspired by traditional concept-to-text systems. To enable accurate comparison… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: NAACL 2021 Camera Ready

  9. arXiv:2103.11921  [pdf, other

    cs.CL

    Nutri-bullets: Summarizing Health Studies by Composing Segments

    Authors: Darsh J Shah, Lili Yu, Tao Lei, Regina Barzilay

    Abstract: We introduce \emph{Nutri-bullets}, a multi-document summarization task for health and nutrition. First, we present two datasets of food and health summaries from multiple scientific studies. Furthermore, we propose a novel \emph{extract-compose} model to solve the problem in the regime of limited parallel data. We explicitly select key spans from several abstracts using a policy network, followed… ▽ More

    Submitted 22 March, 2021; originally announced March 2021.

    Comments: 12 pages

    Journal ref: AAAI 2021 Camera Ready

  10. arXiv:1910.10274  [pdf, other

    cs.CL

    Capturing Greater Context for Question Generation

    Authors: Luu Anh Tuan, Darsh J Shah, Regina Barzilay

    Abstract: Automatic question generation can benefit many applications ranging from dialogue systems to reading comprehension. While questions are often asked with respect to long documents, there are many challenges with modeling such long documents. Many existing techniques generate questions by effectively looking at one sentence at a time, leading to questions that are easy and not reflective of the huma… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

  11. arXiv:1909.13838  [pdf, other

    cs.CL

    Automatic Fact-guided Sentence Modification

    Authors: Darsh J Shah, Tal Schuster, Regina Barzilay

    Abstract: Online encyclopediae like Wikipedia contain large amounts of text that need frequent corrections and updates. The new information may contradict existing content in encyclopediae. In this paper, we focus on rewriting such dynamically changing articles. This is a challenging constrained generation task, as the output must be consistent with the new information and fit into the rest of the existing… ▽ More

    Submitted 2 December, 2019; v1 submitted 30 September, 2019; originally announced September 2019.

    Comments: AAAI 2020

  12. arXiv:1908.09805  [pdf, other

    cs.CL cs.CY

    The Limitations of Stylometry for Detecting Machine-Generated Fake News

    Authors: Tal Schuster, Roei Schuster, Darsh J Shah, Regina Barzilay

    Abstract: Recent developments in neural language models (LMs) have raised concerns about their potential misuse for automatically spreading misinformation. In light of these concerns, several studies have proposed to detect machine-generated fake news by capturing their stylistic differences from human-written text. These approaches, broadly termed stylometry, have found success in source attribution and mi… ▽ More

    Submitted 20 February, 2020; v1 submitted 26 August, 2019; originally announced August 2019.

    Comments: Accepted for Computational Linguistics journal (squib). Previously posted with title "Are We Safe Yet? The Limitations of Distributional Features for Fake News Detection"

  13. arXiv:1908.05267  [pdf, other

    cs.CL

    Towards Debiasing Fact Verification Models

    Authors: Tal Schuster, Darsh J Shah, Yun Jie Serene Yeo, Daniel Filizzola, Enrico Santus, Regina Barzilay

    Abstract: Fact verification requires validating a claim in the context of evidence. We show, however, that in the popular FEVER dataset this might not necessarily be the case. Claim-only classifiers perform competitively with top evidence-aware models. In this paper, we investigate the cause of this phenomenon, identifying strong cues for predicting labels solely based on the claim, without considering any… ▽ More

    Submitted 30 August, 2019; v1 submitted 14 August, 2019; originally announced August 2019.

    Comments: EMNLP IJCNLP 2019

  14. arXiv:1906.06870  [pdf, other

    cs.CL

    Robust Zero-Shot Cross-Domain Slot Filling with Example Values

    Authors: Darsh J Shah, Raghav Gupta, Amir A Fayazi, Dilek Hakkani-Tur

    Abstract: Task-oriented dialog systems increasingly rely on deep learning-based slot filling models, usually needing extensive labeled training data for target domains. Often, however, little to no target domain training data may be available, or the training and target domain schemas may be misaligned, as is common for web forms on similar websites. Prior zero-shot slot filling models use slot descriptions… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: To appear in ACL 2019

  15. arXiv:1809.02256  [pdf, other

    cs.CL

    Multi-Source Domain Adaptation with Mixture of Experts

    Authors: Jiang Guo, Darsh J Shah, Regina Barzilay

    Abstract: We propose a mixture-of-experts approach for unsupervised domain adaptation from multiple sources. The key idea is to explicitly capture the relationship between a target example and different source domains. This relationship, expressed by a point-to-set metric, determines how to combine predictors trained on various domains. The metric is learned in an unsupervised fashion using meta-training. E… ▽ More

    Submitted 16 October, 2018; v1 submitted 6 September, 2018; originally announced September 2018.

    Comments: 11 pages, EMNLP 2018

  16. arXiv:1809.02255  [pdf, other

    cs.CL

    Adversarial Domain Adaptation for Duplicate Question Detection

    Authors: Darsh J Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, Preslav Nakov

    Abstract: We address the problem of detecting duplicate questions in forums, which is an important step towards automating the process of answering new questions. As finding and annotating such potential duplicates manually is very tedious and costly, automatic methods based on machine learning are a viable alternative. However, many forums do not have annotated data, i.e., questions labeled by experts as d… ▽ More

    Submitted 6 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018 short paper - camera ready. 8 pages