Skip to main content

Showing 1–11 of 11 results for author: Pham, C M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.22945  [pdf, other

    cs.CL cs.AI

    OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

    Authors: Alisha Srivastava, Emir Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer

    Abstract: Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented i… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: preprint, 25 pages

  2. arXiv:2505.18128  [pdf, ps, other

    cs.CL

    Frankentext: Stitching random text fragments into long-form narratives

    Authors: Chau Minh Pham, Jenna Russell, Dzung Pham, Mohit Iyyer

    Abstract: We introduce Frankentexts, a new type of long-form narratives produced by LLMs under the extreme constraint that most tokens (e.g., 90%) must be copied verbatim from human writings. This task presents a challenging test of controllable generation, requiring models to satisfy a writing prompt, integrate disparate text fragments, and still produce a coherent narrative. To generate Frankentexts, we i… ▽ More

    Submitted 28 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

  3. arXiv:2505.14549  [pdf, ps, other

    cs.CR cs.AI

    Can Large Language Models Really Recognize Your Name?

    Authors: Dzung Pham, Peter Kairouz, Niloofar Mireshghallah, Eugene Bagdasarian, Chau Minh Pham, Amir Houmansadr

    Abstract: Large language models (LLMs) are increasingly being used to protect sensitive user data. However, current LLM-based privacy solutions assume that these models can reliably detect personally identifiable information (PII), particularly named entities. In this paper, we challenge that assumption by revealing systematic failures in LLM-based privacy tasks. Specifically, we show that modern LLMs regul… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  4. arXiv:2503.07919  [pdf, other

    cs.AI cs.CL cs.LG

    BEARCUBS: A benchmark for computer-using web agents

    Authors: Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer

    Abstract: Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seek… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 16 pages

  5. arXiv:2502.14854  [pdf, other

    cs.CL

    CLIPPER: Compression enables long-context synthetic data generation

    Authors: Chau Minh Pham, Yapei Chang, Mohit Iyyer

    Abstract: LLM developers are increasingly reliant on synthetic data, but generating high-quality data for complex long-context reasoning tasks remains challenging. We introduce CLIPPER, a compression-based approach for generating synthetic data tailored to narrative claim verification - a task that requires reasoning over a book to verify a given claim. Instead of generating claims directly from the raw tex… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

  6. arXiv:2502.13028  [pdf, other

    cs.CL

    Whose story is it? Personalizing story generation by inferring author styles

    Authors: Nischal Ashok Kumar, Chau Minh Pham, Mohit Iyyer, Andrew Lan

    Abstract: Personalization is critical for improving user experience in interactive writing and educational applications, yet remains understudied in story generation. We study the task of personalizing story generation, where our goal is to mimic an author's writing style, given other stories written by them. We collect Mythos, a dataset of 3.6k stories from 112 authors, with an average of 16 stories per au… ▽ More

    Submitted 21 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

    Comments: preprint:55 pages

  7. arXiv:2407.08792  [pdf, ps, other

    cs.CR

    ProxyGPT: Enabling User Anonymity in LLM Chatbots via (Un)Trustworthy Volunteer Proxies

    Authors: Dzung Pham, Jade Sheffey, Chau Minh Pham, Amir Houmansadr

    Abstract: Popular large language model (LLM) chatbots such as ChatGPT and Claude require users to create an account with an email or a phone number before allowing full access to their services. This practice ties users' personally identifiable information (PII) to their sensitive conversational data, thus posing significant privacy risks. Unfortunately, existing private LLM solutions based on cryptography… ▽ More

    Submitted 11 June, 2025; v1 submitted 11 July, 2024; originally announced July 2024.

  8. arXiv:2406.19928  [pdf, other

    cs.CL cs.HC cs.IR

    Interactive Topic Models with Optimal Transport

    Authors: Garima Dhanania, Sheshera Mysore, Chau Minh Pham, Mohit Iyyer, Hamed Zamani, Andrew McCallum

    Abstract: Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Pre-print; Work in progress

  9. arXiv:2406.19371  [pdf, other

    cs.CL

    Suri: Multi-constraint Instruction Following for Long-form Text Generation

    Authors: Chau Minh Pham, Simeng Sun, Mohit Iyyer

    Abstract: Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challe… ▽ More

    Submitted 1 October, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted to EMNLP'24 (Findings)

  10. arXiv:2311.01449  [pdf, other

    cs.CL

    TopicGPT: A Prompt-based Topic Modeling Framework

    Authors: Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, Mohit Iyyer

    Abstract: Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require "reading the tea leaves" to interpret; additionally, they offer users minimal control over the formatting and specificity of resulting topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large lan… ▽ More

    Submitted 1 April, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024 (Main conference)

  11. arXiv:2306.03280  [pdf, other

    cs.HC

    AHA!: Facilitating AI Impact Assessment by Generating Examples of Harms

    Authors: Zana Buçinca, Chau Minh Pham, Maurice Jakesch, Marco Tulio Ribeiro, Alexandra Olteanu, Saleema Amershi

    Abstract: While demands for change and accountability for harmful AI consequences mount, foreseeing the downstream effects of deploying AI systems remains a challenging task. We developed AHA! (Anticipating Harms of AI), a generative framework to assist AI practitioners and decision-makers in anticipating potential harms and unintended consequences of AI systems prior to development or deployment. Given an… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.