Search | arXiv e-print repository

Evaluating Style-Personalized Text Generation: Challenges and Directions

Authors: Anubhav Jangra, Bahareh Sarrafzadeh, Adrian de Wynter, Silviu Cucerzan, Sujay Kumar Jauhar

Abstract: While prior research has built tools and benchmarks towards style personalized text generation, there has been limited exploration of evaluation in low-resource author style personalized text generation space. Through this work, we question the effectiveness of the widely adopted evaluation metrics like BLEU and ROUGE, and explore other evaluation paradigms such as style embeddings and LLM-as-judg… ▽ More While prior research has built tools and benchmarks towards style personalized text generation, there has been limited exploration of evaluation in low-resource author style personalized text generation space. Through this work, we question the effectiveness of the widely adopted evaluation metrics like BLEU and ROUGE, and explore other evaluation paradigms such as style embeddings and LLM-as-judge to holistically evaluate the style personalized text generation task. We evaluate these metrics and their ensembles using our style discrimination benchmark, that spans eight writing tasks, and evaluates across three settings, domain discrimination, authorship attribution, and LLM personalized vs non-personalized discrimination. We provide conclusive evidence to adopt ensemble of diverse evaluation metrics to effectively evaluate style personalized text generation. △ Less

Submitted 8 August, 2025; originally announced August 2025.

arXiv:2505.16023 [pdf, ps, other]

Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild

Authors: Sheshera Mysore, Debarati Das, Hancheng Cao, Bahareh Sarrafzadeh

Abstract: As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bi… ▽ More As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users' intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment. △ Less

Submitted 21 June, 2025; v1 submitted 21 May, 2025; originally announced May 2025.

Comments: Pre-print under-review

arXiv:2503.16789 [pdf, ps, other]

Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation

Authors: Rupak Sarkar, Bahareh Sarrafzadeh, Nirupama Chandrasekaran, Nagu Rangan, Philip Resnik, Longqi Yang, Sujay Kumar Jauhar

Abstract: Human-LLM conversations are increasingly becoming more pervasive in peoples' professional and personal lives, yet many users still struggle to elicit helpful responses from LLM Chatbots. One of the reasons for this issue is users' lack of understanding in crafting effective prompts that accurately convey their information needs. Meanwhile, the existence of real-world conversational datasets on the… ▽ More Human-LLM conversations are increasingly becoming more pervasive in peoples' professional and personal lives, yet many users still struggle to elicit helpful responses from LLM Chatbots. One of the reasons for this issue is users' lack of understanding in crafting effective prompts that accurately convey their information needs. Meanwhile, the existence of real-world conversational datasets on the one hand, and the text understanding faculties of LLMs on the other, present a unique opportunity to study this problem, and its potential solutions at scale. Thus, in this paper we present the first LLM-centric study of real human-AI chatbot conversations, focused on investigating aspects in which user queries fall short of expressing information needs, and the potential of using LLMs to rewrite suboptimal user prompts. Our findings demonstrate that rephrasing ineffective prompts can elicit better responses from a conversational system, while preserving the user's original intent. Notably, the performance of rewrites improves in longer conversations, where contextual inferences about user needs can be made more accurately. Additionally, we observe that LLMs often need to -- and inherently do -- make \emph{plausible} assumptions about a user's intentions and goals when interpreting prompts. Our findings largely hold true across conversational domains, user intents, and LLMs of varying sizes and families, indicating the promise of using prompt rewriting as a solution for better human-AI interactions. △ Less

Submitted 25 June, 2025; v1 submitted 20 March, 2025; originally announced March 2025.

Comments: 8 pages, ACL style

arXiv:2312.06908 [pdf, other]

"I Want It That Way": Enabling Interactive Decision Support Using Large Language Models and Constraint Programming

Authors: Connor Lawless, Jakob Schoeffer, Lindy Le, Kael Rowan, Shilad Sen, Cristina St. Hill, Jina Suh, Bahareh Sarrafzadeh

Abstract: A critical factor in the success of decision support systems is the accurate modeling of user preferences. Psychology research has demonstrated that users often develop their preferences during the elicitation process, highlighting the pivotal role of system-user interaction in developing personalized systems. This paper introduces a novel approach, combining Large Language Models (LLMs) with Cons… ▽ More A critical factor in the success of decision support systems is the accurate modeling of user preferences. Psychology research has demonstrated that users often develop their preferences during the elicitation process, highlighting the pivotal role of system-user interaction in developing personalized systems. This paper introduces a novel approach, combining Large Language Models (LLMs) with Constraint Programming to facilitate interactive decision support. We study this hybrid framework through the lens of meeting scheduling, a time-consuming daily activity faced by a multitude of information workers. We conduct three studies to evaluate the novel framework, including a diary study (n=64) to characterize contextual scheduling preferences, a quantitative evaluation of the system's performance, and a user study (n=10) with a prototype system. Our work highlights the potential for a hybrid LLM and optimization approach for iterative preference elicitation and design considerations for building systems that support human-system collaborative decision-making processes. △ Less

Submitted 1 October, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Journal ref: ACM Trans. Interact. Intell. Syst., Vol. 14, No. 3, Article 22. 2024

arXiv:2311.09180 [pdf, other]

Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

Authors: Sheshera Mysore, Zhuoran Lu, Mengting Wan, Longqi Yang, Bahareh Sarrafzadeh, Steve Menezes, Tina Baghaee, Emmanuel Barajas Gonzalez, Jennifer Neville, Tara Safavi

Abstract: Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author's communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing… ▽ More Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author's communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing Pearl, a LLM writing assistant personalized with a retriever that is trained to be generation-calibrated for personalization. Generation calibration ensures that our retriever selects historic user authored documents to augment an LLM prompt such that they are likely to help an LLM generation better adhere to a users' preferences. We propose two key novelties for training such a retriever: (1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and (2) A scale-calibrating KL-divergence objective that ensures that our retriever scores remain proportional to the downstream generation quality from using the document for personalized generation. In a series of holistic evaluations, we demonstrate the effectiveness of Pearl in generating long-form texts on multiple social media datasets. Finally, we demonstrate how a generation-calibrated retriever can double as a performance predictor -- detecting low quality retrieval, and improving potentially under-performing outputs via revision with LLMs. △ Less

Submitted 4 November, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: Accepted to Workshop on Customizable NLP at EMNLP 2024

arXiv:2309.08104 [pdf, other]

Rhythm of Work: Mixed-methods Characterization of Information Workers Scheduling Preferences and Practices

Authors: Lu Sun, Lillio Mok, Shilad Sen, Bahar Sarrafzadeh

Abstract: As processes around hybrid work, spatially distant collaborations, and work-life boundaries grow increasingly complex, managing workers' schedules for synchronous meetings has become a critical aspect of building successful global teams. However, gaps remain in our understanding of workers' scheduling preferences and practices, which we aim to fill in this large-scale, mixed-methods study of indiv… ▽ More As processes around hybrid work, spatially distant collaborations, and work-life boundaries grow increasingly complex, managing workers' schedules for synchronous meetings has become a critical aspect of building successful global teams. However, gaps remain in our understanding of workers' scheduling preferences and practices, which we aim to fill in this large-scale, mixed-methods study of individuals calendars in a multinational organization. Using interviews with eight participants, survey data from 165 respondents, and telemetry data from millions of meetings scheduled by 211 thousand workers, we characterize scheduling preferences, practices, and their relationship with each other and organizational factors. We find that temporal preferences can be broadly classified as either cyclical, such as suitability of certain days, or relational, such as dispersed meetings, at various time scales. Furthermore, our results suggest that these preferences are disconnected from actual practice--albeit with several notable exceptions--and that individual differences are associated with factors like meeting load, time-zones, importance of meetings to job function, and job titles. We discuss key themes for our findings, along with the implications for calendar and scheduling systems and socio-technical systems more broadly. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2108.03350 [pdf, other]

doi 10.1145/1122445.1122456

Learning to Represent Human Motives for Goal-directed Web Browsing

Authors: Jyun-Yu Jiang, Chia-Jung Lee, Longqi Yang, Bahareh Sarrafzadeh, Brent Hecht, Jaime Teevan

Abstract: Motives or goals are recognized in psychology literature as the most fundamental drive that explains and predicts why people do what they do, including when they browse the web. Although providing enormous value, these higher-ordered goals are often unobserved, and little is known about how to leverage such goals to assist people's browsing activities. This paper proposes to take a new approach to… ▽ More Motives or goals are recognized in psychology literature as the most fundamental drive that explains and predicts why people do what they do, including when they browse the web. Although providing enormous value, these higher-ordered goals are often unobserved, and little is known about how to leverage such goals to assist people's browsing activities. This paper proposes to take a new approach to address this problem, which is fulfilled through a novel neural framework, Goal-directed Web Browsing (GoWeB). We adopt a psychologically-sound taxonomy of higher-ordered goals and learn to build their representations in a structure-preserving manner. Then we incorporate the resulting representations for enhancing the experiences of common activities people perform on the web. Experiments on large-scale data from Microsoft Edge web browser show that GoWeB significantly outperforms competitive baselines for in-session web page recommendation, re-visitation classification, and goal-based web page grouping. A follow-up analysis further characterizes how the variety of human motives can affect the difference observed in human behavioral patterns. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Comments: Accepted by RecSys 2021

arXiv:2008.08165 [pdf, other]

Characterizing Stage-Aware Writing Assistance in Collaborative Document Authoring

Authors: Bahareh Sarrafzadeh, Sujay Kumar Jauhar, Michael Gamon, Edward Lank, Ryen White

Abstract: Writing is a complex non-linear process that begins with a mental model of intent, and progresses through an outline of ideas, to words on paper (and their subsequent refinement). Despite past research in understanding writing, Web-scale consumer and enterprise collaborative digital writing environments are yet to greatly benefit from intelligent systems that understand the stages of document evol… ▽ More Writing is a complex non-linear process that begins with a mental model of intent, and progresses through an outline of ideas, to words on paper (and their subsequent refinement). Despite past research in understanding writing, Web-scale consumer and enterprise collaborative digital writing environments are yet to greatly benefit from intelligent systems that understand the stages of document evolution, providing opportune assistance based on authors' situated actions and context. In this paper, we present three studies that explore temporal stages of document authoring. We first survey information workers at a large technology company about their writing habits and preferences, concluding that writers do in fact conceptually progress through several distinct phases while authoring documents. We also explore, qualitatively, how writing stages are linked to document lifespan. We supplement these qualitative findings with an analysis of the longitudinal user interaction logs of a popular digital writing platform over several million documents. Finally, as a first step towards facilitating an intelligent digital writing assistant, we conduct a preliminary investigation into the utility of user interaction log data for predicting the temporal stage of a document. Our results support the benefit of tools tailored to writing stages, identify primary tasks associated with these stages, and show that it is possible to predict stages from anonymous interaction logs. Together, these results argue for the benefit and feasibility of more tailored digital writing assistance. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: Accepted for publication at CSCW 2020

Journal ref: CSCW 2020

arXiv:2005.01716 [pdf, other]

Hierarchical Knowledge Graphs: A Novel Information Representation for Exploratory Search Tasks

Authors: Bahareh Sarrafzadeh, Adam Roegiest, Edward Lank

Abstract: In exploratory search tasks, alongside information retrieval, information representation is an important factor in sensemaking. In this paper, we explore a multi-layer extension to knowledge graphs, hierarchical knowledge graphs (HKGs), that combines hierarchical and network visualizations into a unified data representation asa tool to support exploratory search. We describe our algorithm to const… ▽ More In exploratory search tasks, alongside information retrieval, information representation is an important factor in sensemaking. In this paper, we explore a multi-layer extension to knowledge graphs, hierarchical knowledge graphs (HKGs), that combines hierarchical and network visualizations into a unified data representation asa tool to support exploratory search. We describe our algorithm to construct these visualizations, analyze interaction logs to quantitatively demonstrate performance parity with networks and performance advantages over hierarchies, and synthesize data from interaction logs, interviews, and thinkalouds on a testbed data set to demonstrate the utility of the unified hierarchy+network structure in our HKGs. Alongside the above study, we perform an additional mixed methods analysis of the effect of precision and recall on the performance of hierarchical knowledge graphs for two different exploratory search tasks. While the quantitative data shows a limited effect of precision and recall on user performance and user effort, qualitative data combined with post-hoc statistical analysis provides evidence that the type of exploratory search task (e.g., learning versus investigating) can be impacted by precision and recall. Furthermore, our qualitative analyses find that users are unable to perceive differences in the quality of extracted information. We discuss the implications of our results and analyze other factors that more significantly impact exploratory search performance in our experimental tasks. △ Less

Submitted 4 May, 2020; originally announced May 2020.

Comments: 35 Pages of main content, Extension of previous work published at SigIR 2017, Currently under review at TOIS 2020

ACM Class: H.5; H.4

arXiv:1901.04375 [pdf, other]

doi 10.1145/3289600.3291028

Characterizing and Predicting Email Deferral Behavior

Authors: Bahareh Sarrafzadeh, Ahmed Hassan Awadallah, Christopher H. Lin, Chia-Jung Lee, Milad Shokouhi, Susan T. Dumais

Abstract: Email triage involves going through unhandled emails and deciding what to do with them. This familiar process can become increasingly challenging as the number of unhandled email grows. During a triage session, users commonly defer handling emails that they cannot immediately deal with to later. These deferred emails, are often related to tasks that are postponed until the user has more time or th… ▽ More Email triage involves going through unhandled emails and deciding what to do with them. This familiar process can become increasingly challenging as the number of unhandled email grows. During a triage session, users commonly defer handling emails that they cannot immediately deal with to later. These deferred emails, are often related to tasks that are postponed until the user has more time or the right information to deal with them. In this paper, through qualitative interviews and a large-scale log analysis, we study when and what enterprise email users tend to defer. We found that users are more likely to defer emails when handling them involves replying, reading carefully, or clicking on links and attachments. We also learned that the decision to defer emails depends on many factors such as user's workload and the importance of the sender. Our qualitative results suggested that deferring is very common, and our quantitative log analysis confirms that 12% of triage sessions and 16% of daily active users had at least one deferred email on weekdays. We also discuss several deferral strategies such as marking emails as unread and flagging that are reported by our interviewees, and illustrate how such patterns can be also observed in user logs. Inspired by the characteristics of deferred emails and contextual factors involved in deciding if an email should be deferred, we train a classifier for predicting whether a recently triaged email is actually deferred. Our experimental results suggests that deferral can be classified with modest effectiveness. Overall, our work provides novel insights about how users handle their emails and how deferral can be modeled. △ Less

Submitted 14 January, 2019; originally announced January 2019.

Journal ref: WSDM 2019

Showing 1–10 of 10 results for author: Sarrafzadeh, B