-
OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
Authors:
Ziyi Wang,
Yuxuan Lu,
Wenbo Li,
Amirali Amini,
Bo Sun,
Yakov Bart,
Weimin Lyu,
Jiri Gesi,
Tian Wang,
Jing Huang,
Yu Su,
Upol Ehsan,
Malihe Alikhani,
Toby Jia-Jun Li,
Lydia Chilton,
Dakuo Wang
Abstract:
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasonin…
▽ More
Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
△ Less
Submitted 7 July, 2025; v1 submitted 5 June, 2025;
originally announced June 2025.
-
Schemex: Interactive Structural Abstraction from Examples with Contrastive Refinement
Authors:
Sitong Wang,
Samia Menon,
Dingzeyu Li,
Xiaojuan Ma,
Richard Zemel,
Lydia B. Chilton
Abstract:
Each type of creative or communicative work is underpinned by an implicit structure. People learn these structures from examples - a process known in cognitive science as schema induction. However, inducing schemas is challenging, as structural patterns are often obscured by surface-level variation. We present Schemex, an interactive visual workflow that scaffolds schema induction through clusteri…
▽ More
Each type of creative or communicative work is underpinned by an implicit structure. People learn these structures from examples - a process known in cognitive science as schema induction. However, inducing schemas is challenging, as structural patterns are often obscured by surface-level variation. We present Schemex, an interactive visual workflow that scaffolds schema induction through clustering, abstraction, and contrastive refinement. Schemex supports users through visual representations and interactive exploration that connect abstract structures to concrete examples, promoting transparency, adaptability, and effective human-AI collaboration. In our user study, participants reported significantly greater insight and confidence in the schemas developed with Schemex compared to those created using a baseline of an AI reasoning model. We conclude by discussing the broader implications of structural abstraction and contrastive refinement across domains.
△ Less
Submitted 16 April, 2025;
originally announced April 2025.
-
AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations
Authors:
Jenny Ma,
Riya Sahni,
Karthik Sreedhar,
Lydia B. Chilton
Abstract:
Multi-agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulatio…
▽ More
Multi-agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulations from user-specified mechanics and dynamics. AgentDynEx uses LLMs to guide users through a Configuration Matrix to identify core mechanics and define milestones to track dynamics. It also introduces a method called \textit{nudging}, where the system dynamically reflects on simulation progress and gently intervenes if it begins to deviate from intended outcomes. A technical evaluation found that nudging enables simulations to have more complex mechanics and maintain its notable dynamics compared to simulations without nudging. We discuss the importance of nudging as a technique for balancing mechanics and dynamics of multi-agent simulations.
△ Less
Submitted 13 April, 2025;
originally announced April 2025.
-
Schemex: Discovering Design Patterns from Examples through Iterative Abstraction and Refinement
Authors:
Sitong Wang,
Lydia B. Chilton
Abstract:
Expertise is often built by learning from examples. This process, known as schema induction, helps us identify patterns from examples. Despite its importance, schema induction remains a challenging cognitive task. Recent advances in generative AI reasoning capabilities offer new opportunities to support schema induction through human-AI collaboration. We present Schemex, an AI-powered workflow tha…
▽ More
Expertise is often built by learning from examples. This process, known as schema induction, helps us identify patterns from examples. Despite its importance, schema induction remains a challenging cognitive task. Recent advances in generative AI reasoning capabilities offer new opportunities to support schema induction through human-AI collaboration. We present Schemex, an AI-powered workflow that enhances human schema induction through three stages: clustering, abstraction, and refinement via contrasting examples. We conducted an initial evaluation of Schemex through two real-world case studies: writing abstracts for HCI papers and creating news TikToks. Qualitative analysis demonstrates the high accuracy and usefulness of the generated schemas. We also discuss future work on developing more flexible methods for workflow construction to help humans focus on high-level thinking.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
Beyond Training: Social Dynamics of AI Adoption in Industry
Authors:
Riya Sahni,
Lydia B. Chilton
Abstract:
While organizations continue to invest in AI tools like M365 Copilot, little is known about how individual employees engage with these technologies once deployed. This study examines M365 Copilot adoption behaviors among a group of 10 experienced users across many industries in the United States. Findings reveal a strong preference for informal learning methods over structured training. Even thoug…
▽ More
While organizations continue to invest in AI tools like M365 Copilot, little is known about how individual employees engage with these technologies once deployed. This study examines M365 Copilot adoption behaviors among a group of 10 experienced users across many industries in the United States. Findings reveal a strong preference for informal learning methods over structured training. Even though 9 out of 10 participants acknowledged that formal training for Copilot tools would be useful, 7 out of 10 stated that they ignored the Copilot onboarding videos provided to them, citing reasons such as time constraints, preference for self-guided learning, or reliance on external resources like ChatGPT. No participants used formal training as their primary learning method. Instead, experiential learning (trial and error, 8 participants) and social learning (peer discussions, 6 participants) emerged as dominant learning strategies. We discuss opportunities for promoting social learning of AI tools in the workplace.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Simulating Cooperative Prosocial Behavior with Multi-Agent LLMs: Evidence and Mechanisms for AI Agents to Inform Policy Decisions
Authors:
Karthik Sreedhar,
Alice Cai,
Jenny Ma,
Jeffrey V. Nickerson,
Lydia B. Chilton
Abstract:
Human prosocial cooperation is essential for our collective health, education, and welfare. However, designing social systems to maintain or incentivize prosocial behavior is challenging because people can act selfishly to maximize personal gain. This complex and unpredictable aspect of human behavior makes it difficult for policymakers to foresee the implications of their designs. Recently, multi…
▽ More
Human prosocial cooperation is essential for our collective health, education, and welfare. However, designing social systems to maintain or incentivize prosocial behavior is challenging because people can act selfishly to maximize personal gain. This complex and unpredictable aspect of human behavior makes it difficult for policymakers to foresee the implications of their designs. Recently, multi-agent LLM systems have shown remarkable capabilities in simulating human-like behavior, and replicating some human lab experiments. This paper studies how well multi-agent systems can simulate prosocial human behavior, such as that seen in the public goods game (PGG), and whether multi-agent systems can exhibit ``unbounded actions'' seen outside the lab in real world scenarios. We find that multi-agent LLM systems successfully replicate human behavior from lab experiments of the public goods game with three experimental treatments - priming, transparency, and varying endowments. Beyond replicating existing experiments, we find that multi-agent LLM systems can replicate the expected human behavior when combining experimental treatments, even if no previous study combined those specific treatments. Lastly, we find that multi-agent systems can exhibit a rich set of unbounded actions that people do in the real world outside of the lab -- such as collaborating and even cheating. In sum, these studies are steps towards a future where LLMs can be used to inform policy decisions that encourage people to act in a prosocial manner.
△ Less
Submitted 17 February, 2025;
originally announced February 2025.
-
AI Humor Generation: Cognitive, Social and Creative Skills for Effective Humor
Authors:
Sean Kim,
Lydia B. Chilton
Abstract:
Humor is a social binding agent. It is an act of creativity that can provoke emotional reactions on a broad range of topics. Humor has long been thought to be "too human" for AI to generate. However, humans are complex, and humor requires our complex set of skills: cognitive reasoning, social understanding, a broad base of knowledge, creative thinking, and audience understanding. We explore whethe…
▽ More
Humor is a social binding agent. It is an act of creativity that can provoke emotional reactions on a broad range of topics. Humor has long been thought to be "too human" for AI to generate. However, humans are complex, and humor requires our complex set of skills: cognitive reasoning, social understanding, a broad base of knowledge, creative thinking, and audience understanding. We explore whether giving AI such skills enables it to write humor. We target one audience: Gen Z humor fans. We ask people to rate meme caption humor from three sources: highly upvoted human captions, 2) basic LLMs, and 3) LLMs captions with humor skills. We find that users like LLMs captions with humor skills more than basic LLMs and almost on par with top-rated humor written by people. We discuss how giving AI human-like skills can help it generate communication that resonates with people.
△ Less
Submitted 11 February, 2025;
originally announced February 2025.
-
The Role of Human Creativity in the Presence of AI Creativity Tools at Work: A Case Study on AI-Driven Content Transformation in Journalism
Authors:
Sitong Wang,
Jocelyn McKinnon-Crowley,
Tao Long,
Kian Loong Lua,
Keren Henderson,
Kevin Crowston,
Jeffrey V. Nickerson,
Mark Hansen,
Lydia B. Chilton
Abstract:
As AI becomes more capable, it is unclear how human creativity will remain essential in jobs that incorporate AI. We conducted a 14-week study of a student newsroom using an AI tool to convert web articles into social media videos. Most treated the tool as a creative springboard, yet still had to edit many AI outputs. The tool enabled the team to publish successful content, receiving over 500,000…
▽ More
As AI becomes more capable, it is unclear how human creativity will remain essential in jobs that incorporate AI. We conducted a 14-week study of a student newsroom using an AI tool to convert web articles into social media videos. Most treated the tool as a creative springboard, yet still had to edit many AI outputs. The tool enabled the team to publish successful content, receiving over 500,000 views. Yet creators sometimes treated AI as an unquestioned expert, accepting flawed suggestions. Editorial critique was essential to spot errors and guide creative solutions when AI failed. We discuss how AI's inherent gaps ensure human creativity remains vital.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Audience Impressions of Narrative Structures and Personal Language Style in Science Communication on Social Media
Authors:
Grace Li,
Yuanyang Teng,
Juna Kawai-Yue,
Unaisah Ahmed,
Anatta S. Tantiwongse,
Jessica Y. Liang,
Dorothy Zhang,
Kynnedy Simone Smith,
Tao Long,
Mina Lee,
Lydia B Chilton
Abstract:
Science communication increases public interest in science by educating, engaging, and encouraging everyday people to participate in the sciences. But traditional science communication is often too formal and inaccessible for general audiences. However, there is a growing trend on social media to make it more approachable using three techniques: relatable examples to make explanations concrete, st…
▽ More
Science communication increases public interest in science by educating, engaging, and encouraging everyday people to participate in the sciences. But traditional science communication is often too formal and inaccessible for general audiences. However, there is a growing trend on social media to make it more approachable using three techniques: relatable examples to make explanations concrete, step-by-step walkthroughs to improve understanding, and personal language to drive engagement. These techniques are flashy and often garner more engagement from social media users, but the effectiveness of these techniques in actually explaining the science is unknown. Furthermore, many scientists struggle with adopting these science communication strategies for social media, fearing it might undermine their authority. We conduct a reader study to understand how these science communication techniques on social media affect readers' understanding and engagement of the science. We found that while most readers prefer these techniques, they had diverse preferences for when and where these techniques are used. With these findings, we conducted a writer study to understand how scientists' varying comfort levels with these strategies can be supported by presenting different structure and style options. We found that the side-by-side comparison of options helped writers make editorial decisions. Instead of adhering to one direction of science communication, writers explored a continuum of options which helped them identify which communication strategies they wanted to implement.
△ Less
Submitted 18 February, 2025; v1 submitted 7 February, 2025;
originally announced February 2025.
-
JumpStarter: Human-AI Planning with Task-Structured Context Curation
Authors:
Xuanming Zhang,
Sitong Wang,
Jenny Ma,
Alyssa Hwang,
Zhou Yu,
Lydia B. Chilton
Abstract:
Human-AI planning for complex goals remains challenging with current large language models (LLMs), which rely on linear chat histories and simplistic memory mechanisms. Despite advances in long-context prompting, users still manually manage information, leading to a high cognitive burden. Hence, we propose JumpStarter, a system that enables LLMs to collaborate with humans on complex goals by dynam…
▽ More
Human-AI planning for complex goals remains challenging with current large language models (LLMs), which rely on linear chat histories and simplistic memory mechanisms. Despite advances in long-context prompting, users still manually manage information, leading to a high cognitive burden. Hence, we propose JumpStarter, a system that enables LLMs to collaborate with humans on complex goals by dynamically decomposing tasks to help users manage context. We specifically introduce task-structured context curation, a novel framework that breaks down a user's goal into a hierarchy of actionable subtasks, and scopes context to localized decision points, enabling finer-grained personalization and reuse. The framework is realized through three core mechanisms: context elicitation, selection, and reuse. We demonstrate that task-structured context curation significantly improves plan quality by 16% over ablations. Our user study shows that JumpStarter helped users generate plans with 79% higher quality compared to ChatGPT.
△ Less
Submitted 22 May, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
DynEx: Dynamic Code Synthesis with Structured Design Exploration for Accelerated Exploratory Programming
Authors:
Jenny Ma,
Karthik Sreedhar,
Vivian Liu,
Pedro Alejandro Perez,
Sitong Wang,
Riya Sahni,
Lydia B. Chilton
Abstract:
Recent advancements in large language models have significantly expedited the process of generating front-end code. This allows users to rapidly prototype user interfaces and ideate through code, a process known as exploratory programming. However, existing LLM code generation tools focus more on technical implementation details rather than finding the right design given a particular problem. We p…
▽ More
Recent advancements in large language models have significantly expedited the process of generating front-end code. This allows users to rapidly prototype user interfaces and ideate through code, a process known as exploratory programming. However, existing LLM code generation tools focus more on technical implementation details rather than finding the right design given a particular problem. We present DynEx, an LLM-based method for design exploration in accelerated exploratory programming. DynEx introduces a technique to explore the design space through a structured Design Matrix before creating the prototype with a modular, stepwise approach to LLM code generation. Code is generated sequentially, and users can test and approve each step before moving onto the next. A user study of 10 experts found that DynEx increased design exploration and enabled the creation of more complex and varied prototypes compared to a Claude Artifact baseline. We conclude with a discussion of the implications of design exploration for exploratory programming.
△ Less
Submitted 7 February, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Copying style, Extracting value: Illustrators' Perception of AI Style Transfer and its Impact on Creative Labor
Authors:
Julien Porquet,
Sitong Wang,
Lydia B. Chilton
Abstract:
Generative text-to-image models are disrupting the lives of creative professionals. Specifically, illustrators are threatened by models that claim to extract and reproduce their style. Yet, research on style transfer has rarely focused on their perspectives. We provided four illustrators with a model fine-tuned to their style and conducted semi-structured interviews about the model's successes, li…
▽ More
Generative text-to-image models are disrupting the lives of creative professionals. Specifically, illustrators are threatened by models that claim to extract and reproduce their style. Yet, research on style transfer has rarely focused on their perspectives. We provided four illustrators with a model fine-tuned to their style and conducted semi-structured interviews about the model's successes, limitations, and potential uses. Evaluating their output, artists reported that style transfer successfully copies aesthetic fragments but is limited by content-style disentanglement and lacks the crucial emergent quality of their style. They also deemed the others' copies more successful. Understanding the results of style transfer as "boundary objects," we analyze how they can simultaneously be considered unsuccessful by artists and poised to replace their work by others. We connect our findings to critical HCI frameworks, demonstrating that style transfer, rather than merely a Creativity Support Tool, should also be understood as a supply chain optimization one.
△ Less
Submitted 30 January, 2025; v1 submitted 25 September, 2024;
originally announced September 2024.
-
DIDUP: Dynamic Iterative Development for UI Prototyping
Authors:
Jenny Ma,
Karthik Sreedhar,
Vivian Liu,
Sitong Wang,
Pedro Alejandro Perez,
Lydia B. Chilton
Abstract:
Large language models (LLMs) are remarkably good at writing code. A particularly valuable case of human-LLM collaboration is code-based UI prototyping, a method for creating interactive prototypes that allows users to view and fully engage with a user interface. We conduct a formative study of GPT Pilot, a leading LLM-generated code-prototyping system, and find that its inflexibility towards chang…
▽ More
Large language models (LLMs) are remarkably good at writing code. A particularly valuable case of human-LLM collaboration is code-based UI prototyping, a method for creating interactive prototypes that allows users to view and fully engage with a user interface. We conduct a formative study of GPT Pilot, a leading LLM-generated code-prototyping system, and find that its inflexibility towards change once development has started leads to weaknesses in failure prevention and dynamic planning; it closely resembles the linear workflow of the waterfall model. We introduce DIDUP, a system for code-based UI prototyping that follows an iterative spiral model, which takes changes and iterations that come up during the development process into account. We propose three novel mechanisms for LLM-generated code-prototyping systems: (1) adaptive planning, where plans should be dynamic and reflect changes during implementation, (2) code injection, where the system should write a minimal amount of code and inject it instead of rewriting code so users have a better mental model of the code evolution, and (3) lightweight state management, a simplified version of source control so users can quickly revert to different working states. Together, this enables users to rapidly develop and iterate on prototypes.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
STORYSUMM: Evaluating Faithfulness in Story Summarization
Authors:
Melanie Subbiah,
Faisal Ladhak,
Akankshya Mishra,
Griffin Adams,
Lydia B. Chilton,
Kathleen McKeown
Abstract:
Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness l…
▽ More
Human evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, STORYSUMM, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.
△ Less
Submitted 1 April, 2025; v1 submitted 8 July, 2024;
originally announced July 2024.
-
LogoMotion: Visually-Grounded Code Synthesis for Creating and Editing Animation
Authors:
Vivian Liu,
Rubaiat Habib Kazi,
Li-Yi Wei,
Matthew Fisher,
Timothy Langlois,
Seth Walker,
Lydia Chilton
Abstract:
Creating animation takes time, effort, and technical expertise. To help novices with animation, we present LogoMotion, an AI code generation approach that helps users create semantically meaningful animation for logos. LogoMotion automatically generates animation code with a method called visually-grounded code synthesis and program repair. This method performs visual analysis, instantiates a desi…
▽ More
Creating animation takes time, effort, and technical expertise. To help novices with animation, we present LogoMotion, an AI code generation approach that helps users create semantically meaningful animation for logos. LogoMotion automatically generates animation code with a method called visually-grounded code synthesis and program repair. This method performs visual analysis, instantiates a design concept, and conducts visual checking to generate animation code. LogoMotion provides novices with code-connected AI editing widgets that help them edit the motion, grouping, and timing of their animation. In a comparison study on 276 animations, LogoMotion was found to produce more content-aware animation than an industry-leading tool. In a user evaluation (n=16) comparing against a prompt-only baseline, these code-connected widgets helped users edit animations with control, iteration, and creative expression.
△ Less
Submitted 24 February, 2025; v1 submitted 11 May, 2024;
originally announced May 2024.
-
Scrolly2Reel: Retargeting Graphics for Social Media Using Narrative Beats
Authors:
Duy K. Nguyen,
Jenny Ma,
Pedro Alejandro Perez,
Lydia B. Chilton
Abstract:
Content retargeting is crucial for social media creators. Once great content is created, it is important to reach as broad an audience as possible. This is particularly important in journalism where younger audiences are shifting away from print and towards short-video platforms. Many newspapers already create rich graphics for the web that they want to be able to reuse for social media. One examp…
▽ More
Content retargeting is crucial for social media creators. Once great content is created, it is important to reach as broad an audience as possible. This is particularly important in journalism where younger audiences are shifting away from print and towards short-video platforms. Many newspapers already create rich graphics for the web that they want to be able to reuse for social media. One example is scrollytelling sequences or "scrollies" -- immersive articles with graphics like animation, charts, and 3D visualizations that appear as a user scrolls. We present a system that helps transform scrollies into social media videos. By using the scriptwriting concept of narrative beats to extract fundamental storytelling units, we can create videos that are more aligned with narration, and allow for better pacing and stylistic changes. Narrative beats are thus an important primitive to retargeting content that matches the style of a new medium while maintaining the cohesiveness of the original content.
△ Less
Submitted 19 June, 2024; v1 submitted 26 March, 2024;
originally announced March 2024.
-
MoodSmith: Enabling Mood-Consistent Multimedia for AI-Generated Advocacy Campaigns
Authors:
Samia Menon,
Sitong Wang,
Lydia Chilton
Abstract:
Emotion is vital to information and message processing, playing a key role in attitude formation. Consequently, creating a mood that evokes an emotional response is essential to any compelling piece of outreach communication. Many nonprofits and charities, despite having established messages, face challenges in creating advocacy campaign videos for social media. It requires significant creative an…
▽ More
Emotion is vital to information and message processing, playing a key role in attitude formation. Consequently, creating a mood that evokes an emotional response is essential to any compelling piece of outreach communication. Many nonprofits and charities, despite having established messages, face challenges in creating advocacy campaign videos for social media. It requires significant creative and cognitive efforts to ensure that videos achieve the desired mood across multiple dimensions: script, visuals, and audio. We introduce MoodSmith, an AI-powered system that helps users explore mood possibilities for their message and create advocacy campaigns that are mood-consistent across dimensions. To achieve this, MoodSmith uses emotive language and plotlines for scripts, artistic style and color palette for visuals, and positivity and energy for audio. Our studies show that MoodSmith can effectively achieve a variety of moods, and the produced videos are consistent across media dimensions.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
Authors:
Melanie Subbiah,
Sean Zhang,
Lydia B. Chilton,
Kathleen McKeown
Abstract:
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the autho…
▽ More
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.
△ Less
Submitted 11 July, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
Not Just Novelty: A Longitudinal Study on Utility and Customization of an AI Workflow
Authors:
Tao Long,
Katy Ilonka Gero,
Lydia B. Chilton
Abstract:
Generative AI brings novel and impressive abilities to help people in everyday tasks. There are many AI workflows that solve real and complex problems by chaining AI outputs together with human interaction. Although there is an undeniable lure of AI, it is uncertain how useful generative AI workflows are after the novelty wears off. Additionally, workflows built with generative AI have the potenti…
▽ More
Generative AI brings novel and impressive abilities to help people in everyday tasks. There are many AI workflows that solve real and complex problems by chaining AI outputs together with human interaction. Although there is an undeniable lure of AI, it is uncertain how useful generative AI workflows are after the novelty wears off. Additionally, workflows built with generative AI have the potential to be easily customized to fit users' individual needs, but do users take advantage of this? We conducted a three-week longitudinal study with 12 users to understand the familiarization and customization of generative AI tools for science communication. Our study revealed that there exists a familiarization phase, during which users were exploring the novel capabilities of the workflow and discovering which aspects they found useful. After this phase, users understood the workflow and were able to anticipate the outputs. Surprisingly, after familiarization the perceived utility of the system was rated higher than before, indicating that the perceived utility of AI is not just a novelty effect. The increase in benefits mainly comes from end-users' ability to customize prompts, and thus potentially appropriate the system to their own needs. This points to a future where generative AI systems can allow us to design for appropriation.
△ Less
Submitted 31 May, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Simulating Human Strategic Behavior: Comparing Single and Multi-agent LLMs
Authors:
Karthik Sreedhar,
Lydia Chilton
Abstract:
When creating policies, plans, or designs for people, it is challenging for designers to foresee all of the ways in which people may reason and behave. Recently, Large Language Models (LLMs) have been shown to be able to simulate human reasoning. We extend this work by measuring LLMs ability to simulate strategic reasoning in the ultimatum game, a classic economics bargaining experiment. Experimen…
▽ More
When creating policies, plans, or designs for people, it is challenging for designers to foresee all of the ways in which people may reason and behave. Recently, Large Language Models (LLMs) have been shown to be able to simulate human reasoning. We extend this work by measuring LLMs ability to simulate strategic reasoning in the ultimatum game, a classic economics bargaining experiment. Experimental evidence shows human strategic reasoning is complex; people will often choose to punish other players to enforce social norms even at personal expense. We test if LLMs can replicate this behavior in simulation, comparing two structures: single LLMs and multi-agent systems. We compare their abilities to (1) simulate human-like reasoning in the ultimatum game, (2) simulate two player personalities, greedy and fair, and (3) create robust strategies that are logically complete and consistent with personality. Our evaluation shows that multi-agent systems are more accurate than single LLMs (88 percent vs. 50 percent) in simulating human reasoning and actions for personality pairs. Thus, there is potential to use LLMs to simulate human strategic reasoning to help decision and policy-makers perform preliminary explorations of how people behave in systems.
△ Less
Submitted 1 July, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
PodReels: Human-AI Co-Creation of Video Podcast Teasers
Authors:
Sitong Wang,
Zheng Ning,
Anh Truong,
Mira Dontcheva,
Dingzeyu Li,
Lydia B. Chilton
Abstract:
Video podcast teasers are short videos that can be shared on social media platforms to capture interest in the full episodes of a video podcast. These teasers enable long-form podcasters to reach new audiences and gain new followers. However, creating a compelling teaser from an hour-long episode is challenging. Selecting interesting clips requires significant mental effort; editing the chosen cli…
▽ More
Video podcast teasers are short videos that can be shared on social media platforms to capture interest in the full episodes of a video podcast. These teasers enable long-form podcasters to reach new audiences and gain new followers. However, creating a compelling teaser from an hour-long episode is challenging. Selecting interesting clips requires significant mental effort; editing the chosen clips into a cohesive, well-produced teaser is time-consuming. To support the creation of video podcast teasers, we first investigate what makes a good teaser. We combine insights from both audience comments and creator interviews to determine a set of essential ingredients. We also identify a common workflow shared by creators during the process. Based on these findings, we introduce a human-AI co-creative tool called PodReels to assist video podcasters in creating teasers. Our user study shows that PodReels significantly reduces creators' mental demand and improves their efficiency in producing video podcast teasers.
△ Less
Submitted 9 May, 2024; v1 submitted 9 November, 2023;
originally announced November 2023.
-
Eliciting Topic Hierarchies from Large Language Models
Authors:
Grace Li,
Tao Long,
Lydia B. Chilton
Abstract:
Current research has explored how Generative AI can support the brainstorming process for content creators, but a gap remains in exploring support-tools for the pre-writing process. Specifically, our research is focused on supporting users in finding topics at the right level of specificity for their audience. This process is called topic scoping. Topic scoping is a cognitively demanding task, req…
▽ More
Current research has explored how Generative AI can support the brainstorming process for content creators, but a gap remains in exploring support-tools for the pre-writing process. Specifically, our research is focused on supporting users in finding topics at the right level of specificity for their audience. This process is called topic scoping. Topic scoping is a cognitively demanding task, requiring users to actively recall subtopics in a given domain. This manual approach also reduces the diversity of subtopics that a user is able to explore. We propose using Large Language Models (LLMs) to support the process of topic scoping by iteratively generating subtopics at increasing levels of specificity: dynamically creating topic hierarchies. We tested three different prompting strategies and found that increasing the amount of context included in the prompt improves subtopic generation by 20 percentage points. Finally, we discuss applications of this research in education, content creation, and product management.
△ Less
Submitted 17 June, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.
-
Challenges and Opportunities for the Design of Smart Speakers
Authors:
Tao Long,
Lydia B. Chilton
Abstract:
Advances in voice technology and voice user interfaces (VUIs) -- such as Alexa, Siri, and Google Home -- have opened up the potential for many new types of interaction. However, despite the potential of these devices reflected by the growing market and body of VUI research, there is a lingering sense that the technology is still underused. In this paper, we conducted a systematic literature review…
▽ More
Advances in voice technology and voice user interfaces (VUIs) -- such as Alexa, Siri, and Google Home -- have opened up the potential for many new types of interaction. However, despite the potential of these devices reflected by the growing market and body of VUI research, there is a lingering sense that the technology is still underused. In this paper, we conducted a systematic literature review of 35 papers to identify and synthesize 127 VUI design guidelines into five themes. Additionally, we conducted semi-structured interviews with 15 smart speaker users to understand their use and non-use of the technology. From the interviews, we distill four design challenges that contribute the most to non-use. Based on their (non-)use, we identify four opportunity spaces for designers to explore such as focusing on information support while multitasking (cooking, driving, childcare, etc), incorporating users' mental models for smart speakers, and integrating calm design principles.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
Tweetorial Hooks: Generative AI Tools to Motivate Science on Social Media
Authors:
Tao Long,
Dorothy Zhang,
Grace Li,
Batool Taraif,
Samia Menon,
Kynnedy Simone Smith,
Sitong Wang,
Katy Ilonka Gero,
Lydia B. Chilton
Abstract:
Communicating science and technology is essential for the public to understand and engage in a rapidly changing world. Tweetorials are an emerging phenomenon where experts explain STEM topics on social media in creative and engaging ways. However, STEM experts struggle to write an engaging "hook" in the first tweet that captures the reader's attention. We propose methods to use large language mode…
▽ More
Communicating science and technology is essential for the public to understand and engage in a rapidly changing world. Tweetorials are an emerging phenomenon where experts explain STEM topics on social media in creative and engaging ways. However, STEM experts struggle to write an engaging "hook" in the first tweet that captures the reader's attention. We propose methods to use large language models (LLMs) to help users scaffold their process of writing a relatable hook for complex scientific topics. We demonstrate that LLMs can help writers find everyday experiences that are relatable and interesting to the public, avoid jargon, and spark curiosity. Our evaluation shows that the system reduces cognitive load and helps people write better hooks. Lastly, we discuss the importance of interactivity with LLMs to preserve the correctness, effectiveness, and authenticity of the writing.
△ Less
Submitted 5 December, 2023; v1 submitted 20 May, 2023;
originally announced May 2023.
-
STORYWARS: A Dataset and Instruction Tuning Baselines for Collaborative Story Understanding and Generation
Authors:
Yulun Du,
Lydia Chilton
Abstract:
Collaborative stories, which are texts created through the collaborative efforts of multiple authors with different writing styles and intentions, pose unique challenges for NLP models. Understanding and generating such stories remains an underexplored area due to the lack of open-domain corpora. To address this, we introduce STORYWARS, a new dataset of over 40,000 collaborative stories written by…
▽ More
Collaborative stories, which are texts created through the collaborative efforts of multiple authors with different writing styles and intentions, pose unique challenges for NLP models. Understanding and generating such stories remains an underexplored area due to the lack of open-domain corpora. To address this, we introduce STORYWARS, a new dataset of over 40,000 collaborative stories written by 9,400 different authors from an online platform. We design 12 task types, comprising 7 understanding and 5 generation task types, on STORYWARS, deriving 101 diverse story-related tasks in total as a multi-task benchmark covering all fully-supervised, few-shot, and zero-shot scenarios. Furthermore, we present our instruction-tuned model, INSTRUCTSTORY, for the story tasks showing that instruction tuning, in addition to achieving superior results in zero-shot and few-shot scenarios, can also obtain the best performance on the fully-supervised tasks in STORYWARS, establishing strong multi-task benchmark performances on STORYWARS.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
ReelFramer: Human-AI Co-Creation for News-to-Video Translation
Authors:
Sitong Wang,
Samia Menon,
Tao Long,
Keren Henderson,
Dingzeyu Li,
Kevin Crowston,
Mark Hansen,
Jeffrey V. Nickerson,
Lydia B. Chilton
Abstract:
Short videos on social media are the dominant way young people consume content. News outlets aim to reach audiences through news reels -- short videos conveying news -- but struggle to translate traditional journalistic formats into short, entertaining videos. To translate news into social media reels, we support journalists in reframing the narrative. In literature, narrative framing is a high-le…
▽ More
Short videos on social media are the dominant way young people consume content. News outlets aim to reach audiences through news reels -- short videos conveying news -- but struggle to translate traditional journalistic formats into short, entertaining videos. To translate news into social media reels, we support journalists in reframing the narrative. In literature, narrative framing is a high-level structure that shapes the overall presentation of a story. We identified three narrative framings for reels that adapt social media norms but preserve news value, each with a different balance of information and entertainment. We introduce ReelFramer, a human-AI co-creative system that helps journalists translate print articles into scripts and storyboards. ReelFramer supports exploring multiple narrative framings to find one appropriate to the story. AI suggests foundational narrative details, including characters, plot, setting, and key information. ReelFramer also supports visual framing; AI suggests character and visual detail designs before generating a full storyboard. Our studies show that narrative framing introduces the necessary diversity to translate various articles into reels, and establishing foundational details helps generate scripts that are more relevant and coherent. We also discuss the benefits of using narrative framing and foundational details in content retargeting.
△ Less
Submitted 10 March, 2024; v1 submitted 19 April, 2023;
originally announced April 2023.
-
Generative Disco: Text-to-Video Generation for Music Visualization
Authors:
Vivian Liu,
Tao Long,
Nathan Raw,
Lydia Chilton
Abstract:
Visuals can enhance our experience of music, owing to the way they can amplify the emotions and messages conveyed within it. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-video generation. The system helps users…
▽ More
Visuals can enhance our experience of music, owing to the way they can amplify the emotions and messages conveyed within it. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-video generation. The system helps users visualize music in intervals by finding prompts to describe the images that intervals start and end on and interpolating between them to the beat of the music. We introduce design patterns for improving these generated videos: transitions, which express shifts in color, time, subject, or style, and holds, which help focus the video on subjects. A study with professionals showed that transitions and holds were a highly expressive framework that enabled them to build coherent visual narratives. We conclude on the generalizability of these patterns and the potential of generated video for creative professionals.
△ Less
Submitted 28 September, 2023; v1 submitted 17 April, 2023;
originally announced April 2023.
-
SafeText: A Benchmark for Exploring Physical Safety in Language Models
Authors:
Sharon Levy,
Emily Allaway,
Melanie Subbiah,
Lydia Chilton,
Desmond Patton,
Kathleen McKeown,
William Yang Wang
Abstract:
Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create th…
▽ More
Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
How Much is Performance Worth to Users? A Quantitative Approach
Authors:
Adam Hastings,
Lydia B. Chilton,
Simha Sethumadhavan
Abstract:
Architects and systems designers artfully balance multiple competing design constraints during the design process but are unable to translate between system metrics and end user experience. This work presents three methodologies to fill in this gap. The first is an incentive-compatible methodology that determines a "ground truth" measurement of users' value of speed in terms of US dollars, and fin…
▽ More
Architects and systems designers artfully balance multiple competing design constraints during the design process but are unable to translate between system metrics and end user experience. This work presents three methodologies to fill in this gap. The first is an incentive-compatible methodology that determines a "ground truth" measurement of users' value of speed in terms of US dollars, and find that users would accept a performance losses of 10%, 20%, and 30% to their personal computer in exchange for \$2.27, \$4.07, and \$4.43 per day, respectively. However, while highly accurate the methodology is a painstaking process and does not scale with large numbers of participants. To allow for scalability, we introduce a second methodology -- a lab-based simulation experiment -- which finds that users would accept a permanent performance loss of 10%, 20%, and 30% to their personal computer in exchange for \$127, \$169, and \$823, respectively. Finally, to allow for even greater scalability, we introduce a third methodology -- a survey -- and observe that the lack of incentive compatibility and the lack of hands-on experience with throttled device performance skews the results significantly, thus demonstrating the need for lab-based or incentive compatible study designs. By quantifying the tradeoff between user satisfaction and performance, we enable architects and systems designers to make more nuanced tradeoffs between design requirements.
△ Less
Submitted 27 April, 2022;
originally announced April 2022.
-
Opal: Multimodal Image Generation for News Illustration
Authors:
Vivian Liu,
Han Qiao,
Lydia Chilton
Abstract:
Advances in multimodal AI have presented people with powerful ways to create images from text. Recent work has shown that text-to-image generations are able to represent a broad range of subjects and artistic styles. However, finding the right visual language for text prompts is difficult. In this paper, we address this challenge with Opal, a system that produces text-to-image generations for news…
▽ More
Advances in multimodal AI have presented people with powerful ways to create images from text. Recent work has shown that text-to-image generations are able to represent a broad range of subjects and artistic styles. However, finding the right visual language for text prompts is difficult. In this paper, we address this challenge with Opal, a system that produces text-to-image generations for news illustration. Given an article, Opal guides users through a structured search for visual concepts and provides a pipeline allowing users to generate illustrations based on an article's tone, keywords, and related artistic styles. Our evaluation shows that Opal efficiently generates diverse sets of news illustrations, visual assets, and concept ideas. Users with Opal generated two times more usable results than users without. We discuss how structured exploration can help users better understand the capabilities of human AI co-creative systems.
△ Less
Submitted 16 August, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Eliciting Gestures for Novel Note-taking Interactions
Authors:
Katy Ilonka Gero,
Lydia B. Chilton,
Chris Melancon,
Mike Cleron
Abstract:
Handwriting recognition is improving in leaps and bounds, and this opens up new opportunities for stylus-based interactions. In particular, note-taking applications can become a more intelligent user interface, incorporating new features like autocomplete and integrated search. In this work we ran a gesture elicitation study, asking 21 participants to imagine how they would interact with an imagin…
▽ More
Handwriting recognition is improving in leaps and bounds, and this opens up new opportunities for stylus-based interactions. In particular, note-taking applications can become a more intelligent user interface, incorporating new features like autocomplete and integrated search. In this work we ran a gesture elicitation study, asking 21 participants to imagine how they would interact with an imaginary, intelligent note-taking application. We report agreement on the elicited gestures, finding that while existing common interactions are prevalent (like double taps and long presses) a number of more novel interactions (like dragging selected items to hotspots or using annotations) were also well-represented. We discuss the mental models participants drew on when explaining their gestures and what kind of feedback users might need to move to more stylus-centric interactions.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
PopBlends: Strategies for Conceptual Blending with Large Language Models
Authors:
Sitong Wang,
Savvas Petridis,
Taeahn Kwon,
Xiaojuan Ma,
Lydia B. Chilton
Abstract:
Pop culture is an important aspect of communication. On social media people often post pop culture reference images that connect an event, product or other entity to a pop culture domain. Creating these images is a creative challenge that requires finding a conceptual connection between the users' topic and a pop culture domain. In cognitive theory, this task is called conceptual blending. We pres…
▽ More
Pop culture is an important aspect of communication. On social media people often post pop culture reference images that connect an event, product or other entity to a pop culture domain. Creating these images is a creative challenge that requires finding a conceptual connection between the users' topic and a pop culture domain. In cognitive theory, this task is called conceptual blending. We present a system called PopBlends that automatically suggests conceptual blends. The system explores three approaches that involve both traditional knowledge extraction methods and large language models. Our annotation study shows that all three methods provide connections with similar accuracy, but with very different characteristics. Our user study shows that people found twice as many blend suggestions as they did without the system, and with half the mental demand. We discuss the advantages of combining large language models with knowledge bases for supporting divergent and convergent thinking.
△ Less
Submitted 19 February, 2023; v1 submitted 8 November, 2021;
originally announced November 2021.
-
Lightweight Decoding Strategies for Increasing Specificity
Authors:
Katy Ilonka Gero,
Chris Kedzie,
Savvas Petridis,
Lydia Chilton
Abstract:
Language models are known to produce vague and generic outputs. We propose two unsupervised decoding strategies based on either word-frequency or point-wise mutual information to increase the specificity of any model that outputs a probability distribution over its vocabulary at generation time. We test the strategies in a prompt completion task; with human evaluations, we find that both strategie…
▽ More
Language models are known to produce vague and generic outputs. We propose two unsupervised decoding strategies based on either word-frequency or point-wise mutual information to increase the specificity of any model that outputs a probability distribution over its vocabulary at generation time. We test the strategies in a prompt completion task; with human evaluations, we find that both strategies increase the specificity of outputs with only modest decreases in sensibility. We also briefly present a summarization use case, where these strategies can produce more specific summaries.
△ Less
Submitted 22 October, 2021;
originally announced October 2021.
-
Sparks: Inspiration for Science Writing using Language Models
Authors:
Katy Ilonka Gero,
Vivian Liu,
Lydia B. Chilton
Abstract:
Large-scale language models are rapidly improving, performing well on a wide variety of tasks with little to no customization. In this work we investigate how language models can support science writing, a challenging writing task that is both open-ended and highly constrained. We present a system for generating "sparks", sentences related to a scientific concept intended to inspire writers. We fi…
▽ More
Large-scale language models are rapidly improving, performing well on a wide variety of tasks with little to no customization. In this work we investigate how language models can support science writing, a challenging writing task that is both open-ended and highly constrained. We present a system for generating "sparks", sentences related to a scientific concept intended to inspire writers. We find that our sparks are more coherent and diverse than a competitive language model baseline, and approach a human-created gold standard. In a study with 13 PhD students writing on topics of their own selection, we find three main use cases of sparks: aiding with crafting detailed sentences, providing interesting angles to engage readers, and demonstrating common reader perspectives. We also report on the various reasons sparks were considered unhelpful, and discuss how we might improve language models as writing support tools.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
Design Guidelines for Prompt Engineering Text-to-Image Generative Models
Authors:
Vivian Liu,
Lydia B. Chilton
Abstract:
Text-to-image generative models are a new and powerful way to generate visual artwork. However, the open-ended nature of text as interaction is double-edged; while users can input anything and have access to an infinite range of generations, they also must engage in brute-force trial and error with the text prompt when the result quality is poor. We conduct a study exploring what prompt keywords a…
▽ More
Text-to-image generative models are a new and powerful way to generate visual artwork. However, the open-ended nature of text as interaction is double-edged; while users can input anything and have access to an infinite range of generations, they also must engage in brute-force trial and error with the text prompt when the result quality is poor. We conduct a study exploring what prompt keywords and model hyperparameters can help produce coherent outputs. In particular, we study prompts structured to include subject and style keywords and investigate success and failure modes of these prompts. Our evaluation of 5493 generations over the course of five experiments spans 51 abstract and concrete subjects as well as 51 abstract and figurative styles. From this evaluation, we present design guidelines that can help people produce better outcomes from text-to-image generative models.
△ Less
Submitted 28 September, 2023; v1 submitted 14 September, 2021;
originally announced September 2021.
-
Hierarchical Summarization for Longform Spoken Dialog
Authors:
Daniel Li,
Thomas Chen,
Albert Tung,
Lydia Chilton
Abstract:
Every day we are surrounded by spoken dialog. This medium delivers rich diverse streams of information auditorily; however, systematically understanding dialog can often be non-trivial. Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor, especially when compared to written prose. Furthermore, compared to understanding…
▽ More
Every day we are surrounded by spoken dialog. This medium delivers rich diverse streams of information auditorily; however, systematically understanding dialog can often be non-trivial. Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor, especially when compared to written prose. Furthermore, compared to understanding text, auditory communication poses many additional challenges such as speaker disfluencies, informal prose styles, and lack of structure. These concerns all demonstrate the need for a distinctly speech tailored interactive system to help users understand and navigate the spoken language domain. While individual automatic speech recognition (ASR) and text summarization methods already exist, they are imperfect technologies; neither consider user purpose and intent nor address spoken language induced complications. Consequently, we design a two stage ASR and text summarization pipeline and propose a set of semantic segmentation and merging algorithms to resolve these speech modeling challenges. Our system enables users to easily browse and navigate content as well as recover from errors in these underlying technologies. Finally, we present an evaluation of the system which highlights user preference for hierarchical summarization as a tool to quickly skim audio and identify content of interest to the user.
△ Less
Submitted 21 August, 2021;
originally announced August 2021.
-
Low-Level Linguistic Controls for Style Transfer and Content Preservation
Authors:
Katy Gero,
Chris Kedzie,
Jonathan Reeve,
Lydia Chilton
Abstract:
Despite the success of style transfer in image processing, it has seen limited progress in natural language generation. Part of the problem is that content is not as easily decoupled from style in the text domain. Curiously, in the field of stylometry, content does not figure prominently in practical methods of discriminating stylistic elements, such as authorship and genre. Rather, syntax and fun…
▽ More
Despite the success of style transfer in image processing, it has seen limited progress in natural language generation. Part of the problem is that content is not as easily decoupled from style in the text domain. Curiously, in the field of stylometry, content does not figure prominently in practical methods of discriminating stylistic elements, such as authorship and genre. Rather, syntax and function words are the most salient features. Drawing on this work, we model style as a suite of low-level linguistic controls, such as frequency of pronouns, prepositions, and subordinate clause constructions. We train a neural encoder-decoder model to reconstruct reference sentences given only content words and the setting of the controls. We perform style transfer by keeping the content words fixed while adjusting the controls to be indicative of another style. In experiments, we show that the model reliably responds to the linguistic controls and perform both automatic and manual evaluations on style transfer. We find we can fool a style classifier 84% of the time, and that our model produces highly diverse and stylistically distinctive outputs. This work introduces a formal, extendable model of style that can add control to any neural text generation system.
△ Less
Submitted 8 November, 2019;
originally announced November 2019.
-
Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing
Authors:
Quanze Chen,
Jonathan Bragg,
Lydia B. Chilton,
Daniel S. Weld
Abstract:
Traditional approaches for ensuring high quality crowdwork have failed to achieve high-accuracy on difficult problems. Aggregating redundant answers often fails on the hardest problems when the majority is confused. Argumentation has been shown to be effective in mitigating these drawbacks. However, existing argumentation systems only support limited interactions and show workers general justifica…
▽ More
Traditional approaches for ensuring high quality crowdwork have failed to achieve high-accuracy on difficult problems. Aggregating redundant answers often fails on the hardest problems when the majority is confused. Argumentation has been shown to be effective in mitigating these drawbacks. However, existing argumentation systems only support limited interactions and show workers general justifications, not context-specific arguments targeted to their reasoning.
This paper presents Cicero, a new workflow that improves crowd accuracy on difficult tasks by engaging workers in multi-turn, contextual discussions through real-time, synchronous argumentation. Our experiments show that compared to previous argumentation systems which only improve the average individual worker accuracy by 6.8 percentage points on the Relation Extraction domain, our workflow achieves 16.7 percentage point improvement. Furthermore, previous argumentation approaches don't apply to tasks with many possible answers; in contrast, Cicero works well in these cases, raising accuracy from 66.7% to 98.8% on the Codenames domain.
△ Less
Submitted 25 October, 2018;
originally announced October 2018.
-
The Labor Economics of Paid Crowdsourcing
Authors:
John Horton,
Lydia Chilton
Abstract:
Crowdsourcing is a form of "peer production" in which work traditionally performed by an employee is outsourced to an "undefined, generally large group of people in the form of an open call." We present a model of workers supplying labor to paid crowdsourcing projects. We also introduce a novel method for estimating a worker's reservation wage--the smallest wage a worker is willing to accept for…
▽ More
Crowdsourcing is a form of "peer production" in which work traditionally performed by an employee is outsourced to an "undefined, generally large group of people in the form of an open call." We present a model of workers supplying labor to paid crowdsourcing projects. We also introduce a novel method for estimating a worker's reservation wage--the smallest wage a worker is willing to accept for a task and the key parameter in our labor supply model. It shows that the reservation wages of a sample of workers from Amazon's Mechanical Turk (AMT) are approximately log normally distributed, with a median wage of $1.38/hour. At the median wage, the point elasticity of extensive labor supply is 0.43. We discuss how to use our calibrated model to make predictions in applied work. Two experimental tests of the model show that many workers respond rationally to offered incentives. However, a non-trivial fraction of subjects appear to set earnings targets. These "target earners" consider not just the offered wage--which is what the rational model predicts--but also their proximity to earnings goals. Interestingly, a number of workers clearly prefer earning total amounts evenly divisible by 5, presumably because these amounts make good targets.
△ Less
Submitted 5 January, 2010;
originally announced January 2010.