-
LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
Authors:
Ho Yin 'Sam' Ng,
Ting-Yao Hsu,
Aashish Anantha Ramakrishnan,
Branislav Kveton,
Nedim Lipka,
Franck Dernoncourt,
Dongwon Lee,
Tong Yu,
Sungchul Kim,
Ryan A. Rossi,
Ting-Hao 'Kenneth' Huang
Abstract:
Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite languag…
▽ More
Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
△ Less
Submitted 17 June, 2025; v1 submitted 6 June, 2025;
originally announced June 2025.
-
From Selection to Generation: A Survey of LLM-based Active Learning
Authors:
Yu Xia,
Subhojyoti Mukherjee,
Zhouhang Xie,
Junda Wu,
Xintong Li,
Ryan Aponte,
Hanjia Lyu,
Joe Barrow,
Hongjie Chen,
Franck Dernoncourt,
Branislav Kveton,
Tong Yu,
Ruiyi Zhang,
Jiuxiang Gu,
Nesreen K. Ahmed,
Yu Wang,
Xiang Chen,
Hanieh Deilamsalehy,
Sungchul Kim,
Zhengmian Hu,
Yue Zhao,
Nedim Lipka,
Seunghyun Yoon,
Ting-Hao Kenneth Huang,
Zichao Wang
, et al. (9 additional authors not shown)
Abstract:
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the incre…
▽ More
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.
△ Less
Submitted 31 May, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent
Authors:
Zeyu He,
Saniya Naphade,
Ting-Hao 'Kenneth' Huang
Abstract:
Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark," w…
▽ More
Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark," where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable-only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were available. Our findings highlight the importance of gold labels and the needs, as well as the risks, of automated support in human prompt engineering, providing insights for future tool design.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Using Contextually Aligned Online Reviews to Measure LLMs' Performance Disparities Across Language Varieties
Authors:
Zixin Tang,
Chieh-Yang Huang,
Tsung-Che Li,
Ho Yin Sam Ng,
Hen-Hsen Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms,…
▽ More
A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms, such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.
△ Less
Submitted 20 March, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
Authors:
Ting-Yao E. Hsu,
Yi-Li Hsu,
Shaurya Rohatgi,
Chieh-Yang Huang,
Ho Yin Sam Ng,
Ryan Rossi,
Sungchul Kim,
Tong Yu,
Lun-Wei Ku,
C. Lee Giles,
Ting-Hao K. Huang
Abstract:
Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advan…
▽ More
Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?
△ Less
Submitted 18 February, 2025; v1 submitted 31 January, 2025;
originally announced January 2025.
-
Hashtag Re-Appropriation for Audience Control on Recommendation-Driven Social Media Xiaohongshu (rednote)
Authors:
Ruyuan Wan,
Lingbo Tong,
Tiffany Knearem,
Toby Jia-Jun Li,
Ting-Hao 'Kenneth' Huang,
Qunfang Wu
Abstract:
Algorithms have played a central role in personalized recommendations on social media. However, they also present significant obstacles for content creators trying to predict and manage their audience reach. This issue is particularly challenging for marginalized groups seeking to maintain safe spaces. Our study explores how women on Xiaohongshu (rednote), a recommendation-driven social platform,…
▽ More
Algorithms have played a central role in personalized recommendations on social media. However, they also present significant obstacles for content creators trying to predict and manage their audience reach. This issue is particularly challenging for marginalized groups seeking to maintain safe spaces. Our study explores how women on Xiaohongshu (rednote), a recommendation-driven social platform, proactively re-appropriate hashtags (e.g., #Baby Supplemental Food) by using them in posts unrelated to their literal meaning. The hashtags were strategically chosen from topics that would be uninteresting to the male audience they wanted to block. Through a mixed-methods approach, we analyzed the practice of hashtag re-appropriation based on 5,800 collected posts and interviewed 24 active users from diverse backgrounds to uncover users' motivations and reactions towards the re-appropriation. This practice highlights how users can reclaim agency over content distribution on recommendation-driven platforms, offering insights into self-governance within algorithmic-centered power structures.
△ Less
Submitted 3 March, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing
Authors:
Ho Yin,
Ng,
Ting-Yao Hsu,
Jiyoo Min,
Sungchul Kim,
Ryan A. Rossi,
Tong Yu,
Hyunggu Jung,
Ting-Hao 'Kenneth' Huang
Abstract:
Figures and their captions play a key role in scientific publications. However, despite their importance, many captions in published papers are poorly crafted, largely due to a lack of attention by paper authors. While prior AI research has explored caption generation, it has mainly focused on reader-centered use cases, where users evaluate generated captions rather than actively integrating them…
▽ More
Figures and their captions play a key role in scientific publications. However, despite their importance, many captions in published papers are poorly crafted, largely due to a lack of attention by paper authors. While prior AI research has explored caption generation, it has mainly focused on reader-centered use cases, where users evaluate generated captions rather than actively integrating them into their writing. This paper addresses this gap by investigating how paper authors incorporate AI-generated captions into their writing process through a user study involving 18 participants. Each participant rewrote captions for two figures from their own recently published work, using captions generated by state-of-the-art AI models as a resource. By analyzing video recordings of the writing process through interaction analysis, we observed that participants often began by copying and refining AI-generated captions. Paper writers favored longer, detail-rich captions that integrated textual and visual elements but found current AI models less effective for complex figures. These findings highlight the nuanced and diverse nature of figure caption composition, revealing design opportunities for AI systems to better support the challenges of academic writing.
△ Less
Submitted 10 January, 2025;
originally announced January 2025.
-
Multi-LLM Collaborative Caption Generation in Scientific Documents
Authors:
Jaeyoung Kim,
Jongho Lee,
Hong-Jun Choi,
Ting-Yao Hsu,
Chieh-Yang Huang,
Sungchul Kim,
Ryan Rossi,
Tong Yu,
Clyde Lee Giles,
Ting-Hao 'Kenneth' Huang,
Sungchul Choi
Abstract:
Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Mo…
▽ More
Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
△ Less
Submitted 5 January, 2025;
originally announced January 2025.
-
CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds
Authors:
Min-Hsuan Yeh,
Ruyuan Wan,
Ting-Hao 'Kenneth' Huang
Abstract:
Detecting logical fallacies in texts can help users spot argument flaws, but automating this detection is not easy. Manually annotating fallacies in large-scale, real-world text data to create datasets for developing and validating detection models is costly. This paper introduces CoCoLoFa, the largest known logical fallacy dataset, containing 7,706 comments for 648 news articles, with each commen…
▽ More
Detecting logical fallacies in texts can help users spot argument flaws, but automating this detection is not easy. Manually annotating fallacies in large-scale, real-world text data to create datasets for developing and validating detection models is costly. This paper introduces CoCoLoFa, the largest known logical fallacy dataset, containing 7,706 comments for 648 news articles, with each comment labeled for fallacy presence and type. We recruited 143 crowd workers to write comments embodying specific fallacy types (e.g., slippery slope) in response to news articles. Recognizing the complexity of this writing task, we built an LLM-powered assistant into the workers' interface to aid in drafting and refining their comments. Experts rated the writing quality and labeling validity of CoCoLoFa as high and reliable. BERT-based models fine-tuned using CoCoLoFa achieved the highest fallacy detection (F1=0.86) and classification (F1=0.87) performance on its test set, outperforming the state-of-the-art LLMs. Our work shows that combining crowdsourcing and LLMs enables us to more effectively construct datasets for complex linguistic phenomena that crowd workers find challenging to produce on their own.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
A Learning-based Quadcopter Controller with Extreme Adaptation
Authors:
Dingqi Zhang,
Antonio Loquercio,
Jerry Tang,
Ting-Hao Wang,
Jitendra Malik,
Mark W. Mueller
Abstract:
This paper introduces a learning-based low-level controller for quadcopters, which adaptively controls quadcopters with significant variations in mass, size, and actuator capabilities. Our approach leverages a combination of imitation learning and reinforcement learning, creating a fast-adapting and general control framework for quadcopters that eliminates the need for precise model estimation or…
▽ More
This paper introduces a learning-based low-level controller for quadcopters, which adaptively controls quadcopters with significant variations in mass, size, and actuator capabilities. Our approach leverages a combination of imitation learning and reinforcement learning, creating a fast-adapting and general control framework for quadcopters that eliminates the need for precise model estimation or manual tuning. The controller estimates a latent representation of the vehicle's system parameters from sensor-action history, enabling it to adapt swiftly to diverse dynamics. Extensive evaluations in simulation demonstrate the controller's ability to generalize to unseen quadcopter parameters, with an adaptation range up to 16 times broader than the training set. In real-world tests, the controller is successfully deployed on quadcopters with mass differences of 3.7 times and propeller constants varying by more than 100 times, while also showing rapid adaptation to disturbances such as off-center payloads and motor failures. These results highlight the potential of our controller in extreme adaptation to simplify the design process and enhance the reliability of autonomous drone operations in unpredictable environments. The video and code are at: https://github.com/muellerlab/xadapt_ctrl
△ Less
Submitted 8 June, 2025; v1 submitted 19 September, 2024;
originally announced September 2024.
-
What Color Scheme is More Effective in Assisting Readers to Locate Information in a Color-Coded Article?
Authors:
Ho Yin Ng,
Zeyu He,
Ting-Hao 'Kenneth' Huang
Abstract:
Color coding, a technique assigning specific colors to cluster information types, has proven advantages in aiding human cognitive activities, especially reading and comprehension. The rise of Large Language Models (LLMs) has streamlined document coding, enabling simple automatic text labeling with various schemes. This has the potential to make color-coding more accessible and benefit more users.…
▽ More
Color coding, a technique assigning specific colors to cluster information types, has proven advantages in aiding human cognitive activities, especially reading and comprehension. The rise of Large Language Models (LLMs) has streamlined document coding, enabling simple automatic text labeling with various schemes. This has the potential to make color-coding more accessible and benefit more users. However, the impact of color choice on information seeking is understudied. We conducted a user study assessing various color schemes' effectiveness in LLM-coded text documents, standardizing contrast ratios to approximately 5.55:1 across schemes. Participants performed timed information-seeking tasks in color-coded scholarly abstracts. Results showed non-analogous and yellow-inclusive color schemes improved performance, with the latter also being more preferred by participants. These findings can inform better color scheme choices for text annotation. As LLMs advance document coding, we advocate for more research focusing on the "color" aspect of color-coding techniques.
△ Less
Submitted 26 August, 2024; v1 submitted 12 August, 2024;
originally announced August 2024.
-
Generating Educational Materials with Different Levels of Readability using LLMs
Authors:
Chieh-Yang Huang,
Jing Wei,
Ting-Hao 'Kenneth' Huang
Abstract:
This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting signific…
▽ More
This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting significantly improves performance in readability manipulation and information preservation. LLaMA-2 70B performs better in achieving the desired difficulty range, while GPT-3.5 maintains original meaning. However, manual inspection highlights concerns such as misinformation introduction and inconsistent edit distribution. These findings emphasize the need for further research to ensure the quality of generated educational content.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
How Does Conversation Length Impact User's Satisfaction? A Case Study of Length-Controlled Conversations with LLM-Powered Chatbots
Authors:
Shih-Hong Huang,
Ya-Fang Lin,
Zeyu He,
Chieh-Yang Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
Users can discuss a wide range of topics with large language models (LLMs), but they do not always prefer solving problems or getting information through lengthy conversations. This raises an intriguing HCI question: How does instructing LLMs to engage in longer or shorter conversations affect conversation quality? In this paper, we developed two Slack chatbots using GPT-4 with the ability to vary…
▽ More
Users can discuss a wide range of topics with large language models (LLMs), but they do not always prefer solving problems or getting information through lengthy conversations. This raises an intriguing HCI question: How does instructing LLMs to engage in longer or shorter conversations affect conversation quality? In this paper, we developed two Slack chatbots using GPT-4 with the ability to vary conversation lengths and conducted a user study. Participants asked the chatbots both highly and less conversable questions, engaging in dialogues with 0, 3, 5, and 7 conversational turns. We found that the conversation quality does not differ drastically across different conditions, while participants had mixed reactions. Our study demonstrates LLMs' ability to change conversation length and the potential benefits for users resulting from such changes, but we caution that changes in text form may not necessarily imply changes in quality or content.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
SciCapenter: Supporting Caption Composition for Scientific Figures with Machine-Generated Captions and Ratings
Authors:
Ting-Yao Hsu,
Chieh-Yang Huang,
Shih-Hong Huang,
Ryan Rossi,
Sungchul Kim,
Tong Yu,
C. Lee Giles,
Ting-Hao K. Huang
Abstract:
Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific…
▽ More
Crafting effective captions for figures is important. Readers heavily depend on these captions to grasp the figure's message. However, despite a well-developed set of AI technologies for figures and captions, these have rarely been tested for usefulness in aiding caption writing. This paper introduces SciCapenter, an interactive system that puts together cutting-edge AI technologies for scientific figure captions to aid caption composition. SciCapenter generates a variety of captions for each figure in a scholarly article, providing scores and a comprehensive checklist to assess caption quality across multiple critical aspects, such as helpfulness, OCR mention, key takeaways, and visual properties reference. Users can directly edit captions in SciCapenter, resubmit for revised evaluations, and iteratively refine them. A user study with Ph.D. students indicates that SciCapenter significantly lowers the cognitive load of caption writing. Participants' feedback further offers valuable design insights for future systems aiming to enhance caption writing.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
If in a Crowdsourced Data Annotation Pipeline, a GPT-4
Authors:
Zeyu He,
Chieh-Yang Huang,
Chien-Kuang Cornelia Ding,
Shaurya Rohatgi,
Ting-Hao 'Kenneth' Huang
Abstract:
Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, w…
▽ More
Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy.
△ Less
Submitted 28 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Inspo: Writing with Crowds Alongside AI
Authors:
Chieh-Yang Huang,
Sanjana Gautam,
Shannon McClellan Brooks,
Ya-Fang Lin,
Tiffany Knearem,
Ting-Hao 'Kenneth' Huang
Abstract:
The use of artificial intelligence (AI) to support creative writing has bloomed in recent years. However, it is less well understood how AI compares to on-demand human support. We explored how writers interact with both AI and crowd worker writing assistants in creative writing. We replicated the interface of the prior crowd-writing system, Heteroglossia, and developed Inspo, a text editor allowin…
▽ More
The use of artificial intelligence (AI) to support creative writing has bloomed in recent years. However, it is less well understood how AI compares to on-demand human support. We explored how writers interact with both AI and crowd worker writing assistants in creative writing. We replicated the interface of the prior crowd-writing system, Heteroglossia, and developed Inspo, a text editor allowing users to request suggestions from AI models and crowd workers. In a one-week deployment study involving eight creative writers, we examined how often participants selected crowd workers when fluent AI text generators were also available. Findings showed a consistent decline in crowd worker usage, with participants favoring AI due to its faster responses and more consistent quality. We conclude with suggestions for future systems, recommending designs that account for the unique strengths and weaknesses of human versus AI assistants, strategies to address automation bias, and sociocultural views of writing.
△ Less
Submitted 19 October, 2024; v1 submitted 28 November, 2023;
originally announced November 2023.
-
GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions
Authors:
Ting-Yao Hsu,
Chieh-Yang Huang,
Ryan Rossi,
Sungchul Kim,
C. Lee Giles,
Ting-Hao K. Huang
Abstract:
There is growing interest in systems that generate captions for scientific figures. However, assessing these systems output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method fo…
▽ More
There is growing interest in systems that generate captions for scientific figures. However, assessing these systems output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions. We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions, both original and machine-made, for 600 arXiv figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding, given relevant context such as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by Computer Science and Informatics undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students rankings
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Location-Aware Visual Question Generation with Lightweight Models
Authors:
Nicholas Collin Suwono,
Justin Chih-Yao Chen,
Tun Min Hung,
Ting-Hao Kenneth Huang,
I-Bin Liao,
Yung-Hui Li,
Lun-Wei Ku,
Shao-Hua Sun
Abstract:
This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and s…
▽ More
This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and sophisticated questions. Then, we aim to learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone. To this end, we propose a method which can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines regarding human evaluation (e.g., engagement, grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, ROUGE-2). Moreover, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Unmasking Nationality Bias: A Study of Human Perception of Nationalities in AI-Generated Articles
Authors:
Pranav Narayanan Venkit,
Sanjana Gautam,
Ruchi Panchanadikar,
Ting-Hao `Kenneth' Huang,
Shomir Wilson
Abstract:
We investigate the potential for nationality biases in natural language processing (NLP) models using human evaluation methods. Biased NLP models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of AI systems. Our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to i…
▽ More
We investigate the potential for nationality biases in natural language processing (NLP) models using human evaluation methods. Biased NLP models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of AI systems. Our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to identify and understand the impact of nationality bias in a text generation model. Through our human-centered quantitative analysis, we measure the extent of nationality bias in articles generated by AI sources. We then conduct open-ended interviews with participants, performing qualitative coding and thematic analysis to understand the implications of these biases on human readers. Our findings reveal that biased NLP models tend to replicate and amplify existing societal biases, which can translate to harm if used in a sociotechnical setting. The qualitative analysis from our interviews offers insights into the experience readers have when encountering such articles, highlighting the potential to shift a reader's perception of a country. These findings emphasize the critical role of public perception in shaping AI's impact on society and the need to correct biases in AI systems.
△ Less
Submitted 8 August, 2023;
originally announced August 2023.
-
Good Data, Large Data, or No Data? Comparing Three Approaches in Developing Research Aspect Classifiers for Biomedical Papers
Authors:
Shreya Chandrasekhar,
Chieh-Yang Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
The rapid growth of scientific publications, particularly during the COVID-19 pandemic, emphasizes the need for tools to help researchers efficiently comprehend the latest advancements. One essential part of understanding scientific literature is research aspect classification, which categorizes sentences in abstracts to Background, Purpose, Method, and Finding. In this study, we investigate the i…
▽ More
The rapid growth of scientific publications, particularly during the COVID-19 pandemic, emphasizes the need for tools to help researchers efficiently comprehend the latest advancements. One essential part of understanding scientific literature is research aspect classification, which categorizes sentences in abstracts to Background, Purpose, Method, and Finding. In this study, we investigate the impact of different datasets on model performance for the crowd-annotated CODA-19 research aspect classification task. Specifically, we explore the potential benefits of using the large, automatically curated PubMed 200K RCT dataset and evaluate the effectiveness of large language models (LLMs), such as LLaMA, GPT-3, ChatGPT, and GPT-4. Our results indicate that using the PubMed 200K RCT dataset does not improve performance for the CODA-19 task. We also observe that while GPT-4 performs well, it does not outperform the SciBERT model fine-tuned on the CODA-19 dataset, emphasizing the importance of a dedicated and task-aligned datasets dataset for the target task. Our code is available at https://github.com/Crowd-AI-Lab/CODA-19-exp.
△ Less
Submitted 7 June, 2023;
originally announced June 2023.
-
ConvXAI: Delivering Heterogeneous AI Explanations via Conversations to Support Human-AI Scientific Writing
Authors:
Hua Shen,
Chieh-Yang Huang,
Tongshuang Wu,
Ting-Hao 'Kenneth' Huang
Abstract:
Despite a surge collection of XAI methods, users still struggle to obtain required AI explanations. Previous research suggests chatbots as dynamic solutions, but the effective design of conversational XAI agents for practical human needs remains under-explored. This paper focuses on Conversational XAI for AI-assisted scientific writing tasks. Drawing from human linguistic theories and formative st…
▽ More
Despite a surge collection of XAI methods, users still struggle to obtain required AI explanations. Previous research suggests chatbots as dynamic solutions, but the effective design of conversational XAI agents for practical human needs remains under-explored. This paper focuses on Conversational XAI for AI-assisted scientific writing tasks. Drawing from human linguistic theories and formative studies, we identify four design rationales: "multifaceted", "controllability", "mix-initiative", "context-aware drill-down". We incorporate them into an interactive prototype, ConvXAI, which facilitates heterogeneous AI explanations for scientific writing through dialogue. In two studies with 21 users, ConvXAI outperforms a GUI-based baseline on improving human-perceived understanding and writing improvement. The paper further discusses the practical human usage patterns in interacting with ConvXAI for scientific co-writing.
△ Less
Submitted 27 October, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
Does Human Collaboration Enhance the Accuracy of Identifying LLM-Generated Deepfake Texts?
Authors:
Adaku Uchendu,
Jooyoung Lee,
Hua Shen,
Thai Le,
Ting-Hao 'Kenneth' Huang,
Dongwon Lee
Abstract:
Advances in Large Language Models (e.g., GPT-4, LLaMA) have improved the generation of coherent sentences resembling human writing on a large scale, resulting in the creation of so-called deepfake texts. However, this progress poses security and privacy concerns, necessitating effective solutions for distinguishing deepfake texts from human-written ones. Although prior works studied humans' abilit…
▽ More
Advances in Large Language Models (e.g., GPT-4, LLaMA) have improved the generation of coherent sentences resembling human writing on a large scale, resulting in the creation of so-called deepfake texts. However, this progress poses security and privacy concerns, necessitating effective solutions for distinguishing deepfake texts from human-written ones. Although prior works studied humans' ability to detect deepfake texts, none has examined whether "collaboration" among humans improves the detection of deepfake texts. In this study, to address this gap of understanding on deepfake texts, we conducted experiments with two groups: (1) nonexpert individuals from the AMT platform and (2) writing experts from the Upwork platform. The results demonstrate that collaboration among humans can potentially improve the detection of deepfake texts for both groups, increasing detection accuracies by 6.36% for non-experts and 12.76% for experts, respectively, compared to individuals' detection accuracies. We further analyze the explanations that humans used for detecting a piece of text as deepfake text, and find that the strongest indicator of deepfake texts is their lack of coherence and consistency. Our study provides useful insights for future tools and framework designs to facilitate the collaborative human detection of deepfake texts. The experiment datasets and AMT implementations are available at: https://github.com/huashen218/llm-deepfake-human-study.git
△ Less
Submitted 9 October, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
What Types of Questions Require Conversation to Answer? A Case Study of AskReddit Questions
Authors:
Shih-Hong Huang,
Chieh-Yang Huang,
Ya-Fang Lin,
Ting-Hao 'Kenneth' Huang
Abstract:
The proliferation of automated conversational systems such as chatbots, spoken-dialogue systems, and smart speakers, has significantly impacted modern digital life. However, these systems are primarily designed to provide answers to well-defined questions rather than to support users in exploring complex, ill-defined questions. In this paper, we aim to push the boundaries of conversational systems…
▽ More
The proliferation of automated conversational systems such as chatbots, spoken-dialogue systems, and smart speakers, has significantly impacted modern digital life. However, these systems are primarily designed to provide answers to well-defined questions rather than to support users in exploring complex, ill-defined questions. In this paper, we aim to push the boundaries of conversational systems by examining the types of nebulous, open-ended questions that can best be answered through conversation. We first sampled 500 questions from one million open-ended requests posted on AskReddit, and then recruited online crowd workers to answer eight inquiries about these questions. We also performed open coding to categorize the questions into 27 different domains. We found that the issues people believe require conversation to resolve satisfactorily are highly social and personal. Our work provides insights into how future research could be geared to align with users' needs.
△ Less
Submitted 3 April, 2023; v1 submitted 30 March, 2023;
originally announced March 2023.
-
Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization
Authors:
Chieh-Yang Huang,
Ting-Yao Hsu,
Ryan Rossi,
Ani Nenkova,
Sungchul Kim,
Gromit Yeuk-Yin Chan,
Eunyee Koh,
Clyde Lee Giles,
Ting-Hao 'Kenneth' Huang
Abstract:
Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be…
▽ More
Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be more effectively tackled as a text summarization task in scientific documents. We fine-tuned PEGASUS, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs (e.g., "Figure 3 shows...") into figure captions. Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations. We further conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions. Our code and data are available at: https://github.com/Crowd-AI-Lab/Generating-Figure-Captions-as-a-Text-Summarization-Task.
△ Less
Submitted 11 August, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
Conveying the Predicted Future to Users: A Case Study of Story Plot Prediction
Authors:
Chieh-Yang Huang,
Saniya Naphade,
Kavya Laalasa Karanam,
Ting-Hao 'Kenneth' Huang
Abstract:
Creative writing is hard: Novelists struggle with writer's block daily. While automatic story generation has advanced recently, it is treated as a "toy task" for advancing artificial intelligence rather than helping people. In this paper, we create a system that produces a short description that narrates a predicted plot using existing story generation approaches. Our goal is to assist writers in…
▽ More
Creative writing is hard: Novelists struggle with writer's block daily. While automatic story generation has advanced recently, it is treated as a "toy task" for advancing artificial intelligence rather than helping people. In this paper, we create a system that produces a short description that narrates a predicted plot using existing story generation approaches. Our goal is to assist writers in crafting a consistent and compelling story arc. We conducted experiments on Amazon Mechanical Turk (AMT) to examine the quality of the generated story plots in terms of consistency and storiability. The results show that short descriptions produced by our frame-enhanced GPT-2 (FGPT-2) were rated as the most consistent and storiable among all models; FGPT-2's outputs even beat some random story snippets written by humans. Next, we conducted a preliminary user study using a story continuation task where AMT workers were given access to machine-generated story plots and asked to write a follow-up story. FGPT-2 could positively affect the writing process, though people favor other baselines more. Our study shed some light on the possibilities of future creative writing support systems beyond the scope of completing sentences. Our code is available at: https://github.com/appleternity/Story-Plot-Generation.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Nationality Bias in Text Generation
Authors:
Pranav Narayanan Venkit,
Sanjana Gautam,
Ruchi Panchanadikar,
Ting-Hao 'Kenneth' Huang,
Shomir Wilson
Abstract:
Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to…
▽ More
Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to explore how the number of internet users and the country's economic status impacts the sentiment of the stories. To reduce the propagation of biases through large language models (LLM), we explore the debiasing method of adversarial triggering. Our results show that GPT-2 demonstrates significant bias against countries with lower internet users, and adversarial triggering effectively reduces the same.
△ Less
Submitted 14 February, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
Too Slow to Be Useful? On Incorporating Humans in the Loop of Smart Speakers
Authors:
Shih-Hong Huang,
Chieh-Yang Huang,
Yuxin Deng,
Hua Shen,
Szu-Chi Kuan,
Ting-Hao 'Kenneth' Huang
Abstract:
Real-time crowd-powered systems, such as Chorus/Evorus, VizWiz, and Apparition, have shown how incorporating humans into automated systems could supplement where the automatic solutions fall short. However, one unspoken bottleneck of applying such architectures to more scenarios is the longer latency of including humans in the loop of automated systems. For the applications that have hard constrai…
▽ More
Real-time crowd-powered systems, such as Chorus/Evorus, VizWiz, and Apparition, have shown how incorporating humans into automated systems could supplement where the automatic solutions fall short. However, one unspoken bottleneck of applying such architectures to more scenarios is the longer latency of including humans in the loop of automated systems. For the applications that have hard constraints in turnaround times, human-operated components' longer latency and large speed variation seem to be apparent deal breakers. This paper explicates and quantifies these limitations by using a human-powered text-based backend to hold conversations with users through a voice-only smart speaker. Smart speakers must respond to users' requests within seconds, so the workers behind the scenes only have a few seconds to compose answers. We measured the end-to-end system latency and the conversation quality with eight pairs of participants, showing the challenges and superiority of such systems.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Multi-VQG: Generating Engaging Questions for Multiple Images
Authors:
Min-Hsuan Yeh,
Vicent Chen,
Ting-Hao 'Kenneth' Haung,
Lun-Wei Ku
Abstract:
Generating engaging content has drawn much recent attention in the NLP community. Asking questions is a natural way to respond to photos and promote awareness. However, most answers to questions in traditional question-answering (QA) datasets are factoids, which reduce individuals' willingness to answer. Furthermore, traditional visual question generation (VQG) confines the source data for questio…
▽ More
Generating engaging content has drawn much recent attention in the NLP community. Asking questions is a natural way to respond to photos and promote awareness. However, most answers to questions in traditional question-answering (QA) datasets are factoids, which reduce individuals' willingness to answer. Furthermore, traditional visual question generation (VQG) confines the source data for question generation to single images, resulting in a limited ability to comprehend time-series information of the underlying event. In this paper, we propose generating engaging questions from multiple images. We present MVQG, a new dataset, and establish a series of baselines, including both end-to-end and dual-stage architectures. Results show that building stories behind the image sequence enables models to generate engaging questions, which confirms our assumption that people typically construct a picture of the event in their minds before asking questions. These results open up an exciting challenge for visual-and-language models to implicitly construct a story behind a series of photos to allow for creativity and experience sharing and hence draw attention to downstream applications.
△ Less
Submitted 17 November, 2022; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Let's Talk! Striking Up Conversations via Conversational Visual Question Generation
Authors:
Shih-Han Chan,
Tsai-Lun Yang,
Yun-Wei Chu,
Chi-Yang Hsu,
Ting-Hao Huang,
Yu-Shian Chiu,
Lun-Wei Ku
Abstract:
An engaging and provocative question can open up a great conversation. In this work, we explore a novel scenario: a conversation agent views a set of the user's photos (for example, from social media platforms) and asks an engaging question to initiate a conversation with the user. The existing vision-to-question models mostly generate tedious and obvious questions, which might not be ideals conve…
▽ More
An engaging and provocative question can open up a great conversation. In this work, we explore a novel scenario: a conversation agent views a set of the user's photos (for example, from social media platforms) and asks an engaging question to initiate a conversation with the user. The existing vision-to-question models mostly generate tedious and obvious questions, which might not be ideals conversation starters. This paper introduces a two-phase framework that first generates a visual story for the photo set and then uses the story to produce an interesting question. The human evaluation shows that our framework generates more response-provoking questions for starting conversations than other vision-to-question baselines.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Empathy-Centric Design At Scale
Authors:
Andrea Mauri,
Yen-Chia Hsu,
Marco Brambilla,
Aisling Ann O'Kane,
Ting-Hao 'Kenneth' Huang,
Himanshu Verma
Abstract:
EmpathiCH aims at bringing together and blend different expertise to develop new research agenda in the context of "Empathy-Centric Design at Scale". The main research question is to investigate how new technologies can contribute to the elicitation of empathy across and within multiple stakeholders at scale; and how empathy can be used to design solutions to societal problems that are not only ef…
▽ More
EmpathiCH aims at bringing together and blend different expertise to develop new research agenda in the context of "Empathy-Centric Design at Scale". The main research question is to investigate how new technologies can contribute to the elicitation of empathy across and within multiple stakeholders at scale; and how empathy can be used to design solutions to societal problems that are not only effective but also balanced, inclusive, and aware of their effect on society. Through presentations, participatory sessions, and a living experiment -- where data about the peoples' interactions is collected throughout the event -- we aim to make this workshop the ideal venue to foster collaboration, build networks, and shape the future direction of "Empathy-Centric Design at Scale".
△ Less
Submitted 13 April, 2022;
originally announced April 2022.
-
Are Shortest Rationales the Best Explanations for Human Understanding?
Authors:
Hua Shen,
Tongshuang Wu,
Wenbo Guo,
Ting-Hao 'Kenneth' Huang
Abstract:
Existing self-explaining models typically favor extracting the shortest possible rationales - snippets of an input text "responsible for" corresponding output - to explain the model prediction, with the assumption that shorter rationales are more intuitive to humans. However, this assumption has yet to be validated. Is the shortest rationale indeed the most human-understandable? To answer this que…
▽ More
Existing self-explaining models typically favor extracting the shortest possible rationales - snippets of an input text "responsible for" corresponding output - to explain the model prediction, with the assumption that shorter rationales are more intuitive to humans. However, this assumption has yet to be validated. Is the shortest rationale indeed the most human-understandable? To answer this question, we design a self-explaining model, LimitedInk, which allows users to extract rationales at any target length. Compared to existing baselines, LimitedInk achieves compatible end-task performance and human-annotated rationale agreement, making it a suitable representation of the recent class of self-explaining models. We use LimitedInk to conduct a user study on the impact of rationale length, where we ask human judges to predict the sentiment label of documents based only on LimitedInk-generated rationales with different lengths. We show rationales that are too short do not help humans predict labels better than randomly masked text, suggesting the need for more careful design of the best human rationales.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
SciCap: Generating Captions for Scientific Figures
Authors:
Ting-Yao Hsu,
C. Lee Giles,
Ting-Hao 'Kenneth' Huang
Abstract:
Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientif…
▽ More
Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientific figures. To this end, we introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing - including figure-type classification, sub-figure identification, text normalization, and caption text selection - SCICAP contained more than two million figures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2%) figure type. The experimental results showed both opportunities and steep challenges of generating captions for scientific figures.
△ Less
Submitted 25 October, 2021; v1 submitted 22 October, 2021;
originally announced October 2021.
-
Empowering Local Communities Using Artificial Intelligence
Authors:
Yen-Chia Hsu,
Ting-Hao 'Kenneth' Huang,
Himanshu Verma,
Andrea Mauri,
Illah Nourbakhsh,
Alessandro Bozzon
Abstract:
Artificial Intelligence (AI) is increasingly used to analyze large amounts of data in various practices, such as object recognition. We are specifically interested in using AI-powered systems to engage local communities in developing plans or solutions for pressing societal and environmental concerns. Such local contexts often involve multiple stakeholders with different and even contradictory age…
▽ More
Artificial Intelligence (AI) is increasingly used to analyze large amounts of data in various practices, such as object recognition. We are specifically interested in using AI-powered systems to engage local communities in developing plans or solutions for pressing societal and environmental concerns. Such local contexts often involve multiple stakeholders with different and even contradictory agendas, resulting in mismatched expectations of these systems' behaviors and desired outcomes. There is a need to investigate if AI models and pipelines can work as expected in different contexts through co-creation and field deployment. Based on case studies in co-creating AI-powered systems with local people, we explain challenges that require more attention and provide viable paths to bridge AI research with citizen needs. We advocate for developing new collaboration approaches and mindsets that are needed to co-create AI-powered systems in multi-stakeholder contexts to address local concerns.
△ Less
Submitted 26 April, 2022; v1 submitted 5 October, 2021;
originally announced October 2021.
-
FinQA: A Dataset of Numerical Reasoning over Financial Data
Authors:
Zhiyu Chen,
Wenhu Chen,
Charese Smiley,
Sameena Shah,
Iana Borova,
Dylan Langdon,
Reema Moussa,
Matt Beane,
Ting-Hao Huang,
Bryan Routledge,
William Yang Wang
Abstract:
The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance…
▽ More
The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset -- the first of its kind -- should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available\url{https://github.com/czyssrs/FinQA}.
△ Less
Submitted 7 May, 2022; v1 submitted 31 August, 2021;
originally announced September 2021.
-
ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Authors:
Yanjun Gao,
Ting-hao Huang,
Rebecca J. Passonneau
Abstract:
Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence i…
▽ More
Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.
△ Less
Submitted 22 June, 2021;
originally announced June 2021.
-
Plot and Rework: Modeling Storylines for Visual Storytelling
Authors:
Chi-Yang Hsu,
Yun-Wei Chu,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
Writing a coherent and engaging story is not easy. Creative writers use their knowledge and worldview to put disjointed elements together to form a coherent storyline, and work and rework iteratively toward perfection. Automated visual storytelling (VIST) models, however, make poor use of external knowledge and iterative generation when attempting to create stories. This paper introduces PR-VIST,…
▽ More
Writing a coherent and engaging story is not easy. Creative writers use their knowledge and worldview to put disjointed elements together to form a coherent storyline, and work and rework iteratively toward perfection. Automated visual storytelling (VIST) models, however, make poor use of external knowledge and iterative generation when attempting to create stories. This paper introduces PR-VIST, a framework that represents the input image sequence as a story graph in which it finds the best path to form a storyline. PR-VIST then takes this path and learns to generate the final story via an iterative training process. This framework produces stories that are superior in terms of diversity, coherence, and humanness, per both automatic and human evaluations. An ablation study shows that both plotting and reworking contribute to the model's superiority.
△ Less
Submitted 7 July, 2021; v1 submitted 14 May, 2021;
originally announced May 2021.
-
Semantic Frame Forecast
Authors:
Chieh-Yang Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
This paper introduces semantic frame forecast, a task that predicts the semantic frames that will occur in the next 10, 100, or even 1,000 sentences in a running story. Prior work focused on predicting the immediate future of a story, such as one to a few sentences ahead. However, when novelists write long stories, generating a few sentences is not enough to help them gain high-level insight to de…
▽ More
This paper introduces semantic frame forecast, a task that predicts the semantic frames that will occur in the next 10, 100, or even 1,000 sentences in a running story. Prior work focused on predicting the immediate future of a story, such as one to a few sentences ahead. However, when novelists write long stories, generating a few sentences is not enough to help them gain high-level insight to develop the follow-up story. In this paper, we formulate a long story as a sequence of "story blocks," where each block contains a fixed number of sentences (e.g., 10, 100, or 200). This formulation allows us to predict the follow-up story arc beyond the scope of a few sentences. We represent a story block using the term frequencies (TF) of semantic frames in it, normalized by each frame's inverse document frequency (IDF). We conduct semantic frame forecast experiments on 4,794 books from the Bookcorpus and 7,962 scientific abstracts from CODA-19, with block sizes ranging from 5 to 1,000 sentences. The results show that automated models can forecast the follow-up story blocks better than the random, prior, and replay baselines, indicating the task's feasibility. We also learn that the models using the frame representation as features outperform all the existing approaches when the block size is over 150 sentences. The human evaluation also shows that the proposed frame representation, when visualized as word clouds, is comprehensible, representative, and specific to humans. Our code is available at https://github.com/appleternity/FrameForecasting.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Explaining the Road Not Taken
Authors:
Hua Shen,
Ting-Hao 'Kenneth' Huang
Abstract:
It is unclear if existing interpretations of deep neural network models respond effectively to the needs of users. This paper summarizes the common forms of explanations (such as feature attribution, decision rules, or probes) used in over 200 recent papers about natural language processing (NLP), and compares them against user questions collected in the XAI Question Bank. We found that although u…
▽ More
It is unclear if existing interpretations of deep neural network models respond effectively to the needs of users. This paper summarizes the common forms of explanations (such as feature attribution, decision rules, or probes) used in over 200 recent papers about natural language processing (NLP), and compares them against user questions collected in the XAI Question Bank. We found that although users are interested in explanations for the road not taken -- namely, why the model chose one result and not a well-defined, seemly similar legitimate counterpart -- most model interpretations cannot answer these questions.
△ Less
Submitted 30 March, 2021; v1 submitted 27 March, 2021;
originally announced March 2021.
-
Assessing the Helpfulness of Learning Materials with Inference-Based Learner-Like Agent
Authors:
Yun-Hsuan Jen,
Chieh-Yang Huang,
Mei-Hua Chen,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
Many English-as-a-second language learners have trouble using near-synonym words (e.g., small vs.little; briefly vs.shortly) correctly, and often look for example sentences to learn how two nearly synonymous terms differ. Prior work uses hand-crafted scores to recommend sentences but has difficulty in adopting such scores to all the near-synonyms as near-synonyms differ in various ways. We notice…
▽ More
Many English-as-a-second language learners have trouble using near-synonym words (e.g., small vs.little; briefly vs.shortly) correctly, and often look for example sentences to learn how two nearly synonymous terms differ. Prior work uses hand-crafted scores to recommend sentences but has difficulty in adopting such scores to all the near-synonyms as near-synonyms differ in various ways. We notice that the helpfulness of the learning material would reflect on the learners' performance. Thus, we propose the inference-based learner-like agent to mimic learner behavior and identify good learning materials by examining the agent's performance. To enable the agent to behave like a learner, we leverage entailment modeling's capability of inferring answers from the provided materials. Experimental results show that the proposed agent is equipped with good learner-like behavior to achieve the best performance in both fill-in-the-blank (FITB) and good example sentence selection tasks. We further conduct a classroom user study with college ESL learners. The results of the user study show that the proposed agent can find out example sentences that help students learn more easily and efficiently. Compared to other models, the proposed agent improves the score of more than 17% of students after learning.
△ Less
Submitted 5 October, 2020;
originally announced October 2020.
-
How Useful Are the Machine-Generated Interpretations to General Users? A Human Evaluation on Guessing the Incorrectly Predicted Labels
Authors:
Hua Shen,
Ting-Hao Kenneth Huang
Abstract:
Explaining to users why automated systems make certain mistakes is important and challenging. Researchers have proposed ways to automatically produce interpretations for deep neural network models. However, it is unclear how useful these interpretations are in helping users figure out why they are getting an error. If an interpretation effectively explains to users how the underlying deep neural n…
▽ More
Explaining to users why automated systems make certain mistakes is important and challenging. Researchers have proposed ways to automatically produce interpretations for deep neural network models. However, it is unclear how useful these interpretations are in helping users figure out why they are getting an error. If an interpretation effectively explains to users how the underlying deep neural network model works, people who were presented with the interpretation should be better at predicting the model's outputs than those who were not. This paper presents an investigation on whether or not showing machine-generated visual interpretations helps users understand the incorrectly predicted labels produced by image classifiers. We showed the images and the correct labels to 150 online crowd workers and asked them to select the incorrectly predicted labels with or without showing them the machine-generated visual interpretations. The results demonstrated that displaying the visual interpretations did not increase, but rather decreased, the average guessing accuracy by roughly 10%.
△ Less
Submitted 27 August, 2020; v1 submitted 26 August, 2020;
originally announced August 2020.
-
Project RISE: Recognizing Industrial Smoke Emissions
Authors:
Yen-Chia Hsu,
Ting-Hao 'Kenneth' Huang,
Ting-Yao Hu,
Paul Dille,
Sean Prendi,
Ryan Hoffman,
Anastasia Tsuhlares,
Jessica Pachuta,
Randy Sargent,
Illah Nourbakhsh
Abstract:
Industrial smoke emissions pose a significant concern to human health. Prior works have shown that using Computer Vision (CV) techniques to identify smoke as visual evidence can influence the attitude of regulators and empower citizens to pursue environmental justice. However, existing datasets are not of sufficient quality nor quantity to train the robust CV models needed to support air quality a…
▽ More
Industrial smoke emissions pose a significant concern to human health. Prior works have shown that using Computer Vision (CV) techniques to identify smoke as visual evidence can influence the attitude of regulators and empower citizens to pursue environmental justice. However, existing datasets are not of sufficient quality nor quantity to train the robust CV models needed to support air quality advocacy. We introduce RISE, the first large-scale video dataset for Recognizing Industrial Smoke Emissions. We adopted a citizen science approach to collaborate with local community members to annotate whether a video clip has smoke emissions. Our dataset contains 12,567 clips from 19 distinct views from cameras that monitored three industrial facilities. These daytime clips span 30 days over two years, including all four seasons. We ran experiments using deep neural networks to establish a strong performance baseline and reveal smoke recognition challenges. Our survey study discussed community feedback, and our data analysis displayed opportunities for integrating citizen scientists and crowd workers into the application of Artificial Intelligence for Social Impact.
△ Less
Submitted 29 April, 2024; v1 submitted 12 May, 2020;
originally announced May 2020.
-
CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset
Authors:
Ting-Hao 'Kenneth' Huang,
Chieh-Yang Huang,
Chien-Kuang Cornelia Ding,
Yen-Chia Hsu,
C. Lee Giles
Abstract:
This paper introduces CODA-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the COVID-19 Open Research Dataset. CODA-19 was created by 248 crowd workers from Amazon Mechanical Turk within 10 days, and achieved labeling quality comparable to that of experts. Each abstract was annotated by nine different…
▽ More
This paper introduces CODA-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the COVID-19 Open Research Dataset. CODA-19 was created by 248 crowd workers from Amazon Mechanical Turk within 10 days, and achieved labeling quality comparable to that of experts. Each abstract was annotated by nine different workers, and the final labels were acquired by majority vote. The inter-annotator agreement (Cohen's kappa) between the crowd and the biomedical expert (0.741) is comparable to inter-expert agreement (0.788). CODA-19's labels have an accuracy of 82.2% when compared to the biomedical expert's labels, while the accuracy between experts was 85.0%. Reliable human annotations help scientists access and integrate the rapidly accelerating coronavirus literature, and also serve as the battery of AI/NLP research, but obtaining expert annotations can be slow. We demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19.
△ Less
Submitted 17 September, 2020; v1 submitted 5 May, 2020;
originally announced May 2020.
-
Heteroglossia: In-Situ Story Ideation with the Crowd
Authors:
Chieh-Yang Huang,
Shih-Hong Huang,
Ting-Hao 'Kenneth' Huang
Abstract:
Ideation is essential for creative writing. Many authors struggle to come up with ideas throughout the writing process, yet modern writing tools fail to provide on-the-spot assistance for writers when they get stuck. This paper introduces Heteroglossia, an add-on for Google Docs that allows writers to elicit story ideas from the online crowd using their text editors. Writers can share snippets of…
▽ More
Ideation is essential for creative writing. Many authors struggle to come up with ideas throughout the writing process, yet modern writing tools fail to provide on-the-spot assistance for writers when they get stuck. This paper introduces Heteroglossia, an add-on for Google Docs that allows writers to elicit story ideas from the online crowd using their text editors. Writers can share snippets of their working drafts and ask the crowd to provide follow-up story ideas based on it. Heteroglossia employs a strategy called "role play", where each worker is assigned a fictional character in a story and asked to brainstorm plot ideas from that character's perspective. Our deployment with two experienced story writers shows that Heteroglossia is easy to use and can generate interesting ideas. Heteroglossia allows us to gain insight into how future technologies can be developed to support ideation in creative writing
△ Less
Submitted 15 January, 2020; v1 submitted 13 January, 2020;
originally announced January 2020.
-
Smell Pittsburgh: Engaging Community Citizen Science for Air Quality
Authors:
Yen-Chia Hsu,
Jennifer Cross,
Paul Dille,
Michael Tasota,
Beatrice Dias,
Randy Sargent,
Ting-Hao 'Kenneth' Huang,
Illah Nourbakhsh
Abstract:
Urban air pollution has been linked to various human health concerns, including cardiopulmonary diseases. Communities who suffer from poor air quality often rely on experts to identify pollution sources due to the lack of accessible tools. Taking this into account, we developed Smell Pittsburgh, a system that enables community members to report odors and track where these odors are frequently conc…
▽ More
Urban air pollution has been linked to various human health concerns, including cardiopulmonary diseases. Communities who suffer from poor air quality often rely on experts to identify pollution sources due to the lack of accessible tools. Taking this into account, we developed Smell Pittsburgh, a system that enables community members to report odors and track where these odors are frequently concentrated. All smell report data are publicly accessible online. These reports are also sent to the local health department and visualized on a map along with air quality data from monitoring stations. This visualization provides a comprehensive overview of the local pollution landscape. Additionally, with these reports and air quality data, we developed a model to predict upcoming smell events and send push notifications to inform communities. We also applied regression analysis to identify statistically significant effects of push notifications on user engagement. Our evaluation of this system demonstrates that engaging residents in documenting their experiences with pollution odors can help identify local air pollution patterns, and can empower communities to advocate for better air quality. All citizen-contributed smell data are publicly accessible and can be downloaded from https://smellpgh.org.
△ Less
Submitted 20 November, 2020; v1 submitted 26 December, 2019;
originally announced December 2019.
-
Knowledge-Enriched Visual Storytelling
Authors:
Chao-Chun Hsu,
Zi-Yuan Chen,
Chi-Yang Hsu,
Chih-Chia Li,
Tzu-Yuan Lin,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
Stories are diverse and highly personalized, resulting in a large possible output space for story generation. Existing end-to-end approaches produce monotonous stories because they are limited to the vocabulary and knowledge in a single training dataset. This paper introduces KG-Story, a three-stage framework that allows the story generation model to take advantage of external Knowledge Graphs to…
▽ More
Stories are diverse and highly personalized, resulting in a large possible output space for story generation. Existing end-to-end approaches produce monotonous stories because they are limited to the vocabulary and knowledge in a single training dataset. This paper introduces KG-Story, a three-stage framework that allows the story generation model to take advantage of external Knowledge Graphs to produce interesting stories. KG-Story distills a set of representative words from the input prompts, enriches the word set by using external knowledge graphs, and finally generates stories based on the enriched word set. This distill-enrich-generate framework allows the use of external resources not only for the enrichment phase, but also for the distillation and generation phases. In this paper, we show the superiority of KG-Story for visual storytelling, where the input prompt is a sequence of five photos and the output is a short story. Per the human ranking evaluation, stories generated by KG-Story are on average ranked better than that of the state-of-the-art systems. Our code and output stories are available at https://github.com/zychen423/KE-VIST.
△ Less
Submitted 3 December, 2019;
originally announced December 2019.
-
On Automating Conversations
Authors:
Ting-Hao 'Kenneth' Huang
Abstract:
From 2016 to 2018, we developed and deployed Chorus, a system that blends real-time human computation with artificial intelligence (AI) and has real-world, open conversations with users. We took a top-down approach that started with a working crowd-powered system, Chorus, and then created a framework, Evorus, that enables Chorus to automate itself over time. Over our two-year deployment, more than…
▽ More
From 2016 to 2018, we developed and deployed Chorus, a system that blends real-time human computation with artificial intelligence (AI) and has real-world, open conversations with users. We took a top-down approach that started with a working crowd-powered system, Chorus, and then created a framework, Evorus, that enables Chorus to automate itself over time. Over our two-year deployment, more than 420 users talked with Chorus, having over 2,200 conversation sessions. This line of work demonstrated how a crowd-powered conversational assistant can be automated over time, and more importantly, how such a system can be deployed to talk with real users to help them with their everyday tasks. This position paper discusses two sets of challenges that we explored during the development and deployment of Chorus and Evorus: the challenges that come from being an "agent" and those that arise from the subset of conversations that are more difficult to automate.
△ Less
Submitted 24 October, 2019; v1 submitted 21 October, 2019;
originally announced October 2019.
-
On Using Chatbots to Promote Smoking Cessation Among Adolescents of Low Socioeconomic Status
Authors:
Patricia Simon,
Suchitra Krishnan-Sarin,
Ting-Hao 'Kenneth' Huang
Abstract:
Reducing youth tobacco use is critical for improving child health since tobacco use is associated with respiratory problems, and nicotine may interfere with healthy brain development. While tobacco regulation has contributed to declines in cigarette use among youth, these declines have occurred more quickly for youth of high socioeconomic status (SES) compared to youth of low SES. A major barrier…
▽ More
Reducing youth tobacco use is critical for improving child health since tobacco use is associated with respiratory problems, and nicotine may interfere with healthy brain development. While tobacco regulation has contributed to declines in cigarette use among youth, these declines have occurred more quickly for youth of high socioeconomic status (SES) compared to youth of low SES. A major barrier to smoking cessation for adolescents of low SES is coordination of access and transportation to in-person treatment sessions. Low-SES youth may have family obligations that limit their ability to access in-person treatment. At the same time, mobile use among adolescents is high: 85% have smartphones. Additionally, adolescents engage in texting at high rates, suggesting that they are well-suited for mobile instant messaging interventions. Mobile interventions have shown promise for youth, but their use remains low. Thus, more research is needed to develop effective and engaging mobile interventions to increase quit rates. In this paper, we provide a brief review of approaches to adolescent smoking cessation and describe the promise of chatbots for smoking cessation.
△ Less
Submitted 19 October, 2019;
originally announced October 2019.
-
InstructableCrowd: Creating IF-THEN Rules for Smartphones via Conversations with the Crowd
Authors:
Ting-Hao 'Kenneth' Huang,
Amos Azaria,
Oscar J. Romero,
Jeffrey P. Bigham
Abstract:
Natural language interfaces have become a common part of modern digital life. Chatbots utilize text-based conversations to communicate with users; personal assistants on smartphones such as Google Assistant take direct speech commands from their users; and speech-controlled devices such as Amazon Echo use voice as their only input mode. In this paper, we introduce InstructableCrowd, a crowd-powere…
▽ More
Natural language interfaces have become a common part of modern digital life. Chatbots utilize text-based conversations to communicate with users; personal assistants on smartphones such as Google Assistant take direct speech commands from their users; and speech-controlled devices such as Amazon Echo use voice as their only input mode. In this paper, we introduce InstructableCrowd, a crowd-powered system that allows users to program their devices via conversation. The user verbally expresses a problem to the system, in which a group of crowd workers collectively respond and program relevant multi-part IF-THEN rules to help the user. The IF-THEN rules generated by InstructableCrowd connect relevant sensor combinations (e.g., location, weather, device acceleration, etc.) to useful effectors (e.g., text messages, device alarms, etc.). Our study showed that non-programmers can use the conversational interface of InstructableCrowd to create IF-THEN rules that have similar quality compared with the rules created manually. InstructableCrowd generally illustrates how users may converse with their devices, not only to trigger simple voice commands, but also to personalize their increasingly powerful and complicated devices.
△ Less
Submitted 12 September, 2019;
originally announced September 2019.
-
Visual Story Post-Editing
Authors:
Ting-Yao Hsu,
Chieh-Yang Huang,
Yen-Chia Hsu,
Ting-Hao 'Kenneth' Huang
Abstract:
We introduce the first dataset for human edits of machine-generated visual stories and explore how these collected edits may be used for the visual story post-editing task. The dataset, VIST-Edit, includes 14,905 human edited versions of 2,981 machine-generated visual stories. The stories were generated by two state-of-the-art visual storytelling models, each aligned to 5 human-edited versions. We…
▽ More
We introduce the first dataset for human edits of machine-generated visual stories and explore how these collected edits may be used for the visual story post-editing task. The dataset, VIST-Edit, includes 14,905 human edited versions of 2,981 machine-generated visual stories. The stories were generated by two state-of-the-art visual storytelling models, each aligned to 5 human-edited versions. We establish baselines for the task, showing how a relatively small set of human edits can be leveraged to boost the performance of large visual storytelling models. We also discuss the weak correlation between automatic evaluation scores and human ratings, motivating the need for new automatic metrics.
△ Less
Submitted 4 June, 2019;
originally announced June 2019.
-
Dixit: Interactive Visual Storytelling via Term Manipulation
Authors:
Chao-Chun Hsu,
Yu-Hua Chen,
Zi-Yuan Chen,
Hsin-Yu Lin,
Ting-Hao 'Kenneth' Huang,
Lun-Wei Ku
Abstract:
In this paper, we introduce Dixit, an interactive visual storytelling system that the user interacts with iteratively to compose a short story for a photo sequence. The user initiates the process by uploading a sequence of photos. Dixit first extracts text terms from each photo which describe the objects (e.g., boy, bike) or actions (e.g., sleep) in the photo, and then allows the user to add new t…
▽ More
In this paper, we introduce Dixit, an interactive visual storytelling system that the user interacts with iteratively to compose a short story for a photo sequence. The user initiates the process by uploading a sequence of photos. Dixit first extracts text terms from each photo which describe the objects (e.g., boy, bike) or actions (e.g., sleep) in the photo, and then allows the user to add new terms or remove existing terms. Dixit then generates a short story based on these terms. Behind the scenes, Dixit uses an LSTM-based model trained on image caption data and FrameNet to distill terms from each image and utilizes a transformer decoder to compose a context-coherent story. Users change images or terms iteratively with Dixit to create the most ideal story. Dixit also allows users to manually edit and rate stories. The proposed procedure opens up possibilities for interpretable and controllable visual storytelling, allowing users to understand the story formation rationale and to intervene in the generation process.
△ Less
Submitted 31 May, 2019; v1 submitted 6 March, 2019;
originally announced March 2019.