-
Measuring Sycophancy of Language Models in Multi-turn Dialogues
Authors:
Jiseung Hong,
Grace Byun,
Seungone Kim,
Kai Shu
Abstract:
Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark fo…
▽ More
Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Real-World Deployment of Cloud Autonomous Mobility System Using 5G Networks for Outdoor and Indoor Environments
Authors:
Yufeng Yang,
Minghao Ning,
Keqi Shu,
Aladdin Saleh,
Ehsan Hashemi,
Amir Khajepour
Abstract:
The growing complexity of both outdoor and indoor mobility systems demands scalable, cost-effective, and reliable perception and communication frameworks. This work presents the real-world deployment and evaluation of a Cloud Autonomous Mobility (CAM) system that leverages distributed sensor nodes connected via 5G networks, which integrates LiDAR- and camera-based perception at infrastructure unit…
▽ More
The growing complexity of both outdoor and indoor mobility systems demands scalable, cost-effective, and reliable perception and communication frameworks. This work presents the real-world deployment and evaluation of a Cloud Autonomous Mobility (CAM) system that leverages distributed sensor nodes connected via 5G networks, which integrates LiDAR- and camera-based perception at infrastructure units, cloud computing for global information fusion, and Ultra-Reliable Low Latency Communications (URLLC) to enable real-time situational awareness and autonomous operation. The CAM system is deployed in two distinct environments: a dense urban roundabout and a narrow indoor hospital corridor. Field experiments show improved traffic monitoring, hazard detection, and asset management capabilities. The paper also discusses practical deployment challenges and shares key insights for scaling CAM systems. The results highlight the potential of cloud-based infrastructure perception to advance both outdoor and indoor intelligent transportation systems.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
SACM: SEEG-Audio Contrastive Matching for Chinese Speech Decoding
Authors:
Hongbin Wang,
Zhihong Jia,
Yuanzhong Shen,
Ziwei Wang,
Siyang Li,
Kai Shu,
Feng Hu,
Dongrui Wu
Abstract:
Speech disorders such as dysarthria and anarthria can severely impair the patient's ability to communicate verbally. Speech decoding brain-computer interfaces (BCIs) offer a potential alternative by directly translating speech intentions into spoken words, serving as speech neuroprostheses. This paper reports an experimental protocol for Mandarin Chinese speech decoding BCIs, along with the corres…
▽ More
Speech disorders such as dysarthria and anarthria can severely impair the patient's ability to communicate verbally. Speech decoding brain-computer interfaces (BCIs) offer a potential alternative by directly translating speech intentions into spoken words, serving as speech neuroprostheses. This paper reports an experimental protocol for Mandarin Chinese speech decoding BCIs, along with the corresponding decoding algorithms. Stereo-electroencephalography (SEEG) and synchronized audio data were collected from eight drug-resistant epilepsy patients as they conducted a word-level reading task. The proposed SEEG and Audio Contrastive Matching (SACM), a contrastive learning-based framework, achieved decoding accuracies significantly exceeding chance levels in both speech detection and speech decoding tasks. Electrode-wise analysis revealed that a single sensorimotor cortex electrode achieved performance comparable to that of the full electrode array. These findings provide valuable insights for developing more accurate online speech decoding BCIs.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Decision Making in Urban Traffic: A Game Theoretic Approach for Autonomous Vehicles Adhering to Traffic Rules
Authors:
Keqi Shu,
Minghao Ning,
Ahmad Alghooneh,
Shen Li,
Mohammad Pirani,
Amir Khajepour
Abstract:
One of the primary challenges in urban autonomous vehicle decision-making and planning lies in effectively managing intricate interactions with diverse traffic participants characterized by unpredictable movement patterns. Additionally, interpreting and adhering to traffic regulations within rapidly evolving traffic scenarios pose significant hurdles. This paper proposed a rule-based autonomous ve…
▽ More
One of the primary challenges in urban autonomous vehicle decision-making and planning lies in effectively managing intricate interactions with diverse traffic participants characterized by unpredictable movement patterns. Additionally, interpreting and adhering to traffic regulations within rapidly evolving traffic scenarios pose significant hurdles. This paper proposed a rule-based autonomous vehicle decision-making and planning framework which extracts right-of-way from traffic rules to generate behavioural parameters, integrating them to effectively adhere to and navigate through traffic regulations. The framework considers the strong interaction between traffic participants mathematically by formulating the decision-making and planning problem into a differential game. By finding the Nash equilibrium of the problem, the autonomous vehicle is able to find optimal decisions. The proposed framework was tested under simulation as well as full-size vehicle platform, the results show that the ego vehicle is able to safely interact with surrounding traffic participants while adhering to traffic rules.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
SAP-CoPE: Social-Aware Planning using Cooperative Pose Estimation with Infrastructure Sensor Nodes
Authors:
Minghao Ning,
Yufeng Yang,
Shucheng Huang,
Jiaming Zhong,
Keqi Shu,
Chen Sun,
Ehsan Hashemi,
Amir Khajepour
Abstract:
Autonomous driving systems must operate safely in human-populated indoor environments, where challenges such as limited perception and occlusion sensitivity arise when relying solely on onboard sensors. These factors generate difficulties in the accurate recognition of human intentions and the generation of comfortable, socially aware trajectories. To address these issues, we propose SAP-CoPE, a s…
▽ More
Autonomous driving systems must operate safely in human-populated indoor environments, where challenges such as limited perception and occlusion sensitivity arise when relying solely on onboard sensors. These factors generate difficulties in the accurate recognition of human intentions and the generation of comfortable, socially aware trajectories. To address these issues, we propose SAP-CoPE, a social-aware planning framework that integrates cooperative infrastructure with a novel 3D human pose estimation method and a model predictive control-based controller. This real-time framework formulates an optimization problem that accounts for uncertainty propagation in the camera projection matrix while ensuring human joint coherence. The proposed method is adaptable to single- or multi-camera configurations and can incorporate sparse LiDAR point-cloud data. To enhance safety and comfort in human environments, we integrate a human personal space field based on human pose into a model predictive controller, enabling the system to navigate while avoiding discomfort zones. Extensive evaluations in both simulated and real-world settings demonstrate the effectiveness of our approach in generating socially aware trajectories for autonomous systems.
△ Less
Submitted 8 April, 2025;
originally announced April 2025.
-
TransNet: Transfer Knowledge for Few-shot Knowledge Graph Completion
Authors:
Lihui Liu,
Zihao Wang,
Dawei Zhou,
Ruijie Wang,
Yuchen Yan,
Bo Xiong,
Sihong He,
Kai Shu,
Hanghang Tong
Abstract:
Knowledge graphs (KGs) are ubiquitous and widely used in various applications. However, most real-world knowledge graphs are incomplete, which significantly degrades their performance on downstream tasks. Additionally, the relationships in real-world knowledge graphs often follow a long-tail distribution, meaning that most relations are represented by only a few training triplets. To address these…
▽ More
Knowledge graphs (KGs) are ubiquitous and widely used in various applications. However, most real-world knowledge graphs are incomplete, which significantly degrades their performance on downstream tasks. Additionally, the relationships in real-world knowledge graphs often follow a long-tail distribution, meaning that most relations are represented by only a few training triplets. To address these challenges, few-shot learning has been introduced. Few-shot KG completion aims to make accurate predictions for triplets involving novel relations when only a limited number of training triplets are available. Although many methods have been proposed, they typically learn each relation individually, overlooking the correlations between different tasks and the relevant information in previously trained tasks. In this paper, we propose a transfer learning-based few-shot KG completion method (TransNet). By learning the relationships between different tasks, TransNet effectively transfers knowledge from similar tasks to improve the current task's performance. Furthermore, by employing meta-learning, TransNet can generalize effectively to new, unseen relations. Extensive experiments on benchmark datasets demonstrate the superiority of TransNet over state-of-the-art methods. Code can be found at https://github.com/lihuiliullh/TransNet/tree/main
△ Less
Submitted 29 March, 2025;
originally announced April 2025.
-
Can Multimodal LLMs Perform Time Series Anomaly Detection?
Authors:
Xiongxiao Xu,
Haoran Wang,
Yueqing Liang,
Philip S. Yu,
Yue Zhao,
Kai Shu
Abstract:
Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: C…
▽ More
Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: Can multimodal LLMs perform time series anomaly detection? To answer this, we propose VisualTimeAnomaly benchmark to evaluate MLLMs in time series anomaly detection (TSAD). Our approach transforms time series numerical data into the image format and feed these images into various MLLMs, including proprietary models (GPT-4o and Gemini-1.5) and open-source models (LLaVA-NeXT and Qwen2-VL), each with one larger and one smaller variant. In total, VisualTimeAnomaly contains 12.4k time series images spanning 3 scenarios and 3 anomaly granularities with 9 anomaly types across 8 MLLMs. Starting with the univariate case (point- and range-wise anomalies), we extend our evaluation to more practical scenarios, including multivariate and irregular time series scenarios, and variate-wise anomalies. Our study reveals several key insights:
1) MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies.
2) MLLMs are highly robust to irregular time series, even with 25% of the data missing.
3) Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series.
To the best of our knowledge, this is the first work to comprehensively investigate MLLMs for TSAD, particularly for multivariate and irregular time series scenarios. We release our dataset and code at https://github.com/mllm-ts/VisualTimeAnomaly to support future research.
△ Less
Submitted 24 February, 2025;
originally announced February 2025.
-
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
Authors:
Yue Huang,
Chujie Gao,
Siyuan Wu,
Haoran Wang,
Xiangqi Wang,
Yujun Zhou,
Yanbo Wang,
Jiayi Ye,
Jiawen Shi,
Qihui Zhang,
Yuan Li,
Han Bao,
Zhaoyi Liu,
Tianrui Guan,
Dongping Chen,
Ruoxi Chen,
Kehan Guo,
Andy Zou,
Bryan Hooi Kuen-Yew,
Caiming Xiong,
Elias Stengel-Eskin,
Hongyang Zhang,
Hongzhi Yin,
Huan Zhang,
Huaxiu Yao
, et al. (41 additional authors not shown)
Abstract:
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, a…
▽ More
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components--metadata curation, test case generation, and contextual variation--to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation.
△ Less
Submitted 11 May, 2025; v1 submitted 20 February, 2025;
originally announced February 2025.
-
Benchmarking LLMs for Political Science: A United Nations Perspective
Authors:
Yueqing Liang,
Liangwei Yang,
Chen Wang,
Congying Xia,
Rui Meng,
Xiongxiao Xu,
Haoran Wang,
Ali Payani,
Kai Shu
Abstract:
Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequenc…
▽ More
Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process--drafting, voting, and discussing--and aim to assess LLMs' ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: https://github.com/yueqingliang1/UNBench.
△ Less
Submitted 19 February, 2025;
originally announced February 2025.
-
Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding
Authors:
Yunpeng Xiao,
Youpeng Zhao,
Kai Shu
Abstract:
Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unex…
▽ More
Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7\% and 23.3\%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87\% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at https://github.com/24yearsoldstudent/Individual-NLU
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Beyond Minimax Optimality: A Subgame Perfect Gradient Method
Authors:
Benjamin Grimmer,
Kevin Shu,
Alex L. Wang
Abstract:
The study of unconstrained convex optimization has historically been concerned with worst-case a priori convergence rates. The development of the Optimized Gradient Method (OGM), due to Drori and Teboulle, Kim and Fessler, marked a major milestone in this study, as OGM achieves the optimal worst-case convergence rate among all gradient-span first-order methods. However, this notion of worst-case o…
▽ More
The study of unconstrained convex optimization has historically been concerned with worst-case a priori convergence rates. The development of the Optimized Gradient Method (OGM), due to Drori and Teboulle, Kim and Fessler, marked a major milestone in this study, as OGM achieves the optimal worst-case convergence rate among all gradient-span first-order methods. However, this notion of worst-case optimality is relatively coarse and allows OGM to have worst-case performance even on instances where stronger convergence guarantees are possible. For example, OGM is known to converge at its worst-case rate even on the toy example $Lx^2/$, where exact convergence in just two steps is possible.
We introduce a notion of optimality which is stronger than minimax optimality that requires a method to give optimal dynamic guarantees that exploit any "non-adversarialness" in the first-order oracle's reported information. We then give an algorithm which achieves this stronger optimality notion: the Subgame Perfect Gradient Method (SPGM). SPGM is a refinement of OGM whose update rules and convergence guarantees are dynamically computed in response to first-order information seen during the algorithm's execution. From a game-theoretic viewpoint, OGM can be seen as one side of a Nash Equilibrium for the "minimization game" whereas SPGM can be seen as one side of a Subgame Perfect Equilibrium for the same game. We also show that SPGM can be implemented with minimal computational and storage overhead in each iteration and provide a Julia implementation.
△ Less
Submitted 27 January, 2025; v1 submitted 9 December, 2024;
originally announced December 2024.
-
Graph with Sequence: Broad-Range Semantic Modeling for Fake News Detection
Authors:
Junwei Yin,
Min Gao,
Kai Shu,
Wentao Li,
Yinqiu Huang,
Zongwei Wang
Abstract:
The rapid proliferation of fake news on social media threatens social stability, creating an urgent demand for more effective detection methods. While many promising approaches have emerged, most rely on content analysis with limited semantic depth, leading to suboptimal comprehension of news content.To address this limitation, capturing broader-range semantics is essential yet challenging, as it…
▽ More
The rapid proliferation of fake news on social media threatens social stability, creating an urgent demand for more effective detection methods. While many promising approaches have emerged, most rely on content analysis with limited semantic depth, leading to suboptimal comprehension of news content.To address this limitation, capturing broader-range semantics is essential yet challenging, as it introduces two primary types of noise: fully connecting sentences in news graphs often adds unnecessary structural noise, while highly similar but authenticity-irrelevant sentences introduce feature noise, complicating the detection process. To tackle these issues, we propose BREAK, a broad-range semantics model for fake news detection that leverages a fully connected graph to capture comprehensive semantics while employing dual denoising modules to minimize both structural and feature noise. The semantic structure denoising module balances the graph's connectivity by iteratively refining it between two bounds: a sequence-based structure as a lower bound and a fully connected graph as the upper bound. This refinement uncovers label-relevant semantic interrelations structures. Meanwhile, the semantic feature denoising module reduces noise from similar semantics by diversifying representations, aligning distinct outputs from the denoised graph and sequence encoders using KL-divergence to achieve feature diversification in high-dimensional space. The two modules are jointly optimized in a bi-level framework, enhancing the integration of denoised semantics into a comprehensive representation for detection. Extensive experiments across four datasets demonstrate that BREAK significantly outperforms existing fake news detection methods.
△ Less
Submitted 6 February, 2025; v1 submitted 7 December, 2024;
originally announced December 2024.
-
ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges
Authors:
Kaustubh D. Dhole,
Kai Shu,
Eugene Agichtein
Abstract:
Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence f…
▽ More
Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.
△ Less
Submitted 6 December, 2024;
originally announced December 2024.
-
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Authors:
Dawei Li,
Bohan Jiang,
Liangjie Huang,
Alimohammad Beigi,
Chengshuai Zhao,
Zhen Tan,
Amrita Bhattacharjee,
Yuxuan Jiang,
Canyu Chen,
Tianhao Wu,
Kai Shu,
Lu Cheng,
Huan Liu
Abstract:
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are levera…
▽ More
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge and https://llm-as-a-judge.github.io.
△ Less
Submitted 5 February, 2025; v1 submitted 25 November, 2024;
originally announced November 2024.
-
Piecing It All Together: Verifying Multi-Hop Multimodal Claims
Authors:
Haoran Wang,
Aman Rangapur,
Xiongxiao Xu,
Yueqing Liang,
Haroon Gharwi,
Carl Yang,
Kai Shu
Abstract:
Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal e…
▽ More
Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 15k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.
△ Less
Submitted 12 December, 2024; v1 submitted 14 November, 2024;
originally announced November 2024.
-
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?
Authors:
Canyu Chen,
Jian Yu,
Shan Chen,
Che Liu,
Zhongwei Wan,
Danielle Bitterman,
Fei Wang,
Kai Shu
Abstract:
Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a…
▽ More
Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.
△ Less
Submitted 10 November, 2024;
originally announced November 2024.
-
Can Knowledge Editing Really Correct Hallucinations?
Authors:
Baixiang Huang,
Canyu Chen,
Xiongxiao Xu,
Ali Payani,
Kai Shu
Abstract:
Large Language Models (LLMs) suffer from hallucinations, referring to the non-factual information in generated content, despite their superior capacities across tasks. Meanwhile, knowledge editing has been developed as a new popular paradigm to correct erroneous factual knowledge encoded in LLMs with the advantage of avoiding retraining from scratch. However, a common issue of existing evaluation…
▽ More
Large Language Models (LLMs) suffer from hallucinations, referring to the non-factual information in generated content, despite their superior capacities across tasks. Meanwhile, knowledge editing has been developed as a new popular paradigm to correct erroneous factual knowledge encoded in LLMs with the advantage of avoiding retraining from scratch. However, a common issue of existing evaluation datasets for knowledge editing is that they do not ensure that LLMs actually generate hallucinated answers to the evaluation questions before editing. When LLMs are evaluated on such datasets after being edited by different techniques, it is hard to directly adopt the performance to assess the effectiveness of different knowledge editing methods in correcting hallucinations. Thus, the fundamental question remains insufficiently validated: Can knowledge editing really correct hallucinations in LLMs? We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6,000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through HalluEditBench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire future improvements and facilitate progress in the field of knowledge editing.
△ Less
Submitted 3 March, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
Composing Optimized Stepsize Schedules for Gradient Descent
Authors:
Benjamin Grimmer,
Kevin Shu,
Alex L. Wang
Abstract:
Recent works by Altschuler and Parrilo and the authors have shown that it is possible to accelerate the convergence of gradient descent on smooth convex functions, even without momentum, just by picking special stepsizes. In this paper, we provide a general theory for composing stepsize schedules capturing all recent advances in this area and more. We propose three notions of ``composable'' stepsi…
▽ More
Recent works by Altschuler and Parrilo and the authors have shown that it is possible to accelerate the convergence of gradient descent on smooth convex functions, even without momentum, just by picking special stepsizes. In this paper, we provide a general theory for composing stepsize schedules capturing all recent advances in this area and more. We propose three notions of ``composable'' stepsize schedules with elementary associated composition operations for combining them. From these operations, in addition to recovering recent works, we construct three highly optimized sequences of stepsize schedules. We first construct optimized stepsize schedules of every length generalizing the exponentially spaced silver stepsizes. We then construct highly optimized stepsizes schedules for minimizing final objective gap or gradient norm, improving on prior rates by constants and, more importantly, matching or beating the numerically computed minimax optimal schedules. We conjecture these schedules are in fact minimax (information theoretic) optimal. Several novel tertiary results follow from our theory including recovery of the recent dynamic gradient norm minimizing short stepsizes and extending them to objective gap minimization.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach
Authors:
Xiongxiao Xu,
Solomon Abera Bekele,
Brice Videau,
Kai Shu
Abstract:
Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous hig…
▽ More
Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous high performance computing (HPC) systems. Moreover, they typically rely on either purely offline training or a hybrid of offline and online training, which are impractical and lead to energy loss during data collection. Therefore, this paper studies a novel and practical online energy optimization problem for GPUs in HPC scenarios. The problem is challenging due to the inherent performance-energy trade-offs of GPUs, the exploration & exploitation dilemma across frequencies, and the lack of explicit performance counters in GPUs. To address these challenges, we formulate the online energy consumption optimization problem as a multi-armed bandit framework and develop a novel bandit based framework EnergyUCB. EnergyUCB is designed to dynamically adjust GPU core frequencies in real-time, reducing energy consumption with minimal impact on performance. Specifically, the proposed framework EnergyUCB (1) balances the performance-energy trade-off in the reward function, (2) effectively navigates the exploration & exploitation dilemma when adjusting GPU core frequencies online, and (3) leverages the ratio of GPU core utilization to uncore utilization as a real-time GPU performance metric. Experiments on a wide range of real-world HPC benchmarks demonstrate that EnergyUCB can achieve substantial energy savings. The code of EnergyUCB is available at https://github.com/XiongxiaoXu/EnergyUCB-Bandit.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
FODA-PG for Enhanced Medical Imaging Narrative Generation: Adaptive Differentiation of Normal and Abnormal Attributes
Authors:
Kai Shu,
Yuzhuo Jia,
Ziyang Zhang,
Jiechao Gao
Abstract:
Automatic Medical Imaging Narrative generation aims to alleviate the workload of radiologists by producing accurate clinical descriptions directly from radiological images. However, the subtle visual nuances and domain-specific terminology in medical images pose significant challenges compared to generic image captioning tasks. Existing approaches often neglect the vital distinction between normal…
▽ More
Automatic Medical Imaging Narrative generation aims to alleviate the workload of radiologists by producing accurate clinical descriptions directly from radiological images. However, the subtle visual nuances and domain-specific terminology in medical images pose significant challenges compared to generic image captioning tasks. Existing approaches often neglect the vital distinction between normal and abnormal findings, leading to suboptimal performance. In this work, we propose FODA-PG, a novel Fine-grained Organ-Disease Adaptive Partitioning Graph framework that addresses these limitations through domain-adaptive learning. FODA-PG constructs a granular graphical representation of radiological findings by separating disease-related attributes into distinct "disease-specific" and "disease-free" categories based on their clinical significance and location. This adaptive partitioning enables our model to capture the nuanced differences between normal and pathological states, mitigating the impact of data biases. By integrating this fine-grained semantic knowledge into a powerful transformer-based architecture and providing rigorous mathematical justifications for its effectiveness, FODA-PG generates precise and clinically coherent reports with enhanced generalization capabilities. Extensive experiments on the IU-Xray and MIMIC-CXR benchmarks demonstrate the superiority of our approach over state-of-the-art methods, highlighting the importance of domain adaptation in medical report generation.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges
Authors:
Baixiang Huang,
Canyu Chen,
Kai Shu
Abstract:
Accurate attribution of authorship is crucial for maintaining the integrity of digital content, improving forensic investigations, and mitigating the risks of misinformation and plagiarism. Addressing the imperative need for proper authorship attribution is essential to uphold the credibility and accountability of authentic authorship. The rapid advancements of Large Language Models (LLMs) have bl…
▽ More
Accurate attribution of authorship is crucial for maintaining the integrity of digital content, improving forensic investigations, and mitigating the risks of misinformation and plagiarism. Addressing the imperative need for proper authorship attribution is essential to uphold the credibility and accountability of authentic authorship. The rapid advancements of Large Language Models (LLMs) have blurred the lines between human and machine authorship, posing significant challenges for traditional methods. We presents a comprehensive literature review that examines the latest research on authorship attribution in the era of LLMs. This survey systematically explores the landscape of this field by categorizing four representative problems: (1) Human-written Text Attribution; (2) LLM-generated Text Detection; (3) LLM-generated Text Attribution; and (4) Human-LLM Co-authored Text Attribution. We also discuss the challenges related to ensuring the generalization and explainability of authorship attribution methods. Generalization requires the ability to generalize across various domains, while explainability emphasizes providing transparent and understandable insights into the decisions made by these models. By evaluating the strengths and limitations of existing methods and benchmarks, we identify key open problems and future research directions in this field. This literature review serves a roadmap for researchers and practitioners interested in understanding the state of the art in this rapidly evolving field. Additional resources and a curated list of papers are available and regularly updated at https://llm-authorship.github.io
△ Less
Submitted 9 January, 2025; v1 submitted 16 August, 2024;
originally announced August 2024.
-
Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning
Authors:
Alimohammad Beigi,
Zhen Tan,
Nivedh Mudiam,
Canyu Chen,
Kai Shu,
Huan Liu
Abstract:
Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern large language models (LLMs) produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods…
▽ More
Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern large language models (LLMs) produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods introduce domain-specific features that can mask the fundamental characteristics of the models. In this paper, we introduce the concept of model attribution as a domain generalization problem, where each prompting method represents a unique domain. We argue that an effective attribution model must be invariant to these domain-specific features. It should also be proficient in identifying the originating models across all scenarios, reflecting real-world detection challenges. To address this, we introduce a novel approach based on Supervised Contrastive Learning. This method is designed to enhance the model's robustness to variations in prompts and focuses on distinguishing between different source LLMs. We evaluate our model through rigorous experiments involving three common prompting methods: ``open-ended'', ``rewriting'', and ``paraphrasing'', and three advanced LLMs: ``llama 2'', ``chatgpt'', and ``vicuna''. Our results demonstrate the effectiveness of our approach in model attribution tasks, achieving state-of-the-art performance across diverse and unseen datasets.
△ Less
Submitted 14 August, 2024; v1 submitted 30 July, 2024;
originally announced July 2024.
-
Can Editing LLMs Inject Harm?
Authors:
Canyu Chen,
Baixiang Huang,
Zekun Li,
Zhaorun Chen,
Shiyang Lai,
Xiongxiao Xu,
Jia-Chen Gu,
Jindong Gu,
Huaxiu Yao,
Chaowei Xiao,
Xifeng Yan,
William Yang Wang,
Philip Torr,
Dawn Song,
Kai Shu
Abstract:
Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation wi…
▽ More
Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.
△ Less
Submitted 16 August, 2024; v1 submitted 29 July, 2024;
originally announced July 2024.
-
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
Authors:
Benjamin Grimmer,
Kevin Shu,
Alex L. Wang
Abstract:
Drori and Teboulle [4] conjectured that the minimax optimal constant stepsize for N steps of gradient descent is given by the stepsize that balances performance on Huber and quadratic objective functions. This was numerically supported by semidefinite program (SDP) solves of the associated performance estimation problems up to $N\approx 100$. This note presents a strengthened version of the initia…
▽ More
Drori and Teboulle [4] conjectured that the minimax optimal constant stepsize for N steps of gradient descent is given by the stepsize that balances performance on Huber and quadratic objective functions. This was numerically supported by semidefinite program (SDP) solves of the associated performance estimation problems up to $N\approx 100$. This note presents a strengthened version of the initial conjecture. Specifically, we conjecture the existence of a certificate for the convergence rate with a very specific low-rank structure. This structure allows us to bypass SDPs and to numerically verify both conjectures up to $N = 20160$.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Taxonomy-Guided Zero-Shot Recommendations with LLMs
Authors:
Yueqing Liang,
Liangwei Yang,
Chen Wang,
Xiongxiao Xu,
Philip S. Yu,
Kai Shu
Abstract:
With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To a…
▽ More
With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To address these issues, we propose a novel method using a taxonomy dictionary. This method provides a systematic framework for categorizing and organizing items, improving the clarity and structure of item information. By incorporating the taxonomy dictionary into LLM prompts, we achieve efficient token utilization and controlled feature generation, leading to more accurate and contextually relevant recommendations. Our Taxonomy-guided Recommendation (TaxRec) approach features a two-step process: one-time taxonomy categorization and LLM-based recommendation, enabling zero-shot recommendations without the need for domain-specific fine-tuning. Experimental results demonstrate TaxRec significantly enhances recommendation quality compared to traditional zero-shot approaches, showcasing its efficacy as personal recommender with LLMs. Code is available at https://github.com/yueqingliang1/TaxRec.
△ Less
Submitted 19 February, 2025; v1 submitted 20 June, 2024;
originally announced June 2024.
-
LMO-DP: Optimizing the Randomization Mechanism for Differentially Private Fine-Tuning (Large) Language Models
Authors:
Qin Yang,
Meisam Mohammad,
Han Wang,
Ali Payani,
Ashish Kundu,
Kai Shu,
Yan Yan,
Yuan Hong
Abstract:
Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget $ε< 3$). To address such limitations…
▽ More
Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget $ε< 3$). To address such limitations, we propose a novel Language Model-based Optimal Differential Privacy (LMO-DP) mechanism, which takes the first step to enable the tight composition of accurately fine-tuning (large) language models with a sub-optimal DP mechanism, even in strong privacy regimes (e.g., $0.1\leq ε<3$). Furthermore, we propose a novel offline optimal noise search method to efficiently derive the sub-optimal DP that significantly reduces the noise magnitude. For instance, fine-tuning RoBERTa-large (with 300M parameters) on the SST-2 dataset can achieve an accuracy of 92.20% (given $ε=0.3$, $δ=10^{-10}$) by drastically outperforming the Gaussian mechanism (e.g., $\sim 50\%$ for small $ε$ and $δ$). We also draw similar findings on the text generation tasks on GPT-2. Finally, to our best knowledge, LMO-DP is also the first solution to accurately fine-tune Llama-2 with strong differential privacy guarantees. The code will be released soon and available upon request.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
SST: Multi-Scale Hybrid Mamba-Transformer Experts for Long-Short Range Time Series Forecasting
Authors:
Xiongxiao Xu,
Canyu Chen,
Yueqing Liang,
Baixiang Huang,
Guangji Bai,
Liang Zhao,
Kai Shu
Abstract:
Despite significant progress in time series forecasting, existing forecasters often overlook the heterogeneity between long-range and short-range time series, leading to performance degradation in practical applications. In this work, we highlight the need of distinct objectives tailored to different ranges. We point out that time series can be decomposed into global patterns and local variations,…
▽ More
Despite significant progress in time series forecasting, existing forecasters often overlook the heterogeneity between long-range and short-range time series, leading to performance degradation in practical applications. In this work, we highlight the need of distinct objectives tailored to different ranges. We point out that time series can be decomposed into global patterns and local variations, which should be addressed separately in long- and short-range time series. To meet the objectives, we propose a multi-scale hybrid Mamba-Transformer experts model State Space Transformer (SST). SST leverages Mamba as an expert to extract global patterns in coarse-grained long-range time series, and Local Window Transformer (LWT), the other expert to focus on capturing local variations in fine-grained short-range time series. With an input-dependent mechanism, State Space Model (SSM)-based Mamba is able to selectively retain long-term patterns and filter out fluctuations, while LWT employs a local window to enhance locality-awareness capability, thus effectively capturing local variations. To adaptively integrate the global patterns and local variations, a long-short router dynamically adjusts contributions of the two experts. SST achieves superior performance with scaling linearly $O(L)$ on time series length $L$. The comprehensive experiments demonstrate the SST can achieve SOTA results in long-short range time series forecasting while maintaining low memory footprint and computational cost. The code of SST is available at https://github.com/XiongxiaoXu/SST.
△ Less
Submitted 22 August, 2024; v1 submitted 23 April, 2024;
originally announced April 2024.
-
Accelerated Objective Gap and Gradient Norm Convergence for Gradient Descent via Long Steps
Authors:
Benjamin Grimmer,
Kevin Shu,
Alex L. Wang
Abstract:
This work considers gradient descent for L-smooth convex optimization with stepsizes larger than the classic regime where descent can be ensured. The stepsize schedules considered are similar to but differ slightly from the recent silver stepsizes of Altschuler and Parrilo. For one of our stepsize sequences, we prove a $O\left(N^{- 1.2716\dots}\right)$ convergence rate in terms of objective gap de…
▽ More
This work considers gradient descent for L-smooth convex optimization with stepsizes larger than the classic regime where descent can be ensured. The stepsize schedules considered are similar to but differ slightly from the recent silver stepsizes of Altschuler and Parrilo. For one of our stepsize sequences, we prove a $O\left(N^{- 1.2716\dots}\right)$ convergence rate in terms of objective gap decrease and for the other, we show the same rate of decrease for squared-gradient-norm decrease. This first result improves on the recent result of Altschuler and Parrilo by a constant factor, while the second results improve on the exponent of the prior best squared-gradient-norm convergence guarantee of $O\left(N^{-1}\right)$.
△ Less
Submitted 12 April, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Re-Search for The Truth: Multi-round Retrieval-augmented Large Language Models are Strong Fake News Detectors
Authors:
Guanghua Li,
Wensheng Lu,
Wei Zhang,
Defu Lian,
Kezhong Lu,
Rui Mao,
Kai Shu,
Hao Liao
Abstract:
The proliferation of fake news has had far-reaching implications on politics, the economy, and society at large. While Fake news detection methods have been employed to mitigate this issue, they primarily depend on two essential elements: the quality and relevance of the evidence, and the effectiveness of the verdict prediction mechanism. Traditional methods, which often source information from st…
▽ More
The proliferation of fake news has had far-reaching implications on politics, the economy, and society at large. While Fake news detection methods have been employed to mitigate this issue, they primarily depend on two essential elements: the quality and relevance of the evidence, and the effectiveness of the verdict prediction mechanism. Traditional methods, which often source information from static repositories like Wikipedia, are limited by outdated or incomplete data, particularly for emerging or rare claims. Large Language Models (LLMs), known for their remarkable reasoning and generative capabilities, introduce a new frontier for fake news detection. However, like traditional methods, LLM-based solutions also grapple with the limitations of stale and long-tail knowledge. Additionally, retrieval-enhanced LLMs frequently struggle with issues such as low-quality evidence retrieval and context length constraints. To address these challenges, we introduce a novel, retrieval-augmented LLMs framework--the first of its kind to automatically and strategically extract key evidence from web sources for claim verification. Employing a multi-round retrieval strategy, our framework ensures the acquisition of sufficient, relevant evidence, thereby enhancing performance. Comprehensive experiments across three real-world datasets validate the framework's superiority over existing methods. Importantly, our model not only delivers accurate verdicts but also offers human-readable explanations to improve result interpretability.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Can Large Language Models Identify Authorship?
Authors:
Baixiang Huang,
Canyu Chen,
Kai Shu
Abstract:
The ability to accurately identify authorship is crucial for verifying content authenticity and mitigating misinformation. Large Language Models (LLMs) have demonstrated an exceptional capacity for reasoning and problem-solving. However, their potential in authorship analysis remains under-explored. Traditional studies have depended on hand-crafted stylistic features, whereas state-of-the-art appr…
▽ More
The ability to accurately identify authorship is crucial for verifying content authenticity and mitigating misinformation. Large Language Models (LLMs) have demonstrated an exceptional capacity for reasoning and problem-solving. However, their potential in authorship analysis remains under-explored. Traditional studies have depended on hand-crafted stylistic features, whereas state-of-the-art approaches leverage text embeddings from pre-trained language models. These methods, which typically require fine-tuning on labeled data, often suffer from performance degradation in cross-domain applications and provide limited explainability. This work seeks to address three research questions: (1) Can LLMs perform zero-shot, end-to-end authorship verification effectively? (2) Are LLMs capable of accurately attributing authorship among multiple candidates authors (e.g., 10 and 20)? (3) Can LLMs provide explainability in authorship analysis, particularly through the role of linguistic features? Moreover, we investigate the integration of explicit linguistic features to guide LLMs in their reasoning processes. Our assessment demonstrates LLMs' proficiency in both tasks without the need for domain-specific fine-tuning, providing explanations into their decision making via a detailed analysis of linguistic features. This establishes a new benchmark for future research on LLM-based authorship analysis.
△ Less
Submitted 22 October, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Can Large Language Model Agents Simulate Human Trust Behavior?
Authors:
Chengxing Xie,
Canyu Chen,
Feiran Jia,
Ziyu Ye,
Shiyang Lai,
Kai Shu,
Jindong Gu,
Adel Bibi,
Ziniu Hu,
David Jurgens,
James Evans,
Philip Torr,
Bernard Ghanem,
Guohao Li
Abstract:
Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in social science and role-playing applications. However, one fundamental question remains: can LLM agents really simulate human behavior? In this paper, we focus on one critical and elemental behavior in human interactions, trust, and investigate whether LLM agents can simulate human trust behavio…
▽ More
Large Language Model (LLM) agents have been increasingly adopted as simulation tools to model humans in social science and role-playing applications. However, one fundamental question remains: can LLM agents really simulate human behavior? In this paper, we focus on one critical and elemental behavior in human interactions, trust, and investigate whether LLM agents can simulate human trust behavior. We first find that LLM agents generally exhibit trust behavior, referred to as agent trust, under the framework of Trust Games, which are widely recognized in behavioral economics. Then, we discover that GPT-4 agents manifest high behavioral alignment with humans in terms of trust behavior, indicating the feasibility of simulating human trust behavior with LLM agents. In addition, we probe the biases of agent trust and differences in agent trust towards other LLM agents and humans. We also explore the intrinsic properties of agent trust under conditions including external manipulations and advanced reasoning strategies. Our study provides new insights into the behaviors of LLM agents and the fundamental analogy between LLMs and humans beyond value alignment. We further illustrate broader implications of our discoveries for applications where trust is paramount.
△ Less
Submitted 1 November, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Baichuan2-Sum: Instruction Finetune Baichuan2-7B Model for Dialogue Summarization
Authors:
Jianfei Xiao,
Yancan Chen,
Yimin Ou,
Hanyi Yu,
Kai Shu,
Yiyong Xiao
Abstract:
Large language models (LLMs) like Llama, Baichuan and Bloom models show remarkable ability with instruction fine-tuning in many natural language tasks. Nevertheless, for the dialogue summarization task, which aims to generate summaries for different roles in dialogue, most of the state-of-the-art methods conduct on small models (e.g Bart and Bert). Existing methods try to add task specified optimi…
▽ More
Large language models (LLMs) like Llama, Baichuan and Bloom models show remarkable ability with instruction fine-tuning in many natural language tasks. Nevertheless, for the dialogue summarization task, which aims to generate summaries for different roles in dialogue, most of the state-of-the-art methods conduct on small models (e.g Bart and Bert). Existing methods try to add task specified optimization on small models like adding global-local centrality score to models. In this paper, we propose an instruction fine-tuning model: Baichuan2-Sum, for role-oriented diaglouge summarization. By setting different instructions for different roles, the model can learn from the dialogue interactions and output the expected summaries. Furthermore, we applied NEFTune technique to add suitable noise during training to improve the results. The experiments demonstrate that the proposed model achieves the new state-of-the-art results on two public dialogue summarization datasets: CSDS and SAMSUM. We release our model and related codes to facilitate future studies on dialogue summarization task.
△ Less
Submitted 3 April, 2024; v1 submitted 27 January, 2024;
originally announced January 2024.
-
Understanding the concerns and choices of public when using large language models for healthcare
Authors:
Yunpeng Xiao,
Kyrie Zhixuan Zhou,
Yueqing Liang,
Kai Shu
Abstract:
Large language models (LLMs) have shown their potential in biomedical fields. However, how the public uses them for healthcare purposes such as medical Q\&A, self-diagnosis, and daily healthcare information seeking is under-investigated. This paper adopts a mixed-methods approach, including surveys (N=214) and interviews (N=17) to investigate how and why the public uses LLMs for healthcare. We fou…
▽ More
Large language models (LLMs) have shown their potential in biomedical fields. However, how the public uses them for healthcare purposes such as medical Q\&A, self-diagnosis, and daily healthcare information seeking is under-investigated. This paper adopts a mixed-methods approach, including surveys (N=214) and interviews (N=17) to investigate how and why the public uses LLMs for healthcare. We found that participants generally believed LLMs as a healthcare tool have gained popularity, and are often used in combination with other information channels such as search engines and online health communities to optimize information quality. Based on the findings, we reflect on the ethical and effective use of LLMs for healthcare and propose future research directions.
△ Less
Submitted 12 September, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
TrustLLM: Trustworthiness in Large Language Models
Authors:
Yue Huang,
Lichao Sun,
Haoran Wang,
Siyuan Wu,
Qihui Zhang,
Yuan Li,
Chujie Gao,
Yixin Huang,
Wenhan Lyu,
Yixuan Zhang,
Xiner Li,
Zhengliang Liu,
Yixin Liu,
Yijue Wang,
Zhikun Zhang,
Bertie Vidgen,
Bhavya Kailkhura,
Caiming Xiong,
Chaowei Xiao,
Chunyuan Li,
Eric Xing,
Furong Huang,
Hao Liu,
Heng Ji,
Hongyi Wang
, et al. (45 additional authors not shown)
Abstract:
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in…
▽ More
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.
△ Less
Submitted 30 September, 2024; v1 submitted 10 January, 2024;
originally announced January 2024.
-
CSGNN: Conquering Noisy Node labels via Dynamic Class-wise Selection
Authors:
Yifan Li,
Zhen Tan,
Kai Shu,
Zongsheng Cao,
Yu Kong,
Huan Liu
Abstract:
Graph Neural Networks (GNNs) have emerged as a powerful tool for representation learning on graphs, but they often suffer from overfitting and label noise issues, especially when the data is scarce or imbalanced. Different from the paradigm of previous methods that rely on single-node confidence, in this paper, we introduce a novel Class-wise Selection for Graph Neural Networks, dubbed CSGNN, whic…
▽ More
Graph Neural Networks (GNNs) have emerged as a powerful tool for representation learning on graphs, but they often suffer from overfitting and label noise issues, especially when the data is scarce or imbalanced. Different from the paradigm of previous methods that rely on single-node confidence, in this paper, we introduce a novel Class-wise Selection for Graph Neural Networks, dubbed CSGNN, which employs a neighbor-aggregated latent space to adaptively select reliable nodes across different classes. Specifically, 1) to tackle the class imbalance issue, we introduce a dynamic class-wise selection mechanism, leveraging the clustering technique to identify clean nodes based on the neighbor-aggregated confidences. In this way, our approach can avoid the pitfalls of biased sampling which is common with global threshold techniques. 2) To alleviate the problem of noisy labels, built on the concept of the memorization effect, CSGNN prioritizes learning from clean nodes before noisy ones, thereby iteratively enhancing model performance while mitigating label noise. Through extensive experiments, we demonstrate that CSGNN outperforms state-of-the-art methods in terms of both effectiveness and robustness.
△ Less
Submitted 14 December, 2023; v1 submitted 19 November, 2023;
originally announced November 2023.
-
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment
Authors:
Haoran Wang,
Kai Shu
Abstract:
To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potent…
▽ More
To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack (TA^2), which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. Our experiment results on four primary alignment tasks show that TA^2 is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks.
△ Less
Submitted 15 August, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models
Authors:
Yueqing Liang,
Lu Cheng,
Ali Payani,
Kai Shu
Abstract:
This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they al…
▽ More
This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., "non-abusive") and flip their labels to the unfavored outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection.
△ Less
Submitted 5 December, 2023; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Combating Misinformation in the Age of LLMs: Opportunities and Challenges
Authors:
Canyu Chen,
Kai Shu
Abstract:
Misinformation such as fake news and rumors is a serious threat on information ecosystems and public trust. The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation. Generally, LLMs can be a double-edged sword in the fight. On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowl…
▽ More
Misinformation such as fake news and rumors is a serious threat on information ecosystems and public trust. The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation. Generally, LLMs can be a double-edged sword in the fight. On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowledge and strong reasoning abilities. Thus, one emergent question is: how to utilize LLMs to combat misinformation? On the other hand, the critical challenge is that LLMs can be easily leveraged to generate deceptive misinformation at scale. Then, another important question is: how to combat LLM-generated misinformation? In this paper, we first systematically review the history of combating misinformation before the advent of LLMs. Then we illustrate the current efforts and present an outlook for these two fundamental questions respectively. The goal of this survey paper is to facilitate the progress of utilizing LLMs for fighting misinformation and call for interdisciplinary efforts from different stakeholders for combating LLM-generated misinformation.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Laser cooling of positronium
Authors:
K. Shu,
Y. Tajima,
R. Uozumi,
N. Miyamoto,
S. Shiraishi,
T. Kobayashi,
A. Ishida,
K. Yamada,
R. W. Gladen,
T. Namba,
S. Asai,
K. Wada,
I. Mochizuki,
T. Hyodo,
K. Ito,
K. Michishio,
B. E. O'Rourke,
N. Oshima,
K. Yoshioka
Abstract:
When laser radiation is skilfully applied, atoms and molecules can be cooled allowing precise measurements and control of quantum systems. This is essential in fundamental studies of physics as well as practical applications such as precision spectroscopy, quantum-statistical-property manifesting ultracold gases, and quantum computing. In laser cooling, repeated cycles of laser photon absorption a…
▽ More
When laser radiation is skilfully applied, atoms and molecules can be cooled allowing precise measurements and control of quantum systems. This is essential in fundamental studies of physics as well as practical applications such as precision spectroscopy, quantum-statistical-property manifesting ultracold gases, and quantum computing. In laser cooling, repeated cycles of laser photon absorption and direction-independent spontaneous emission can slow atoms and molecules to otherwise unattainable velocities. Simple systems can provide a rigorous testing ground for fundamental theories of physics; one such system is the purely leptonic positronium, an exotic atom of an electron and its antiparticle, the positron. However, the cooling of positronium has hitherto remained unrealised. Here, we demonstrate laser cooling of positronium. A novel laser system of a train of broadband pulses with successively increasing central frequencies was used to overcome major challenges presented by the short lifetime of positronium and the significant Doppler broadening and recoil as a consequence of its very light mass. One-dimensional chirp cooling of the dilute positronium gas in a counter-propagating configuration gave a final velocity distribution corresponding to approximately 1 K in a short time of 100 ns. This study on a pure leptonic system is a major step in the field of low-temperature fundamental physics of antimatter, and is complementary to the laser cooling of antihydrogen, a hadron-containing exotic atom. Progress in this field is vital in elucidating the origin of the matter-antimatter asymmetry in the universe. The application of laser cooling to positronium may afford a unique opportunity to rigorously test bound-state quantum electrodynamics. Moreover, laser cooling of positronium is key to the realisation of Bose-Einstein condensation in this matter-antimatter system.
△ Less
Submitted 15 October, 2023; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Explainable Claim Verification via Knowledge-Grounded Reasoning with Large Language Models
Authors:
Haoran Wang,
Kai Shu
Abstract:
Claim verification plays a crucial role in combating misinformation. While existing works on claim verification have shown promising results, a crucial piece of the puzzle that remains unsolved is to understand how to verify claims without relying on human-annotated data, which is expensive to create at a large scale. Additionally, it is important for models to provide comprehensive explanations t…
▽ More
Claim verification plays a crucial role in combating misinformation. While existing works on claim verification have shown promising results, a crucial piece of the puzzle that remains unsolved is to understand how to verify claims without relying on human-annotated data, which is expensive to create at a large scale. Additionally, it is important for models to provide comprehensive explanations that can justify their decisions and assist human fact-checkers. This paper presents First-Order-Logic-Guided Knowledge-Grounded (FOLK) Reasoning that can verify complex claims and generate explanations without the need for annotated evidence using Large Language Models (LLMs). FOLK leverages the in-context learning ability of LLMs to translate the claim into a First-Order-Logic (FOL) clause consisting of predicates, each corresponding to a sub-claim that needs to be verified. Then, FOLK performs FOL-Guided reasoning over a set of knowledge-grounded question-and-answer pairs to make veracity predictions and generate explanations to justify its decision-making process. This process makes our model highly explanatory, providing clear explanations of its reasoning process in human-readable form. Our experiment results indicate that FOLK outperforms strong baselines on three datasets encompassing various claim verification challenges. Our code and data are available.
△ Less
Submitted 19 October, 2023; v1 submitted 8 October, 2023;
originally announced October 2023.
-
Can LLM-Generated Misinformation Be Detected?
Authors:
Canyu Chen,
Kai Shu
Abstract:
The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the pe…
▽ More
The advent of Large Language Models (LLMs) has made a transformative impact. However, the potential that LLMs such as ChatGPT can be exploited to generate misinformation has posed a serious concern to online safety and public trust. A fundamental research question is: will LLM-generated misinformation cause more harm than human-written misinformation? We propose to tackle this question from the perspective of detection difficulty. We first build a taxonomy of LLM-generated misinformation. Then we categorize and validate the potential real-world methods for generating misinformation with LLMs. Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. We also discuss the implications of our discovery on combating misinformation in the age of LLMs and the countermeasures.
△ Less
Submitted 23 April, 2024; v1 submitted 24 September, 2023;
originally announced September 2023.
-
Investigating Online Financial Misinformation and Its Consequences: A Computational Perspective
Authors:
Aman Rangapur,
Haoran Wang,
Kai Shu
Abstract:
The rapid dissemination of information through digital platforms has revolutionized the way we access and consume news and information, particularly in the realm of finance. However, this digital age has also given rise to an alarming proliferation of financial misinformation, which can have detrimental effects on individuals, markets, and the overall economy. This research paper aims to provide a…
▽ More
The rapid dissemination of information through digital platforms has revolutionized the way we access and consume news and information, particularly in the realm of finance. However, this digital age has also given rise to an alarming proliferation of financial misinformation, which can have detrimental effects on individuals, markets, and the overall economy. This research paper aims to provide a comprehensive survey of online financial misinformation, including its types, sources, and impacts. We first discuss the characteristics and manifestations of financial misinformation, encompassing false claims and misleading content. We explore various case studies that illustrate the detrimental consequences of financial misinformation on the economy. Finally, we highlight the potential impact and implications of detecting financial misinformation. Early detection and mitigation strategies can help protect investors, enhance market transparency, and preserve financial stability. We emphasize the importance of greater awareness, education, and regulation to address the issue of online financial misinformation and safeguard individuals and businesses from its harmful effects. In conclusion, this research paper sheds light on the pervasive issue of online financial misinformation and its wide-ranging consequences. By understanding the types, sources, and impacts of misinformation, stakeholders can work towards implementing effective detection and prevention measures to foster a more informed and resilient financial ecosystem.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Accelerated Gradient Descent via Long Steps
Authors:
Benjamin Grimmer,
Kevin Shu,
Alex L. Wang
Abstract:
Recently Grimmer [1] showed for smooth convex optimization by utilizing longer steps periodically, gradient descent's textbook $LD^2/2T$ convergence guarantees can be improved by constant factors, conjecturing an accelerated rate strictly faster than $O(1/T)$ could be possible. Here we prove such a big-O gain, establishing gradient descent's first accelerated convergence rate in this setting. Name…
▽ More
Recently Grimmer [1] showed for smooth convex optimization by utilizing longer steps periodically, gradient descent's textbook $LD^2/2T$ convergence guarantees can be improved by constant factors, conjecturing an accelerated rate strictly faster than $O(1/T)$ could be possible. Here we prove such a big-O gain, establishing gradient descent's first accelerated convergence rate in this setting. Namely, we prove a $O(1/T^{1.0564})$ rate for smooth convex minimization by utilizing a nonconstant nonperiodic sequence of increasingly large stepsizes. It remains open if one can achieve the $O(1/T^{1.178})$ rate conjectured by Das Gupta et. al. [2] or the optimal gradient method rate of $O(1/T^2)$. Big-O convergence rate accelerations from long steps follow from our theory for strongly convex optimization, similar to but somewhat weaker than those concurrently developed by Altschuler and Parrilo [3].
△ Less
Submitted 26 September, 2023; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation
Authors:
Aman Rangapur,
Haoran Wang,
Ling Jian,
Kai Shu
Abstract:
Fact-checking in financial domain is under explored, and there is a shortage of quality dataset in this domain. In this paper, we propose Fin-Fact, a benchmark dataset for multimodal fact-checking within the financial domain. Notably, it includes professional fact-checker annotations and justifications, providing expertise and credibility. With its multimodal nature encompassing both textual and v…
▽ More
Fact-checking in financial domain is under explored, and there is a shortage of quality dataset in this domain. In this paper, we propose Fin-Fact, a benchmark dataset for multimodal fact-checking within the financial domain. Notably, it includes professional fact-checker annotations and justifications, providing expertise and credibility. With its multimodal nature encompassing both textual and visual content, Fin-Fact provides complementary information sources to enhance factuality analysis. Its primary objective is combating misinformation in finance, fostering transparency, and building trust in financial reporting and news dissemination. By offering insightful explanations, Fin-Fact empowers users, including domain experts and end-users, to understand the reasoning behind fact-checking decisions, validating claim credibility, and fostering trust in the fact-checking process. The Fin-Fact dataset, along with our experimental codes is available at https://github.com/IIT-DM/Fin-Fact/.
△ Less
Submitted 1 May, 2024; v1 submitted 15 September, 2023;
originally announced September 2023.
-
Symmetric Hyperbolic Polynomials
Authors:
Grigoriy Blekherman,
Julia Lindberg,
Kevin Shu
Abstract:
Hyperbolic polynomials have been of recent interest due to applications in a wide variety of fields. We seek to better understand these polynomials in the case when they are symmetric, i.e. invariant under all permutations of variables. We give a complete characterization of the set of symmetric hyperbolic polynomials of degree 3, and a large class of symmetric hyperbolic polynomials of degree 4.…
▽ More
Hyperbolic polynomials have been of recent interest due to applications in a wide variety of fields. We seek to better understand these polynomials in the case when they are symmetric, i.e. invariant under all permutations of variables. We give a complete characterization of the set of symmetric hyperbolic polynomials of degree 3, and a large class of symmetric hyperbolic polynomials of degree 4. For a class of polynomials, which we call hook-shaped, we relate symmetric hyperbolic polynomials to a class of linear maps of univariate polynomials preserving hyperbolicity, and give evidence toward a beautiful characterization of all such hook-shaped symmetric hyperbolic polynomials. We show that hyperbolicity cones of a class of symmetric hyperbolic polynomials, including all symmetric hyperbolic cubics, are spectrahedral. Finally, we connect testing hyperbolicity of a symmetric polynomial to the degree principle for symmetric nonnegative polynomials.
△ Less
Submitted 18 August, 2023;
originally announced August 2023.
-
Development of an optimal laser for chirp cooling of positronium based on chirped pulse-train generator
Authors:
Kenji Shu,
Naoki Miyamoto,
Yuto Motohashi,
Ryosuke Uozumi,
Yohei Tajima,
Kosuke Yoshioka
Abstract:
We report the development and characterization of a pulsed 243 nm laser that is optimal for the cooling of positronium (Ps). The laser, which is based on the recent chirped pulse-train generator (CPTG) demonstrated by K. Yamada et al. (Phys. Rev. Appl. 16, 014009 (2021)), was designed to output a train of pulses with linewidths of 10 GHz, and with the center frequency of each pulse shifting upward…
▽ More
We report the development and characterization of a pulsed 243 nm laser that is optimal for the cooling of positronium (Ps). The laser, which is based on the recent chirped pulse-train generator (CPTG) demonstrated by K. Yamada et al. (Phys. Rev. Appl. 16, 014009 (2021)), was designed to output a train of pulses with linewidths of 10 GHz, and with the center frequency of each pulse shifting upward (up-chirped) in time by $4.9\times10^2\,\mathrm{GHz\,μs^{-1}}$. These parameters were determined by the mechanism of chirp cooling, which is the best scheme for cooling many Ps atoms to the recoil temperature of laser cooling. To achieve the designed performance, we drove an optical phase modulator in the CPTG with a deep modulation depth based on the operating principle of the cooling laser. Time-resolved spectroscopic measurements confirmed that the developed laser satisfied the chirp rate and linewidth requirements for efficient chirp cooling. Combined with pulse energy of hundreds of microjoules, we believe that the experimental demonstration of Ps laser cooling has become possible using realistic methods for the generation and velocity measurement of Ps.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
Emulating Reader Behaviors for Fake News Detection
Authors:
Junwei Yin,
Min Gao,
Kai Shu,
Zehua Zhao,
Yinqiu Huang,
Jia Wang
Abstract:
The wide dissemination of fake news has affected our lives in many aspects, making fake news detection important and attracting increasing attention. Existing approaches make substantial contributions in this field by modeling news from a single-modal or multi-modal perspective. However, these modal-based methods can result in sub-optimal outcomes as they ignore reader behaviors in news consumptio…
▽ More
The wide dissemination of fake news has affected our lives in many aspects, making fake news detection important and attracting increasing attention. Existing approaches make substantial contributions in this field by modeling news from a single-modal or multi-modal perspective. However, these modal-based methods can result in sub-optimal outcomes as they ignore reader behaviors in news consumption and authenticity verification. For instance, they haven't taken into consideration the component-by-component reading process: from the headline, images, comments, to the body, which is essential for modeling news with more granularity. To this end, we propose an approach of Emulating the behaviors of readers (Ember) for fake news detection on social media, incorporating readers' reading and verificating process to model news from the component perspective thoroughly. Specifically, we first construct intra-component feature extractors to emulate the behaviors of semantic analyzing on each component. Then, we design a module that comprises inter-component feature extractors and a sequence-based aggregator. This module mimics the process of verifying the correlation between components and the overall reading and verification sequence. Thus, Ember can handle the news with various components by emulating corresponding sequences. We conduct extensive experiments on nine real-world datasets, and the results demonstrate the superiority of Ember.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
MUSER: A MUlti-Step Evidence Retrieval Enhancement Framework for Fake News Detection
Authors:
Hao Liao,
Jiaohao Peng,
Zhanyi Huang,
Wei Zhang,
Guanghua Li,
Kai Shu,
Xing Xie
Abstract:
The ease of spreading false information online enables individuals with malicious intent to manipulate public opinion and destabilize social stability. Recently, fake news detection based on evidence retrieval has gained popularity in an effort to identify fake news reliably and reduce its impact. Evidence retrieval-based methods can improve the reliability of fake news detection by computing the…
▽ More
The ease of spreading false information online enables individuals with malicious intent to manipulate public opinion and destabilize social stability. Recently, fake news detection based on evidence retrieval has gained popularity in an effort to identify fake news reliably and reduce its impact. Evidence retrieval-based methods can improve the reliability of fake news detection by computing the textual consistency between the evidence and the claim in the news. In this paper, we propose a framework for fake news detection based on MUlti-Step Evidence Retrieval enhancement (MUSER), which simulates the steps of human beings in the process of reading news, summarizing, consulting materials, and inferring whether the news is true or fake. Our model can explicitly model dependencies among multiple pieces of evidence, and perform multi-step associations for the evidence required for news verification through multi-step retrieval. In addition, our model is able to automatically collect existing evidence through paragraph retrieval and key evidence selection, which can save the tedious process of manual evidence collection. We conducted extensive experiments on real-world datasets in different languages, and the results demonstrate that our proposed model outperforms state-of-the-art baseline methods for detecting fake news by at least 3% in F1-Macro and 4% in F1-Micro. Furthermore, it provides interpretable evidence for end users.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
Data Augmentation for Seizure Prediction with Generative Diffusion Model
Authors:
Kai Shu,
Le Wu,
Yuchang Zhao,
Aiping Liu,
Ruobing Qian,
Xun Chen
Abstract:
Data augmentation (DA) can significantly strengthen the electroencephalogram (EEG)-based seizure prediction methods. However, existing DA approaches are just the linear transformations of original data and cannot explore the feature space to increase diversity effectively. Therefore, we propose a novel diffusion-based DA method called DiffEEG. DiffEEG can fully explore data distribution and genera…
▽ More
Data augmentation (DA) can significantly strengthen the electroencephalogram (EEG)-based seizure prediction methods. However, existing DA approaches are just the linear transformations of original data and cannot explore the feature space to increase diversity effectively. Therefore, we propose a novel diffusion-based DA method called DiffEEG. DiffEEG can fully explore data distribution and generate samples with high diversity, offering extra information to classifiers. It involves two processes: the diffusion process and the denoised process. In the diffusion process, the model incrementally adds noise with different scales to EEG input and converts it into random noise. In this way, the representation of data can be learned. In the denoised process, the model utilizes learned knowledge to sample synthetic data from random noise input by gradually removing noise. The randomness of input noise and the precise representation enable the synthetic samples to possess diversity while ensuring the consistency of feature space. We compared DiffEEG with original, down-sampling, sliding windows and recombination methods, and integrated them into five representative classifiers. The experiments demonstrate the effectiveness and generality of our method. With the contribution of DiffEEG, the Multi-scale CNN achieves state-of-the-art performance, with an average sensitivity, FPR, AUC of 95.4%, 0.051/h, 0.932 on the CHB-MIT database and 93.6%, 0.121/h, 0.822 on the Kaggle database.
△ Less
Submitted 9 December, 2024; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Investigating Gender Euphoria and Dysphoria on TikTok: Characterization and Comparison
Authors:
SJ Dillon,
Yueqing Liang,
H. Russell Bernard,
Kai Shu
Abstract:
With the emergence of short video-sharing platforms, engagement with social media sites devoted to opinion and knowledge dissemination has rapidly increased. Among these platforms, TikTok is one of the most popular globally and has become the platform of choice for transgender and nonbinary individuals, who have formed a large community to mobilize personal experience and exchange information. The…
▽ More
With the emergence of short video-sharing platforms, engagement with social media sites devoted to opinion and knowledge dissemination has rapidly increased. Among these platforms, TikTok is one of the most popular globally and has become the platform of choice for transgender and nonbinary individuals, who have formed a large community to mobilize personal experience and exchange information. The knowledge produced in online spaces can influence the ways in which people understand and experience their own gender and transitions, as they hear about others and weigh experiential and medical knowledge against their own. This paper extends current research and past interview methods on gender euphoria and gender dysphoria to analyze what and how online communities on TikTok discuss these two types of gender experiences. Our findings indicate that gender euphoria and gender dysphoria are differently described in online TikTok spaces. These findings indicate similarities in the words used to describe gender dysphoria as well as gender euphoria in both the comments of videos and content creators' hashtags. Finally, our results show that gender euphoria is described in more similar terms between transfeminine and transmasculine experiences than gender dysphoria, which appears to be more differentiated by gendering experience and transition goals. We hope this paper can provide insights for future research on understanding transgender and nonbinary individuals in online communities.
△ Less
Submitted 11 March, 2025; v1 submitted 31 May, 2023;
originally announced May 2023.