Search | arXiv e-print repository

STRICT: Stress Test of Rendering Images Containing Text

Authors: Tianyu Zhang, Xinyu Wang, Zhenghan Tai, Lu Li, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang

Abstract: While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we… ▽ More While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: 13 pages

arXiv:2505.17747 [pdf, ps, other]

Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

Authors: Maureen de Seyssel, Jie Chi, Skyler Seto, Maartje ter Hoeve, Masha Fedzechkina, Natalie Schluter

Abstract: We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Con… ▽ More We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance. Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations. △ Less

Submitted 2 June, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

arXiv:2505.17196 [pdf, ps, other]

Shape it Up! Restoring LLM Safety during Finetuning

Authors: ShengYun Peng, Pin-Yu Chen, Jianfeng Chi, Seongmin Lee, Duen Horng Chau

Abstract: Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating… ▽ More Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial responses, tracking how safety risk evolves throughout the response, segment by segment. This leads to the Safety Trajectory Assessment of Response (STAR), a token-level signal that enables shaping to operate dynamically over the training sequence. Building on this, we present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families-all without compromising capability on intended tasks. We encourage future safety research to build on dynamic shaping principles for stronger mitigation against evolving finetuning risks. △ Less

Submitted 22 May, 2025; originally announced May 2025.

arXiv:2505.16277 [pdf, ps, other]

Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility

Authors: Sheng-Fu Wang, Laurent Prevot, Jou-an Chi, Ri-Sheng Huang, Shu-Kai Hsieh

Abstract: The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during langu… ▽ More The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during language processing (e.g., reading/listening). In this paper, we propose using spontaneous speech corpora to derive production variables (speech reductions, prosodic prominences) and applying them in a similar fashion. More precisely, we extract. We then test models trained with a standard procedure on different pretraining datasets (written, spoken, and mixed genres) for their ability to predict these two variables. Our results show that, after some fine-tuning, the models can predict these production variables well above baselines. We also observe that spoken genre training data provides more accurate predictions than written genres. These results contribute to the broader effort of using high-quality speech corpora as benchmarks for LLMs. △ Less

Submitted 22 May, 2025; originally announced May 2025.

Comments: The 14th Workshop on Cognitive Modeling and Computational Linguistics (CMCL). May 3, 2025. Collocated with NAACL 2025

arXiv:2505.11909 [pdf, other]

Bridging the Inter-Domain Gap through Low-Level Features for Cross-Modal Medical Image Segmentation

Authors: Pengfei Lyu, Pak-Hei Yeung, Xiaosheng Yu, Jing Xia, Jianning Chi, Chengdong Wu, Jagath C. Rajapakse

Abstract: This paper addresses the task of cross-modal medical image segmentation by exploring unsupervised domain adaptation (UDA) approaches. We propose a model-agnostic UDA framework, LowBridge, which builds on a simple observation that cross-modal images share some similar low-level features (e.g., edges) as they are depicting the same structures. Specifically, we first train a generative model to recov… ▽ More This paper addresses the task of cross-modal medical image segmentation by exploring unsupervised domain adaptation (UDA) approaches. We propose a model-agnostic UDA framework, LowBridge, which builds on a simple observation that cross-modal images share some similar low-level features (e.g., edges) as they are depicting the same structures. Specifically, we first train a generative model to recover the source images from their edge features, followed by training a segmentation model on the generated source images, separately. At test time, edge features from the target images are input to the pretrained generative model to generate source-style target domain images, which are then segmented using the pretrained segmentation network. Despite its simplicity, extensive experiments on various publicly available datasets demonstrate that \proposed achieves state-of-the-art performance, outperforming eleven existing UDA approaches under different settings. Notably, further ablation studies show that \proposed is agnostic to different types of generative and segmentation models, suggesting its potential to be seamlessly plugged with the most advanced models to achieve even more outstanding results in the future. The code is available at https://github.com/JoshuaLPF/LowBridge. △ Less

Submitted 17 May, 2025; originally announced May 2025.

Comments: 11 pages, 2 figures

arXiv:2504.14493 [pdf, ps, other]

FinSage: A Multi-aspect RAG System for Financial Filings Question Answering

Authors: Xinyu Wang, Jijun Chi, Zhenghan Tai, Tung Sum Thomas Kwok, Muzhi Li, Zhuhong Li, Hailin He, Yuchen Hua, Peng Lu, Suyuchen Wang, Yihong Wu, Jerry Huang, Jingrui Tian, Fengran Mo, Yufei Cui, Ling Zhou

Abstract: Leveraging large language models in real-world settings often entails a need to utilize domain-specific data and tools in order to follow the complex regulations that need to be followed for acceptable use. Within financial sectors, modern enterprises increasingly rely on Retrieval-Augmented Generation (RAG) systems to address complex compliance requirements in financial document workflows. Howeve… ▽ More Leveraging large language models in real-world settings often entails a need to utilize domain-specific data and tools in order to follow the complex regulations that need to be followed for acceptable use. Within financial sectors, modern enterprises increasingly rely on Retrieval-Augmented Generation (RAG) systems to address complex compliance requirements in financial document workflows. However, existing solutions struggle to account for the inherent heterogeneity of data (e.g., text, tables, diagrams) and evolving nature of regulatory standards used in financial filings, leading to compromised accuracy in critical information extraction. We propose the FinSage framework as a solution, utilizing a multi-aspect RAG framework tailored for regulatory compliance analysis in multi-modal financial documents. FinSage introduces three innovative components: (1) a multi-modal pre-processing pipeline that unifies diverse data formats and generates chunk-level metadata summaries, (2) a multi-path sparse-dense retrieval system augmented with query expansion (HyDE) and metadata-aware semantic search, and (3) a domain-specialized re-ranking module fine-tuned via Direct Preference Optimization (DPO) to prioritize compliance-critical content. Extensive experiments demonstrate that FinSage achieves an impressive recall of 92.51% on 75 expert-curated questions derived from surpasses the best baseline method on the FinanceBench question answering datasets by 24.06% in accuracy. Moreover, FinSage has been successfully deployed as financial question-answering agent in online meetings, where it has already served more than 1,200 people. △ Less

Submitted 6 June, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

arXiv:2504.13914 [pdf, other]

Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark. △ Less

Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

arXiv:2504.11536 [pdf, other]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Authors: Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong

Abstract: While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhanc… ▽ More While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems. △ Less

Submitted 17 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

Comments: fix typos

arXiv:2504.01956 [pdf, other]

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Authors: Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan

Abstract: Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent… ▽ More Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene △ Less

Submitted 3 April, 2025; v1 submitted 2 April, 2025; originally announced April 2025.

Comments: Accepted by CVPR 2025; Project Page: https://hanyang-21.github.io/VideoScene

arXiv:2502.19707 [pdf, other]

Weakly Supervised Segmentation Framework for Thyroid Nodule Based on High-confidence Labels and High-rationality Losses

Authors: Jianning Chi, Zelan Li, Geng Lin, MingYang Sun, Xiaosheng Yu

Abstract: Weakly supervised segmentation methods can delineate thyroid nodules in ultrasound images efficiently using training data with coarse labels, but suffer from: 1) low-confidence pseudo-labels that follow topological priors, introducing significant label noise, and 2) low-rationality loss functions that rigidly compare segmentation with labels, ignoring discriminative information for nodules with di… ▽ More Weakly supervised segmentation methods can delineate thyroid nodules in ultrasound images efficiently using training data with coarse labels, but suffer from: 1) low-confidence pseudo-labels that follow topological priors, introducing significant label noise, and 2) low-rationality loss functions that rigidly compare segmentation with labels, ignoring discriminative information for nodules with diverse and complex shapes. To solve these issues, we clarify the objective and references for weakly supervised ultrasound image segmentation, presenting a framework with high-confidence pseudo-labels to represent topological and anatomical information and high-rationality losses to capture multi-level discriminative features. Specifically, we fuse geometric transformations of four-point annotations and MedSAM model results prompted by specific annotations to generate high-confidence box, foreground, and background labels. Our high-rationality learning strategy includes: 1) Alignment loss measuring spatial consistency between segmentation and box label, and topological continuity within the foreground label, guiding the network to perceive nodule location; 2) Contrastive loss pulling features from labeled foreground regions while pushing features from labeled foreground and background regions, guiding the network to learn nodule and background feature distribution; 3) Prototype correlation loss measuring consistency between correlation maps derived by comparing features with foreground and background prototypes, refining uncertain regions to accurate nodule edges. Experimental results show that our method achieves state-of-the-art performance on the TN3K and DDTI datasets. The code is available at https://github.com/bluehenglee/MLI-MSC. △ Less

Submitted 26 February, 2025; originally announced February 2025.

Comments: 10 pages, 6 figures

ACM Class: J.3.3

arXiv:2502.05389 [pdf, other]

The Role of Prosody in Spoken Question Answering

Authors: Jie Chi, Maureen de Seyssel, Natalie Schluter

Abstract: Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody--additional information carried by the speech signal beyond the phonetics of the words themselves and difficul… ▽ More Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody--additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: accepted to NAACL 2025 Findings

arXiv:2501.12127 [pdf, ps, other]

On the cohomology of simple Shimura varieties with non quasi-split local groups

Authors: Jingren Chi, Thomas J. Haines

Abstract: We study the Scholze test functions for bad reduction of simple Shimura varieties at a prime where the underlying local group is any inner form of a product of Weil restrictions of general linear groups. Using global methods, we prove that these test functions satisfy a vanishing property of their twisted orbital integrals, and we prove that the pseudostabilization base changes of such functions e… ▽ More We study the Scholze test functions for bad reduction of simple Shimura varieties at a prime where the underlying local group is any inner form of a product of Weil restrictions of general linear groups. Using global methods, we prove that these test functions satisfy a vanishing property of their twisted orbital integrals, and we prove that the pseudostabilization base changes of such functions exist (even though the local group need not be quasi-split) and can be expressed in terms of explicit distributions in the stable Bernstein center. We then deduce applications to the stable trace formula and local Hasse-Weil zeta functions for these Shimura varieties. △ Less

Submitted 1 February, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

Comments: 48 pages

arXiv:2411.17713 [pdf, other]

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations

Authors: Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, Zechun Liu, Changsheng Zhao, Yangyang Shi, Tijmen Blankevoort, Mahesh Pasupuleti, Bilge Soran, Zacharie Delpierre Coudert, Rachad Alao, Raghuraman Krishnamoorthi, Vikas Chandra

Abstract: This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our… ▽ More This paper presents Llama Guard 3-1B-INT4, a compact and efficient Llama Guard model, which has been open-sourced to the community during Meta Connect 2024. We demonstrate that Llama Guard 3-1B-INT4 can be deployed on resource-constrained devices, achieving a throughput of at least 30 tokens per second and a time-to-first-token of 2.5 seconds or less on a commodity Android mobile CPU. Notably, our experiments show that Llama Guard 3-1B-INT4 attains comparable or superior safety moderation scores to its larger counterpart, Llama Guard 3-1B, despite being approximately 7 times smaller in size (440MB). △ Less

Submitted 18 November, 2024; originally announced November 2024.

arXiv:2411.11033 [pdf, other]

REACCEPT: Automated Co-evolution of Production and Test Code Based on Dynamic Validation and Large Language Models

Authors: Jianlei Chi, Xiaotian Wang, Yuhan Huang, Lechen Yu, Di Cui, Jianguo Sun, Jun Sun

Abstract: Synchronizing production and test code, known as PT co-evolution, is critical for software quality in the software development lifecycle. Existing methods for automatic PT co-evolution either utilize predefined heuristic rules or rely on simple application of machine learning techniques. Due to the limitations of underlying techniques, existing methods either only partially automate PT co-evolutio… ▽ More Synchronizing production and test code, known as PT co-evolution, is critical for software quality in the software development lifecycle. Existing methods for automatic PT co-evolution either utilize predefined heuristic rules or rely on simple application of machine learning techniques. Due to the limitations of underlying techniques, existing methods either only partially automate PT co-evolution (e.g., only automate obsolete test code identification) or result in low accuracy. In this paper, we propose REACCEPT, a novel approach that leverages large language models and dynamic validation to fully automate PT co-evolution (i.e., capable of both identifying and updating obsolete test cases). REACCEPT relies on experience-based prompt template generation, dynamic validation, and retrieval-augmented generation techniques to accomplish automated PT co-evolution. To evaluate REACCEPT's effectiveness, we extensive experiments with a dataset of 537 Java projects and compared REACCEPT's performance with several state-of-the-art methods. Results show that REACCEPT achieved an update accuracy of 60.16% on correctly identified obsolete test code, surpassing the state-of-the-art technique CEPROT by 90%. This confirms that REACCEPT can effectively assist developers in maintaining test code, improving overall software quality and reducing maintenance effort. △ Less

Submitted 17 November, 2024; originally announced November 2024.

Comments: 21 pages, 8 figures

arXiv:2411.10414 [pdf, other]

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

Authors: Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, Mahesh Pasupuleti

Abstract: We introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI conversations that involves image understanding: it can be used to safeguard content for both multimodal LLM inputs (prompt classification) and outputs (response classification). Unlike the previous text-only Llama Guard versions (Inan et al., 2023; Llama Team, 2024b,a), it is specifically designed to support image re… ▽ More We introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI conversations that involves image understanding: it can be used to safeguard content for both multimodal LLM inputs (prompt classification) and outputs (response classification). Unlike the previous text-only Llama Guard versions (Inan et al., 2023; Llama Team, 2024b,a), it is specifically designed to support image reasoning use cases and is optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts. Llama Guard 3 Vision is fine-tuned on Llama 3.2-Vision and demonstrates strong performance on the internal benchmarks using the MLCommons taxonomy. We also test its robustness against adversarial attacks. We believe that Llama Guard 3 Vision serves as a good starting point to build more capable and robust content moderation tools for human-AI conversation with multimodal capabilities. △ Less

Submitted 15 November, 2024; originally announced November 2024.

arXiv:2410.19332 [pdf, other]

Beyond Point Annotation: A Weakly Supervised Network Guided by Multi-Level Labels Generated from Four-Point Annotation for Thyroid Nodule Segmentation in Ultrasound Image

Authors: Jianning Chi, Zelan Li, Huixuan Wu, Wenjun Zhang, Ying Huang

Abstract: Weakly-supervised methods typically guided the pixel-wise training by comparing the predictions to single-level labels containing diverse segmentation-related information at once, but struggled to represent delicate feature differences between nodule and background regions and confused incorrect information, resulting in underfitting or overfitting in the segmentation predictions. In this work, we… ▽ More Weakly-supervised methods typically guided the pixel-wise training by comparing the predictions to single-level labels containing diverse segmentation-related information at once, but struggled to represent delicate feature differences between nodule and background regions and confused incorrect information, resulting in underfitting or overfitting in the segmentation predictions. In this work, we propose a weakly-supervised network that generates multi-level labels from four-point annotation to refine diverse constraints for delicate nodule segmentation. The Distance-Similarity Fusion Prior referring to the points annotations filters out information irrelevant to nodules. The bounding box and pure foreground/background labels, generated from the point annotation, guarantee the rationality of the prediction in the arrangement of target localization and the spatial distribution of target/background regions, respectively. Our proposed network outperforms existing weakly-supervised methods on two public datasets with respect to the accuracy and robustness, improving the applicability of deep-learning based segmentation in the clinical practice of thyroid nodule diagnosis. △ Less

Submitted 25 October, 2024; originally announced October 2024.

arXiv:2410.18210 [pdf, other]

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Authors: Samuele Poppi, Zheng-Xin Yong, Yifei He, Bobbie Chern, Han Zhao, Aobo Yang, Jianfeng Chi

Abstract: Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual g… ▽ More Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of weight parameters in fine-tuning attacks can break safety alignment across all languages. Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our attack vector can still jailbreak LLMs adapted to new languages. △ Less

Submitted 27 February, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

Comments: 15 pages, 6 figures, 7 tables

arXiv:2410.13722 [pdf, other]

Persistent Pre-Training Poisoning of LLMs

Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, Daphne Ippolito

Abstract: Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be co… ▽ More Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2409.14586 [pdf, other]

Backtracking Improves Generation Safety

Authors: Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, Eric Michael Smith

Abstract: Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier… ▽ More Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1\% $\to$ 1.5\%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so. △ Less

Submitted 22 September, 2024; originally announced September 2024.

arXiv:2409.08575 [pdf, ps, other]

A Simple approach for precision calculation of Bethe logarithm

Authors: San-Jiang Yang, Jing Chi, Wan-Ping Zhou, Li-Yan Tang, Zhen-Xiang Zhong, Ting-Yun Shi, Hao-Xue Qiao

Abstract: In this article we propose a simple approach for the precision calculation of Bethe logarithm. The leading contributions are obtained using specific operators, while the remaining terms are eliminated by adjusting the parameter $λ$. Through the use of dimensional regularization, singular divergences are algebraically canceled. Compared to the standard form of Bethe logarithm, our approach signific… ▽ More In this article we propose a simple approach for the precision calculation of Bethe logarithm. The leading contributions are obtained using specific operators, while the remaining terms are eliminated by adjusting the parameter $λ$. Through the use of dimensional regularization, singular divergences are algebraically canceled. Compared to the standard form of Bethe logarithm, our approach significantly reduces the complexity of constructing pseudostates in numerical evaluations. Using this approach we obtain a very highly precise result of Bethe logarithm for the ground state of the hydrogen, achieving 49 significant digits. And for multi-electron systems this approach appears simplicity and efficiency as well. △ Less

Submitted 13 September, 2024; originally announced September 2024.

Comments: 8 pages, 5 tables

arXiv:2408.12832 [pdf, other]

LIMP: Large Language Model Enhanced Intent-aware Mobility Prediction

Authors: Songwei Li, Jie Feng, Jiawei Chi, Xinyuan Hu, Xiaomeng Zhao, Fengli Xu

Abstract: Human mobility prediction is essential for applications like urban planning and transportation management, yet it remains challenging due to the complex, often implicit, intentions behind human behavior. Existing models predominantly focus on spatiotemporal patterns, paying less attention to the underlying intentions that govern movements. Recent advancements in large language models (LLMs) offer… ▽ More Human mobility prediction is essential for applications like urban planning and transportation management, yet it remains challenging due to the complex, often implicit, intentions behind human behavior. Existing models predominantly focus on spatiotemporal patterns, paying less attention to the underlying intentions that govern movements. Recent advancements in large language models (LLMs) offer a promising alternative research angle for integrating commonsense reasoning into mobility prediction. However, it is a non-trivial problem because LLMs are not natively built for mobility intention inference, and they also face scalability issues and integration difficulties with spatiotemporal models. To address these challenges, we propose a novel LIMP (LLMs for Intent-ware Mobility Prediction) framework. Specifically, LIMP introduces an "Analyze-Abstract-Infer" (A2I) agentic workflow to unleash LLM's commonsense reasoning power for mobility intention inference. Besides, we design an efficient fine-tuning scheme to transfer reasoning power from commercial LLM to smaller-scale, open-source language model, ensuring LIMP's scalability to millions of mobility records. Moreover, we propose a transformer-based intention-aware mobility prediction model to effectively harness the intention inference ability of LLM. Evaluated on two real-world datasets, LIMP significantly outperforms baseline models, demonstrating improved accuracy in next-location prediction and effective intention inference. The interpretability of intention-aware mobility prediction highlights our LIMP framework's potential for real-world applications. Codes and data can be found in https://github.com/tsinghua-fib-lab/LIMP . △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: 13 pages

arXiv:2408.07362 [pdf, other]

BadMerging: Backdoor Attacks Against Model Merging

Authors: Jinghuai Zhang, Jianfeng Chi, Zheng Li, Kunlin Cai, Yang Zhang, Yuan Tian

Abstract: Fine-tuning pre-trained models for downstream tasks has led to a proliferation of open-sourced task-specific models. Recently, Model Merging (MM) has emerged as an effective approach to facilitate knowledge transfer among these independently fine-tuned models. MM directly combines multiple fine-tuned task-specific models into a merged model without additional training, and the resulting model show… ▽ More Fine-tuning pre-trained models for downstream tasks has led to a proliferation of open-sourced task-specific models. Recently, Model Merging (MM) has emerged as an effective approach to facilitate knowledge transfer among these independently fine-tuned models. MM directly combines multiple fine-tuned task-specific models into a merged model without additional training, and the resulting model shows enhanced capabilities in multiple tasks. Although MM provides great utility, it may come with security risks because an adversary can exploit MM to affect multiple downstream tasks. However, the security risks of MM have barely been studied. In this paper, we first find that MM, as a new learning paradigm, introduces unique challenges for existing backdoor attacks due to the merging process. To address these challenges, we introduce BadMerging, the first backdoor attack specifically designed for MM. Notably, BadMerging allows an adversary to compromise the entire merged model by contributing as few as one backdoored task-specific model. BadMerging comprises a two-stage attack mechanism and a novel feature-interpolation-based loss to enhance the robustness of embedded backdoors against the changes of different merging parameters. Considering that a merged model may incorporate tasks from different domains, BadMerging can jointly compromise the tasks provided by the adversary (on-task attack) and other contributors (off-task attack) and solve the corresponding unique challenges with novel attack designs. Extensive experiments show that BadMerging achieves remarkable attacks against various MM algorithms. Our ablation study demonstrates that the proposed attack designs can progressively contribute to the attack performance. Finally, we show that prior defense mechanisms fail to defend against our attacks, highlighting the need for more advanced defense. △ Less

Submitted 2 September, 2024; v1 submitted 14 August, 2024; originally announced August 2024.

Comments: To appear in ACM Conference on Computer and Communications Security (CCS), 2024

arXiv:2407.21783 [pdf, other]

The Llama 3 Herd of Models

Authors: Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere , et al. (536 additional authors not shown)

Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. △ Less

Submitted 23 November, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

arXiv:2406.06839 [pdf, other]

EAVE: Efficient Product Attribute Value Extraction via Lightweight Sparse-layer Interaction

Authors: Li Yang, Qifan Wang, Jianfeng Chi, Jiahao Liu, Jingang Wang, Fuli Feng, Zenglin Xu, Yi Fang, Lifu Huang, Dongfang Liu

Abstract: Product attribute value extraction involves identifying the specific values associated with various attributes from a product profile. While existing methods often prioritize the development of effective models to improve extraction performance, there has been limited emphasis on extraction efficiency. However, in real-world scenarios, products are typically associated with multiple attributes, ne… ▽ More Product attribute value extraction involves identifying the specific values associated with various attributes from a product profile. While existing methods often prioritize the development of effective models to improve extraction performance, there has been limited emphasis on extraction efficiency. However, in real-world scenarios, products are typically associated with multiple attributes, necessitating multiple extractions to obtain all corresponding values. In this work, we propose an Efficient product Attribute Value Extraction (EAVE) approach via lightweight sparse-layer interaction. Specifically, we employ a heavy encoder to separately encode the product context and attribute. The resulting non-interacting heavy representations of the context can be cached and reused for all attributes. Additionally, we introduce a light encoder to jointly encode the context and the attribute, facilitating lightweight interactions between them. To enrich the interaction within the lightweight encoder, we design a sparse-layer interaction module to fuse the non-interacting heavy representation into the lightweight encoder. Comprehensive evaluation on two benchmarks demonstrate that our method achieves significant efficiency gains with neutral or marginal loss in performance when the context is long and number of attributes is large. Our code is available \href{https://anonymous.4open.science/r/EAVE-EA18}{here}. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2404.09945 [pdf, ps, other]

Witt vector affine Springer fibers

Authors: Jingren Chi

Abstract: We establish dimension formulas for the Witt vector affine Springer fibers associated to a reductive group over a mixed characteristic local field, under the assumption that the group is essentially tamely ramified and the residue characteristic is not bad. Besides the discriminant valuations that show up in classical works on the usual affine Springer fibers, our formula also involves the Artin c… ▽ More We establish dimension formulas for the Witt vector affine Springer fibers associated to a reductive group over a mixed characteristic local field, under the assumption that the group is essentially tamely ramified and the residue characteristic is not bad. Besides the discriminant valuations that show up in classical works on the usual affine Springer fibers, our formula also involves the Artin conductors and the Kottwitz invariants of the relevant conjugacy classes. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2403.01777 [pdf, other]

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

Authors: Lizhou Fan, Wenyue Hua, Xiang Li, Kaijie Zhu, Mingyu Jin, Lingyao Li, Haoyang Ling, Jinkui Chi, Jindong Wang, Xin Ma, Yongfeng Zhang

Abstract: Understanding the reasoning capabilities of Multimodal Large Language Models (MLLMs) is an important area of research. In this study, we introduce a dynamic benchmark, NPHardEval4V, aimed at addressing the existing gaps in evaluating the pure reasoning abilities of MLLMs. Our benchmark aims to provide a venue to disentangle the effect of various factors such as image recognition and instruction fo… ▽ More Understanding the reasoning capabilities of Multimodal Large Language Models (MLLMs) is an important area of research. In this study, we introduce a dynamic benchmark, NPHardEval4V, aimed at addressing the existing gaps in evaluating the pure reasoning abilities of MLLMs. Our benchmark aims to provide a venue to disentangle the effect of various factors such as image recognition and instruction following, from the overall performance of the models, allowing us to focus solely on evaluating their reasoning abilities. It is built by converting textual description of questions from NPHardEval to image representations. Our findings reveal significant discrepancies in reasoning abilities across different models and highlight the relatively weak performance of MLLMs compared to LLMs in terms of reasoning. We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs, demonstrating the different impacts of multimodal inputs in model performance. Unlike traditional benchmarks, which focus primarily on static evaluations, our benchmark will be updated monthly to prevent overfitting and ensure a more authentic and fine-grained evaluation of the models. We believe that this benchmark can aid in understanding and guide the further development of reasoning abilities in MLLMs. The benchmark dataset and code are available at https://github.com/lizhouf/NPHardEval4V △ Less

Submitted 5 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

Comments: 16 pages, 10 figures, 2 tables

arXiv:2401.01773 [pdf, other]

A Global Analysis of Pre-Earthquake Ionospheric Anomalies

Authors: Luke Cullen, Andy W Smith, Asadullah H Galib, Debvrat Varshney, Edward J E Brown, Peter J Chi, Xiangning Chu, Filip Svoboda

Abstract: Local ionospheric density anomalies have been reported in the days prior to major earthquakes. This global study statistically investigates whether consistent ionospheric anomalies occur in the 24 hours prior to earthquakes across different regions, magnitudes, temporal and spatial scales. We match earthquake data to Total Electron Content (TEC) data from 2000-2020 at a higher resolution and caden… ▽ More Local ionospheric density anomalies have been reported in the days prior to major earthquakes. This global study statistically investigates whether consistent ionospheric anomalies occur in the 24 hours prior to earthquakes across different regions, magnitudes, temporal and spatial scales. We match earthquake data to Total Electron Content (TEC) data from 2000-2020 at a higher resolution and cadence than previous assessed. Globally, no significant, consistent anomaly is found. Regionally, statistically significant ionospheric anomalies arise in the 12 hours prior to earthquakes with $p \leq 0.01$ following Wilcoxon tests. For the Japanese region we find a median negative ionospheric anomaly of around 0.5 TECU between 3 and 8 hours before earthquakes. For the South American region, the median TEC is enhanced by up to ~ 2 TECU, between 7 and 10 hours before an event. We show that the results are robust to different definitions of the ''local'' region and earthquake magnitude. This demonstrates the promise of monitoring the ionosphere as part of a multimodal earthquake forecasting system. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: 12 pages, 4 figures. Presented at AGU fall meeting 2022 (https://agu.confex.com/agu/fm22/meetingapp.cgi/Paper/1142329)

arXiv:2312.06674 [pdf, other]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Authors: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa

Abstract: We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to… ▽ More We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2310.18606 [pdf, other]

doi 10.1145/3637528.3671758

Where have you been? A Study of Privacy Risk for Point-of-Interest Recommendation

Authors: Kunlin Cai, Jinghuai Zhang, Zhiqing Hong, Will Shand, Guang Wang, Desheng Zhang, Jianfeng Chi, Yuan Tian

Abstract: As location-based services (LBS) have grown in popularity, more human mobility data has been collected. The collected data can be used to build machine learning (ML) models for LBS to enhance their performance and improve overall experience for users. However, the convenience comes with the risk of privacy leakage since this type of data might contain sensitive information related to user identiti… ▽ More As location-based services (LBS) have grown in popularity, more human mobility data has been collected. The collected data can be used to build machine learning (ML) models for LBS to enhance their performance and improve overall experience for users. However, the convenience comes with the risk of privacy leakage since this type of data might contain sensitive information related to user identities, such as home/work locations. Prior work focuses on protecting mobility data privacy during transmission or prior to release, lacking the privacy risk evaluation of mobility data-based ML models. To better understand and quantify the privacy leakage in mobility data-based ML models, we design a privacy attack suite containing data extraction and membership inference attacks tailored for point-of-interest (POI) recommendation models, one of the most widely used mobility data-based ML models. These attacks in our attack suite assume different adversary knowledge and aim to extract different types of sensitive information from mobility data, providing a holistic privacy risk assessment for POI recommendation models. Our experimental evaluation using two real-world mobility datasets demonstrates that current POI recommendation models are vulnerable to our attacks. We also present unique findings to understand what types of mobility data are more susceptible to privacy attacks. Finally, we evaluate defenses against these attacks and highlight future directions and challenges. Our attack suite is released at https://github.com/KunlinChoi/POIPrivacy. △ Less

Submitted 5 July, 2024; v1 submitted 28 October, 2023; originally announced October 2023.

Comments: 18 pages

Journal ref: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024)

arXiv:2306.09468 [pdf, other]

FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods

Authors: Xiaotian Han, Jianfeng Chi, Yu Chen, Qifan Wang, Han Zhao, Na Zou, Xia Hu

Abstract: This paper introduces the Fair Fairness Benchmark (\textsf{FFB}), a benchmarking framework for in-processing group fairness methods. Ensuring fairness in machine learning is important for ethical compliance. However, there exist challenges in comparing and developing fairness methods due to inconsistencies in experimental settings, lack of accessible algorithmic implementations, and limited extens… ▽ More This paper introduces the Fair Fairness Benchmark (\textsf{FFB}), a benchmarking framework for in-processing group fairness methods. Ensuring fairness in machine learning is important for ethical compliance. However, there exist challenges in comparing and developing fairness methods due to inconsistencies in experimental settings, lack of accessible algorithmic implementations, and limited extensibility of current fairness packages and tools. To address these issues, we introduce an open-source standardized benchmark for evaluating in-processing group fairness methods and provide a comprehensive analysis of state-of-the-art methods to ensure different notions of group fairness. This work offers the following key contributions: the provision of flexible, extensible, minimalistic, and research-oriented open-source code; the establishment of unified fairness method benchmarking pipelines; and extensive benchmarking, which yields key insights from $\mathbf{45,079}$ experiments, $\mathbf{14,428}$ GPU hours. We believe that our work will significantly facilitate the growth and development of the fairness research community. △ Less

Submitted 10 June, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: ICLR2024

arXiv:2212.10011 [pdf, other]

PLUE: Language Understanding Evaluation Benchmark for Privacy Policies in English

Authors: Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, Kai-Wei Chang

Abstract: Privacy policies provide individuals with information about their rights and how their personal information is handled. Natural language understanding (NLU) technologies can support individuals and practitioners to understand better privacy practices described in lengthy and complex documents. However, existing efforts that use NLU technologies are limited by processing the language in a way exclu… ▽ More Privacy policies provide individuals with information about their rights and how their personal information is handled. Natural language understanding (NLU) technologies can support individuals and practitioners to understand better privacy practices described in lengthy and complex documents. However, existing efforts that use NLU technologies are limited by processing the language in a way exclusive to a single task focusing on certain privacy practices. To this end, we introduce the Privacy Policy Language Understanding Evaluation (PLUE) benchmark, a multi-task benchmark for evaluating the privacy policy language understanding across various tasks. We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training. We evaluate several generic pre-trained language models and continue pre-training them on the collected corpus. We demonstrate that domain-specific continual pre-training offers performance improvements across all tasks. △ Less

Submitted 12 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: ACL 2023. Code is released at https://github.com/JFChi/PLUE

arXiv:2211.05749 [pdf, other]

doi 10.1137/23M155493X

Linear Discriminant Analysis with the Randomized Kaczmarz Method

Authors: Jocelyn T. Chi, Deanna Needell

Abstract: We present a randomized Kaczmarz method for linear discriminant analysis (rkLDA), an iterative randomized approach to binary-class Gaussian model linear discriminant analysis (LDA) for very large data. We harness a least squares formulation and mobilize the stochastic gradient descent framework to obtain a randomized classifier with performance that can achieve comparable accuracy to that of full… ▽ More We present a randomized Kaczmarz method for linear discriminant analysis (rkLDA), an iterative randomized approach to binary-class Gaussian model linear discriminant analysis (LDA) for very large data. We harness a least squares formulation and mobilize the stochastic gradient descent framework to obtain a randomized classifier with performance that can achieve comparable accuracy to that of full data LDA. We present analysis for the expected change in the LDA discriminant function if one employs the randomized Kaczmarz solution in lieu of the full data least squares solution that accounts for both the Gaussian modeling assumptions on the data and algorithmic randomness. Our analysis shows how the expected change depends on quantities inherent in the data such as the scaled condition number and Frobenius norm of the input data, how well the linear model fits the data, and choices from the randomized algorithm. Our experiments demonstrate that rkLDA can offer a viable alternative to full data LDA on a range of step-sizes and numbers of iterations. △ Less

Submitted 7 January, 2025; v1 submitted 10 November, 2022; originally announced November 2022.

arXiv:2209.04968 [pdf, other]

Population-Based Hierarchical Non-negative Matrix Factorization for Survey Data

Authors: Xiaofu Ding, Xinyu Dong, Olivia McGough, Chenxin Shen, Annie Ulichney, Ruiyao Xu, William Swartworth, Jocelyn T. Chi, Deanna Needell

Abstract: Motivated by the problem of identifying potential hierarchical population structure on modern survey data containing a wide range of complex data types, we introduce population-based hierarchical non-negative matrix factorization (PHNMF). PHNMF is a variant of hierarchical non-negative matrix factorization based on feature similarity. As such, it enables an automatic and interpretable approach for… ▽ More Motivated by the problem of identifying potential hierarchical population structure on modern survey data containing a wide range of complex data types, we introduce population-based hierarchical non-negative matrix factorization (PHNMF). PHNMF is a variant of hierarchical non-negative matrix factorization based on feature similarity. As such, it enables an automatic and interpretable approach for identifying and understanding hierarchical structure in a data matrix constructed from a wide range of data types. Our numerical experiments on synthetic and real survey data demonstrate that PHNMF can recover latent hierarchical population structure in complex data with high accuracy. Moreover, the recovered subpopulation structure is meaningful and can be useful for improving downstream inference. △ Less

Submitted 11 September, 2022; originally announced September 2022.

arXiv:2207.00012 [pdf, other]

doi 10.1145/3534678.3539484

Reliable Representations Make A Stronger Defender: Unsupervised Structure Refinement for Robust GNN

Authors: Kuan Li, Yang Liu, Xiang Ao, Jianfeng Chi, Jinghua Feng, Hao Yang, Qing He

Abstract: Benefiting from the message passing mechanism, Graph Neural Networks (GNNs) have been successful on flourish tasks over graph data. However, recent studies have shown that attackers can catastrophically degrade the performance of GNNs by maliciously modifying the graph structure. A straightforward solution to remedy this issue is to model the edge weights by learning a metric function between pair… ▽ More Benefiting from the message passing mechanism, Graph Neural Networks (GNNs) have been successful on flourish tasks over graph data. However, recent studies have shown that attackers can catastrophically degrade the performance of GNNs by maliciously modifying the graph structure. A straightforward solution to remedy this issue is to model the edge weights by learning a metric function between pairwise representations of two end nodes, which attempts to assign low weights to adversarial edges. The existing methods use either raw features or representations learned by supervised GNNs to model the edge weights. However, both strategies are faced with some immediate problems: raw features cannot represent various properties of nodes (e.g., structure information), and representations learned by supervised GNN may suffer from the poor performance of the classifier on the poisoned graph. We need representations that carry both feature information and as mush correct structure information as possible and are insensitive to structural perturbations. To this end, we propose an unsupervised pipeline, named STABLE, to optimize the graph structure. Finally, we input the well-refined graph into a downstream classifier. For this part, we design an advanced GCN that significantly enhances the robustness of vanilla GCN without increasing the time complexity. Extensive experiments on four real-world graph benchmarks demonstrate that STABLE outperforms the state-of-the-art methods and successfully defends against various attacks. △ Less

Submitted 21 April, 2023; v1 submitted 30 June, 2022; originally announced July 2022.

Comments: Accepted in KDD2022

arXiv:2205.11485 [pdf, other]

Conditional Supervised Contrastive Learning for Fair Text Classification

Authors: Jianfeng Chi, William Shand, Yaodong Yu, Kai-Wei Chang, Han Zhao, Yuan Tian

Abstract: Contrastive representation learning has gained much attention due to its superior performance in learning representations from both image and sequential data. However, the learned representations could potentially lead to performance disparities in downstream tasks, such as increased silencing of underrepresented groups in toxicity comment classification. In light of this challenge, in this work,… ▽ More Contrastive representation learning has gained much attention due to its superior performance in learning representations from both image and sequential data. However, the learned representations could potentially lead to performance disparities in downstream tasks, such as increased silencing of underrepresented groups in toxicity comment classification. In light of this challenge, in this work, we study learning fair representations that satisfy a notion of fairness known as equalized odds for text classification via contrastive learning. Specifically, we first theoretically analyze the connections between learning representations with a fairness constraint and conditional supervised contrastive objectives, and then propose to use conditional supervised contrastive objectives to learn fair representations for text classification. We conduct experiments on two text datasets to demonstrate the effectiveness of our approaches in balancing the trade-offs between task performance and bias mitigation among existing baselines for text classification. Furthermore, we also show that the proposed methods are stable in different hyperparameter settings. △ Less

Submitted 31 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Findings of EMNLP 2022

arXiv:2204.08952 [pdf, other]

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Authors: Md Rizwan Parvez, Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, Kai-Wei Chang

Abstract: Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever mode… ▽ More Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach. △ Less

Submitted 22 April, 2023; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: EACL 2023

arXiv:2203.09360 [pdf, other]

doi 10.1109/TIFS.2022.3208471

Behavior-aware Account De-anonymization on Ethereum Interaction Graph

Authors: Jiajun Zhou, Chenkai Hu, Jianlei Chi, Jiajing Wu, Meng Shen, Qi Xuan

Abstract: Blockchain technology has the characteristics of decentralization, traceability and tamper-proof, which creates a reliable decentralized trust mechanism, further accelerating the development of blockchain finance. However, the anonymization of blockchain hinders market regulation, resulting in increasing illegal activities such as money laundering, gambling and phishing fraud on blockchain financi… ▽ More Blockchain technology has the characteristics of decentralization, traceability and tamper-proof, which creates a reliable decentralized trust mechanism, further accelerating the development of blockchain finance. However, the anonymization of blockchain hinders market regulation, resulting in increasing illegal activities such as money laundering, gambling and phishing fraud on blockchain financial platforms. Thus, financial security has become a top priority in the blockchain ecosystem, calling for effective market regulation. In this paper, we consider identifying Ethereum accounts from a graph classification perspective, and propose an end-to-end graph neural network framework named Ethident, to characterize the behavior patterns of accounts and further achieve account de-anonymization. Specifically, we first construct an Account Interaction Graph (AIG) using raw Ethereum data. Then we design a hierarchical graph attention encoder named HGATE as the backbone of our framework, which can effectively characterize the node-level account features and subgraph-level behavior patterns. For alleviating account label scarcity, we further introduce contrastive self-supervision mechanism as regularization to jointly train our framework. Comprehensive experiments on Ethereum datasets demonstrate that our framework achieves superior performance in account identification, yielding 1.13% ~ 4.93% relative improvement over previous state-of-the-art. Furthermore, detailed analyses illustrate the effectiveness of Ethident in identifying and understanding the behavior of known participants in Ethereum (e.g. exchanges, miners, etc.), as well as that of the lawbreakers (e.g. phishing scammers, hackers, etc.), which may aid in risk assessment and market regulation. △ Less

Submitted 13 September, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

Comments: Accepted by IEEE Transactions on Information Forensics & Security

Journal ref: in IEEE Transactions on Information Forensics and Security, vol. 17, pp. 3433-3448, 2022

arXiv:2112.03470 [pdf]

doi 10.1016/j.prostr.2022.01.060

Structural Health Monitoring of a Foot Bridge in Virtual Reality Environment

Authors: Furkan Luleci, Liangding Li, Jiapeng Chi, Dirk Reiners, Carolina Cruz-Neira, F. Necati Catbas

Abstract: Ageing civil infrastructure systems require imminent attention before any failure mechanism becomes critical. Structural Health Monitoring (SHM) is employed to track inputs and/or responses of structural systems for decision support. Inspections and structural health monitoring require field visits, and subsequently expert assessment of critical elements at site, which may be both time-consuming a… ▽ More Ageing civil infrastructure systems require imminent attention before any failure mechanism becomes critical. Structural Health Monitoring (SHM) is employed to track inputs and/or responses of structural systems for decision support. Inspections and structural health monitoring require field visits, and subsequently expert assessment of critical elements at site, which may be both time-consuming and costly. Also, fieldwork including visits and inspections may pose danger, require personal protective equipment and structure closures during the fieldwork. To address some of these issues, a Virtual Reality (VR) collaborative application is developed to bring the structure and SHM data from the field to the office such that many experts from different places can simultaneously virtually visit the bridge structure for final assessment. In this work, we present an SHM system in a VR environment that includes the technical and visual information necessary for the engineers to make decisions for a footbridge on the campus of the University of Central Florida. In this VR application, for the visualization stage, UAV (Unmanned Air Vehicle) photogrammetry and LiDAR (Light Detection and Ranging) methods are used to capture the bridge. For the technical assessment stage, Finite Element Analysis (FEA) and Operational Modal Analysis (OMA) from vibration data as part of SHM are analyzed. To better visualize the dynamic response of the structure, the operational behaviour from the FEA is reflected on the LiDAR point cloud model for immersive. The multi-user feature allowing teams to collaborate simultaneously is essential for decision-making activities. In conclusion, the proposed VR environment offers the potential to provide beneficial features with further automated and real-time improvements along with the SHM and FEA models. △ Less

Submitted 3 March, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

arXiv:2111.10476 [pdf, other]

Towards Return Parity in Markov Decision Processes

Authors: Jianfeng Chi, Jian Shen, Xinyi Dai, Weinan Zhang, Yuan Tian, Han Zhao

Abstract: Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. However, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose ret… ▽ More Algorithmic decisions made by machine learning models in high-stakes domains may have lasting impacts over time. However, naive applications of standard fairness criterion in static settings over temporal domains may lead to delayed and adverse effects. To understand the dynamics of performance disparity, we study a fairness problem in Markov decision processes (MDPs). Specifically, we propose return parity, a fairness notion that requires MDPs from different demographic groups that share the same state and action spaces to achieve approximately the same expected time-discounted rewards. We first provide a decomposition theorem for return disparity, which decomposes the return disparity of any two MDPs sharing the same state and action spaces into the distance between group-wise reward functions, the discrepancy of group policies, and the discrepancy between state visitation distributions induced by the group policies. Motivated by our decomposition theorem, we propose algorithms to mitigate return disparity via learning a shared group policy with state visitation distributional alignment using integral probability metrics. We conduct experiments to corroborate our results, showing that the proposed algorithm can successfully close the disparity gap while maintaining the performance of policies on two real-world recommender system benchmark datasets. △ Less

Submitted 25 February, 2022; v1 submitted 19 November, 2021; originally announced November 2021.

Comments: AISTATS 2022. Code is released at https://github.com/JFChi/Return-Parity-MDP

arXiv:2110.11707 [pdf, other]

Variational Wasserstein Barycenters with c-Cyclical Monotonicity

Authors: Jinjin Chi, Zhiyao Yang, Jihong Ouyang, Ximing Li

Abstract: Wasserstein barycenter, built on the theory of optimal transport, provides a powerful framework to aggregate probability distributions, and it has increasingly attracted great attention within the machine learning community. However, it suffers from severe computational burden, especially for high dimensional and continuous settings. To this end, we develop a novel continuous approximation method… ▽ More Wasserstein barycenter, built on the theory of optimal transport, provides a powerful framework to aggregate probability distributions, and it has increasingly attracted great attention within the machine learning community. However, it suffers from severe computational burden, especially for high dimensional and continuous settings. To this end, we develop a novel continuous approximation method for the Wasserstein barycenters problem given sample access to the input distributions. The basic idea is to introduce a variational distribution as the approximation of the true continuous barycenter, so as to frame the barycenters computation problem as an optimization problem, where parameters of the variational distribution adjust the proxy distribution to be similar to the barycenter. Leveraging the variational distribution, we construct a tractable dual formulation for the regularized Wasserstein barycenter problem with c-cyclical monotonicity, which can be efficiently solved by stochastic optimization. We provide theoretical analysis on convergence and demonstrate the practical effectiveness of our method on real applications of subset posterior aggregation and synthetic data. △ Less

Submitted 17 December, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

arXiv:2105.03228 [pdf, other]

SEAGLE: A Scalable Exact Algorithm for Large-Scale Set-Based GxE Tests in Biobank Data

Authors: Jocelyn T. Chi, Ilse C. F. Ipsen, Tzu-Hung Hsiao, Ching-Heng Lin, Li-San Wang, Wan-Ping Lee, Tzu-Pin Lu, Jung-Ying Tzeng

Abstract: The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which are… ▽ More The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which are a widely used strategy to boost overall GxE signals and to evaluate the joint GxE effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based GxE tests, to permit GxE VC tests for biobank-scale data. SEAGLE employs modern matrix computations to achieve the same "exact" results as the original GxE VC tests without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of $10^5$, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate SEAGLE's performance through extensive simulations. We illustrate its utility by conducting genome-wide gene-based GxE analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index. △ Less

Submitted 14 May, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

arXiv:2102.12013 [pdf, other]

Understanding and Mitigating Accuracy Disparity in Regression

Authors: Jianfeng Chi, Yuan Tian, Geoffrey J. Gordon, Han Zhao

Abstract: With the widespread deployment of large-scale prediction systems in high-stakes domains, e.g., face recognition, criminal justice, etc., disparity in prediction accuracy between different demographic subgroups has called for fundamental understanding on the source of such disparity and algorithmic intervention to mitigate it. In this paper, we study the accuracy disparity problem in regression. To… ▽ More With the widespread deployment of large-scale prediction systems in high-stakes domains, e.g., face recognition, criminal justice, etc., disparity in prediction accuracy between different demographic subgroups has called for fundamental understanding on the source of such disparity and algorithmic intervention to mitigate it. In this paper, we study the accuracy disparity problem in regression. To begin with, we first propose an error decomposition theorem, which decomposes the accuracy disparity into the distance between marginal label distributions and the distance between conditional representations, to help explain why such accuracy disparity appears in practice. Motivated by this error decomposition and the general idea of distribution alignment with statistical distances, we then propose an algorithm to reduce this disparity, and analyze its game-theoretic optima of the proposed objective functions. To corroborate our theoretical findings, we also conduct experiments on five benchmark datasets. The experimental results suggest that our proposed algorithms can effectively mitigate accuracy disparity while maintaining the predictive power of the regression models. △ Less

Submitted 12 June, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: ICML 2021

arXiv:2101.00123 [pdf, other]

Intent Classification and Slot Filling for Privacy Policies

Authors: Wasi Uddin Ahmad, Jianfeng Chi, Tu Le, Thomas Norton, Yuan Tian, Kai-Wei Chang

Abstract: Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the… ▽ More Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, an English corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging real-world benchmark with limited labeled examples reflecting the cost of collecting large-scale annotations from domain experts. We present two alternative neural approaches as baselines, (1) intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. The experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. We perform a detailed error analysis to reveal the challenges of the proposed corpus. △ Less

Submitted 4 June, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

Comments: ACL 2021 (camera ready)

arXiv:2010.10805 [pdf, other]

SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning

Authors: Jianlei Chi, Yu Qu, Ting Liu, Qinghua Zheng, Heng Yin

Abstract: Software vulnerabilities are now reported at an unprecedented speed due to the recent development of automated vulnerability hunting tools. However, fixing vulnerabilities still mainly depends on programmers' manual efforts. Developers need to deeply understand the vulnerability and try to affect the system's functions as little as possible. In this paper, with the advancement of Neural Machine… ▽ More Software vulnerabilities are now reported at an unprecedented speed due to the recent development of automated vulnerability hunting tools. However, fixing vulnerabilities still mainly depends on programmers' manual efforts. Developers need to deeply understand the vulnerability and try to affect the system's functions as little as possible. In this paper, with the advancement of Neural Machine Translation (NMT) techniques, we provide a novel approach called SeqTrans to exploit historical vulnerability fixes to provide suggestions and automatically fix the source code. To capture the contextual information around the vulnerable code, we propose to leverage data flow dependencies to construct code sequences and fed them into the state-of-the-art transformer model. The fine-tuning strategy has been introduced to overcome the small sample size problem. We evaluate SeqTrans on a dataset containing 1,282 commits that fix 624 vulnerabilities in 205 Java projects. Results show that the accuracy of SeqTrans outperforms the latest techniques and achieves 23.3% in statement-level fix and 25.3% in CVE-level fix. In the meantime, we look deep inside the result and observe that NMT model performs very well in certain kinds of vulnerabilities like CWE-287 (Improper Authentication) and CWE-863 (Incorrect Authorization). △ Less

Submitted 22 March, 2022; v1 submitted 21 October, 2020; originally announced October 2020.

Comments: 22 pages, 20 figures, 7 tables

arXiv:2010.08980 [pdf, other]

Querent Intent in Multi-Sentence Questions

Authors: Laurie Burchell, Jie Chi, Tom Hosking, Nina Markl, Bonnie Webber

Abstract: Multi-sentence questions (MSQs) are sequences of questions connected by relations which, unlike sequences of standalone questions, need to be answered as a unit. Following Rhetorical Structure Theory (RST), we recognise that different "question discourse relations" between the subparts of MSQs reflect different speaker intents, and consequently elicit different answering strategies. Correctly iden… ▽ More Multi-sentence questions (MSQs) are sequences of questions connected by relations which, unlike sequences of standalone questions, need to be answered as a unit. Following Rhetorical Structure Theory (RST), we recognise that different "question discourse relations" between the subparts of MSQs reflect different speaker intents, and consequently elicit different answering strategies. Correctly identifying these relations is therefore a crucial step in automatically answering MSQs. We identify five different types of MSQs in English, and define five novel relations to describe them. We extract over 162,000 MSQs from Stack Exchange to enable future research. Finally, we implement a high-precision baseline classifier based on surface features. △ Less

Submitted 18 October, 2020; originally announced October 2020.

Comments: LAW XIV, COLING 2020

arXiv:2010.04133 [pdf, other]

A User-Friendly Computational Framework for Robust Structured Regression with the L$_2$ Criterion

Authors: Jocelyn T. Chi, Eric C. Chi

Abstract: We introduce a user-friendly computational framework for implementing robust versions of a wide variety of structured regression methods with the L$_{2}$ criterion. In addition to introducing an algorithm for performing L$_{2}$E regression, our framework enables robust regression with the L$_{2}$ criterion for additional structural constraints, works without requiring complex tuning procedures on… ▽ More We introduce a user-friendly computational framework for implementing robust versions of a wide variety of structured regression methods with the L$_{2}$ criterion. In addition to introducing an algorithm for performing L$_{2}$E regression, our framework enables robust regression with the L$_{2}$ criterion for additional structural constraints, works without requiring complex tuning procedures on the precision parameter, can be used to identify heterogeneous subpopulations, and can incorporate readily available non-robust structured regression solvers. We provide convergence guarantees for the framework and demonstrate its flexibility with some examples. Supplementary materials for this article are available online. △ Less

Submitted 13 September, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2010.02557 [pdf, other]

PolicyQA: A Reading Comprehension Dataset for Privacy Policies

Authors: Wasi Uddin Ahmad, Jianfeng Chi, Yuan Tian, Kai-Wei Chang

Abstract: Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from pol… ▽ More Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from policy documents reduces the burden of searching the target information from a lengthy text segment. In this paper, we present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices. We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: EMNLP Findings 2020 (short paper)

arXiv:2007.06099 [pdf, ps, other]

Multiplicative Perturbation Bounds for Multivariate Multiple Linear Regression in Schatten $p$-Norms

Authors: Jocelyn T. Chi, Ilse C. F. Ipsen

Abstract: Multivariate multiple linear regression (MMLR), which occurs in a number of practical applications, generalizes traditional least squares (multivariate linear regression) to multiple right-hand sides. We extend recent MLR analyses to sketched MMLR in general Schatten $p$-norms by interpreting the sketched problem as a multiplicative perturbation. Our work represents an extension of Maher's results… ▽ More Multivariate multiple linear regression (MMLR), which occurs in a number of practical applications, generalizes traditional least squares (multivariate linear regression) to multiple right-hand sides. We extend recent MLR analyses to sketched MMLR in general Schatten $p$-norms by interpreting the sketched problem as a multiplicative perturbation. Our work represents an extension of Maher's results on Schatten $p$-norms. We derive expressions for the exact and perturbed solutions in terms of projectors for easy geometric interpretation. We also present a geometric interpretation of the action of the sketching matrix in terms of relevant subspaces. We show that a key term in assessing the accuracy of the sketched MMLR solution can be viewed as a tangent of a largest principal angle between subspaces under some assumptions. Our results enable additional interpretation of the difference between an orthogonal and oblique projector with the same range. △ Less

Submitted 12 July, 2020; originally announced July 2020.

arXiv:2007.03128 [pdf, other]

doi 10.1017/pasa.2020.39

Neutron Star Extreme Matter Observatory: A kilohertz-band gravitational-wave detector in the global network

Authors: K. Ackley, V. B. Adya, P. Agrawal, P. Altin, G. Ashton, M. Bailes, E. Baltinas, A. Barbuio, D. Beniwal, C. Blair, D. Blair, G. N. Bolingbroke, V. Bossilkov, S. Shachar Boublil, D. D. Brown, B. J. Burridge, J. Calderon Bustillo, J. Cameron, H. Tuong Cao, J. B. Carlin, S. Chang, P. Charlton, C. Chatterjee, D. Chattopadhyay, X. Chen , et al. (139 additional authors not shown)

Abstract: Gravitational waves from coalescing neutron stars encode information about nuclear matter at extreme densities, inaccessible by laboratory experiments. The late inspiral is influenced by the presence of tides, which depend on the neutron star equation of state. Neutron star mergers are expected to often produce rapidly-rotating remnant neutron stars that emit gravitational waves. These will provid… ▽ More Gravitational waves from coalescing neutron stars encode information about nuclear matter at extreme densities, inaccessible by laboratory experiments. The late inspiral is influenced by the presence of tides, which depend on the neutron star equation of state. Neutron star mergers are expected to often produce rapidly-rotating remnant neutron stars that emit gravitational waves. These will provide clues to the extremely hot post-merger environment. This signature of nuclear matter in gravitational waves contains most information in the 2-4 kHz frequency band, which is outside of the most sensitive band of current detectors. We present the design concept and science case for a neutron star extreme matter observatory (NEMO): a gravitational-wave interferometer optimized to study nuclear physics with merging neutron stars. The concept uses high circulating laser power, quantum squeezing and a detector topology specifically designed to achieve the high-frequency sensitivity necessary to probe nuclear matter using gravitational waves. Above one kHz, the proposed strain sensitivity is comparable to full third-generation detectors at a fraction of the cost. Such sensitivity changes expected event rates for detection of post-merger remnants from approximately one per few decades with two A+ detectors to a few per year, and potentially allows for the first gravitational-wave observations of supernovae, isolated neutron stars, and other exotica. △ Less

Submitted 5 November, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: Accepted for publication in PASA

Journal ref: PASA (2020) 37, e047

arXiv:1910.04523 [pdf, ps, other]

doi 10.1103/PhysRevC.100.034328

Systematic investigations of positive-parity doublet bands with three-quasiparticle configurations in $^{125,127,129,131}$Cs

Authors: Rui Guo, Wu-Ji Sun, Jian Li, Dong Yang, Yonghao Liu, Chengkun Ru, Jihuai Chi

Abstract: The experimental features of positive-parity doublet bands in the odd-\emph{A} cesium isotopes $^{125,127,129,131}$Cs, including angular momentum alignment, energy staggering, $B(M1)/B(E2)$ etc. are studied systematically and compared to those of the candidate chiral bands in the adjacent odd-odd Cs isotopes. The configuration assignments and the dynamics of these bands are discussed. The self-con… ▽ More The experimental features of positive-parity doublet bands in the odd-\emph{A} cesium isotopes $^{125,127,129,131}$Cs, including angular momentum alignment, energy staggering, $B(M1)/B(E2)$ etc. are studied systematically and compared to those of the candidate chiral bands in the adjacent odd-odd Cs isotopes. The configuration assignments and the dynamics of these bands are discussed. The self-consistent tilted axis cranking relativistic mean-field calculations are performed with configuration reassigned to these bands. The experimental level schemes of four nuclei are well reproduced, and the calculations also show four nuclei have obvious triaxial deformations and thus support the candidate chiral doublet bands in $^{125,127,129,131}$Cs. △ Less

Submitted 10 October, 2019; originally announced October 2019.

Comments: 18 pages, 10 figures

Journal ref: Physical Review C 100, 034328 (2019)

Showing 1–50 of 68 results for author: Chi, J