Skip to main content

Showing 1–50 of 576 results for author: Liang, P

.
  1. arXiv:2506.08534  [pdf, ps, other

    eess.IV cs.AI cs.CV

    DCD: A Semantic Segmentation Model for Fetal Ultrasound Four-Chamber View

    Authors: Donglian Li, Hui Guo, Minglang Chen, Huizhen Chen, Jialing Chen, Bocheng Liang, Pengchen Liang, Ying Tan

    Abstract: Accurate segmentation of anatomical structures in the apical four-chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workl… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  2. arXiv:2506.06958  [pdf, ps, other

    cs.CY cs.AI cs.MA

    Position: Simulating Society Requires Simulating Thought

    Authors: Chance Jiajie Li, Jiayi Wu, Zhenze Mo, Ao Qu, Yuhan Tang, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Jinhua Zhao, Paul Liang, Luis Alonso, Kent Larson

    Abstract: Simulating society with large language models (LLMs), we argue, requires more than generating plausible behavior -- it demands cognitively grounded reasoning that is structured, revisable, and traceable. LLM-based agents are increasingly used to emulate individual and group behavior -- primarily through prompting and supervised fine-tuning. Yet they often lack internal coherence, causal reasoning,… ▽ More

    Submitted 7 June, 2025; originally announced June 2025.

  3. arXiv:2506.06211  [pdf, other

    cs.CL cs.AI cs.CV

    PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

    Authors: Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang

    Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, o… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  4. arXiv:2506.03724  [pdf, ps, other

    math.FA

    Uncertainty principles for free metaplectic transformation and associated metaplectic operators

    Authors: Ping Liang, Pei Dang, Weixiong Mai

    Abstract: In this paper, we systematically investigate the Heisenberg-Pauli-Weyl uncertainty principle for free metaplectic transformation, as well as metaplectic operators. Specifically, we obtain two different types of the uncertainty principle for free metaplectic transformations in terms of the so-called phase derivative, one of which can be generalized to the $L^p$-case with $1\le p\le 2$. The obtained… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

    Comments: 42 pages

  5. arXiv:2506.02891  [pdf, ps, other

    cs.CV

    OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis

    Authors: Jiewen Hu, Leena Mathur, Paul Pu Liang, Louis-Philippe Morency

    Abstract: In recent years, there has been increasing interest in automatic facial behavior analysis systems from computing communities such as vision, multimodal interaction, robotics, and affective computing. Building upon the widespread utility of prior open-source facial analysis systems, we introduce OpenFace 3.0, an open-source toolkit capable of facial landmark detection, facial action unit detection,… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: IEEE FG 2025, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work

  6. arXiv:2506.02308  [pdf, ps, other

    cs.LG cs.AI

    MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping

    Authors: Xiaojun Shan, Qi Cao, Xing Han, Haofei Yu, Paul Pu Liang

    Abstract: Recent advances in multimodal foundation models have achieved state-of-the-art performance across a range of tasks. These breakthroughs are largely driven by new pre-training paradigms that leverage large-scale, unlabeled multimodal data, followed by instruction fine-tuning on curated labeled datasets and high-quality prompts. While there is growing interest in scaling instruction fine-tuning to e… ▽ More

    Submitted 6 June, 2025; v1 submitted 2 June, 2025; originally announced June 2025.

  7. arXiv:2506.01456  [pdf

    q-bio.GN cs.AI cs.LG q-bio.NC

    GenDMR: A dynamic multimodal role-swapping network for identifying risk gene phenotypes

    Authors: Lina Qin, Cheng Zhu, Chuqi Zhou, Yukun Huang, Jiayi Zhu, Ping Liang, Jinju Wang, Yixing Huang, Cheng Luo, Dezhong Yao, Ying Tan

    Abstract: Recent studies have shown that integrating multimodal data fusion techniques for imaging and genetic features is beneficial for the etiological analysis and predictive diagnosis of Alzheimer's disease (AD). However, there are several critical flaws in current deep learning methods. Firstly, there has been insufficient discussion and exploration regarding the selection and encoding of genetic infor… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: 31 pages, 9 figures

  8. arXiv:2506.00711  [pdf, other

    cs.LG cs.AI cs.CV

    QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

    Authors: Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang

    Abstract: Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

  9. arXiv:2506.00613  [pdf, ps, other

    cs.RO cs.AI

    Evaluating Robot Policies in a World Model

    Authors: Julian Quevedo, Percy Liang, Sherry Yang

    Abstract: Robotics has broad applications from automating house chores to taking care of patients. However, evaluating robot control policies is challenging, as real-world testing is expensive, while handcrafted simulations often fail to accurately reflect real-world conditions, resulting in poor correlation between simulated evaluation and real-world outcomes. In this work, we investigate World-model-based… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: https://world-model-eval.github.io

  10. arXiv:2506.00239  [pdf, ps, other

    cs.AI

    SMELLNET: A Large-scale Dataset for Real-world Smell Recognition

    Authors: Dewei Feng, Carol Li, Wei Dai, Paul Pu Liang

    Abstract: The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g., smelling gluten or peanuts in a cake), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are virtually no large scale benchmarks, and therefore little pro… ▽ More

    Submitted 30 May, 2025; originally announced June 2025.

    Comments: 22 pages, 13 figures

  11. arXiv:2505.23802  [pdf, ps, other

    cs.CL cs.AI

    MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

    Authors: Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi, Asad Aali, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah , et al. (56 additional authors not shown)

    Abstract: While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcatego… ▽ More

    Submitted 2 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

  12. arXiv:2505.19430  [pdf, ps, other

    cs.CL cs.AI

    Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation

    Authors: Keane Ong, Rui Mao, Deeksha Varshney, Paul Pu Liang, Erik Cambria, Gianmarco Mengaldo

    Abstract: Counterfactual reasoning typically involves considering alternatives to actual events. While often applied to understand past events, a distinct form-forward counterfactual reasoning-focuses on anticipating plausible future developments. This type of reasoning is invaluable in dynamic financial markets, where anticipating market developments can powerfully unveil potential risks and opportunities… ▽ More

    Submitted 5 June, 2025; v1 submitted 25 May, 2025; originally announced May 2025.

  13. arXiv:2505.18880  [pdf, ps, other

    cs.CV cs.AI

    REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

    Authors: Weihan Xu, Yimeng Ma, Jingyue Huang, Yang Li, Wenye Ma, Taylor Berg-Kirkpatrick, Julian McAuley, Paul Pu Liang, Hao-Wen Dong

    Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  14. arXiv:2505.15216  [pdf, ps, other

    cs.CR cs.AI cs.CL cs.LG

    BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

    Authors: Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu , et al. (9 additional authors not shown)

    Abstract: AI agents have the potential to significantly alter the cybersecurity landscape. To help us understand this change, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: 78 pages

  15. arXiv:2505.14462  [pdf, ps, other

    cs.CV cs.CL

    RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

    Authors: Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie

    Abstract: As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its appli… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  16. arXiv:2505.12546  [pdf, other

    cs.CL cs.CY cs.LG

    Extracting memorized pieces of (copyrighted) books from open-weight language models

    Authors: A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang

    Abstract: Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression. Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we leverage a recen… ▽ More

    Submitted 18 May, 2025; originally announced May 2025.

  17. arXiv:2505.07782  [pdf, ps, other

    cs.LG

    MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

    Authors: Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai

    Abstract: We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experimen… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

  18. arXiv:2505.03901  [pdf, other

    cs.SE

    Unveiling the Role of ChatGPT in Software Development: Insights from Developer-ChatGPT Interactions on GitHub

    Authors: Ruiyin Li, Peng Liang, Yifei Wang, Yangxiao Cai, Weisong Sun, Zengyang Li

    Abstract: The advent of Large Language Models (LLMs) has introduced a new paradigm in software engineering, with generative AI tools like ChatGPT gaining widespread adoption among developers. While ChatGPT's potential has been extensively discussed, there is limited empirical evidence exploring its real-world usage by developers. This study bridges this gap by conducting a large-scale empirical analysis of… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

    Comments: 25 pages, 10 images, 2 tables, Manuscript submitted to a journal (2025)

  19. arXiv:2505.03007  [pdf, other

    cs.CV

    NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results

    Authors: Nikolay Safonov, Alexey Bryncev, Andrey Moskalenko, Dmitry Kulikov, Dmitry Vatolin, Radu Timofte, Haibo Lei, Qifan Gao, Qing Luo, Yaqing Li, Jie Song, Shaozhe Hao, Meisong Zheng, Jingyi Xu, Chengbin Wu, Jiahui Liu, Ying Chen, Xin Deng, Mai Xu, Peipei Liang, Jie Ma, Junjie Jin, Yingxue Pang, Fangzhou Luo, Kai Chen , et al. (6 additional authors not shown)

    Abstract: This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such vid… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  20. arXiv:2505.02351  [pdf

    cs.DC

    Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

    Authors: Jie Kong, Junxiang Zhang, Jiheng Xu, Yalong Li, Shouhua Zhang, Jiehan Zhou, Yuhai Liu, Peng Liang, Quan Zhang, Luohan Jiang

    Abstract: In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, opti… ▽ More

    Submitted 5 May, 2025; originally announced May 2025.

  21. arXiv:2505.01113  [pdf, other

    cs.RO cs.CV cs.NE

    NeuroLoc: Encoding Navigation Cells for 6-DOF Camera Localization

    Authors: Xun Li, Jian Yang, Fenli Jia, Muyu Wang, Qi Wu, Jun Wu, Jinpeng Mi, Jilin Hu, Peidong Liang, Xuan Tang, Ke Li, Xiong You, Xian Wei

    Abstract: Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the biological brain navigation mechanism (such as grid cell… ▽ More

    Submitted 2 May, 2025; originally announced May 2025.

  22. arXiv:2504.20781  [pdf, other

    cs.SE cs.AI

    Using LLMs in Generating Design Rationale for Software Architecture Decisions

    Authors: Xiyu Zhou, Ruiyin Li, Peng Liang, Beiqi Zhang, Mojtaba Shahin, Zengyang Li, Chen Yang

    Abstract: Design Rationale (DR) for software architecture decisions refers to the reasoning underlying architectural choices, which provides valuable insights into the different phases of the architecting process throughout software development. However, in practice, DR is often inadequately documented due to a lack of motivation and effort from developers. With the recent advancements in Large Language Mod… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 28 pages, 5 images, 7 tables, Manuscript submitted to a journal (2025)

  23. arXiv:2504.16485  [pdf, other

    cs.SE cs.AI

    On Developers' Self-Declaration of AI-Generated Code: An Analysis of Practices

    Authors: Syed Mohammad Kashif, Peng Liang, Amjed Tahir

    Abstract: AI code generation tools have gained significant popularity among developers, who use them to assist in software development due to their capability to generate code. Existing studies mainly explored the quality, e.g., correctness and security, of AI-generated code, while in real-world software development, the prerequisite is to distinguish AI-generated code from human-written code, which emphasi… ▽ More

    Submitted 23 April, 2025; originally announced April 2025.

    Comments: 35 pages, 17 images, 8 tables, Manuscript submitted to a journal (2025)

  24. arXiv:2504.13392  [pdf, ps, other

    cs.CV cs.HC

    POET: Supporting Prompting Creativity and Personalization with Automated Expansion of Text-to-Image Generation

    Authors: Evans Xu Han, Alice Qian Zhang, Hong Shen, Haiyi Zhu, Paul Pu Liang, Jane Hsieh

    Abstract: State-of-the-art visual generative AI tools hold immense potential to assist users in the early ideation stages of creative tasks -- offering the ability to generate (rather than search for) novel and unprecedented (instead of existing) images of considerable quality that also adhere to boundless combinations of user specifications. However, many large-scale text-to-image systems are designed for… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

  25. arXiv:2504.09897  [pdf, other

    cs.CV

    TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models

    Authors: Jaewoo Lee, Keyang Xuan, Chanakya Ekbote, Sandeep Polisetty, Yi R. Fung, Paul Pu Liang

    Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. However, these capabilities come with an increased model scale. While post-training pruning reduces model size in unimodal models, its application to MLLMs often yields limited success. Our analysis discovers that conventional methods fail to account for the unique token a… ▽ More

    Submitted 17 May, 2025; v1 submitted 14 April, 2025; originally announced April 2025.

    Comments: ACL Findings 2025

  26. arXiv:2504.07316  [pdf, other

    cs.CL

    Alice: Proactive Learning with Teacher's Demonstrations for Weak-to-Strong Generalization

    Authors: Shujin Wu, Cheng Qian, Yi R. Fung, Paul Pu Liang, Heng Ji

    Abstract: The growing capabilities of large language models (LLMs) present a key challenge of maintaining effective human oversight. Weak-to-strong generalization (W2SG) offers a promising framework for supervising increasingly capable LLMs using weaker ones. Traditional W2SG methods rely on passive learning, where a weak teacher provides noisy demonstrations to train a strong student. This hinders students… ▽ More

    Submitted 11 April, 2025; v1 submitted 9 April, 2025; originally announced April 2025.

  27. arXiv:2503.24270  [pdf, other

    cs.CV cs.AI

    Visual Acoustic Fields

    Authors: Yuelei Li, Hyunjin Kim, Fangneng Zhan, Ri-Zhao Qiu, Mazeyu Ji, Xiaojun Shan, Xueyan Zou, Paul Liang, Hanspeter Pfister, Xiaolong Wang

    Abstract: Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localiz… ▽ More

    Submitted 31 March, 2025; v1 submitted 31 March, 2025; originally announced March 2025.

  28. arXiv:2503.20536  [pdf, other

    cs.SE

    Knowledge-Based Multi-Agent Framework for Automated Software Architecture Design

    Authors: Yiran Zhang, Ruiyin Li, Peng Liang, Weisong Sun, Yang Liu

    Abstract: Architecture design is a critical step in software development. However, creating a high-quality architecture is often costly due to the significant need for human expertise and manual effort. Recently, agents built upon Large Language Models (LLMs) have achieved remarkable success in various software engineering tasks. Despite this progress, the use of agents to automate the architecture design p… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  29. arXiv:2503.17514  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Language Models May Verbatim Complete Text They Were Not Explicitly Trained On

    Authors: Ken Ziyu Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot

    Abstract: An important question today is whether a given text was used to train a large language model (LLM). A \emph{completion} test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the $n$-gram overlap between the target text and any text in the dataset. In this wor… ▽ More

    Submitted 25 March, 2025; v1 submitted 21 March, 2025; originally announced March 2025.

    Comments: Main text: 9 pages, 7 figures, 1 table. Appendix: 29 pages, 20 tables, 15 figures

  30. arXiv:2503.16861  [pdf, other

    cs.AI

    In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

    Authors: Shayne Longpre, Kevin Klyman, Ruth E. Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey John Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark M. Jaycox, Markus Anderljung, Nadine Farid Johnson, Nicholas Carlini, Nicolas Miailhe , et al. (9 additional authors not shown)

    Abstract: The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and… ▽ More

    Submitted 25 March, 2025; v1 submitted 21 March, 2025; originally announced March 2025.

  31. arXiv:2503.16434  [pdf, other

    cs.HC cs.AI

    Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving

    Authors: Steven-Shine Chen, Jimin Lee, Paul Pu Liang

    Abstract: Humans have long relied on visual aids like sketches and diagrams to support reasoning and problem-solving. Visual tools, like auxiliary lines in geometry or graphs in calculus, are essential for understanding complex ideas. However, many tutoring systems remain text-based, providing feedback only through natural language. Leveraging recent advances in Large Multimodal Models (LMMs), this paper in… ▽ More

    Submitted 1 April, 2025; v1 submitted 11 February, 2025; originally announced March 2025.

    Comments: To be published in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA 25)

  32. arXiv:2503.13335  [pdf, other

    cs.CL cs.AI cs.LG stat.AP

    Reliable and Efficient Amortized Model-based Evaluation

    Authors: Sang Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo

    Abstract: Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use o… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  33. arXiv:2503.13205  [pdf, other

    cs.AI cs.CL cs.CV cs.HC cs.MA

    MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

    Authors: Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, Yixuan Yuan

    Abstract: Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks ty… ▽ More

    Submitted 17 March, 2025; originally announced March 2025.

  34. arXiv:2503.09639  [pdf, other

    cs.MA cs.AI cs.CL cs.CY cs.HC

    Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy

    Authors: Abe Bohan Hou, Hongru Du, Yichen Wang, Jingyu Zhang, Zixiao Wang, Paul Pu Liang, Daniel Khashabi, Lauren Gardner, Tianxing He

    Abstract: Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we investigate the feasibility of simulating health-related decision-making, using vaccine hesitancy, defined as the delay in acceptance or refusal of vaccines despite the availability of vaccination services (MacDona… ▽ More

    Submitted 2 April, 2025; v1 submitted 11 March, 2025; originally announced March 2025.

  35. arXiv:2503.08330  [pdf, other

    cs.RO

    KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments

    Authors: Shibo Huang, Chenfan Shi, Jian Yang, Hanlin Dong, Jinpeng Mi, Ke Li, Jianfeng Zhang, Miao Ding, Peidong Liang, Xiong You, Xian Wei

    Abstract: Autonomous navigation in open-world outdoor environments faces challenges in integrating dynamic conditions, long-distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although dif… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

  36. arXiv:2503.07667  [pdf, other

    cs.LG cs.AI cs.CV eess.SP

    CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models

    Authors: Wei Dai, Peilin Chen, Malinda Lu, Daniel Li, Haowen Wei, Hejie Cui, Paul Pu Liang

    Abstract: Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multi… ▽ More

    Submitted 20 March, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

  37. arXiv:2503.06976  [pdf, other

    cs.CV

    Task-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

    Authors: Pengchen Liang, Haishan Huang, Bin Pu, Jianguo Chen, Xiang Hua, Jing Zhang, Weibo Ma, Zhuangzhuang Chen, Yiwei Li, Qing Chang

    Abstract: Large-scale pre-trained models, such as Vision Foundation Models (VFMs), have demonstrated impressive performance across various downstream tasks by transferring generalized knowledge, especially when target data is limited. However, their high computational cost and the domain gap between natural and medical images limit their practical application in medical segmentation tasks. Motivated by this… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: 29 pages, 10 figures, 16 tables

  38. Fits like a Flex-Glove: Automatic Design of Personalized FPCB-Based Tactile Sensing Gloves

    Authors: Devin Murphy, Yichen Li, Crystal Owens, Layla Stanton, Young Joong Lee, Paul Pu Liang, Yiyue Luo, Antonio Torralba, Wojciech Matusik

    Abstract: Resistive tactile sensing gloves have captured the interest of researchers spanning diverse domains, such as robotics, healthcare, and human-computer interaction. However, existing fabrication methods often require labor-intensive assembly or costly equipment, limiting accessibility. Leveraging flexible printed circuit board (FPCB) technology, we present an automated pipeline for generating resist… ▽ More

    Submitted 8 March, 2025; originally announced March 2025.

    Comments: 8 pages, 6 figures, to be published in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25)

  39. arXiv:2503.05777  [pdf, other

    cs.CL cs.AI cs.CY

    Medical Hallucinations in Foundation Models and Their Impact on Healthcare

    Authors: Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Hae Won Park, Samir Tulebaev, Cynthia Breazeal

    Abstract: Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examine… ▽ More

    Submitted 25 February, 2025; originally announced March 2025.

  40. arXiv:2503.05731  [pdf, other

    cs.CY cs.AI

    AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

    Authors: Shaona Ghosh, Heather Frase, Adina Williams, Sarah Luger, Paul Röttger, Fazl Barez, Sean McGregor, Kenneth Fricklas, Mala Kumar, Quentin Feuillade--Montixi, Kurt Bollacker, Felix Friedrich, Ryan Tsang, Bertie Vidgen, Alicia Parrish, Chris Knotz, Eleonora Presani, Jonathan Bennion, Marisa Ferrara Boston, Mike Kuniavsky, Wiebke Hutiri, James Ezick, Malek Ben Salem, Rajat Sahay, Sujata Goswami , et al. (77 additional authors not shown)

    Abstract: The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance… ▽ More

    Submitted 18 April, 2025; v1 submitted 19 February, 2025; originally announced March 2025.

    Comments: 51 pages, 8 figures and an appendix

  41. arXiv:2503.02321  [pdf, ps, other

    eess.IV cs.CV

    Rapid Bone Scintigraphy Enhancement via Semantic Prior Distillation from Segment Anything Model

    Authors: Pengchen Liang, Leijun Shi, Huiping Yao, Bin Pu, Jianguo Chen, Lei Zhao, Haishan Huang, Zhuangzhuang Chen, Zhaozhao Xu, Lite Xu, Qing Chang, Yiwei Li

    Abstract: Rapid bone scintigraphy is crucial for diagnosing skeletal disorders and detecting tumor metastases in children, as it shortens scan duration and reduces discomfort. However, accelerated acquisition often degrades image quality, impairing the visibility of fine anatomical details and potentially compromising diagnosis. To overcome this limitation, we introduce the first application of SAM-based se… ▽ More

    Submitted 4 June, 2025; v1 submitted 4 March, 2025; originally announced March 2025.

    Comments: 12 pages, 9 figures, 8 tables

  42. arXiv:2502.19687  [pdf, other

    cs.SE cs.CR

    Unveiling Security Weaknesses in Autonomous Driving Systems: An In-Depth Empirical Study

    Authors: Wenyuan Cheng, Zengyang Li, Peng Liang, Ran Mo, Hui Liu

    Abstract: The advent of Autonomous Driving Systems (ADS) has marked a significant shift towards intelligent transportation, with implications for public safety and traffic efficiency. While these systems integrate a variety of technologies and offer numerous benefits, their security is paramount, as vulnerabilities can have severe consequences for safety and trust. This study aims to systematically investig… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

    Comments: Preprint accepted for publication in Information and Software Technology, 2025

  43. arXiv:2502.19645  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Authors: Moo Jin Kim, Chelsea Finn, Percy Liang

    Abstract: Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many… ▽ More

    Submitted 28 April, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

    Comments: Accepted to Robotics: Science and Systems (RSS) 2025. Project website: https://openvla-oft.github.io/

  44. arXiv:2502.19412  [pdf, other

    cs.CL

    The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

    Authors: Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer

    Abstract: Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of… ▽ More

    Submitted 2 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  45. arXiv:2502.18707  [pdf, other

    quant-ph

    Purified pseudomode model for nonlinear system-bath interactions

    Authors: Cheng Zhang, Neil Lambert, Xin-Qi Li, Mauro Cirio, Pengfei Liang

    Abstract: The theory of purified pseudomodes [arXiv:2412.04264 (2024)] was recently developed to provide a numerical tool for the analysis of the properties of a quantum system and the environment it couples to via linear system-bath interactions. Here we extend this theory to allow for the description of general nonlinear system-bath interactions. We demonstrate the validity of our method by considering th… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

    Comments: 13 pages, 2 figures

  46. arXiv:2502.17955  [pdf, other

    cs.CL cs.AI

    Language Models' Factuality Depends on the Language of Inquiry

    Authors: Tushar Aggarwal, Kumar Tanmay, Ayush Agrawal, Kumar Ayush, Hamid Palangi, Paul Pu Liang

    Abstract: Multilingual language models (LMs) are expected to recall factual knowledge consistently across languages, yet they often fail to transfer knowledge between languages even when they possess the correct information in one of the languages. For example, we find that an LM may correctly identify Rashed Al Shashai as being from Saudi Arabia when asked in Arabic, but consistently fails to do so when as… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  47. arXiv:2502.16671  [pdf, ps, other

    cs.CL cs.AI cs.CV

    MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

    Authors: Hengzhi Li, Megan Tjandrasuwita, Yi R. Fung, Armando Solar-Lezama, Paul Pu Liang

    Abstract: As AI becomes more closely integrated with peoples' daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but st… ▽ More

    Submitted 6 June, 2025; v1 submitted 23 February, 2025; originally announced February 2025.

  48. arXiv:2502.16282  [pdf, other

    cs.LG cs.AI

    Understanding the Emergence of Multimodal Representation Alignment

    Authors: Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang

    Abstract: Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research primarily focused on explicitly aligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can become impl… ▽ More

    Submitted 22 February, 2025; originally announced February 2025.

    Comments: 21 pages, 22 figures, 3 tables

  49. arXiv:2502.15109  [pdf, ps, other

    cs.CL cs.LG

    Social Genome: Grounded Social Reasoning Abilities of Multimodal Models

    Authors: Leena Mathur, Marian Qian, Paul Pu Liang, Louis-Philippe Morency

    Abstract: Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce SOCIAL GENOME, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. SOCIAL GENOME contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferen… ▽ More

    Submitted 3 June, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

    Comments: Under Review, 24 pages

  50. arXiv:2502.14888  [pdf, other

    cs.CV cs.AI

    Multi-Faceted Multimodal Monosemanticity

    Authors: Hanqi Yan, Xiangxiang Cui, Lu Yin, Paul Pu Liang, Yulan He, Yifei Wang

    Abstract: Humans experience the world through multiple modalities, such as, vision, language, and speech, making it natural to explore the commonality and distinctions among them. In this work, we take a data-driven approach to address this question by analyzing interpretable, monosemantic features extracted from deep multimodal models. Specifically, we investigate CLIP, a prominent visual-language represen… ▽ More

    Submitted 23 May, 2025; v1 submitted 16 February, 2025; originally announced February 2025.