Skip to main content

Showing 1–50 of 450 results for author: Cai, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.10250  [pdf, other

    cs.CV

    ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

    Authors: Wenhao Shen, Wanqi Yin, Xiaofeng Yang, Cheng Chen, Chaoyue Song, Zhongang Cai, Lei Yang, Hao Wang, Guosheng Lin

    Abstract: Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework that Aligns a Diffusion… ▽ More

    Submitted 15 May, 2025; originally announced May 2025.

    Comments: Accepted by ICML 2025. Code: https://github.com/shenwenhao01/ADHMR

  2. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  3. arXiv:2505.03364  [pdf, other

    cs.HC

    DroidRetriever: An Autonomous Navigation and Information Integration System Facilitating Mobile Sensemaking

    Authors: Yiheng Bian, Yunpeng Song, Guiyu Ma, Rongrong Zhu, Zhongmin Cai

    Abstract: Users regularly rely on mobile applications for their daily information needs, and mobile sensemaking is prevalent in various domains such as education, healthcare, business intelligence, and emergency response, where timely and context-aware information-processing and decision-making is critical. However, valuable information is often scattered across the closed ecosystems within various applicat… ▽ More

    Submitted 6 May, 2025; originally announced May 2025.

  4. arXiv:2505.00358  [pdf, other

    cs.LG cs.AI cs.CL

    R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

    Authors: Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala

    Abstract: Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohib… ▽ More

    Submitted 1 May, 2025; originally announced May 2025.

  5. arXiv:2505.00021  [pdf, other

    cs.CL cs.AI

    Ustnlp16 at SemEval-2025 Task 9: Improving Model Performance through Imbalance Handling and Focal Loss

    Authors: Zhuoang Cai, Zhenghao Li, Yang Liu, Liyuan Guo, Yangqiu Song

    Abstract: Classification tasks often suffer from imbal- anced data distribution, which presents chal- lenges in food hazard detection due to severe class imbalances, short and unstructured text, and overlapping semantic categories. In this paper, we present our system for SemEval- 2025 Task 9: Food Hazard Detection, which ad- dresses these issues by applying data augmenta- tion techniques to improve classif… ▽ More

    Submitted 24 April, 2025; originally announced May 2025.

  6. arXiv:2504.20965  [pdf, other

    cs.LG

    AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

    Authors: Zikui Cai, Shayan Shabihi, Bang An, Zora Che, Brian R. Bartoldson, Bhavya Kailkhura, Tom Goldstein, Furong Huang

    Abstract: We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: ICLR 2025 Workshop BuildingTrust

  7. AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings

    Authors: Guoqing Hu, An Zhang, Shuo Liu, Zhibo Cai, Xun Yang, Xiang Wang

    Abstract: Recent advancements in sequential recommendation have underscored the potential of Large Language Models (LLMs) for enhancing item embeddings. However, existing approaches face three key limitations: 1) the degradation of the semantic space when high-dimensional language embeddings are mapped to lower-dimensional ID embeddings, 2) the underutilization of language embeddings, and 3) the reliance on… ▽ More

    Submitted 29 April, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

    Comments: Accepted by SIGIR'25

  8. arXiv:2504.17834  [pdf, other

    cs.IR cs.CL

    Unveiling the Hidden: Movie Genre and User Bias in Spoiler Detection

    Authors: Haokai Zhang, Shengtao Zhang, Zijian Cai, Heng Wang, Ruixuan Zhu, Zinan Zeng, Minnan Luo

    Abstract: Spoilers in movie reviews are important on platforms like IMDb and Rotten Tomatoes, offering benefits and drawbacks. They can guide some viewers' choices but also affect those who prefer no plot details in advance, making effective spoiler detection essential. Existing spoiler detection methods mainly analyze review text, often overlooking the impact of movie genres and user bias, limiting their e… ▽ More

    Submitted 28 April, 2025; v1 submitted 24 April, 2025; originally announced April 2025.

    Comments: 11 pages, 6 figures, under review

  9. arXiv:2504.17784  [pdf, other

    cs.RO

    Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation

    Authors: Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, Jiangmiao Pang

    Abstract: Bimanual manipulation is a challenging yet crucial robotic capability, demanding precise spatial localization and versatile motion trajectories, which pose significant challenges to existing approaches. Existing approaches fall into two categories: keyframe-based strategies, which predict gripper poses in keyframes and execute them via motion planners, and continuous control methods, which estimat… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

    Comments: Published at Robotics: Science and Systems (RSS) 2025

  10. arXiv:2504.17353  [pdf, other

    cs.CL cs.CV cs.MM

    M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction

    Authors: Chengguang Gan, Sunbowen Lee, Zhixi Cai, Yanbin Wei, Lei Zheng, Yunhao Liang, Shiwen Ni, Tatsunori Mori

    Abstract: Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability t… ▽ More

    Submitted 24 April, 2025; originally announced April 2025.

  11. arXiv:2504.16074  [pdf, other

    cs.CL

    PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

    Authors: Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Weike Wang , et al. (27 additional authors not shown)

    Abstract: We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, t… ▽ More

    Submitted 22 April, 2025; originally announced April 2025.

    Comments: 21 pages ,8 figures, 4 tables

  12. arXiv:2504.13426  [pdf, other

    cs.LG

    Simplifying Graph Convolutional Networks with Redundancy-Free Neighbors

    Authors: Jielong Lu, Zhihao Wu, Zhiling Cai, Yueyang Pi, Shiping Wang

    Abstract: In recent years, Graph Convolutional Networks (GCNs) have gained popularity for their exceptional ability to process graph-structured data. Existing GCN-based approaches typically employ a shallow model architecture due to the over-smoothing phenomenon. Current approaches to mitigating over-smoothing primarily involve adding supplementary components to GCN architectures, such as residual connectio… ▽ More

    Submitted 21 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

  13. arXiv:2504.03137  [pdf, other

    cs.AI cs.CL

    LightPROF: A Lightweight Reasoning Framework for Large Language Model on Knowledge Graph

    Authors: Tu Ao, Yanhua Yu, Yuling Wang, Yang Deng, Zirui Guo, Liang Pang, Pinghui Wang, Tat-Seng Chua, Xiao Zhang, Zhen Cai

    Abstract: Large Language Models (LLMs) have impressive capabilities in text understanding and zero-shot reasoning. However, delays in knowledge updates may cause them to reason incorrectly or produce harmful results. Knowledge Graphs (KGs) provide rich and reliable contextual information for the reasoning process of LLMs by structurally organizing and connecting a wide range of entities and relations. Exist… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: This paper has been accepted by AAAI 2025

  14. arXiv:2504.00034  [pdf, other

    quant-ph cs.LG

    Quantum Generative Models for Image Generation: Insights from MNIST and MedMNIST

    Authors: Chi-Sheng Chen, Wei An Hou, Hsiang-Wei Hu, Zhen-Sheng Cai

    Abstract: Quantum generative models offer a promising new direction in machine learning by leveraging quantum circuits to enhance data generation capabilities. In this study, we propose a hybrid quantum-classical image generation framework that integrates variational quantum circuits into a diffusion-based model. To improve training dynamics and generation quality, we introduce two novel noise strategies: i… ▽ More

    Submitted 3 April, 2025; v1 submitted 30 March, 2025; originally announced April 2025.

  15. arXiv:2503.23793  [pdf, other

    cs.CV

    Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables

    Authors: Zhongnan Cai, Yingying Wang, Yunlong Lin, Hui Zheng, Ge Meng, Zixu Lin, Jiaxin Xie, Junbin Lu, Yue Huang, Xinghao Ding

    Abstract: Recently, deep learning-based pan-sharpening algorithms have achieved notable advancements over traditional methods. However, many deep learning-based approaches incur substantial computational overhead during inference, especially with high-resolution images. This excessive computational demand limits the applicability of these methods in real-world scenarios, particularly in the absence of dedic… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

    Comments: 12 pages, 6 figures

  16. arXiv:2503.23356  [pdf, other

    cs.CV

    ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts

    Authors: Linfeng Tang, Yeda Wang, Zhanchuan Cai, Junjun Jiang, Jiayi Ma

    Abstract: Current image fusion methods struggle to address the composite degradations encountered in real-world imaging scenarios and lack the flexibility to accommodate user-specific requirements. In response to these challenges, we propose a controllable image fusion framework with language-vision prompts, termed ControlFusion, which adaptively neutralizes composite degradations. On the one hand, we devel… ▽ More

    Submitted 30 March, 2025; originally announced March 2025.

  17. GazeSwipe: Enhancing Mobile Touchscreen Reachability through Seamless Gaze and Finger-Swipe Integration

    Authors: Zhuojiang Cai, Jingkai Hong, Zhimin Wang, Feng Lu

    Abstract: Smartphones with large screens provide users with increased display and interaction space but pose challenges in reaching certain areas with the thumb when using the device with one hand. To address this, we introduce GazeSwipe, a multimodal interaction technique that combines eye gaze with finger-swipe gestures, enabling intuitive and low-friction reach on mobile touchscreens. Specifically, we de… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  18. arXiv:2503.19586  [pdf, other

    cs.CL q-bio.NC

    Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment

    Authors: Hanlin Wu, Xufeng Duan, Zhenguang Cai

    Abstract: Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs' (Qwen2-Audio and Ul… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

    Comments: Accepted by the 14th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2025)

  19. arXiv:2503.19263  [pdf, other

    cs.CV

    DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

    Authors: Fucai Ke, Vijay Kumar B G, Xingjian Leng, Zhixi Cai, Zaid Khan, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi, Manmohan Chandraker

    Abstract: Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face… ▽ More

    Submitted 24 March, 2025; originally announced March 2025.

  20. arXiv:2503.10624  [pdf, other

    cs.CV cs.AI cs.GR

    ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

    Authors: Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu

    Abstract: Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that est… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: Page: https://boqian-li.github.io/ETCH/, Code: https://github.com/boqian-li/ETCH

  21. arXiv:2503.09929  [pdf, ps, other

    cs.CV

    Emotion Recognition with CLIP and Sequential Learning

    Authors: Weiwei Zhou, Chenkun Ling, Zefeng Cai

    Abstract: Human emotion recognition plays a crucial role in facilitating seamless interactions between humans and computers. In this paper, we present our innovative methodology for tackling the Valence-Arousal (VA) Estimation Challenge, the Expression Recognition Challenge, and the Action Unit (AU) Detection Challenge, all within the framework of the 8th Workshop and Competition on Affective Behavior Analy… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  22. arXiv:2503.09376  [pdf, other

    cs.RO eess.SY

    Robust Self-Reconfiguration for Fault-Tolerant Control of Modular Aerial Robot Systems

    Authors: Rui Huang, Siyu Tang, Zhiqian Cai, Lin Zhao

    Abstract: Modular Aerial Robotic Systems (MARS) consist of multiple drone units assembled into a single, integrated rigid flying platform. With inherent redundancy, MARS can self-reconfigure into different configurations to mitigate rotor or unit failures and maintain stable flight. However, existing works on MARS self-reconfiguration often overlook the practical controllability of intermediate structures f… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  23. arXiv:2503.09351  [pdf, other

    cs.RO eess.SY

    Robust Fault-Tolerant Control and Agile Trajectory Planning for Modular Aerial Robotic Systems

    Authors: Rui Huang, Zhenyu Zhang, Siyu Tang, Zhiqian Cai, Lin Zhao

    Abstract: Modular Aerial Robotic Systems (MARS) consist of multiple drone units that can self-reconfigure to adapt to various mission requirements and fault conditions. However, existing fault-tolerant control methods exhibit significant oscillations during docking and separation, impacting system stability. To address this issue, we propose a novel fault-tolerant control reallocation method that adapts to… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  24. arXiv:2503.09326  [pdf, other

    cs.CL cs.AI

    A Survey on Enhancing Causal Reasoning Ability of Large Language Models

    Authors: Xin Li, Zhuo Cai, Shoujin Wang, Kun Yu, Fang Chen

    Abstract: Large language models (LLMs) have recently shown remarkable performance in language tasks and beyond. However, due to their limited inherent causal reasoning ability, LLMs still face challenges in handling tasks that require robust causal reasoning ability, such as health-care and economic analysis. As a result, a growing body of research has focused on enhancing the causal reasoning ability of LL… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  25. arXiv:2503.08576  [pdf, other

    cs.CV

    RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding

    Authors: Xichen Tan, Yunfan Ye, Yuanjing Luo, Qian Wan, Fang Liu, Zhiping Cai

    Abstract: Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluation… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: 37 pages, 36 figures

  26. arXiv:2503.07298  [pdf, other

    cs.CV

    ALLVB: All-in-One Long Video Understanding Benchmark

    Authors: Xichen Tan, Yuanjing Luo, Yunfan Ye, Fang Liu, Zhiping Cai

    Abstract: From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess… ▽ More

    Submitted 1 April, 2025; v1 submitted 10 March, 2025; originally announced March 2025.

    Comments: AAAI 2025

  27. arXiv:2503.07056  [pdf

    cs.LG cs.AI

    Generative method for aerodynamic optimization based on classifier-free guided denoising diffusion probabilistic model

    Authors: Shisong Deng, Qiang Zhang, Zhengyang Cai

    Abstract: Inverse design approach, which directly generates optimal aerodynamic shape with neural network models to meet designated performance targets, has drawn enormous attention. However, the current state-of-the-art inverse design approach for airfoils, which is based on generative adversarial network, demonstrates insufficient precision in its generating and training processes and struggles to reveal… ▽ More

    Submitted 10 March, 2025; originally announced March 2025.

    Comments: Under Review

  28. arXiv:2503.03803  [pdf, other

    cs.CV

    EgoLife: Towards Egocentric Life Assistant

    Authors: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu

    Abstract: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, social… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: Accepted to CVPR 2025. Project Page: https://egolife-ai.github.io/. Code: https://github.com/EvolvingLMMs-Lab/EgoLife

  29. arXiv:2503.03430  [pdf, other

    cs.CV

    CoSDH: Communication-Efficient Collaborative Perception via Supply-Demand Awareness and Intermediate-Late Hybridization

    Authors: Junhao Xu, Yanan Zhang, Zhi Cai, Di Huang

    Abstract: Multi-agent collaborative perception enhances perceptual capabilities by utilizing information from multiple agents and is considered a fundamental solution to the problem of weak single-vehicle perception in autonomous driving. However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel commu… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

    Comments: Accepted at CVPR 2025

  30. arXiv:2502.20846  [pdf, other

    cs.DC cs.PF eess.SY

    AARC: Automated Affinity-aware Resource Configuration for Serverless Workflows

    Authors: Lingxiao Jin, Zinuo Cai, Zebin Chen, Hongyu Zhao, Ruhui Ma

    Abstract: Serverless computing is increasingly adopted for its ability to manage complex, event-driven workloads without the need for infrastructure provisioning. However, traditional resource allocation in serverless platforms couples CPU and memory, which may not be optimal for all functions. Existing decoupling approaches, while offering some flexibility, are not designed to handle the vast configuration… ▽ More

    Submitted 28 February, 2025; originally announced February 2025.

    Comments: Accepted by the 62nd Design Automation Conference (DAC 2025)

  31. arXiv:2502.20422  [pdf, other

    cs.CL cs.AI

    SEKI: Self-Evolution and Knowledge Inspiration based Neural Architecture Search via Large Language Models

    Authors: Zicheng Cai, Yaohua Tang, Yutao Lai, Hua Wang, Zhi Chen, Hao Chen

    Abstract: We introduce SEKI, a novel large language model (LLM)-based neural architecture search (NAS) method. Inspired by the chain-of-thought (CoT) paradigm in modern LLMs, SEKI operates in two key stages: self-evolution and knowledge distillation. In the self-evolution stage, LLMs initially lack sufficient reference examples, so we implement an iterative refinement mechanism that enhances architectures b… ▽ More

    Submitted 27 February, 2025; originally announced February 2025.

  32. arXiv:2502.18072  [pdf, other

    cs.RO cs.AI cs.MA

    MRBTP: Efficient Multi-Robot Behavior Tree Planning and Collaboration

    Authors: Yishuai Cai, Xinglin Chen, Zhongxuan Cai, Yunxin Mao, Minglong Li, Wenjing Yang, Ji Wang

    Abstract: Multi-robot task planning and collaboration are critical challenges in robotics. While Behavior Trees (BTs) have been established as a popular control architecture and are plannable for a single robot, the development of effective multi-robot BT planning algorithms remains challenging due to the complexity of coordinating diverse action spaces. We propose the Multi-Robot Behavior Tree Planning (MR… ▽ More

    Submitted 25 February, 2025; originally announced February 2025.

  33. arXiv:2502.15348  [pdf

    cs.CL cs.AI

    Constructing a Norm for Children's Scientific Drawing: Distribution Features Based on Semantic Similarity of Large Language Models

    Authors: Yi Zhang, Fan Wei, Jingyi Li, Yan Wang, Yanyan Yu, Jianli Chen, Zipo Cai, Xinyu Liu, Wei Wang, Peng Wang, Zhong Wang

    Abstract: The use of children's drawings to examining their conceptual understanding has been proven to be an effective method, but there are two major problems with previous research: 1. The content of the drawings heavily relies on the task, and the ecological validity of the conclusions is low; 2. The interpretation of drawings relies too much on the subjective feelings of the researchers. To address thi… ▽ More

    Submitted 21 February, 2025; originally announced February 2025.

  34. arXiv:2502.13943  [pdf, other

    cs.AI cs.CL cs.LG

    AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

    Authors: Yuliang Liu, Junjie Lu, Zhaoling Chen, Chaofeng Qu, Jason Klein Liu, Chonghan Liu, Zefan Cai, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin

    Abstract: Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose Ada… ▽ More

    Submitted 19 February, 2025; originally announced February 2025.

    Comments: 17 pages

  35. arXiv:2502.12574  [pdf, other

    cs.LG cs.AI

    HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

    Authors: Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar

    Abstract: Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer l… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  36. arXiv:2502.10203  [pdf, other

    cs.LG cs.DC

    AI-in-the-Loop Sensing and Communication Joint Design for Edge Intelligence

    Authors: Zhijie Cai, Xiaowen Cao, Xu Chen, Yuanhao Cui, Guangxu Zhu, Kaibin Huang, Shuguang Cui

    Abstract: Recent breakthroughs in artificial intelligence (AI), wireless communications, and sensing technologies have accelerated the evolution of edge intelligence. However, conventional systems still grapple with issues such as low communication efficiency, redundant data acquisition, and poor model generalization. To overcome these challenges, we propose an innovative framework that enhances edge intell… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  37. arXiv:2502.05674  [pdf, other

    eess.AS cs.SD

    Less is More for Synthetic Speech Detection in the Wild

    Authors: Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola GarcĂ­a-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

    Abstract: Driven by advances in self-supervised learning for speech, state-of-the-art synthetic speech detectors have achieved low error rates on popular benchmarks such as ASVspoof. However, prior benchmarks do not address the wide range of real-world variability in speech. Are reported error rates realistic in real-world conditions? To assess detector failure modes and robustness under controlled distribu… ▽ More

    Submitted 13 February, 2025; v1 submitted 8 February, 2025; originally announced February 2025.

  38. arXiv:2502.05209  [pdf, other

    cs.CR cs.AI

    Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

    Authors: Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell

    Abstract: Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot evaluate realistic risks from open-weight… ▽ More

    Submitted 12 April, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

  39. arXiv:2502.04519  [pdf, other

    eess.AS cs.LG

    GenVC: Self-Supervised Zero-Shot Voice Conversion

    Authors: Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola GarcĂ­a-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

    Abstract: Zero-shot voice conversion has recently made substantial progress, but many models still depend on external supervised systems to disentangle speaker identity and linguistic content. Furthermore, current methods often use parallel conversion, where the converted speech inherits the source utterance's temporal structure, restricting speaker similarity and privacy. To overcome these limitations, we… ▽ More

    Submitted 6 February, 2025; originally announced February 2025.

  40. arXiv:2502.01612  [pdf, other

    cs.LG cs.AI

    Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

    Authors: Nayoung Lee, Ziyang Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos

    Abstract: Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulat… ▽ More

    Submitted 13 February, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: Added references

  41. arXiv:2502.01299  [pdf

    q-bio.NC cs.CL

    Probabilistic adaptation of language comprehension for individual speakers: Evidence from neural oscillations

    Authors: Hanlin Wu, Xiaohui Rao, Zhenguang G. Cai

    Abstract: Listeners adapt language comprehension based on their mental representations of speakers, but how these representations are dynamically updated remains unclear. We investigated whether listeners probabilistically adapt their comprehension based on the likelihood of speakers producing stereotype-incongruent utterances. Our findings reveal two potential mechanisms: a speaker-general mechanism that a… ▽ More

    Submitted 3 February, 2025; originally announced February 2025.

  42. arXiv:2502.00372  [pdf, other

    cs.CV

    NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

    Authors: Zhixi Cai, Fucai Ke, Simindokht Jahangard, Maria Garcia de la Banda, Reza Haffari, Peter J. Stuckey, Hamid Rezatofighi

    Abstract: Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (L… ▽ More

    Submitted 8 March, 2025; v1 submitted 1 February, 2025; originally announced February 2025.

  43. arXiv:2502.00319  [pdf, other

    cs.LG cs.DC eess.SP

    Physics-Inspired Distributed Radio Map Estimation

    Authors: Dong Yang, Yue Wang, Songyang Zhang, Yingshu Li, Zhipeng Cai

    Abstract: To gain panoramic awareness of spectrum coverage in complex wireless environments, data-driven learning approaches have recently been introduced for radio map estimation (RME). While existing deep learning based methods conduct RME given spectrum measurements gathered from dispersed sensors in the region of interest, they rely on centralized data at a fusion center, which however raises critical c… ▽ More

    Submitted 31 January, 2025; originally announced February 2025.

  44. arXiv:2501.17610  [pdf, other

    cs.DC

    FeedSign: Robust Full-parameter Federated Fine-tuning of Large Models with Extremely Low Communication Overhead of One Bit

    Authors: Zhijie Cai, Haolong Chen, Guangxu Zhu

    Abstract: Federated fine-tuning (FFT) attempts to fine-tune a pre-trained model with private data from distributed clients by exchanging models rather than data under the orchestration of a parameter server (PS). To overcome the bottleneck forged by the growing communication and memory overhead for clients in such systems due to the growing model sizes, we propose \textit{FeedSign}, an FFT algorithm in whic… ▽ More

    Submitted 31 March, 2025; v1 submitted 29 January, 2025; originally announced January 2025.

  45. arXiv:2501.15579  [pdf, other

    cs.CV cs.CL

    An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training

    Authors: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen

    Abstract: The clinical adoption of artificial intelligence (AI) in medical imaging requires models that are both diagnostically accurate and interpretable to clinicians. While current multimodal biomedical foundation models prioritize performance, their black-box nature hinders explaining the decision-making process in clinically meaningful concepts. Here, we present ConceptCLIP, the first explainable biome… ▽ More

    Submitted 26 April, 2025; v1 submitted 26 January, 2025; originally announced January 2025.

  46. arXiv:2501.14785  [pdf, other

    stat.ML cs.AI cs.LG cs.SI

    ED-Filter: Dynamic Feature Filtering for Eating Disorder Classification

    Authors: Mehdi Naseriparsa, Suku Sukunesan, Zhen Cai, Osama Alfarraj, Amr Tolba, Saba Fathi Rabooki, Feng Xia

    Abstract: Eating disorders (ED) are critical psychiatric problems that have alarmed the mental health community. Mental health professionals are increasingly recognizing the utility of data derived from social media platforms such as Twitter. However, high dimensionality and extensive feature sets of Twitter data present remarkable challenges for ED classification. To overcome these hurdles, we introduce a… ▽ More

    Submitted 4 January, 2025; originally announced January 2025.

  47. arXiv:2501.13335  [pdf, other

    cs.CV

    Deblur-Avatar: Animatable Avatars from Motion-Blurred Monocular Videos

    Authors: Xianrui Luo, Juewen Peng, Zhongang Cai, Lei Yang, Fan Yang, Zhiguo Cao, Guosheng Lin

    Abstract: We introduce a novel framework for modeling high-fidelity, animatable 3D human avatars from motion-blurred monocular video inputs. Motion blur is prevalent in real-world dynamic video capture, especially due to human movements in 3D human avatar modeling. Existing methods either (1) assume sharp image inputs, failing to address the detail loss introduced by motion blur, or (2) mainly consider blur… ▽ More

    Submitted 5 March, 2025; v1 submitted 22 January, 2025; originally announced January 2025.

  48. arXiv:2501.12946  [pdf, other

    cs.SI

    Less is More: Simple yet Effective Heuristic Community Detection with Graph Convolution Network

    Authors: Hong Wang, Yinglong Zhang, Zhangqi Zhao, Zhicong Cai, Xuewen Xia, Xing Xu

    Abstract: Community detection is crucial in data mining. Traditional methods primarily focus on graph structure, often neglecting the significance of attribute features. In contrast, deep learning-based approaches incorporate attribute features and local structural information through contrastive learning, improving detection performance. However, existing algorithms' complex design and joint optimization m… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

    Comments: 19 pages, 6 figures

  49. arXiv:2501.09782  [pdf, other

    cs.CV cs.GR cs.HC cs.MM cs.RO

    SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

    Authors: Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Yamashita, Lei Yang, Ziwei Liu

    Abstract: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform… ▽ More

    Submitted 16 January, 2025; originally announced January 2025.

    Comments: An extension of SMPLer-X [arXiv:2309.17448]. Homepage: https://caizhongang.com/projects/SMPLer-X/

  50. arXiv:2501.08643  [pdf, other

    cs.CV

    MonSter: Marry Monodepth to Stereo Unleashes Power

    Authors: Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yurui Chen, Zhipeng Cai, Xin Yang

    Abstract: Stereo matching recovers depth from image correspondences. Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. MonSter integrates monocular depth and stereo matching into a dual-bran… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.