Skip to main content

Showing 1–50 of 181 results for author: Zhong, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.04636  [pdf, ps, other

    cs.CL

    Put Teacher in Student's Shoes: Cross-Distillation for Ultra-compact Model Compression Framework

    Authors: Maolin Wang, Jun Chu, Sicong Xie, Xiaoling Zang, Yao Zhao, Wenliang Zhong, Xiangyu Zhao

    Abstract: In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance acro… ▽ More

    Submitted 6 July, 2025; originally announced July 2025.

    Comments: Accepted by KDD 2025

  2. arXiv:2506.23590  [pdf, ps, other

    cs.CV

    CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

    Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

    Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' a… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

  3. arXiv:2506.15167  [pdf, ps, other

    cs.IT cs.AI

    LLM Agent for Hyper-Parameter Optimization

    Authors: Wanzhe Wang, Jianqiu Peng, Menghao Hu, Weihuang Zhong, Tong Zhang, Shuai Wang, Yixin Zhang, Mingjie Shao, Wanli Ni

    Abstract: Hyper-parameters are essential and critical for the performance of communication algorithms. However, current hyper-parameters optimization approaches for Warm-Start Particles Swarm Optimization with Crossover and Mutation (WS-PSO-CM) algorithm, designed for radio map-enabled unmanned aerial vehicle (UAV) trajectory and communication, are primarily heuristic-based, exhibiting low levels of automat… ▽ More

    Submitted 9 July, 2025; v1 submitted 18 June, 2025; originally announced June 2025.

    Comments: 6 pages, 6 figures

  4. arXiv:2506.11003  [pdf, other

    cs.SE cs.AI

    EmbedAgent: Benchmarking Large Language Models in Embedded System Development

    Authors: Ruiyang Xu, Jialun Cao, Mingyuan Wu, Wenliang Zhong, Yaojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, Le Sun

    Abstract: Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development.In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap betwe… ▽ More

    Submitted 19 April, 2025; originally announced June 2025.

    Comments: 21 pages

  5. arXiv:2505.24544  [pdf, ps, other

    cs.CL cs.AI

    Cross-Attention Speculative Decoding

    Authors: Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Nikhil Verma, Yipeng Ji, Chul Lee

    Abstract: Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  6. arXiv:2505.18822  [pdf, ps, other

    cs.AI cs.CL

    AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting

    Authors: Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, Yi R. Fung

    Abstract: Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget alloca… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

  7. arXiv:2505.18612  [pdf, ps, other

    cs.CV

    Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

    Authors: Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li

    Abstract: Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-t… ▽ More

    Submitted 2 July, 2025; v1 submitted 24 May, 2025; originally announced May 2025.

    Comments: Project page: https://weizhi-zhong.github.io/Mod-Adapter

  8. arXiv:2505.16980  [pdf, ps, other

    cs.CV cs.MM

    Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

    Authors: Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, Tao Mei

    Abstract: Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal incons… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: CVPR 2025

  9. arXiv:2505.14299  [pdf, ps, other

    cs.MA

    Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

    Authors: Zihao Feng, Xiaoxue Wang, Bowen Wu, Weihong Zhong, Zhen Xu, Hailong Cao, Tiejun Zhao, Ying Li, Baoxun Wang

    Abstract: Task-oriented dialogue systems based on Large Language Models (LLMs) have gained increasing attention across various industries and achieved significant results. Current approaches condense complex procedural workflows into a single agent to achieve satisfactory performance on large-scale LLMs. However, these approaches face challenges to achieve comparable performance on fine-tuned lightweight LL… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  10. arXiv:2505.14116  [pdf, ps, other

    cs.CL

    Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

    Authors: Hongru Wang, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong

    Abstract: Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we i… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  11. arXiv:2505.07062  [pdf, ps, other

    cs.CV cs.AI

    Seed1.5-VL Technical Report

    Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng , et al. (172 additional authors not shown)

    Abstract: We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluati… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  12. arXiv:2505.06105  [pdf, other

    eess.IV cs.CV

    S2MNet: Speckle-To-Mesh Net for Three-Dimensional Cardiac Morphology Reconstruction via Echocardiogram

    Authors: Xilin Gong, Yongkai Chen, Shushan Wu, Fang Wang, Ping Ma, Wenxuan Zhong

    Abstract: Echocardiogram is the most commonly used imaging modality in cardiac assessment duo to its non-invasive nature, real-time capability, and cost-effectiveness. Despite its advantages, most clinical echocardiograms provide only two-dimensional views, limiting the ability to fully assess cardiac anatomy and function in three dimensions. While three-dimensional echocardiography exists, it often suffers… ▽ More

    Submitted 9 May, 2025; originally announced May 2025.

  13. arXiv:2504.19452  [pdf, ps, other

    cs.LG physics.comp-ph

    Geometry-Informed Neural Operator Transformer

    Authors: Qibang Liu, Weiheng Zhong, Hadi Meidani, Diab Abueidda, Seid Koric, Philippe Geubelle

    Abstract: Machine-learning-based surrogate models offer significant computational efficiency and faster simulations compared to traditional numerical methods, especially for problems requiring repeated evaluations of partial differential equations. This work introduces the Geometry-Informed Neural Operator Transformer (GINOT), which integrates the transformer architecture with the neural operator framework… ▽ More

    Submitted 9 July, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

  14. arXiv:2504.15681  [pdf, other

    cs.CV

    Vidi: Large Multimodal Models for Video Understanding and Editing

    Authors: Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu

    Abstract: Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components… ▽ More

    Submitted 24 April, 2025; v1 submitted 22 April, 2025; originally announced April 2025.

  15. arXiv:2504.14870  [pdf, ps, other

    cs.AI cs.CL

    Acting Less is Reasoning More! Teaching Model to Act Efficiently

    Authors: Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, Heng Ji

    Abstract: Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools during long-form reasoning, such as search engines and code interpreters, to solve tasks beyond the capabilities of internal reasoning. While reinforcement learning (RL) has shown promise in training such agents, most of existing approaches typically optimize only for final correctness w… ▽ More

    Submitted 31 May, 2025; v1 submitted 21 April, 2025; originally announced April 2025.

  16. arXiv:2504.14772  [pdf, other

    cs.CL cs.LG stat.ML

    Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

    Authors: Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, Ping Ma

    Abstract: The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and lingui… ▽ More

    Submitted 20 April, 2025; originally announced April 2025.

  17. arXiv:2504.13914  [pdf, other

    cs.CL

    Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

    Authors: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen , et al. (249 additional authors not shown)

    Abstract: We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For in… ▽ More

    Submitted 29 April, 2025; v1 submitted 10 April, 2025; originally announced April 2025.

  18. arXiv:2504.11536  [pdf, other

    cs.CL cs.AI

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Authors: Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong

    Abstract: While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhanc… ▽ More

    Submitted 17 April, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: fix typos

  19. arXiv:2504.04188  [pdf, ps, other

    cs.IR cs.LG

    Towards Principled Learning for Re-ranking in Recommender Systems

    Authors: Qunwei Li, Linghui Li, Jianbin Lin, Wenliang Zhong

    Abstract: As the final stage of recommender systems, re-ranking presents ordered item lists to users that best match their interests. It plays such a critical role and has become a trending research topic with much attention from both academia and industry. Recent advances of re-ranking are focused on attentive listwise modeling of interactions and mutual influences among items to be re-ranked. However, pri… ▽ More

    Submitted 5 April, 2025; originally announced April 2025.

  20. arXiv:2503.19502  [pdf, other

    physics.geo-ph cs.AI

    Towards Long-Range ENSO Prediction with an Explainable Deep Learning Model

    Authors: Qi Chen, Yinghao Cui, Guobin Hong, Karumuri Ashok, Yuchun Pu, Xiaogu Zheng, Xuanze Zhang, Wei Zhong, Peng Zhan, Zhonglei Wang

    Abstract: El Niño-Southern Oscillation (ENSO) is a prominent mode of interannual climate variability with far-reaching global impacts. Its evolution is governed by intricate air-sea interactions, posing significant challenges for long-term prediction. In this study, we introduce CTEFNet, a multivariate deep learning model that synergizes convolutional neural networks and transformers to enhance ENSO forecas… ▽ More

    Submitted 25 March, 2025; originally announced March 2025.

  21. arXiv:2502.19915   

    cs.AI

    LLM-driven Effective Knowledge Tracing by Integrating Dual-channel Difficulty

    Authors: Jiahui Cen, Jianghao Lin, Weixuan Zhong, Dong Zhou, Jin Chen, Aimin Yang, Yongmei Zhou

    Abstract: Knowledge Tracing (KT) is a fundamental technology in intelligent tutoring systems used to simulate changes in students' knowledge state during learning, track personalized knowledge mastery, and predict performance. However, current KT models face three major challenges: (1) When encountering new questions, models face cold-start problems due to sparse interaction records, making precise modeling… ▽ More

    Submitted 29 April, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

    Comments: During a careful review of our base-experiment results, we discovered a possible error in the way some data were recorded. To ensure the integrity and accuracy of our work, we must correct these results and revise the corresponding analysis before making the manuscript publicly available

  22. arXiv:2502.09790  [pdf, other

    astro-ph.EP astro-ph.IM cs.LG

    ExoMiner++: Enhanced Transit Classification and a New Vetting Catalog for 2-Minute TESS Data

    Authors: Hamed Valizadegan, Miguel J. S. Martinho, Jon M. Jenkins, Joseph D. Twicken, Douglas A. Caldwell, Patrick Maynard, Hongbo Wei, William Zhong, Charles Yates, Sam Donald, Karen A. Collins, David Latham, Khalid Barkaoui, Michael L. Calkins, Kylee Carden, Nikita Chazov, Gilbert A. Esquerdo, Tristan Guillot, Vadim Krushinsky, Grzegorz Nowak, Benjamin V. Rackham, Amaury Triaud, Richard P. Schwarz, Denise Stephens, Chris Stockdale , et al. (2 additional authors not shown)

    Abstract: We present ExoMiner++, an enhanced deep learning model that builds on the success of ExoMiner to improve transit signal classification in 2-minute TESS data. ExoMiner++ incorporates additional diagnostic inputs, including periodogram, flux trend, difference image, unfolded flux, and spacecraft attitude control data, all of which are crucial for effectively distinguishing transit signals from more… ▽ More

    Submitted 19 May, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

  23. arXiv:2502.07465   

    cs.LG cs.AI

    Crime Forecasting: A Spatio-temporal Analysis with Deep Learning Models

    Authors: Li Mao, Wei Du, Shuo Wen, Qi Li, Tong Zhang, Wei Zhong

    Abstract: This study uses deep-learning models to predict city partition crime counts on specific days. It helps police enhance surveillance, gather intelligence, and proactively prevent crimes. We formulate crime count prediction as a spatiotemporal sequence challenge, where both input data and prediction targets are spatiotemporal sequences. In order to improve the accuracy of crime forecasting, we introd… ▽ More

    Submitted 13 February, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: The paper was submitted without the consent of all co-authors. The content of the paper is incomplete and requires substantial additional work before it can be considered a complete and coherent submission

  24. arXiv:2502.07221  [pdf, other

    cs.CV

    MLLM4PUE: Toward Universal Embeddings in Digital Pathology through Multimodal LLMs

    Authors: Qifeng Zhou, Thao M. Dang, Wenliang Zhong, Yuzhi Guo, Hehuan Ma, Saiyang Na, Haiqing Li, Junzhou Huang

    Abstract: Pathology plays a critical role in diagnosing a wide range of diseases, yet existing approaches often rely heavily on task-specific models trained on extensive, well-labeled datasets. These methods face sustainability challenges due to the diversity of pathologies and the labor-intensive nature of data collection. To address these limitations, we highlight the need for universal multimodal embeddi… ▽ More

    Submitted 16 March, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

  25. arXiv:2501.13573  [pdf, other

    cs.CL

    Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization

    Authors: Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yangfan Ye, Weihong Zhong, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Bing Qin

    Abstract: Ensuring contextual faithfulness in retrieval-augmented large language models (LLMs) is crucial for building trustworthy information-seeking systems, particularly in long-form question-answering (LFQA) scenarios. In this work, we identify a salient correlation between LFQA faithfulness and retrieval heads, a set of attention heads responsible for retrieving contextual information. Leveraging this… ▽ More

    Submitted 23 January, 2025; originally announced January 2025.

    Comments: Submitted to ARR October 2024

  26. arXiv:2501.12326  [pdf, other

    cs.AI cs.CL cs.CV cs.HC

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Authors: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li , et al. (10 additional authors not shown)

    Abstract: This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks.… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  27. arXiv:2501.10761  [pdf, other

    cs.CV

    Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption

    Authors: Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, Risheng Liu

    Abstract: Infrared-visible image fusion (IVIF) is a critical task in computer vision, aimed at integrating the unique features of both infrared and visible spectra into a unified representation. Since 2018, the field has entered the deep learning era, with an increasing variety of approaches introducing a range of networks and loss functions to enhance visual performance. However, challenges such as data co… ▽ More

    Submitted 18 January, 2025; originally announced January 2025.

  28. arXiv:2501.06271  [pdf, other

    q-bio.QM cs.AI cs.CE

    Large Language Models for Bioinformatics

    Authors: Wei Ruan, Yanjun Lyu, Jing Zhang, Jiazhang Cai, Peng Shu, Yang Ge, Yao Lu, Shang Gao, Yue Wang, Peilong Wang, Lin Zhao, Tao Wang, Yufang Liu, Luyang Fang, Ziyu Liu, Zhengliang Liu, Yiwei Li, Zihao Wu, Junhao Chen, Hanqi Jiang, Yi Pan, Zhenyuan Yang, Jingyuan Chen, Shizhe Liang, Wei Zhang , et al. (30 additional authors not shown)

    Abstract: With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification,… ▽ More

    Submitted 9 January, 2025; originally announced January 2025.

    Comments: 64 pages, 1 figure

  29. arXiv:2412.17787  [pdf, other

    cs.CV cs.CL

    Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

    Authors: Xinmiao Yu, Xiaocheng Feng, Yun Li, Minghui Liao, Ya-Qi Yu, Xiachong Feng, Weihong Zhong, Ruihan Chen, Mengkang Hu, Jihao Wu, Dandan Tu, Duyu Tang, Bing Qin

    Abstract: Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the ins… ▽ More

    Submitted 23 December, 2024; originally announced December 2024.

  30. arXiv:2412.14656  [pdf, other

    cs.CL

    Length Controlled Generation for Black-box LLMs

    Authors: Yuxuan Gu, Wenjie Wang, Xiaocheng Feng, Weihong Zhong, Kun Zhu, Lei Huang, Tat-Seng Chua, Bing Qin

    Abstract: Large language models (LLMs) have demonstrated impressive instruction following capabilities, while still struggling to accurately manage the length of the generated text, which is a fundamental requirement in many real-world applications. Existing length control methods involve fine-tuning the parameters of LLMs, which is inefficient and suboptimal for practical use. In this paper, we propose a n… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    Comments: Preprint

  31. arXiv:2412.05756  [pdf, other

    cs.CV

    Compositional Image Retrieval via Instruction-Aware Contrastive Learning

    Authors: Wenliang Zhong, Weizhi An, Feng Jiang, Hehuan Ma, Yuzhi Guo, Junzhou Huang

    Abstract: Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR)… ▽ More

    Submitted 7 December, 2024; originally announced December 2024.

    Comments: 9 pages, 8 figures

  32. arXiv:2412.03508  [pdf, other

    cs.RO

    Design and Control of an Ultra-Slender Push-Pull Multisection Continuum Manipulator for In-Situ Inspection of Aeroengine

    Authors: Weiheng Zhong, Yuancan Huang, Da Hong, Nianfeng Shao

    Abstract: Since the shape of industrial endoscopes is passively altered according to the contact around it, manual inspection approaches of aeroengines through the inspection ports have unreachable areas, and it's difficult to traverse multistage blades and inspect them simultaneously, which requires engine disassembly or the cooperation of multiple operators, resulting in efficiency decline and increased c… ▽ More

    Submitted 4 December, 2024; originally announced December 2024.

    Comments: This work has been accepted by IROS 2024

  33. arXiv:2412.01806  [pdf, other

    cond-mat.stat-mech cs.AI cs.CL

    Random Tree Model of Meaningful Memory

    Authors: Weishun Zhong, Tankut Can, Antonis Georgiou, Ilya Shnayderman, Mikhail Katkov, Misha Tsodyks

    Abstract: Traditional studies of memory for meaningful narratives focus on specific stories and their semantic structures but do not address common quantitative features of recall across different narratives. We introduce a statistical ensemble of random trees to represent narratives as hierarchies of key points, where each node is a compressed representation of its descendant leaves, which are the original… ▽ More

    Submitted 23 February, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

    Comments: 21 pages, 5 figures; included new derivations

  34. arXiv:2412.00171  [pdf, other

    cs.RO cs.CV

    RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World

    Authors: Weixin Mao, Weiheng Zhong, Zhou Jiang, Dong Fang, Zhongyue Zhang, Zihan Lan, Haosheng Li, Fan Jia, Tiancai Wang, Haoqiang Fan, Osamu Yoshie

    Abstract: Existing robot policies predominantly adopt the task-centric approach, requiring end-to-end task data collection. This results in limited generalization to new tasks and difficulties in pinpointing errors within long-horizon, multi-stage tasks. To address this, we propose RoboMatrix, a skill-centric hierarchical framework designed for scalable robot task planning and execution in open-world enviro… ▽ More

    Submitted 25 March, 2025; v1 submitted 29 November, 2024; originally announced December 2024.

    Comments: 17 pages, 16 figures

  35. arXiv:2411.11464  [pdf, other

    math.ST cs.LG stat.ML

    PALMS: Parallel Adaptive Lasso with Multi-directional Signals for Latent Networks Reconstruction

    Authors: Zhaoyu Xing, Wei Zhong

    Abstract: Large-scale networks exist in many field and play an important role in real-world dynamics. However, the networks are usually latent and expensive to detect, which becomes the main challenging for many applications and empirical analysis. Several statistical methods were proposed to infer the edges, but the complexity of algorithms make them hard to be applied for large-scale networks. In this pap… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

    Comments: 48 pages

    MSC Class: 62-08 ACM Class: C.2.4

  36. arXiv:2411.10032  [pdf, other

    cs.CV cs.AI

    VMID: A Multimodal Fusion LLM Framework for Detecting and Identifying Misinformation of Short Videos

    Authors: Weihao Zhong, Yinhao Xiao, Minghui Xu, Xiuzhen Cheng

    Abstract: Short video platforms have become important channels for news dissemination, offering a highly engaging and immediate way for users to access current events and share information. However, these platforms have also emerged as significant conduits for the rapid spread of misinformation, as fake news and rumors can leverage the visual appeal and wide reach of short videos to circulate extensively am… ▽ More

    Submitted 15 November, 2024; originally announced November 2024.

    Comments: arXiv admin note: text overlap with arXiv:2211.10973 by other authors

  37. arXiv:2410.23757  [pdf, other

    cs.IR

    Identify Then Recommend: Towards Unsupervised Group Recommendation

    Authors: Yue Liu, Shihao Zhu, Tianyuan Yang, Jian Ma, Wenliang Zhong

    Abstract: Group Recommendation (GR), which aims to recommend items to groups of users, has become a promising and practical direction for recommendation systems. This paper points out two issues of the state-of-the-art GR models. (1) The pre-defined and fixed number of user groups is inadequate for real-time industrial recommendation systems, where the group distribution can shift dynamically. (2) The train… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: 26 pages

  38. arXiv:2410.22380  [pdf, other

    cs.LG cs.AI

    Discrete Modeling via Boundary Conditional Diffusion Processes

    Authors: Yuxuan Gu, Xiaocheng Feng, Lei Huang, Yingsheng Wu, Zekun Zhou, Weihong Zhong, Kun Zhu, Bing Qin

    Abstract: We present an novel framework for efficiently and effectively extending the powerful continuous diffusion processes to discrete modeling. Previous approaches have suffered from the discrepancy between discrete data and continuous modeling. Our study reveals that the absence of guidance from discrete boundaries in learning probability contours is one of the main reasons. To address this issue, we p… ▽ More

    Submitted 29 October, 2024; originally announced October 2024.

    Comments: NeuraIPS 2024 poster

  39. arXiv:2410.20424  [pdf, other

    cs.AI cs.CL

    AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

    Authors: Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, Ge Zhang

    Abstract: Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and compreh… ▽ More

    Submitted 5 November, 2024; v1 submitted 27 October, 2024; originally announced October 2024.

    Comments: 44 pages, 10 figures

  40. arXiv:2410.13298  [pdf, other

    cs.CL cs.AI

    Advancing Large Language Model Attribution through Self-Improving

    Authors: Lei Huang, Xiaocheng Feng, Weitao Ma, Liang Zhao, Yuchun Fan, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin

    Abstract: Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: Accepted by EMNLP 2024 Main Conference

  41. arXiv:2410.01490  [pdf, other

    cs.CL

    Extending Context Window of Large Language Models from a Distributional Perspective

    Authors: Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin

    Abstract: Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to o… ▽ More

    Submitted 3 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

    Comments: 14 pages, 8 figures, Accepted to EMNLP2024

  42. arXiv:2409.14165  [pdf

    cs.AI cs.CL cs.LG cs.RO eess.SY

    A Survey on Large Language Model-empowered Autonomous Driving

    Authors: Yuxuan Zhu, Shiyi Wang, Wenqing Zhong, Nianchen Shen, Yunqi Li, Siqi Wang, Zhiheng Li, Cathy Wu, Zhengbing He, Li Li

    Abstract: Artificial intelligence (AI) plays a crucial role in autonomous driving (AD) research, propelling its development towards intelligence and efficiency. Currently, the development of AD technology follows two main technical paths: modularization and end-to-end. Modularization decompose the driving task into modules such as perception, prediction, planning, and control, and train them separately. Due… ▽ More

    Submitted 30 November, 2024; v1 submitted 21 September, 2024; originally announced September 2024.

  43. arXiv:2409.09030  [pdf, other

    cs.SE cs.AI cs.CL

    Agents in Software Engineering: Survey, Landscape, and Vision

    Authors: Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, Zibin Zheng

    Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable success and have been widely used in various downstream tasks, especially in the tasks of the software engineering (SE) field. We find that many studies combining LLMs with SE have employed the concept of agents either explicitly or implicitly. However, there is a lack of an in-depth survey to sort out the development context o… ▽ More

    Submitted 23 September, 2024; v1 submitted 13 September, 2024; originally announced September 2024.

    Comments: 12 pages, 4 figures

  44. arXiv:2408.17053  [pdf, other

    cs.LG

    Estimating Conditional Average Treatment Effects via Sufficient Representation Learning

    Authors: Pengfei Shi, Wei Zhong, Xinyu Zhang, Ningtao Wang, Xing Fu, Weiqiang Wang, Yin Jin

    Abstract: Estimating the conditional average treatment effects (CATE) is very important in causal inference and has a wide range of applications across many fields. In the estimation process of CATE, the unconfoundedness assumption is typically required to ensure the identifiability of the regression problems. When estimating CATE using high-dimensional data, there have been many variable selection methods… ▽ More

    Submitted 13 December, 2024; v1 submitted 30 August, 2024; originally announced August 2024.

  45. arXiv:2408.12001  [pdf, ps, other

    econ.TH cs.GT

    Rank-Guaranteed Auctions

    Authors: Wei He, Jiangtao Li, Weijie Zhong

    Abstract: We propose a combinatorial ascending auction that is "approximately" optimal, requiring minimal rationality to achieve this level of optimality, and is robust to strategic and distributional uncertainties. Specifically, the auction is rank-guaranteed, meaning that for any menu M and any valuation profile, the ex-post revenue is guaranteed to be at least as high as the highest revenue achievable fr… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  46. arXiv:2408.07637  [pdf, other

    q-bio.NC cond-mat.dis-nn cs.CL

    Hierarchical Working Memory and a New Magic Number

    Authors: Weishun Zhong, Mikhail Katkov, Misha Tsodyks

    Abstract: The extremely limited working memory span, typically around four items, contrasts sharply with our everyday experience of processing much larger streams of sensory information concurrently. This disparity suggests that working memory can organize information into compact representations such as chunks, yet the underlying neural mechanisms remain largely unknown. Here, we propose a recurrent neural… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

    Comments: 16 pages, 7 figures

  47. arXiv:2408.07367  [pdf, other

    cs.RO

    Risk Occupancy: A New and Efficient Paradigm through Vehicle-Road-Cloud Collaboration

    Authors: Jiaxing Chen, Wei Zhong, Bolin Gao, Yifei Liu, Hengduo Zou, Jiaxi Liu, Yanbo Lu, Jin Huang, Zhihua Zhong

    Abstract: This study introduces the 4D Risk Occupancy within a vehicle-road-cloud architecture, integrating the road surface spatial, risk, and temporal dimensions, and endowing the algorithm with beyond-line-of-sight, all-angles, and efficient abilities. The algorithm simplifies risk modeling by focusing on directly observable information and key factors, drawing on the concept of Occupancy Grid Maps (OGM)… ▽ More

    Submitted 17 August, 2024; v1 submitted 14 August, 2024; originally announced August 2024.

    Comments: 13 pages,9 figures

  48. arXiv:2408.05542  [pdf, other

    cs.SE

    You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search

    Authors: Yanlin Wang, Lianghong Guo, Ensheng Shi, Wenqing Chen, Jiachi Chen, Wanjun Zhong, Menghan Wang, Hui Li, Hongyu Zhang, Ziyu Lyu, Zibin Zheng

    Abstract: Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming… ▽ More

    Submitted 17 August, 2024; v1 submitted 10 August, 2024; originally announced August 2024.

    Comments: Accepted at ICSME 2023

  49. arXiv:2408.05416  [pdf, other

    cs.CV cs.AI cs.MM

    High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

    Authors: Weizhi Zhong, Junfan Lin, Peixin Chen, Liang Lin, Guanbin Li

    Abstract: Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmark… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: submitted to IEEE Transactions on Image Processing(TIP)

  50. arXiv:2408.05412  [pdf, ps, other

    cs.CV cs.AI cs.MM

    Style-Preserving Lip Sync via Audio-Aware Style Reference

    Authors: Weizhi Zhong, Jichang Li, Yinqi Cai, Ming Li, Feng Gao, Liang Lin, Guanbin Li

    Abstract: Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, r… ▽ More

    Submitted 18 June, 2025; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: submitted to IEEE Transactions on Multimedia(TMM)