Skip to main content

Showing 1–50 of 1,156 results for author: Huang, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2507.01027  [pdf, ps, other

    cs.LG

    DBellQuant: Breaking the Bell with Double-Bell Transformation for LLMs Post Training Binarization

    Authors: Zijian Ye, Wei Huang, Yifei Yu, Tianhe Ren, Zhongrui Wang, Xiaojuan Qi

    Abstract: Large language models (LLMs) demonstrate remarkable performance but face substantial computational and memory challenges that limit their practical deployment. Quantization has emerged as a promising solution; however, its effectiveness is often limited by quantization errors arising from weight distributions that are not quantization-friendly and the presence of activation outliers. To address th… ▽ More

    Submitted 18 June, 2025; originally announced July 2025.

    Comments: 19 pages; Appendix added

  2. arXiv:2507.00458  [pdf, ps, other

    eess.AS cs.SD

    Mitigating Language Mismatch in SSL-Based Speaker Anonymization

    Authors: Zhe Zhang, Wen-Chin Huang, Xin Wang, Xiaoxiao Miao, Junichi Yamagishi

    Abstract: Speaker anonymization aims to protect speaker identity while preserving content information and the intelligibility of speech. However, most speaker anonymization systems (SASs) are developed and evaluated using only English, resulting in degraded utility for other languages. This paper investigates language mismatch in SASs for Japanese and Mandarin speech. First, we fine-tune a self-supervised l… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted to Interspeech 2025

  3. arXiv:2507.00444  [pdf, ps, other

    cs.ET

    DiffCkt: A Diffusion Model-Based Hybrid Neural Network Framework for Automatic Transistor-Level Generation of Analog Circuits

    Authors: Chengjie Liu, Jiajia Li, Yabing Feng, Wenhao Huang, Weiyu Chen, Yuan Du, Jun Yang, Li Du

    Abstract: Analog circuit design consists of the pre-layout and layout phases. Among them, the pre-layout phase directly decides the final circuit performance, but heavily depends on experienced engineers to do manual design according to specific application scenarios. To overcome these challenges and automate the analog circuit pre-layout design phase, we introduce DiffCkt: a diffusion model-based hybrid ne… ▽ More

    Submitted 1 July, 2025; originally announced July 2025.

    Comments: Accepted by ICCAD2025

  4. arXiv:2506.24102  [pdf, ps, other

    cs.CV

    DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

    Authors: Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi

    Abstract: Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill t… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: Datasets and Models: https://github.com/lxtGH/DenseWorld-1M

  5. arXiv:2506.23979  [pdf, ps, other

    cs.CL

    TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

    Authors: Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

    Abstract: Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propos… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 33 pages, 15 tables, 11 figures

  6. arXiv:2506.23546  [pdf, ps, other

    q-bio.NC cond-mat.dis-nn cs.LG cs.NE

    Neural Langevin Machine: a local asymmetric learning rule can be creative

    Authors: Zhendong Yu, Weizhong Huang, Haiping Huang

    Abstract: Fixed points of recurrent neural networks can be leveraged to store and generate information. These fixed points can be captured by the Boltzmann-Gibbs measure, which leads to neural Langevin dynamics that can be used for sampling and learning a real dataset. We call this type of generative model neural Langevin machine, which is interpretable due to its analytic form of distribution and is simple… ▽ More

    Submitted 30 June, 2025; originally announced June 2025.

    Comments: 15 pages, 3 figures, with Github link in the paper

  7. arXiv:2506.22902  [pdf, ps, other

    cs.CV eess.IV

    Point Cloud Compression and Objective Quality Assessment: A Survey

    Authors: Yiling Xu, Yujie Zhang, Shuting Xia, Kaifa Yang, He Huang, Ziyu Shan, Wenjie Huang, Qi Yang, Le Yang

    Abstract: The rapid growth of 3D point cloud data, driven by applications in autonomous driving, robotics, and immersive environments, has led to criticals demand for efficient compression and quality assessment techniques. Unlike traditional 2D media, point clouds present unique challenges due to their irregular structure, high data volume, and complex attributes. This paper provides a comprehensive survey… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

  8. arXiv:2506.22200  [pdf, ps, other

    cs.LG cs.AI

    EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

    Authors: Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yue Wang, Yuzhi Zhang

    Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), an efficient variant of PPO that lowers RL's computational cost, still faces limited exploration, low sample efficiency and instability, constraining its performance on complex reasoning tasks. To address these limitations… ▽ More

    Submitted 30 June, 2025; v1 submitted 27 June, 2025; originally announced June 2025.

  9. arXiv:2506.21154  [pdf, ps, other

    stat.ME cs.AI cs.LG

    Transformer-Based Spatial-Temporal Counterfactual Outcomes Estimation

    Authors: He Li, Haoang Chi, Mingyu Liu, Wanrong Huang, Liyang Xu, Wenjing Yang

    Abstract: The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attr… ▽ More

    Submitted 26 June, 2025; originally announced June 2025.

    Comments: 24 pages, accepted at ICML 2025

  10. arXiv:2506.19482  [pdf, ps, other

    cs.LG cs.AI

    Fast and Distributed Equivariant Graph Neural Networks by Virtual Node Learning

    Authors: Yuelin Zhang, Jiacheng Cen, Jiaqi Han, Wenbing Huang

    Abstract: Equivariant Graph Neural Networks (GNNs) have achieved remarkable success across diverse scientific applications. However, existing approaches face critical efficiency challenges when scaling to large geometric graphs and suffer significant performance degradation when the input graphs are sparsified for computational tractability. To address these limitations, we introduce FastEGNN and DistEGNN,… ▽ More

    Submitted 24 June, 2025; originally announced June 2025.

  11. arXiv:2506.18959  [pdf, ps, other

    cs.IR cs.CL cs.LG

    From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

    Authors: Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu

    Abstract: Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm terme… ▽ More

    Submitted 26 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

  12. arXiv:2506.18883  [pdf, ps, other

    cs.CV

    Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

    Authors: Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie

    Abstract: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language under… ▽ More

    Submitted 23 June, 2025; originally announced June 2025.

  13. arXiv:2506.17968  [pdf, ps, other

    cs.LG cs.AI cs.CV math.PR stat.ML

    h-calibration: Rethinking Classifier Recalibration with Probabilistic Error-Bounded Objective

    Authors: Wenjian Huang, Guiping Cao, Jiahao Xia, Jingkun Chen, Hao Wang, Jianguo Zhang

    Abstract: Deep neural networks have demonstrated remarkable performance across numerous learning tasks but often suffer from miscalibration, resulting in unreliable probability outputs. This has inspired many recent works on mitigating miscalibration, particularly through post-hoc recalibration methods that aim to obtain calibrated probabilities without sacrificing the classification performance of pre-trai… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

  14. arXiv:2506.17912  [pdf, ps, other

    cs.CV cs.MM

    PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis

    Authors: Chuhao Jin, Haosen Li, Bingzi Zhang, Che Liu, Xiting Wang, Ruihua Song, Wenbing Huang, Ying Qin, Fuzheng Zhang, Di Zhang

    Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in many multimodal generation tasks, but a significant performance gap still exists in text-to-motion generation, where LLM-based methods lag far behind non-LLM methods. We identify the granularity of motion tokenization as a critical bottleneck: fine-grained tokenization induces local dependency issues, where LLMs overemph… ▽ More

    Submitted 22 June, 2025; originally announced June 2025.

    Comments: 14 pages, 7 figures

  15. arXiv:2506.17596  [pdf, ps, other

    cs.CV

    A Multimodal In Vitro Diagnostic Method for Parkinson's Disease Combining Facial Expressions and Behavioral Gait Data

    Authors: Wei Huang, Yinxuan Xu, Yintao Zhou, Zhengyu Li, Jing Huang, Meng Pang

    Abstract: Parkinson's disease (PD), characterized by its incurable nature, rapid progression, and severe disability, poses significant challenges to the lives of patients and their families. Given the aging population, the need for early detection of PD is increasing. In vitro diagnosis has garnered attention due to its non-invasive nature and low cost. However, existing methods present several challenges:… ▽ More

    Submitted 21 June, 2025; originally announced June 2025.

    Comments: 8 pages, 4 figures, accepted by CogSci 2025

  16. arXiv:2506.15451  [pdf, ps, other

    cs.CL

    AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

    Authors: Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, Hongwei Feng, Yanghua Xiao, Zheyu Ye, Yao Hu, Shaosheng Cao

    Abstract: Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framewo… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  17. arXiv:2506.14837  [pdf, ps, other

    cs.CV cs.AI

    Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction

    Authors: Chengzhi Xu, Yuyang Wang, Lai Wei, Lichao Sun, Weiran Huang

    Abstract: Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precis… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  18. arXiv:2506.14642  [pdf, ps, other

    cs.CV

    3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting

    Authors: Yuke Xing, Jiarui Wang, Peizhi Niu, Wenjie Huang, Guangtao Zhai, Yiling Xu

    Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framewo… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  19. arXiv:2506.14070  [pdf, ps, other

    cs.AI

    Into the Unknown: Applying Inductive Spatial-Semantic Location Embeddings for Predicting Individuals' Mobility Beyond Visited Places

    Authors: Xinglei Wang, Tao Cheng, Stephen Law, Zichao Zeng, Ilya Ilyankou, Junyuan Liu, Lu Yin, Weiming Huang, Natchapon Jongwiriyanurak

    Abstract: Predicting individuals' next locations is a core task in human mobility modelling, with wide-ranging implications for urban planning, transportation, public policy and personalised mobility services. Traditional approaches largely depend on location embeddings learned from historical mobility patterns, limiting their ability to encode explicit spatial information, integrate rich urban semantic con… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: 10 pages, 5 figures

  20. arXiv:2506.12909  [pdf, ps, other

    cs.CL

    SciDA: Scientific Dynamic Assessor of LLMs

    Authors: Junting Zhou, Tingjia Miao, Yiyan Liao, Qichao Wang, Zhoufutu Wen, Yanqin Wang, Yunjie Huang, Ge Yan, Leqi Wang, Yucheng Xia, Hongwan Gao, Yuansong Zeng, Renjie Zheng, Chen Dun, Yitao Liang, Tong Yang, Wenhao Huang, Ge Zhang

    Abstract: Advancement in Large Language Models (LLMs) reasoning capabilities enables them to solve scientific problems with enhanced efficacy. Thereby, a high-quality benchmark for comprehensive and appropriate assessment holds significance, while existing ones either confront the risk of data contamination or lack involved disciplines. To be specific, due to the data source overlap of LLMs training and sta… ▽ More

    Submitted 15 June, 2025; originally announced June 2025.

  21. arXiv:2506.12474  [pdf, ps, other

    cs.LG cs.AI

    Generalizable Trajectory Prediction via Inverse Reinforcement Learning with Mamba-Graph Architecture

    Authors: Wenyun Li, Wenjie Huang, Zejian Deng, Chen Sun

    Abstract: Accurate driving behavior modeling is fundamental to safe and efficient trajectory prediction, yet remains challenging in complex traffic scenarios. This paper presents a novel Inverse Reinforcement Learning (IRL) framework that captures human-like decision-making by inferring diverse reward functions, enabling robust cross-scenario adaptability. The learned reward function is utilized to maximize… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

  22. arXiv:2506.11532  [pdf, ps, other

    eess.AS cs.SD

    From Sharpness to Better Generalization for Speech Deepfake Detection

    Authors: Wen Huang, Xuechen Liu, Xin Wang, Junichi Yamagishi, Yanmin Qian

    Abstract: Generalization remains a critical challenge in speech deepfake detection (SDD). While various approaches aim to improve robustness, generalization is typically assessed through performance metrics like equal error rate without a theoretical framework to explain model performance. This work investigates sharpness as a theoretical proxy for generalization in SDD. We analyze how sharpness responds to… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  23. arXiv:2506.11498  [pdf, ps, other

    cs.CL

    Lag-Relative Sparse Attention In Long Context Training

    Authors: Manlai Liang, Wanyi Huang, Mandi Liu, Huaijun Li, Jinlong Li

    Abstract: Large Language Models (LLMs) have made significant strides in natural language processing and generation, yet their ability to handle long-context input remains constrained by the quadratic complexity of attention computation and linear-increasing key-value memory footprint. To reduce computational costs and memory, key-value cache compression techniques are commonly applied at inference time, but… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

  24. arXiv:2506.11357  [pdf, ps, other

    cs.LG stat.ML

    Generalization Bound of Gradient Flow through Training Trajectory and Data-dependent Kernel

    Authors: Yilan Chen, Zhichao Wang, Wei Huang, Andi Han, Taiji Suzuki, Arya Mazumdar

    Abstract: Gradient-based optimization methods have shown remarkable empirical success, yet their theoretical generalization properties remain only partially understood. In this paper, we establish a generalization bound for gradient flow that aligns with the classical Rademacher complexity bounds for kernel methods-specifically those based on the RKHS norm and kernel trace-through a data-dependent kernel ca… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

  25. arXiv:2506.11121  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR

    Authors: Wei-Ping Huang, Guan-Ting Lin, Hung-yi Lee

    Abstract: Despite progress in end-to-end ASR, real-world domain mismatches still cause performance drops, which Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference. Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction. In this work, we identify a previously overlooked challenge: TTA can inter… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

  26. arXiv:2506.10521  [pdf, ps, other

    cs.AI cs.CL

    Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

    Authors: Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv , et al. (2 additional authors not shown)

    Abstract: Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on ev… ▽ More

    Submitted 25 June, 2025; v1 submitted 12 June, 2025; originally announced June 2025.

    Comments: 82 pages

  27. arXiv:2506.09736  [pdf, ps, other

    cs.CV cs.AI

    Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

    Authors: Yuting Li, Lai Wei, Kaipeng Zheng, Jingyuan Huang, Linghe Kong, Lichao Sun, Weiran Huang

    Abstract: Despite the rapid progress of multimodal large language models (MLLMs), they have largely overlooked the importance of visual processing. In a simple yet revealing experiment, we interestingly find that language-only models, when provided with image captions, can achieve comparable or even better performance than MLLMs that consume raw visual inputs. This suggests that current MLLMs may generate a… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Technical Report

  28. arXiv:2506.09513  [pdf, ps, other

    cs.CL cs.AI cs.MA

    ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

    Authors: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu

    Abstract: Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is co… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 24 pages, 6 figures, 7 tables

  29. arXiv:2506.09420  [pdf, ps, other

    cs.AI cs.CL cs.HC cs.LG cs.MA

    A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy

    Authors: Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Chunyu Miao, Dongyuan Li, Aiwei Liu, Yue Zhou, Yankai Chen, Weizhi Zhang, Yangning Li, Liancheng Fang, Renhe Jiang, Philip S. Yu

    Abstract: Recent improvements in large language models (LLMs) have led many researchers to focus on building fully autonomous AI agents. This position paper questions whether this approach is the right path forward, as these autonomous systems still have problems with reliability, transparency, and understanding the actual requirements of human. We suggest a different approach: LLM-based Human-Agent Systems… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

  30. arXiv:2506.09284  [pdf, ps, other

    cs.RO cs.AI cs.CV

    UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

    Authors: Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei

    Abstract: Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance know… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  31. arXiv:2506.09113  [pdf, ps, other

    cs.CV

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Authors: Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu , et al. (19 additional authors not shown)

    Abstract: Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core tec… ▽ More

    Submitted 28 June, 2025; v1 submitted 10 June, 2025; originally announced June 2025.

    Comments: Seedance 1.0 Technical Report

  32. arXiv:2506.08526  [pdf, ps, other

    cs.CV

    Robust Visual Localization via Semantic-Guided Multi-Scale Transformer

    Authors: Zhongtao Tian, Wenhao Huang, Zhidong Chen, Xiao Wei Sun

    Abstract: Visual localization remains challenging in dynamic environments where fluctuating lighting, adverse weather, and moving objects disrupt appearance cues. Despite advances in feature representation, current absolute pose regression methods struggle to maintain consistency under varying conditions. To address this challenge, we propose a framework that synergistically combines multi-scale feature lea… ▽ More

    Submitted 10 June, 2025; originally announced June 2025.

  33. arXiv:2506.05918  [pdf, ps, other

    cs.LG

    Over-PINNs: Enhancing Physics-Informed Neural Networks via Higher-Order Partial Derivative Overdetermination of PDEs

    Authors: Wenxuan Huo, Qiang He, Gang Zhu, Weifeng Huang

    Abstract: Partial differential equations (PDEs) serve as the cornerstone of mathematical physics. In recent years, Physics-Informed Neural Networks (PINNs) have significantly reduced the dependence on large datasets by embedding physical laws directly into the training of neural networks. However, when dealing with complex problems, the accuracy of PINNs still has room for improvement. To address this issue… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  34. arXiv:2506.05685  [pdf, ps, other

    cs.IR

    NGA: Non-autoregressive Generative Auction with Global Externalities for Advertising Systems

    Authors: Zuowu Zheng, Ze Wang, Fan Yang, Wenqing Ye, Weihua Huang, Wenqiang He, Teng Zhang, Xingxing Wang

    Abstract: Online advertising auctions are fundamental to internet commerce, demanding solutions that not only maximize revenue but also ensure incentive compatibility, high-quality user experience, and real-time efficiency. While recent learning-based auction frameworks have improved context modeling by capturing intra-list dependencies among ads, they remain limited in addressing global externalities and o… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  35. arXiv:2506.05083  [pdf, ps, other

    cs.CV

    SeedEdit 3.0: Fast and High-Quality Generative Image Editing

    Authors: Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang

    Abstract: We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0, which significantly improves over our previous SeedEdit versions in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline wit… ▽ More

    Submitted 6 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: Website: https://seed.bytedance.com/tech/seededit

  36. arXiv:2506.03157  [pdf, ps, other

    q-bio.BM cs.LG

    UniSim: A Unified Simulator for Time-Coarsened Dynamics of Biomolecules

    Authors: Ziyang Yu, Wenbing Huang, Yang Liu

    Abstract: Molecular Dynamics (MD) simulations are essential for understanding the atomic-level behavior of molecular systems, giving insights into their transitions and interactions. However, classical MD techniques are limited by the trade-off between accuracy and efficiency, while recent deep learning-based improvements have mostly focused on single-domain molecules, lacking transferability to unfamiliar… ▽ More

    Submitted 5 June, 2025; v1 submitted 20 May, 2025; originally announced June 2025.

    Comments: ICML 2025 poster

  37. arXiv:2506.03107  [pdf, ps, other

    cs.CV

    ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

    Authors: Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang

    Abstract: Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To ad… ▽ More

    Submitted 11 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

    Comments: Website: https://boese0601.github.io/bytemorph Dataset: https://huggingface.co/datasets/ByteDance-Seed/BM-6M Benchmark: https://huggingface.co/datasets/ByteDance-Seed/BM-Bench Code: https://github.com/ByteDance-Seed/BM-code Demo: https://huggingface.co/spaces/Boese0601/ByteMorph-Demo

  38. arXiv:2506.02334  [pdf, ps, other

    cs.CV

    Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization

    Authors: Duo Liu, Zhiquan Tan, Linglan Zhao, Zhongqiang Zhang, Xiangzhong Fang, Weiran Huang

    Abstract: Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: ICML2025 Poster

  39. arXiv:2506.01977  [pdf, ps, other

    cs.LG cs.AI

    Towards Unsupervised Training of Matching-based Graph Edit Distance Solver via Preference-aware GAN

    Authors: Wei Huang, Hanchen Wang, Dong Wen, Shaozhen Ma, Wenjie Zhang, Xuemin Lin

    Abstract: Graph Edit Distance (GED) is a fundamental graph similarity metric widely used in various applications. However, computing GED is an NP-hard problem. Recent state-of-the-art hybrid GED solver has shown promising performance by formulating GED as a bipartite graph matching problem, then leveraging a generative diffusion model to predict node matching between two graphs, from which both the GED and… ▽ More

    Submitted 15 May, 2025; originally announced June 2025.

  40. arXiv:2506.01701  [pdf, ps, other

    cs.CV cs.AI

    Data Pruning by Information Maximization

    Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, Xiaojuan Qi

    Abstract: In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning.… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: ICLR 2025

  41. arXiv:2506.01011  [pdf, other

    cs.CR

    Autoregressive Images Watermarking through Lexical Biasing: An Approach Resistant to Regeneration Attack

    Authors: Siqi Hui, Yiren Song, Sanping Zhou, Ye Deng, Wenli Huang, Jinjun Wang

    Abstract: Autoregressive (AR) image generation models have gained increasing attention for their breakthroughs in synthesis quality, highlighting the need for robust watermarking to prevent misuse. However, existing in-generation watermarking techniques are primarily designed for diffusion models, where watermarks are embedded within diffusion latent states. This design poses significant challenges for dire… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  42. arXiv:2506.00866  [pdf, ps, other

    stat.ML cs.LG stat.ME

    Projection Pursuit Density Ratio Estimation

    Authors: Meilin Wang, Wei Huang, Mingming Gong, Zheng Zhang

    Abstract: Density ratio estimation (DRE) is a paramount task in machine learning, for its broad applications across multiple domains, such as covariate shift adaptation, causal inference, independence tests and beyond. Parametric methods for estimating the density ratio possibly lead to biased results if models are misspecified, while conventional non-parametric methods suffer from the curse of dimensionali… ▽ More

    Submitted 1 June, 2025; originally announced June 2025.

  43. arXiv:2505.23922  [pdf, ps, other

    cs.CV cs.CL

    ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

    Authors: David Ma, Huaqing Yuan, Xingjian Wang, Qianbo Zang, Tianci Liu, Xinyang He, Yanbin Wei, Jiawei Guo, Ni Jiahui, Zhenzhu Yang, Meng Cao, Shanghaoran Quan, Yizhi Li, Wangchunshu Zhou, Jiaheng Liu, Wenhao Huang, Ge Zhang, Shiwen Ni, Xiaojie Jin

    Abstract: Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To ad… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

  44. arXiv:2505.23810  [pdf, ps, other

    cs.CL cs.AI

    MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

    Authors: Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu

    Abstract: Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Benc… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 29 pages, 13 figures

  45. arXiv:2505.23053  [pdf, ps, other

    cs.IR cs.AI

    Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders

    Authors: Wei-Hsiang Huang, Chen-Wei Ke, Wei-Ning Chiu, Yu-Xuan Su, Chun-Chun Yang, Chieh-Yuan Cheng, Yun-Nung Chen, Pu-Jen Cheng

    Abstract: Large language models (LLMs) have introduced new paradigms for recommender systems by enabling richer semantic understanding and incorporating implicit world knowledge. In this study, we propose a systematic taxonomy that classifies existing approaches into two categories: (1) Pure LLM Recommenders, which rely solely on LLMs, and (2) Augmented LLM Recommenders, which integrate additional non-LLM t… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  46. arXiv:2505.23024  [pdf, ps, other

    cs.LG

    An Empirical Study of Federated Prompt Learning for Vision Language Model

    Authors: Zhihao Wang, Wenke Huang, Tian Chen, Zekun Shi, Guancheng Wan, Yu Qiao, Bin Yang, Jian Wang, Bing Li, Mang Ye

    Abstract: The Vision Language Model (VLM) excels in aligning vision and language representations, and prompt learning has emerged as a key technique for adapting such models to downstream tasks. However, the application of prompt learning with VLM in federated learning (\fl{}) scenarios remains underexplored. This paper systematically investigates the behavioral differences between language prompt learning… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  47. arXiv:2505.22453  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

    Authors: Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun

    Abstract: Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to ite… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  48. Parental Collaboration and Closeness: Envisioning with New Couple Parents

    Authors: Ya-Fang Lin, Xiaotian Li, Wan-Hsuan Huang, Charan Pushpanathan Prabavathi, Jie Cai, John M. Carroll

    Abstract: Couples often experience a decrease in closeness as they cope with the demands of parenthood. Existing technologies have supported parenting and parental collaboration. However, these technologies do not adequately support closeness in co-parenting. We use scenarios and design probes to brainstorm with 10 new parent couples to explore and envision possibilities for technologies to support closenes… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: DIS 2025

  49. arXiv:2505.22334  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

    Authors: Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang

    Abstract: Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  50. arXiv:2505.22195  [pdf, other

    cs.CV

    S2AFormer: Strip Self-Attention for Efficient Vision Transformer

    Authors: Guoan Xu, Wenfeng Huang, Wenjing Jia, Jiamao Li, Guangwei Gao, Guo-Jun Qi

    Abstract: Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self-attention to achieve… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

    Comments: 12 pages, 6 figures, 8 tables